W

SIQ-1-35B

LLMby WortegaΒ·Model page β†—

Wortega's 34.7B MoE LLM fine-tuned from Qwen3.6-35B-A3B for agentic coding and autonomous research tasks.

Share:

Base model

Qwen/Qwen3.6-35B-A3B

Model Card


SIQ-1-tiny-35b πŸͺ½

A tiny universal agent β€” autoresearch, coding, reasoning.

SIQ-1-tiny-35b is a tiny MoE β€” 35B total but only ~3B active per token β€” distilled to be a strong universal agent: equally at home running autonomous ML research (autoresearch), writing and debugging code, tool-use / agentic workflows, and hard reasoning. Despite its 3B active footprint it matches or beats much larger peers on core reasoning, sycophancy-resistance, and agentic coding β€” at a lower token cost.

Autoresearch duel (head-to-head)

In a controlled three-way autoresearch test on openai/parameter-golf β€” each model driving the same Pi-Agent edit train_gpt.py -> train (300s) -> eval val_bpb -> keep/revert loop on its own 1xA6000 for 2h β€” SIQ-1-tiny-35b reached val_bpb 1.767 (12 experiments, full 2h), neck-and-neck with Claude Opus 4.8 (~1.76) and far ahead of GLM-5.2 (2.078). GLM stagnated on the baseline β€” its only hypothesis was "add depth" (which hurt the metric) and it stopped emitting actions after ~65 min; SIQ instead climbed via LR-schedule and capacity edits (warmdown 1200->800, matrix_lr 0.04->0.05, ...). (val_bpb on a single A6000 is not comparable to the official 8xH100 leaderboard; this is the relative head-to-head under identical conditions.)

It is the winning arm of a controlled SFT / RFT / DPO / offline-GRPO post-training study on Qwen3.6-35B-A3B: ppo on the judge-top-half wins both ideation quality and agentic ability.

Performance

On the full 198-question GPQA-Diamond β€” all models served as Q4_K_M GGUF, greedy (temp 0), identical harness β€” SIQ-1-tiny-35b is Pareto-best: the highest accuracy and the fewest tokens (figure below). A 3B-active model edging out a full 35B base and Nex-N2-mini, while spending fewer tokens per question.

Benchmark SIQ-1-tiny-35b Nex-N2-mini Qwen3.6-35B
General & Reasoning
GPQA-Diamond (Q4, co-measured) 70.2 67.2 68.2
GPQA-Diamond (bf16, full eval) 90.2 82.6 β€”
IFEval (inst-loose) 89.5 89.1 β€”
tok/question (GPQA, mean) 3158 βœ… 3363 3500
Agentic coding
vibetest (Claude-judge, /10) 9.21 8.12 β€”
Ideation (autoresearch)
Opus-judge ideation (/100) 30.2 β€” 10.2 (base)

bf16 + tuned harness scores higher (90.2 GPQA); the Q4 row is the apples-to-apples co-measured comparison shown in the figure. Terminal-Bench 2.1 (Harbor, terminus-2, k=5) is in progress.

BullshitBench v2 β€” pushback vs. sycophancy

Score 0–2 (Clear Pushback = 2 / Partial = 1 / Accepted = 0). Panel: claude-sonnet-4.6 + gpt-5.2 + gemini-3.1-pro (mean), judge sees the final answer only (CoT stripped); no system prompt, temp 0.7.

model avg /2 Clear Pushback Partial Accepted
SIQ-1-tiny-35b (high/think) 1.047 45 17 38
Nex-N2-Pro (free) 1.040 33 43 24

A tie on the mean, but different profiles: SIQ is polarized (cleanly exposes the BS 45Γ— or fully buys it 38Γ—); Nex hedges (rarely fully accepts, but rarely pushes back hard either β€” mostly Partial). Reference (official bullshit-benchmark, different panel, n=55, not co-measured): Opus 4.8 β‰ˆ 1.96, GPT-5.5 β‰ˆ 0.92.

Reasoning modes & system prompts

Qwen3-format hybrid reasoning, toggled per request via chat_template_kwargs.enable_thinking:

mode toggle behavior use for
Thinking enable_thinking: true (default) emits <think> … </think>, then the answer hard reasoning, math, agent planning
No-think enable_thinking: false answers directly instruction-following, high-throughput

Reasoning effort is a trained control β€” Reasoning effort: low | medium | high in the system prompt scales the chain length (high for hard reasoning). For objective reasoning use greedy (temp 0) β€” it beats temp 0.7 by ~8 pts.

Copy-paste system prompts:

1 Β· Hard reasoning β€” greedy + high effort

Reasoning effort: high. Think step by step inside <think>...</think>, then give the final answer.

2 Β· Autoresearch ideator β€” propose a train.py edit to cut val_bpb

Reasoning effort: high. You are an autoresearch ideator.
Given the current train.py and its measured val_bpb under a fixed compute budget, propose ONE concrete,
high-impact edit that should reduce val_bpb. Reason inside <think>...</think>, then output:
- a one-line hypothesis,
- the edit as a minimal unified diff,
- the expected effect and how to verify it.

3 Β· Fast / instruction-following β€” no-think

(no system prompt; set enable_thinking=false β€” the model answers directly, no <think> block)

Usage

πŸͺ½ Try it now (no install): hosted ZeroGPU demo β†’ AlexWortega/hermes-agent-zerogpu

llama.cpp (GGUF β€” single 48 GB GPU; Q4_K_M β‰ˆ 21 GB)

These are the exact flags we serve with:

docker run -d --gpus all --network host -v /models:/m ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /m/SIQ-1-35B.Q4_K_M.gguf --alias SIQ-1-tiny-35b \
  -ngl 99 -c 131072 -np 4 --jinja --host 0.0.0.0 --port 8080
  • --jinja required (Qwen3 chat template β†’ <think> + tool tags; enables enable_thinking).
  • -ngl 99 all layers on GPU; -c 131072 total context split across -np 4 slots (β‰ˆ32k/slot β€” agentic loops need the headroom). Drop to -c 65536 if you only do short reasoning. OpenAI-compatible on :8080.

sglang (bf16 safetensors β€” e.g. 2Γ— 48 GB)

python -m sglang.launch_server \
  --model-path AlexWortega/SIQ-1-35B \
  --tp 2 --context-length 131072 \
  --reasoning-parser qwen3 --tool-call-parser qwen3 \
  --host 0.0.0.0 --port 8080

Call it (OpenAI-compatible) β€” these are the params we run

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")
r = client.chat.completions.create(
    model="SIQ-1-tiny-35b",
    messages=[{"role": "system", "content": "Reasoning effort: high"},
              {"role": "user", "content": "..."}],
    temperature=0.0, top_p=0.95, top_k=40,                     # greedy (temp 0) for reasoning
    extra_body={"chat_template_kwargs": {"enable_thinking": True}})   # False β†’ no-think

Sampling: reasoning β†’ temperature 0 (greedy); general/creative β†’ temp 0.7, top_p 0.95, top_k 40. Files: merged bf16 *.safetensors + GGUF Q4_K_M / Q5_K_M / Q8_0 (+ MTP f16).

Author
W
Wortega
User
AlexWortega
Details
Downloads322
Likes44
AccessOpen Source
Tasktext-generation
Parameters34.7B
Trending43
Licenseapache-2.0
Librarytransformers
CreatedJun 14, 2026
UpdatedJun 17, 2026
View on Hugging Face
Languages
en
Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

SIQ-1-35B β€” AI Model Details | Applied