SIQ-1-35B
Wortega's 34.7B MoE LLM fine-tuned from Qwen3.6-35B-A3B for agentic coding and autonomous research tasks.
Base model
Model Card
SIQ-1-tiny-35b πͺ½
A tiny universal agent β autoresearch, coding, reasoning.
SIQ-1-tiny-35b is a tiny MoE β 35B total but only ~3B active per token β distilled to be a strong universal agent: equally at home running autonomous ML research (autoresearch), writing and debugging code, tool-use / agentic workflows, and hard reasoning. Despite its 3B active footprint it matches or beats much larger peers on core reasoning, sycophancy-resistance, and agentic coding β at a lower token cost.
Autoresearch duel (head-to-head)
In a controlled three-way autoresearch test on openai/parameter-golf β
each model driving the same Pi-Agent edit train_gpt.py -> train (300s) -> eval val_bpb -> keep/revert loop on its
own 1xA6000 for 2h β SIQ-1-tiny-35b reached val_bpb 1.767 (12 experiments, full 2h), neck-and-neck with Claude
Opus 4.8 (~1.76) and far ahead of GLM-5.2 (2.078). GLM stagnated on the baseline β its only hypothesis was
"add depth" (which hurt the metric) and it stopped emitting actions after ~65 min; SIQ instead climbed via LR-schedule
and capacity edits (warmdown 1200->800, matrix_lr 0.04->0.05, ...). (val_bpb on a single A6000 is not comparable to
the official 8xH100 leaderboard; this is the relative head-to-head under identical conditions.)
It is the winning arm of a controlled SFT / RFT / DPO / offline-GRPO post-training study on Qwen3.6-35B-A3B:
ppo on the judge-top-half wins both ideation quality and agentic ability.
Performance
On the full 198-question GPQA-Diamond β all models served as Q4_K_M GGUF, greedy (temp 0), identical harness β SIQ-1-tiny-35b is Pareto-best: the highest accuracy and the fewest tokens (figure below). A 3B-active model edging out a full 35B base and Nex-N2-mini, while spending fewer tokens per question.
| Benchmark | SIQ-1-tiny-35b | Nex-N2-mini | Qwen3.6-35B |
|---|---|---|---|
| General & Reasoning | |||
| GPQA-Diamond (Q4, co-measured) | 70.2 | 67.2 | 68.2 |
| GPQA-Diamond (bf16, full eval) | 90.2 | 82.6 | β |
| IFEval (inst-loose) | 89.5 | 89.1 | β |
| tok/question (GPQA, mean) | 3158 β | 3363 | 3500 |
| Agentic coding | |||
| vibetest (Claude-judge, /10) | 9.21 | 8.12 | β |
| Ideation (autoresearch) | |||
| Opus-judge ideation (/100) | 30.2 | β | 10.2 (base) |
bf16 + tuned harness scores higher (90.2 GPQA); the Q4 row is the apples-to-apples co-measured comparison shown in the figure. Terminal-Bench 2.1 (Harbor, terminus-2, k=5) is in progress.
BullshitBench v2 β pushback vs. sycophancy
Score 0β2 (Clear Pushback = 2 / Partial = 1 / Accepted = 0). Panel: claude-sonnet-4.6 + gpt-5.2 + gemini-3.1-pro (mean), judge sees the final answer only (CoT stripped); no system prompt, temp 0.7.
| model | avg /2 | Clear Pushback | Partial | Accepted |
|---|---|---|---|---|
| SIQ-1-tiny-35b (high/think) | 1.047 | 45 | 17 | 38 |
| Nex-N2-Pro (free) | 1.040 | 33 | 43 | 24 |
A tie on the mean, but different profiles: SIQ is polarized (cleanly exposes the BS 45Γ or fully buys it 38Γ); Nex hedges (rarely fully accepts, but rarely pushes back hard either β mostly Partial). Reference (official bullshit-benchmark, different panel, n=55, not co-measured): Opus 4.8 β 1.96, GPT-5.5 β 0.92.
Reasoning modes & system prompts
Qwen3-format hybrid reasoning, toggled per request via chat_template_kwargs.enable_thinking:
| mode | toggle | behavior | use for |
|---|---|---|---|
| Thinking | enable_thinking: true (default) |
emits <think> β¦ </think>, then the answer |
hard reasoning, math, agent planning |
| No-think | enable_thinking: false |
answers directly | instruction-following, high-throughput |
Reasoning effort is a trained control β Reasoning effort: low | medium | high in the system prompt scales the
chain length (high for hard reasoning). For objective reasoning use greedy (temp 0) β it beats temp 0.7 by ~8 pts.
Copy-paste system prompts:
1 Β· Hard reasoning β greedy + high effort
Reasoning effort: high. Think step by step inside <think>...</think>, then give the final answer.
2 Β· Autoresearch ideator β propose a train.py edit to cut val_bpb
Reasoning effort: high. You are an autoresearch ideator.
Given the current train.py and its measured val_bpb under a fixed compute budget, propose ONE concrete,
high-impact edit that should reduce val_bpb. Reason inside <think>...</think>, then output:
- a one-line hypothesis,
- the edit as a minimal unified diff,
- the expected effect and how to verify it.
3 Β· Fast / instruction-following β no-think
(no system prompt; set enable_thinking=false β the model answers directly, no <think> block)
Usage
πͺ½ Try it now (no install): hosted ZeroGPU demo β AlexWortega/hermes-agent-zerogpu
llama.cpp (GGUF β single 48 GB GPU; Q4_K_M β 21 GB)
These are the exact flags we serve with:
docker run -d --gpus all --network host -v /models:/m ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /m/SIQ-1-35B.Q4_K_M.gguf --alias SIQ-1-tiny-35b \
-ngl 99 -c 131072 -np 4 --jinja --host 0.0.0.0 --port 8080
--jinjarequired (Qwen3 chat template β<think>+ tool tags; enablesenable_thinking).-ngl 99all layers on GPU;-c 131072total context split across-np 4slots (β32k/slot β agentic loops need the headroom). Drop to-c 65536if you only do short reasoning. OpenAI-compatible on:8080.
sglang (bf16 safetensors β e.g. 2Γ 48 GB)
python -m sglang.launch_server \
--model-path AlexWortega/SIQ-1-35B \
--tp 2 --context-length 131072 \
--reasoning-parser qwen3 --tool-call-parser qwen3 \
--host 0.0.0.0 --port 8080
Call it (OpenAI-compatible) β these are the params we run
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")
r = client.chat.completions.create(
model="SIQ-1-tiny-35b",
messages=[{"role": "system", "content": "Reasoning effort: high"},
{"role": "user", "content": "..."}],
temperature=0.0, top_p=0.95, top_k=40, # greedy (temp 0) for reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": True}}) # False β no-think
Sampling: reasoning β temperature 0 (greedy); general/creative β temp 0.7, top_p 0.95, top_k 40.
Files: merged bf16 *.safetensors + GGUF Q4_K_M / Q5_K_M / Q8_0 (+ MTP f16).
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.