gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF
Versión GGUF de un ajuste de Gemma 4 12B para programación agéntica, uso de herramientas y tareas de terminal.
Modelo base
Tarjeta del Modelo
🐣 Tiny footprint, big brain — a local coding & tool-using agent for everyone
No matter your GPU. No matter your RAM. With ~4.5 GB of VRAM or unified memory free, you can run your own private, offline coding agent right now. 🚀 v2 is the big agentic upgrade — it reads, reasons, uses tools, and works through multi-step technical tasks before it acts. 🧠🛠️ All local, all yours, no API, no cloud.
📊 The headline — it works as an agent (tau2-bench)
v2 is built for coding + agentic work — writing code, running commands, using tools, debugging, multi-step
technical tasks. The clearest signal is tau2-bench telecom, an agentic tool-use benchmark whose
diagnose → fix → verify loop mirrors real terminal/debugging work:
| tau2-bench telecom · 20 tasks · local, same harness, all Q8_0 | score |
|---|---|
official gemma-4-12B-it (base) |
~15% |
| 🟢 Gemma4-12B v2 (this model) | ~55% |
→ Roughly 3.5× higher than the base model on technical-agentic tasks. 🎯 Want the full story — why telecom, how the two models fail differently, the honest caveats, and the trade-offs (including general knowledge)? It's all broken down further below. 👇
🚀 Announcements
📌 Hitting a problem? Please check my pinned discussion first. ~99% of issues are a client/sampler config, not
the weights — and they have a quick fix there. For example: garbled or repeating 0000… output almost always
means no repetition penalty (set rep_pen 1.1, temp 1.0); and leaked <|tool_call> / <|channel> tokens mean
your front-end isn't parsing Gemma 4's native tool format (use llama.cpp --jinja). If your question isn't covered,
don't hesitate to open a discussion — I read them and reply as fast as I can. 💬
📦 No Q2_K this release. I finished a Q2_K (imatrix) build, but it didn't hold up under real stress-testing, so I'm holding it back — I only ship a quant once I'm confident it's genuinely good. Smallest reliable option is Q3_K_M; Q4_K_M is the recommended sweet spot. 🙏
🔮 v3 is already on the way. Honestly? Even I didn't expect the post-training jump to be this large — so I'm pushing further. v3 keeps the coding + agentic focus and aims higher still. Stay tuned! 🎉
🐘 And a bigger sibling is coming — Qwen3.6-27B. I've also started fine-tuning Qwen3.6-27B with the same coding + agentic recipe, for those of you who do have the headroom and want more raw capability. But I haven't forgotten what this project is about: a 27B may be too heavy for some of your GPUs / RAM. So this is not a replacement — I'm pushing v3 (this 12B line) in parallel, at the same time, and it will only get stronger. 💪 No matter your hardware, you'll have a model that fits. 💚
💚 A personal note — thank you, and a few honest words (please read)
First, a huge thank-you for all the data and help you've shared. 🙏 The bittersweet part: none of us saw it coming that Fable 5 would be retired — and only my own dataset holds Fable 5's genuine, self-authored chain-of-thought. So for every dataset the community contributed, I rebuilt the missing reasoning from scratch with Opus 4.8 (xhigh). It may diverge from the original Fable 5 traces, but it was the only workable path — and the improvement turned out really, really huge (it nearly launched me out of my chair 😄). The benchmark numbers are right above. 👆
Second — I've tried to reply to every community comment, and I've openly owned v1's training problems. Truly, thank you: your feedback is what lets me improve. 💚
Because v1 hit #1 trending, it also attracted some bad words / trolling. I'll say this gently but firmly: real criticism is always welcome here — pure insults are not. This is a local model that lets anyone run a capable AI on tiny RAM/VRAM, at zero API cost and fully private; I even open-sourced the full safetensors master to study and build on. If something's off, open a discussion about the actual problem — I genuinely want to hear it and I'll act on it. But comments that are only insults help no one, and I'll remove them without hesitation. 🙏
Please remember: I'm one person — not a lab shipping an "open" model for marketing or to monetize later. I don't advertise. I build this for you on my own time and my own money: synthesizing data, reviewing and cleaning it by hand, splitting and re-segmenting it (this round I even built a dynamic context-window pass to keep the agent's read-before-act steps intact), reading the latest papers, then training → evaluating → training → evaluating. It burned through an entire Claude Max 20× plan (I keep a separate Pro for my own work), and v2 alone cost 40+ hours — even with Opus 4.8, the data threw constant curveballs I had to verify myself. Thank you, truly. 🐾
🔬 The benchmarks, in detail (tau2-bench)
I evaluated v2 on tau2-bench (an agentic tool-use benchmark). I did not run the whole suite — it's very time-consuming — so I focused on the single domain that best matches what v2 is for.
Why tau2-bench telecom? Telecom troubleshooting makes the agent diagnose with read/inspect tools → pinpoint the
issue → apply a fix → verify it — structurally the same loop as real terminal/debugging work
(check state → diagnose → fix → confirm). That's exactly what this model is meant to be good at, which makes it the
right yardstick for v2 (much more so than a shopping/customer-service domain).
| tau2-bench telecom · 20 tasks · local, same harness, all Q8_0 | score |
|---|---|
official gemma-4-12B-it (base) |
~15% |
| 🟢 Gemma4-12B v2 (this model) | ~55% |
→ Roughly 3.5× higher than the base model on technical-agentic tasks. 🎯
Grounded, not made-up. Independently, a coding/terminal fabrication probe (tasks that deliberately tempt the
model to invent file paths / function signatures / values) found v2 grounds before it acts just like the base —
it grep/read/ls first, and doesn't make things up (0% fabrication, on par with the base model).
The interesting part — how they fail. The base model gives up early: on this run it bailed to a human agent
10 times (transfer_to_human) instead of finishing the fix. v2 keeps going — it stays in the loop and works the
problem the way a much bigger model would, which is exactly why it solves so many more. It's not perfect yet: it still
flails a little sometimes (over-trying, retrying). And some of the remaining misses are actually a bug in the
benchmark's own APN tool (it throws on inputs it should handle gracefully), not the model. To be clear: I will not
patch the benchmark's tools or leak its test questions just to inflate my score — I'd rather report an honest number
and improve the model itself. More training is coming in v3. 🔧
About retail (customer-service shopping): on tau2-bench retail, the base model scores a bit higher than v2. This
is fully expected and by design. Retail is pure customer-service (look up a user, process an order) — not what this
model is for. v2 is specialized for coding / terminal / technical-agentic work, and on those (telecom) it
dramatically outperforms the base. Need a customer-service bot? This isn't it. Need a local coding/agentic model?
It is. 💚
Let's keep it honest about scale. Today's frontier models — think mimo-v2.5-pro or Opus 4.8 — all land 90%+ on this telecom benchmark. They're also enormous. For a 12B model, my rough guess is that v3 might top out somewhere around 60–70% (emphasis on guess — I haven't even started v3 yet). So let's be clear-eyed: there's still a real gap to the frontier. But keep the scale in mind — this is a 12B model running on your own machine, and narrowing that gap as much as possible at this size is the whole point. 💪
And the trade-off — there's no free lunch. I also ran a general-knowledge benchmark (MMLU-Pro), and v2 lands
a little below the base model there. That's completely normal and expected for a focused fine-tune: when you
push hard on coding + agentic, you trade a sliver of broad-knowledge breadth for it. Need a generalist? Try my own
general-purpose Claude Opus 4.6/4.8 distillation
— or the original google/gemma-4-12B-it base. Need a local coding/agentic worker? That's what v2 is tuned for.
🔬 Methodology, honestly: these are local, same-harness, relative numbers (all models tested at Q8_0, greedy decoding, self-simulated user, 20 tasks). They are not directly comparable to published tau2-bench leaderboard figures (different user-simulator, full task sets, full precision) — local self-eval runs systematically lower than published scores. Read them as "v2 vs the base model under identical conditions", which is the comparison that actually matters here.
📚 What's new in v2 (training)
v2 continues from the v1 coder and adds a big agentic push — the piece v1 was missing:
- 🛠️ Agentic / terminal — real multi-step tool-use trajectories (read → reason → act → verify), in Gemma 4's native tool protocol. This is what drove the tau2-bench telecom jump, and it fixes v1's "stops after the first step" behavior.
- 💻 Coding — verified chain-of-thought over Python tasks (real CoT, gated on passing tests) plus the Fable-5-redo set for the hard cases.
- 📚 General — a curated slice of reasoning/instruction data to keep broad competence.
All reasoning is distilled CoT (see the personal note above on how the Fable 5 traces were rebuilt with Opus 4.8).
📦 Pick your size (GGUF quants)
| Quant | Size | Vibe |
|---|---|---|
| 🟡 Q3_K_M | 5.7 GB | great for 8 GB VRAM |
| 🔵 Q4_K_M | 6.87 GB | the sweet spot 👌 (recommended) |
| 🟣 Q6_K | 9.11 GB | near-lossless |
| ⚪ Q8_0 | 11.8 GB | basically full quality |
ℹ️ No Q2_K this release — it didn't pass stress-testing yet (see Announcements). Smallest reliable quant = Q3_K_M.
🚀 How to run it
Option A — llama.cpp (recommended) 🦙
⚠️ Needs a recent llama.cpp (this is the
gemma4_unifiedarchitecture — older builds won't load it).
@echo off
cd /d C:\llama.cpp
llama-server.exe ^
-m C:\models\gemma4-v2-Q4_K_M.gguf ^
--ctx-size 16384 ^
--n-gpu-layers 99 ^
--no-mmap -fa on ^
--jinja ^
--temp 1.0 --top-p 0.95 --top-k 64 ^
--host 0.0.0.0 --port 18080
pause
- 🛠️ Agentic use: pass your tools via the OpenAI
toolsfield (works with--jinja). v2 emits structured tool-calls in Gemma 4's native protocol and is happy in agent loops (read/grep/edit/run, then verify). - 🖱️ One-click apps: LM Studio / Jan / Ollama — import the GGUF, pick a quant, go.
🧠 Thinking mode
v2 thinks in Gemma's native thought channel before answering (keep enable_thinking=true, the default chat template
handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64; for coding you can also go greedy (temp 0).
⚠️ Good to know
- Specialized for coding / terminal / agentic. General-knowledge facts/numbers should still be double-checked.
- Reduced refusals: task-focused training, not safety-aligned — add your own guardrails for production. Use responsibly. 🙏
- English-centric.
📚 Base & License
- License: Apache 2.0. Gemma 4 is released by Google under Apache 2.0 (unlike the older Gemma 1/2/3 terms), so this fine-tune is Apache 2.0 too — free to use, modify, and redistribute. 🎉
- Base model:
google/gemma-4-12B-it. - Personal/hobby project — shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! 🐾✨
⚡ Speculative decoding (MTP draft) — verified build
The MTP/ folder ships the Gemma 4 multi-token-prediction draft (unsloth's GGUF conversion of Google's official
gemma-4-12B-it-assistant) for speculative decoding. Gemma 4 MTP is in llama.cpp mainline (PR #23398) — no fork
needed — but the gemma4-assistant loader is build-sensitive right now, so please use the exact build below:
- ✅ Verified working: llama.cpp
b9553(commit9e3b928fd). I reproduced it withgemma4-v2-Q8_0+ theMTP-Q8_0draft: loads cleanly and accelerates generation (~88 → ~180 tok/s on a simple deterministic prompt; expect ~1.2–1.3× on real coding/thinking). Lossless either way. - ⚠️ Newer builds (e.g. b9702 / b9717) currently crash while loading the draft with
invalid vector subscript. This is an upstream regression in thegemma4-assistantloader path, not a problem with these GGUFs — the same files load fine on b9553. Stick with b9553 until it's fixed upstream.
Working command on b9553 (note the older flag names — --model-draft, not --spec-draft-model):
llama-server -m gemma4-v2-Q8_0.gguf ^
--model-draft MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^
--spec-type draft-mtp --spec-draft-n-max 4 ^
-ngl 99 -ngld 99 -fa on --jinja
ℹ️ The
Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)line is harmless. The draft is the generic Gemma 4 assistant (not retrained for v2), so acceptance is a touch lower than a model-specific draft would give — still 100% lossless. On small-VRAM cards, Q8 main + long context + the draft can be tight; drop to Q6_K/Q4_K_M or a smaller--ctx-sizeif you hit OOM.
Regístrate para leer casos de estudio completos, acceder a métricas detalladas y recibir todos los reportes.
Regístrate para leer casos de estudio completos, acceder a métricas detalladas y recibir todos los reportes.