YL

gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF

LLMpor Yuxin Lu·Página del modelo

Versión GGUF de un ajuste de Gemma 4 12B para programación agéntica, uso de herramientas y tareas de terminal.

Share:

Modelo base

google/gemma-4-12B-it

Tarjeta del Modelo

🐣 Tiny footprint, big brain — a local coding & tool-using agent for everyone

No matter your GPU. No matter your RAM. With ~4.5 GB of VRAM or unified memory free, you can run your own private, offline coding agent right now. 🚀 v2 is the big agentic upgrade — it reads, reasons, uses tools, and works through multi-step technical tasks before it acts. 🧠🛠️ All local, all yours, no API, no cloud.


📊 The headline — it works as an agent (tau2-bench)

v2 is built for coding + agentic work — writing code, running commands, using tools, debugging, multi-step technical tasks. The clearest signal is tau2-bench telecom, an agentic tool-use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work:

tau2-bench telecom · 20 tasks · local, same harness, all Q8_0 score
official gemma-4-12B-it (base) ~15%
🟢 Gemma4-12B v2 (this model) ~55%

→ Roughly 3.5× higher than the base model on technical-agentic tasks. 🎯 Want the full storywhy telecom, how the two models fail differently, the honest caveats, and the trade-offs (including general knowledge)? It's all broken down further below. 👇


🚀 Announcements

📌 Hitting a problem? Please check my pinned discussion first. ~99% of issues are a client/sampler config, not the weights — and they have a quick fix there. For example: garbled or repeating 0000… output almost always means no repetition penalty (set rep_pen 1.1, temp 1.0); and leaked <|tool_call> / <|channel> tokens mean your front-end isn't parsing Gemma 4's native tool format (use llama.cpp --jinja). If your question isn't covered, don't hesitate to open a discussion — I read them and reply as fast as I can. 💬

📦 No Q2_K this release. I finished a Q2_K (imatrix) build, but it didn't hold up under real stress-testing, so I'm holding it back — I only ship a quant once I'm confident it's genuinely good. Smallest reliable option is Q3_K_M; Q4_K_M is the recommended sweet spot. 🙏

🔮 v3 is already on the way. Honestly? Even I didn't expect the post-training jump to be this large — so I'm pushing further. v3 keeps the coding + agentic focus and aims higher still. Stay tuned! 🎉

🐘 And a bigger sibling is coming — Qwen3.6-27B. I've also started fine-tuning Qwen3.6-27B with the same coding + agentic recipe, for those of you who do have the headroom and want more raw capability. But I haven't forgotten what this project is about: a 27B may be too heavy for some of your GPUs / RAM. So this is not a replacement — I'm pushing v3 (this 12B line) in parallel, at the same time, and it will only get stronger. 💪 No matter your hardware, you'll have a model that fits. 💚


💚 A personal note — thank you, and a few honest words (please read)

First, a huge thank-you for all the data and help you've shared. 🙏 The bittersweet part: none of us saw it coming that Fable 5 would be retired — and only my own dataset holds Fable 5's genuine, self-authored chain-of-thought. So for every dataset the community contributed, I rebuilt the missing reasoning from scratch with Opus 4.8 (xhigh). It may diverge from the original Fable 5 traces, but it was the only workable path — and the improvement turned out really, really huge (it nearly launched me out of my chair 😄). The benchmark numbers are right above. 👆

Second — I've tried to reply to every community comment, and I've openly owned v1's training problems. Truly, thank you: your feedback is what lets me improve. 💚

Because v1 hit #1 trending, it also attracted some bad words / trolling. I'll say this gently but firmly: real criticism is always welcome here — pure insults are not. This is a local model that lets anyone run a capable AI on tiny RAM/VRAM, at zero API cost and fully private; I even open-sourced the full safetensors master to study and build on. If something's off, open a discussion about the actual problem — I genuinely want to hear it and I'll act on it. But comments that are only insults help no one, and I'll remove them without hesitation. 🙏

Please remember: I'm one person — not a lab shipping an "open" model for marketing or to monetize later. I don't advertise. I build this for you on my own time and my own money: synthesizing data, reviewing and cleaning it by hand, splitting and re-segmenting it (this round I even built a dynamic context-window pass to keep the agent's read-before-act steps intact), reading the latest papers, then training → evaluating → training → evaluating. It burned through an entire Claude Max 20× plan (I keep a separate Pro for my own work), and v2 alone cost 40+ hours — even with Opus 4.8, the data threw constant curveballs I had to verify myself. Thank you, truly. 🐾


🔬 The benchmarks, in detail (tau2-bench)

I evaluated v2 on tau2-bench (an agentic tool-use benchmark). I did not run the whole suite — it's very time-consuming — so I focused on the single domain that best matches what v2 is for.

Why tau2-bench telecom? Telecom troubleshooting makes the agent diagnose with read/inspect tools → pinpoint the issue → apply a fix → verify it — structurally the same loop as real terminal/debugging work (check state → diagnose → fix → confirm). That's exactly what this model is meant to be good at, which makes it the right yardstick for v2 (much more so than a shopping/customer-service domain).

tau2-bench telecom · 20 tasks · local, same harness, all Q8_0 score
official gemma-4-12B-it (base) ~15%
🟢 Gemma4-12B v2 (this model) ~55%

→ Roughly 3.5× higher than the base model on technical-agentic tasks. 🎯

Grounded, not made-up. Independently, a coding/terminal fabrication probe (tasks that deliberately tempt the model to invent file paths / function signatures / values) found v2 grounds before it acts just like the base — it grep/read/ls first, and doesn't make things up (0% fabrication, on par with the base model).

The interesting part — how they fail. The base model gives up early: on this run it bailed to a human agent 10 times (transfer_to_human) instead of finishing the fix. v2 keeps going — it stays in the loop and works the problem the way a much bigger model would, which is exactly why it solves so many more. It's not perfect yet: it still flails a little sometimes (over-trying, retrying). And some of the remaining misses are actually a bug in the benchmark's own APN tool (it throws on inputs it should handle gracefully), not the model. To be clear: I will not patch the benchmark's tools or leak its test questions just to inflate my score — I'd rather report an honest number and improve the model itself. More training is coming in v3. 🔧

About retail (customer-service shopping): on tau2-bench retail, the base model scores a bit higher than v2. This is fully expected and by design. Retail is pure customer-service (look up a user, process an order) — not what this model is for. v2 is specialized for coding / terminal / technical-agentic work, and on those (telecom) it dramatically outperforms the base. Need a customer-service bot? This isn't it. Need a local coding/agentic model? It is. 💚

Let's keep it honest about scale. Today's frontier models — think mimo-v2.5-pro or Opus 4.8 — all land 90%+ on this telecom benchmark. They're also enormous. For a 12B model, my rough guess is that v3 might top out somewhere around 60–70% (emphasis on guess — I haven't even started v3 yet). So let's be clear-eyed: there's still a real gap to the frontier. But keep the scale in mind — this is a 12B model running on your own machine, and narrowing that gap as much as possible at this size is the whole point. 💪

And the trade-off — there's no free lunch. I also ran a general-knowledge benchmark (MMLU-Pro), and v2 lands a little below the base model there. That's completely normal and expected for a focused fine-tune: when you push hard on coding + agentic, you trade a sliver of broad-knowledge breadth for it. Need a generalist? Try my own general-purpose Claude Opus 4.6/4.8 distillation — or the original google/gemma-4-12B-it base. Need a local coding/agentic worker? That's what v2 is tuned for.

🔬 Methodology, honestly: these are local, same-harness, relative numbers (all models tested at Q8_0, greedy decoding, self-simulated user, 20 tasks). They are not directly comparable to published tau2-bench leaderboard figures (different user-simulator, full task sets, full precision) — local self-eval runs systematically lower than published scores. Read them as "v2 vs the base model under identical conditions", which is the comparison that actually matters here.


📚 What's new in v2 (training)

v2 continues from the v1 coder and adds a big agentic push — the piece v1 was missing:

  • 🛠️ Agentic / terminal — real multi-step tool-use trajectories (read → reason → act → verify), in Gemma 4's native tool protocol. This is what drove the tau2-bench telecom jump, and it fixes v1's "stops after the first step" behavior.
  • 💻 Coding — verified chain-of-thought over Python tasks (real CoT, gated on passing tests) plus the Fable-5-redo set for the hard cases.
  • 📚 General — a curated slice of reasoning/instruction data to keep broad competence.

All reasoning is distilled CoT (see the personal note above on how the Fable 5 traces were rebuilt with Opus 4.8).


📦 Pick your size (GGUF quants)

Quant Size Vibe
🟡 Q3_K_M 5.7 GB great for 8 GB VRAM
🔵 Q4_K_M 6.87 GB the sweet spot 👌 (recommended)
🟣 Q6_K 9.11 GB near-lossless
Q8_0 11.8 GB basically full quality

ℹ️ No Q2_K this release — it didn't pass stress-testing yet (see Announcements). Smallest reliable quant = Q3_K_M.


🚀 How to run it

Option A — llama.cpp (recommended) 🦙

⚠️ Needs a recent llama.cpp (this is the gemma4_unified architecture — older builds won't load it).

@echo off
cd /d C:\llama.cpp
llama-server.exe ^
  -m C:\models\gemma4-v2-Q4_K_M.gguf ^
  --ctx-size 16384 ^
  --n-gpu-layers 99 ^
  --no-mmap -fa on ^
  --jinja ^
  --temp 1.0 --top-p 0.95 --top-k 64 ^
  --host 0.0.0.0 --port 18080
pause
  • 🛠️ Agentic use: pass your tools via the OpenAI tools field (works with --jinja). v2 emits structured tool-calls in Gemma 4's native protocol and is happy in agent loops (read/grep/edit/run, then verify).
  • 🖱️ One-click apps: LM Studio / Jan / Ollama — import the GGUF, pick a quant, go.

🧠 Thinking mode

v2 thinks in Gemma's native thought channel before answering (keep enable_thinking=true, the default chat template handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64; for coding you can also go greedy (temp 0).


⚠️ Good to know

  • Specialized for coding / terminal / agentic. General-knowledge facts/numbers should still be double-checked.
  • Reduced refusals: task-focused training, not safety-aligned — add your own guardrails for production. Use responsibly. 🙏
  • English-centric.

📚 Base & License

  • License: Apache 2.0. Gemma 4 is released by Google under Apache 2.0 (unlike the older Gemma 1/2/3 terms), so this fine-tune is Apache 2.0 too — free to use, modify, and redistribute. 🎉
  • Base model: google/gemma-4-12B-it.
  • Personal/hobby project — shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! 🐾✨

⚡ Speculative decoding (MTP draft) — verified build

The MTP/ folder ships the Gemma 4 multi-token-prediction draft (unsloth's GGUF conversion of Google's official gemma-4-12B-it-assistant) for speculative decoding. Gemma 4 MTP is in llama.cpp mainline (PR #23398) — no fork needed — but the gemma4-assistant loader is build-sensitive right now, so please use the exact build below:

  • Verified working: llama.cpp b9553 (commit 9e3b928fd). I reproduced it with gemma4-v2-Q8_0 + the MTP-Q8_0 draft: loads cleanly and accelerates generation (~88 → ~180 tok/s on a simple deterministic prompt; expect ~1.2–1.3× on real coding/thinking). Lossless either way.
  • ⚠️ Newer builds (e.g. b9702 / b9717) currently crash while loading the draft with invalid vector subscript. This is an upstream regression in the gemma4-assistant loader path, not a problem with these GGUFs — the same files load fine on b9553. Stick with b9553 until it's fixed upstream.

Working command on b9553 (note the older flag names — --model-draft, not --spec-draft-model):

llama-server -m gemma4-v2-Q8_0.gguf ^
  --model-draft MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^
  --spec-type draft-mtp --spec-draft-n-max 4 ^
  -ngl 99 -ngld 99 -fa on --jinja

ℹ️ The Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting) line is harmless. The draft is the generic Gemma 4 assistant (not retrained for v2), so acceptance is a touch lower than a model-specific draft would give — still 100% lossless. On small-VRAM cards, Q8 main + long context + the draft can be tight; drop to Q6_K/Q4_K_M or a smaller --ctx-size if you hit OOM.

Autor
YL
Yuxin Lu
Usuario
yuxinlu1
Detalles
Descargas21.7K
Me gusta214
AccesoCódigo Abierto
Tareatext-generation
Tendencia198
Licenciaapache-2.0
Libreríagguf
Creado19 jun 2026
Actualizado19 jun 2026
Ver en Hugging Face
Entiende todo el contexto.

Regístrate para leer casos de estudio completos, acceder a métricas detalladas y recibir todos los reportes.