gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
Yuxin Lu's GGUF-quantized coding and reasoning fine-tune of gemma-4-12B-it, optimized for local inference with llama.cpp.
Base model
Model Card
license: apache-2.0 base_model: google/gemma-4-12B-it library_name: gguf pipeline_tag: text-generation tags: [gemma4, coding, code, reasoning, thinking, gguf, llama.cpp, local-llm]
๐ป Gemma4-12B-Coder (GGUF) โ Composer 2.5 ร Fable 5 โจ
๐ฃ Tiny footprint, big brain โ a local coding model for everyone
No matter your GPU. No matter your RAM. If you've got ~4.5 GB of VRAM or unified memory free, you can run your own private, offline coding assistant right now. ๐ This is the v1 / code edition โ distilled from real chain-of-thought so it thinks through a problem before writing the solution. ๐ง ๐ป All local, all yours, no API, no cloud.
๐ฏ What it is
A focused fine-tune of Gemma 4 12B on verifiable Python coding data โ every training example's reasoning leads to code that actually passed its tests. The result reasons in the open (edge cases, complexity, approach) and then emits a clean, runnable solution. ๐
๐ Announcements
๐๐ฅ BIG NEWS โ v2 drops EARLY! I'm pushing it ahead of schedule: tomorrow, 5โ8 PM (US Pacific), sharp. โฐ
It lands in both formats at once, in two repos โ GGUF (ready to run) + the full safetensors master
(build / fine-tune on top). v2 is agentic + coding focused โ the piece v1 was missing.
A sneak peek ๐ (yep, I'm spoiling it early). When I saw v2's tau2-bench telecom result โ an agentic tool-use
benchmark where the model has to diagnose โ fix โ verify, exactly like real terminal/debugging work โ I literally got
launched out of my chair (โฆokay, kidding ๐). The jump in actually solving the problem is wild:
| tau2-bench telecom ยท local, same harness, Q8_0 | score |
|---|---|
official gemma-4-12B-it (base) |
~15% |
| ๐ข v2 (dropping tomorrow) | ~55% |
The base model tends to give up early (hands the problem off to a human); v2 keeps going and works it the way a much bigger model would. Full benchmark details land in the v2 card tomorrow. ๐ง
โ safetensors master (this v1 model) is UP. Full-precision weights are live โ yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1 โ roll your own GGUF / MLX / AWQ quants or fine-tune straight from the master. ๐
๐ฃ Context length fixed: now 256K (was 131K) โ thanks, community! ๐
A community member spotted that this model was reporting only a 131K context window. That turned out to be
the well-known upstream Gemma 4 metadata bug โ Google's initial config.json shipped with
max_position_embeddings: 131072 instead of the real 262144 (256K), and that value got baked into a lot of
downstream finetunes and quants (including this one) before it was fixed upstream.
The weights were always fine โ it was purely a metadata field. All GGUF quants have been re-patched to the
full 256K context (gemma4.context_length = 262144). Just re-download if you grabbed an earlier copy. ๐
๐ Training data (the interesting part ๐ณ)
This is a distillation of two complementary chain-of-thought sources, both over verifiable Python coding tasks (algorithmic / function-level problems that come with deterministic tests):
- ๐ฅ Main set โ Composer 2.5 real CoT. Genuine, model-authored reasoning traces. The teacher solved each problem, its code was run against the task's tests, and only the passing solutions were kept. So the reasoning you're learning from leads to code that actually works.
- ๐ฅ Aux set โ Fable 5 (released today! ๐). A clever twist: we took the problems where Composer 2.5 got it wrong and handed them to Fable 5 to redo โ re-deriving a fresh, self-consistent chain-of-thought and a correct solution, again gated on passing the tests. This recovers the hard cases the main teacher missed. These traces are synthetic (rationalized CoT), and are tagged separately so the two sources stay distinguishable.
The recipe: real CoT for the bulk of solid coverage, plus synthetic "second-attempt" CoT to patch the failures โ both verified by execution before anything entered training. โ
๐ฆ Pick your size (GGUF quants)
| Quant | Size | Vibe |
|---|---|---|
| ๐ข Q2_K | 4.5 GB | tiniest โ runs almost anywhere |
| ๐ก Q3_K_M | 5.7 GB | great for 8 GB VRAM โ much better than Q2 |
| ๐ต Q4_K_M | 6.87 GB | the sweet spot ๐ (recommended) |
| ๐ฃ Q6_K | 9.11 GB | near-lossless |
| โช Q8_0 | 11.8 GB | basically full quality |
๐งฎ "Will it fit?" โ context length cheat-sheet
Rough estimates ๐ค (assumes q8_0 KV cache + ~1.5 GB overhead; use q4_0 KV cache for โ2ร more context!).
Max context is 256K. "โ" = won't fit, pick a smaller quant. โ๏ธ
| Your VRAM / unified mem | ๐ข Q2_K (4.5G) | ๐ก Q3_K_M (5.7G) | ๐ต Q4_K_M (6.87G) | ๐ฃ Q6_K (9.11G) | โช Q8_0 (11.8G) |
|---|---|---|---|---|---|
| 8 GB | ~16K ctx | ~10K | tight (~2โ4K) | โ | โ |
| 12 GB | ~48K | ~38K | ~30K | ~12K | โ |
| 16 GB | ~80K | ~72K | ~64K | ~44K | ~22K |
| 24 GB | ~200K | ~160K | ~128K | ~110K | ~88K |
| 32 GB | 256K (max) ๐ | 256K | 256K | ~230K | ~190K |
๐ก Apple Silicon / integrated GPUs with unified memory count too โ same numbers, just slower than a dGPU. ๐ก Low on room? Drop a quant or switch KV cache to
q4_0and your context roughly doubles.
๐ How to run it (super easy)
Option A โ llama.cpp (recommended) ๐ฆ
- Grab a quant above (e.g.
โฆ-Q4_K_M.gguf) andllama-serverfrom llama.cpp.โ ๏ธ Needs a recent llama.cpp (this is the
gemma4_unifiedarchitecture โ older builds won't load it). - Run a server (Windows
.batshown โ tweak--port,--ctx-sizeto taste):
@echo off
cd /d C:\llama.cpp
llama-server.exe ^
-m C:\models\gemma4-coding-Q4_K_M.gguf ^
--ctx-size 16384 ^
--n-gpu-layers 99 ^
--no-mmap ^
-fa on ^
--cache-type-k q8_0 --cache-type-v q8_0 ^
--temp 1.0 --top-p 0.95 --top-k 64 ^
--host 0.0.0.0 --port 18080
pause
- Open
http://localhost:18080and chat. ๐ (Tip: bump--ctx-sizeper the table; useq4_0KV for more.)
Option B โ one-click apps ๐ฑ๏ธ
Works in LM Studio, Jan, Ollama, etc. โ just import the GGUF, pick your quant, go. ๐พ
๐ง Thinking mode
This model thinks in Gemma's native thought channel before answering โ exactly how it was trained. Keep
enable_thinking=true (the default chat template handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64.
For coding you can also go greedy (temp 0) for more deterministic solutions.
โ ๏ธ Good to know
- Reduced refusals: the training data is task-focused with no safety hedging, so this refuses less than the base model. It is not safety-aligned โ add your own guardrails for production. Use responsibly. ๐
- Specialized for Python / algorithmic coding. Reasoning quality is strongest in that domain; general-knowledge facts/numbers should still be double-checked.
- English-centric.
๐ Base & License
- License: Apache 2.0. Gemma 4 is released by Google under Apache 2.0 (unlike the older Gemma 1/2/3 terms), so this fine-tune is Apache 2.0 too โ free to use, modify, and redistribute. ๐
- Base model:
google/gemma-4-12B-it. - Personal/hobby project โ shared as-is, no warranty. Have fun, and happy hacking! ๐พโจ
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.