Qwen3.6-27B-MTP-pi-tune-GGUF
GGUF quantization of Qwen3.6-27B fine-tuned with multi-token prediction and pi-tuning for speculative decoding and agentic reasoning.
Base model
Model Card
<h1 style="margin:0;font-size:30px;font-weight:800;color:white;border:none;line-height:1.15;letter-spacing:-0.01em;">πͺ Qwen3.6-27B-MTP-pi-tune</h1>
[!TIP] For the strongest Pi-style coding-agent behavior, use the reasoning-trained release:
bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF. See the technical writeup for the broader evaluation context. This no-thinking tune remains useful when you specifically want the lower-latency direct / instruct path.
</div>
</div>
</div>
</div>
π‘ 1. Model Overview
| Attribute | Details |
|---|---|
| Base model | Qwen/Qwen3.6-27B |
| Release format | GGUF |
| Runtime target | llama.cpp-compatible local inference |
| Tuning focus | Harness fluency, coding-agent tasks, terminal workflows, tool use, repository work |
| Fine-tuning style | 4-bit QLoRA SFT on private passed agent trajectories |
| Technical writeup | Qwen3.6 27B reasoning writeup |
| Reasoning data policy | Internal reasoning traces were not exported into the SFT rows |
| Recommended quant | Q4_K_M as the default starting point |
[!NOTE] Qwen3.6-27B is a causal language model with a vision encoder. Image and video understanding are supported by pairing the language-model GGUF with the compatible Qwen3.6
mmproj-F16.ggufsidecar (see Β§4 Multi-modal inference). The MTP draft heads are kept atQ8_0precision inside every quant via--tensor-type nextn=q8_0at quantize time β speculative decoding works at any quant level, not justQ8_0/bf16.
π§© 2. Why This Tune
Qwen3.6 supports both a thinking inference mode (which emits a <thinking>...</thinking> block before responding) and a non-thinking mode (where the model answers and acts directly). This release is fine-tuned specifically for the no-thinking path β the mode where agent loops actually live. In a PI-style harness running tool-call loops, every thinking token is wall time the harness can't dispatch the next action against, so the tune is shaped to make the no-thinking path quality where it counts: tool calls, repository edits, terminal commands, verifier feedback, and structured output.
This carries forward Qwen3.6's existing agentic-coding posture β frontend workflows, repository-level reasoning, and tool calling β but pulls the quality into the inference mode that local agent runtimes can budget for.
- Terminal and shell task execution.
- Repository inspection, patching, and test iteration.
- Tool-call-shaped interactions and structured outputs.
- DevOps runbooks, environment setup, and debugging loops.
- Coding tasks where command use, file edits, and verifier feedback matter.
π¦ 3. Quantizations
Recommended starting point: Q4_K_M.
| Quant | File size | VRAM (approx) | Suggested use |
|---|---|---|---|
Q2_K |
~11 GB | ~13 GB | Smallest memory footprint; quality tradeoffs are expected. |
Q3_K_S |
~12 GB | ~15 GB | Low-memory 3-bit option. |
Q3_K_M |
~14 GB | ~16 GB | Balanced 3-bit option. |
Q3_K_L |
~15 GB | ~17 GB | Higher-quality 3-bit option. |
Q4_K_S |
~16 GB | ~18 GB | Smaller 4-bit option. |
Q4_K_M |
~17 GB | ~19 GB | Default recommendation for most local use. Comfortable on a 24 GB GPU. |
Q5_K_S |
~19 GB | ~21 GB | Higher-quality 5-bit option. |
Q5_K_M |
~20 GB | ~22 GB | Strong quality/memory tradeoff; near the upper edge of a 24 GB GPU. |
Q6_K |
~22 GB | ~25 GB | High-quality local inference if you have the memory. |
Q8_0 |
~29 GB | ~32 GB | Highest-precision quantized option. |
bf16 |
~55 GB | ~58 GB | BF16 GGUF reference, if present. |
VRAM figures are rough estimates for GPU-offload inference (-ngl 99 -fa) at a moderate context (~32k) with quantized KV cache; they scale up with longer contexts.
[!IMPORTANT] Every quant in this release ships with the MTP
nextnprediction heads stored atQ8_0precision, regardless of the overall quant target. That means speculative decoding works at any quant level β pick the smallest one that fits your VRAM and you still get the MTP throughput profile described in Β§6.
Some files may still be uploading. Check the Files tab for the exact artifacts currently available.
π 4. Quickstart
Run with llama.cpp (standard launch β works on any build):
# 128k context shown; base model natively supports 256k and is extensible to ~1M via RoPE scaling.
# Sampling values match Qwen3.6's recommended non-thinking-mode defaults β this is the inference
# path the tune was trained for, so prefer these.
llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF:Q4_K_M \
--jinja -ngl 99 -fa -c 131072 \
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
--presence-penalty 1.5
Run with upstream llama.cpp + MTP speculative decoding (ggml-org/llama.cpp, MTP support merged in PR #22673):
# The nextn prediction heads in this release activate via upstream's draft-mtp speculator.
# -np must be 1 with MTP (parallel slots are not yet supported alongside MTP).
llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF:Q4_K_M \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
-np 1 \
--jinja -ngl 99 -fa -c 131072 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
--presence-penalty 1.5
Run with Ollama:
ollama run hf.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF:Q4_K_M
Download a single GGUF file:
hf download bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF \
Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
--local-dir .
Download the whole repo:
hf download bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF --local-dir .
Multi-modal inference (image + video)
This release is compatible with the Qwen3.6 mmproj-F16.gguf sidecar for vision-language inference. A single mmproj file pairs with every quant in this release; the projector is architecturally tied to the base model's vision tower, not to the LM quant level, so download it once and reuse it.
The compatible mmproj can be downloaded from unsloth/Qwen3.6-27B-MTP-GGUF. The fine-tune in this release is language-only β the vision encoder weights have not been touched. Image / video understanding is therefore inherited unchanged from the upstream Qwen3.6-27B base model; this release does not claim to improve it, only to preserve it.
# Pull the LM weights from this repo
hf download bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF \
Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
--local-dir .
# Pull the compatible mmproj sidecar
hf download unsloth/Qwen3.6-27B-MTP-GGUF \
mmproj-F16.gguf \
--local-dir .
# Launch llama-server with vision attached
llama-server -m ./Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
--mmproj ./mmproj-F16.gguf \
--jinja -ngl 99 -fa -c 131072 \
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
--presence-penalty 1.5
For a quick text-and-image session without spinning up a server:
llama-mtmd-cli -m ./Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
--mmproj ./mmproj-F16.gguf
Use as an OpenAI-compatible API
llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint, so any client written against the OpenAI SDK can point at it directly β no client changes required:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)
resp = client.chat.completions.create(
model="Qwen3.6-27B-MTP-pi-tune",
messages=[
{"role": "system", "content": "You are a precise coding agent."},
{"role": "user", "content": "Write a Python function that merges overlapping intervals."},
],
)
print(resp.choices[0].message.content)
The same endpoint accepts tools=[...] for function calling and supports streaming via stream=True.
𧬠5. Training & Data Notes
Tuned on real agent traces, not synthetic generations or distilled chat β every training row is the trail of an assistant that actually executed the task end-to-end through a PI-style harness, exported as Qwen-compatible ChatML rows with tool schemas and runtime prompts preserved.
High-level task coverage spanned:
Terminal and shell-environment agent tasks.
Tool / function-calling interactions.
Multi-language code editing and repair tasks.
Repository issue resolution and test-driven patching.
Coding and API integration tasks.
Shell, package, migration, ops, and verifier-driven tasks.
Specific dataset names and training-row counts are intentionally omitted from this initial card.
β‘ 6. MTP Throughput
MTP stands for Multi-Token Prediction. The model drafts likely future tokens and the runtime accepts them when they agree with the main decode path. On local agent work this matters because long reasoning, code generation, tool-call setup, and shell-oriented turns otherwise spend most of their wall time waiting on generation.
The numbers below describe the current local profile of the release. They are representative figures from internal runs against the PI harness β full benchmark publication is forthcoming and will replace these with task-success-rate tables in Β§7.
Raw decode profile
Agentic throughput
The agentic number is different from raw tokens/sec. It measures real task throughput across agent runs β including model generation, tool calls, shell commands, package installs, file I/O, and verifier-facing work.
Effective output throughput is computed as:
sum(output tokens) / sum(agent execution duration)
That makes it a more realistic agent-workflow number than plain decode speed β it includes time spent operating through the harness, not just time spent generating text.
π 7. Coding Eval Benchmarks
The follow-up release will cover task-success rates across the high-level areas listed in Β§5: terminal/shell agent tasks, tool & function calling, multi-language code editing, repository issue resolution, and coding/API integration tasks.
[!IMPORTANT] Throughput figures above are from the local MTP-enabled run. Task success rates should be reported only from completed eval runs, not inferred from speed.
π― 8. Recommended Use Cases
- Local coding-agent experiments.
- Tool-heavy chat and function-calling experiments.
- DevOps troubleshooting and runbook drafting.
- Repository navigation, patch planning, and test iteration.
- Long-context engineering workflows where local inference is preferred.
β οΈ 9. Limitations
- This is a community release for research, evaluation, and workflow exploration.
- Low-bit quantizations may reduce instruction following and tool-call reliability.
- Coding-eval success rates are not finalized in this initial card.
- This card does not claim safety alignment beyond the behavior inherited from the base model and the fine-tuning data.
π 10. License
Released under the Apache 2.0 license, inherited from the upstream Qwen3.6-27B base model. You are free to use, modify, and redistribute the model and its derivatives subject to the terms of that license.
π 11. Acknowledgements
Thanks to the Qwen team for the Qwen3.6 base model and its MTP design, to the ggml-org / llama.cpp maintainers for native multi-token-prediction support in upstream, and to the broader open-source quantization tooling community whose work makes local-first inference of frontier models possible.
</div>
</div>
</div>
</div>
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.