How many parameters does Qwen3.6-27B-MTP-pi-tune-GGUF have?

Parameter count for Qwen3.6-27B-MTP-pi-tune-GGUF is not available. See the Hugging Face model page for full specifications.

Who created Qwen3.6-27B-MTP-pi-tune-GGUF?

Qwen3.6-27B-MTP-pi-tune-GGUF was published by Thomas Kim on Hugging Face.

Qwen3.6-27B-MTP-pi-tune-GGUF

Name: Qwen3.6-27B-MTP-pi-tune-GGUF
Author: Thomas Kim

LLMby Thomas Kim·Model page ↗

GGUF quantization of Qwen3.6-27B fine-tuned with multi-token prediction and pi-tuning for speculative decoding and agentic reasoning.

Base model

Qwen/Qwen3.6-27B

Model Description

<h1 style="margin:0;font-size:30px;font-weight:800;color:white;border:none;line-height:1.15;letter-spacing:-0.01em;">🪐 Qwen3.6-27B-MTP-pi-tune</h1>

[!TIP] For the strongest Pi-style coding-agent behavior, use the reasoning-trained release: bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF. See the technical writeup for the broader evaluation context. This no-thinking tune remains useful when you specifically want the lower-latency direct / instruct path.

</div>

</div>

</div>

</div>

💡 1. Model Overview

Attribute	Details
Base model	`Qwen/Qwen3.6-27B`
Release format	GGUF
Runtime target	llama.cpp-compatible local inference
Tuning focus	Harness fluency, coding-agent tasks, terminal workflows, tool use, repository work
Fine-tuning style	4-bit QLoRA SFT on private passed agent trajectories
Technical writeup	Qwen3.6 27B reasoning writeup
Reasoning data policy	Internal reasoning traces were not exported into the SFT rows
Recommended quant	`Q4_K_M` as the default starting point

[!NOTE] Qwen3.6-27B is a causal language model with a vision encoder. Image and video understanding are supported by pairing the language-model GGUF with the compatible Qwen3.6 mmproj-F16.gguf sidecar (see §4 Multi-modal inference). The MTP draft heads are kept at Q8_0 precision inside every quant via --tensor-type nextn=q8_0 at quantize time — speculative decoding works at any quant level, not just Q8_0 / bf16.

🧩 2. Why This Tune

Qwen3.6 supports both a thinking inference mode (which emits a <thinking>...</thinking> block before responding) and a non-thinking mode (where the model answers and acts directly). This release is fine-tuned specifically for the no-thinking path — the mode where agent loops actually live. In a PI-style harness running tool-call loops, every thinking token is wall time the harness can't dispatch the next action against, so the tune is shaped to make the no-thinking path quality where it counts: tool calls, repository edits, terminal commands, verifier feedback, and structured output.

This carries forward Qwen3.6's existing agentic-coding posture — frontend workflows, repository-level reasoning, and tool calling — but pulls the quality into the inference mode that local agent runtimes can budget for.

Terminal and shell task execution.
Repository inspection, patching, and test iteration.
Tool-call-shaped interactions and structured outputs.
DevOps runbooks, environment setup, and debugging loops.
Coding tasks where command use, file edits, and verifier feedback matter.

📦 3. Quantizations

Recommended starting point: Q4_K_M.

Quant	File size	VRAM (approx)	Suggested use
`Q2_K`	~11 GB	~13 GB	Smallest memory footprint; quality tradeoffs are expected.
`Q3_K_S`	~12 GB	~15 GB	Low-memory 3-bit option.
`Q3_K_M`	~14 GB	~16 GB	Balanced 3-bit option.
`Q3_K_L`	~15 GB	~17 GB	Higher-quality 3-bit option.
`Q4_K_S`	~16 GB	~18 GB	Smaller 4-bit option.
`Q4_K_M`	~17 GB	~19 GB	Default recommendation for most local use. Comfortable on a 24 GB GPU.
`Q5_K_S`	~19 GB	~21 GB	Higher-quality 5-bit option.
`Q5_K_M`	~20 GB	~22 GB	Strong quality/memory tradeoff; near the upper edge of a 24 GB GPU.
`Q6_K`	~22 GB	~25 GB	High-quality local inference if you have the memory.
`Q8_0`	~29 GB	~32 GB	Highest-precision quantized option.
`bf16`	~55 GB	~58 GB	BF16 GGUF reference, if present.

VRAM figures are rough estimates for GPU-offload inference (-ngl 99 -fa) at a moderate context (~32k) with quantized KV cache; they scale up with longer contexts.

[!IMPORTANT] Every quant in this release ships with the MTP nextn prediction heads stored at Q8_0 precision, regardless of the overall quant target. That means speculative decoding works at any quant level — pick the smallest one that fits your VRAM and you still get the MTP throughput profile described in §6.

Some files may still be uploading. Check the Files tab for the exact artifacts currently available.

🚀 4. Quickstart

Run with llama.cpp (standard launch — works on any build):

# 128k context shown; base model natively supports 256k and is extensible to ~1M via RoPE scaling.
# Sampling values match Qwen3.6's recommended non-thinking-mode defaults — this is the inference
# path the tune was trained for, so prefer these.
llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF:Q4_K_M \
  --jinja -ngl 99 -fa -c 131072 \
  --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
  --presence-penalty 1.5

Run with upstream llama.cpp + MTP speculative decoding (ggml-org/llama.cpp, MTP support merged in PR #22673):

# The nextn prediction heads in this release activate via upstream's draft-mtp speculator.
# -np must be 1 with MTP (parallel slots are not yet supported alongside MTP).
llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF:Q4_K_M \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -np 1 \
  --jinja -ngl 99 -fa -c 131072 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
  --presence-penalty 1.5

Run with Ollama:

ollama run hf.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF:Q4_K_M

Download a single GGUF file:

hf download bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF \
  Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
  --local-dir .

Download the whole repo:

hf download bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF --local-dir .

Multi-modal inference (image + video)

This release is compatible with the Qwen3.6 mmproj-F16.gguf sidecar for vision-language inference. A single mmproj file pairs with every quant in this release; the projector is architecturally tied to the base model's vision tower, not to the LM quant level, so download it once and reuse it.

The compatible mmproj can be downloaded from unsloth/Qwen3.6-27B-MTP-GGUF. The fine-tune in this release is language-only — the vision encoder weights have not been touched. Image / video understanding is therefore inherited unchanged from the upstream Qwen3.6-27B base model; this release does not claim to improve it, only to preserve it.

# Pull the LM weights from this repo
hf download bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF \
  Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
  --local-dir .

# Pull the compatible mmproj sidecar
hf download unsloth/Qwen3.6-27B-MTP-GGUF \
  mmproj-F16.gguf \
  --local-dir .

# Launch llama-server with vision attached
llama-server -m ./Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
  --mmproj ./mmproj-F16.gguf \
  --jinja -ngl 99 -fa -c 131072 \
  --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
  --presence-penalty 1.5

For a quick text-and-image session without spinning up a server:

llama-mtmd-cli -m ./Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
  --mmproj ./mmproj-F16.gguf

Use as an OpenAI-compatible API

llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint, so any client written against the OpenAI SDK can point at it directly — no client changes required:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)

resp = client.chat.completions.create(
    model="Qwen3.6-27B-MTP-pi-tune",
    messages=[
        {"role": "system", "content": "You are a precise coding agent."},
        {"role": "user",   "content": "Write a Python function that merges overlapping intervals."},
    ],
)
print(resp.choices[0].message.content)

The same endpoint accepts tools=[...] for function calling and supports streaming via stream=True.

🧬 5. Training & Data Notes

Tuned on real agent traces, not synthetic generations or distilled chat — every training row is the trail of an assistant that actually executed the task end-to-end through a PI-style harness, exported as Qwen-compatible ChatML rows with tool schemas and runtime prompts preserved.

High-level task coverage spanned:

Terminal and shell-environment agent tasks.
Tool / function-calling interactions.
Multi-language code editing and repair tasks.
Repository issue resolution and test-driven patching.
Coding and API integration tasks.
Shell, package, migration, ops, and verifier-driven tasks.

Specific dataset names and training-row counts are intentionally omitted from this initial card.

⚡ 6. MTP Throughput

MTP stands for Multi-Token Prediction. The model drafts likely future tokens and the runtime accepts them when they agree with the main decode path. On local agent work this matters because long reasoning, code generation, tool-call setup, and shell-oriented turns otherwise spend most of their wall time waiting on generation.

The numbers below describe the current local profile of the release. They are representative figures from internal runs against the PI harness — full benchmark publication is forthcoming and will replace these with task-success-rate tables in §7.

Raw decode profile

Agentic throughput

The agentic number is different from raw tokens/sec. It measures real task throughput across agent runs — including model generation, tool calls, shell commands, package installs, file I/O, and verifier-facing work.

Effective output throughput is computed as:

sum(output tokens) / sum(agent execution duration)

That makes it a more realistic agent-workflow number than plain decode speed — it includes time spent operating through the harness, not just time spent generating text.

📊 7. Coding Eval Benchmarks

The follow-up release will cover task-success rates across the high-level areas listed in §5: terminal/shell agent tasks, tool & function calling, multi-language code editing, repository issue resolution, and coding/API integration tasks.

[!IMPORTANT] Throughput figures above are from the local MTP-enabled run. Task success rates should be reported only from completed eval runs, not inferred from speed.

🎯 8. Recommended Use Cases

Local coding-agent experiments.
Tool-heavy chat and function-calling experiments.
DevOps troubleshooting and runbook drafting.
Repository navigation, patch planning, and test iteration.
Long-context engineering workflows where local inference is preferred.

⚠️ 9. Limitations

This is a community release for research, evaluation, and workflow exploration.
Low-bit quantizations may reduce instruction following and tool-call reliability.
Coding-eval success rates are not finalized in this initial card.
This card does not claim safety alignment beyond the behavior inherited from the base model and the fine-tuning data.

📜 10. License

Released under the Apache 2.0 license, inherited from the upstream Qwen3.6-27B base model. You are free to use, modify, and redistribute the model and its derivatives subject to the terms of that license.

🙏 11. Acknowledgements

Thanks to the Qwen team for the Qwen3.6 base model and its MTP design, to the ggml-org / llama.cpp maintainers for native multi-token-prediction support in upstream, and to the broader open-source quantization tooling community whose work makes local-first inference of frontier models possible.

  </div>
  
  
    
  </div>
  
  
    
  </div>
</div>

Author

Thomas Kim

User

bytkim

Details

Downloads80K

Likes116

AccessOpen Source

Tasktext-generation

Trending42

Licenseapache-2.0

Librarygguf

CreatedJun 2, 2026

UpdatedJun 15, 2026

View on Hugging Face

Get the full context.