Ornith-1.0-9B-MTP-GGUF
protoLabsAI's GGUF quantization of Ornith-1.0-9B with multi-token prediction support for faster speculative decoding.
Base model
Model Description
GGUF builds of deepreinforce-ai/Ornith-1.0-9B
with the KL-distilled MTP draft head from
protoLabsAI/Ornith-1.0-9B-MTP baked
in — so llama.cpp does lossless multi-token (self-)speculative decoding out of the box, no
separate draft model required.
~1.4–1.7× single-stream decode speedup on a single RTX A6000, distribution-lossless. The head's per-token acceptance on llama.cpp matches the vLLM reference (0.766 here vs 0.762).
Just want the base model with no MTP? Use
deepreinforce-ai/Ornith-1.0-9B-GGUF. These files add thenextnhead on top of the same trunk.
Files
| File | Form | Size | Use |
|---|---|---|---|
ornith-9b-mtp-kl-Q8_0.gguf |
bundled (trunk + head) | 9.8 GB | highest quality / biggest relative speedup |
ornith-9b-mtp-kl-Q6_K.gguf |
bundled | 7.6 GB | near-lossless quant |
ornith-9b-mtp-kl-Q5_K_M.gguf |
bundled | 6.6 GB | balanced |
ornith-9b-mtp-kl-Q4_K_M.gguf |
bundled | 5.8 GB | fastest k-quant |
ornith-9b-mtp-kl-IQ4_XS.gguf |
bundled (imatrix) | 5.5 GB | low VRAM, near-Q4 quality |
ornith-9b-mtp-kl-IQ3_M.gguf |
bundled (imatrix) | 4.7 GB | lower VRAM |
ornith-9b-mtp-kl-IQ2_M.gguf |
bundled (imatrix) | 3.9 GB | very low VRAM (~5 GB to serve) |
ornith-9b-mtp-kl-BF16.gguf |
bundled (full precision) | 18.4 GB | the master; re-quantize from this |
mtp-ornith-9b-mtp-kl-Q8_0.gguf |
standalone draft head | 2.4 GB | attach to a base GGUF via --model-draft |
The IQ quants are i-quants built with an importance matrix (calibrated on the trunk) for
quality at low bit-rates, with the MTP nextn head pinned to Q8_0 so speculative-decode
acceptance holds even on the 2-bit trunk (verified ~0.81–0.84 accept on IQ2_M–IQ4_XS, on par
with the k-quants). Serve them exactly like the k-quants (--spec-type draft-mtp).
Requires llama.cpp ≥ b9616 (Qwen3.5 qwen35 arch + --spec-type draft-mtp).
Run
Bundled (recommended) — the head travels in the file:
llama-server --model ornith-9b-mtp-kl-Q4_K_M.gguf \
--n-gpu-layers 99 --ctx-size 8192 --flash-attn on --jinja \
--spec-type draft-mtp --spec-draft-n-max 3
Standalone draft — pair the small head with any base Ornith-9B GGUF:
llama-server --model ornith-1.0-9b-Q4_K_M.gguf \
--model-draft mtp-ornith-9b-mtp-kl-Q8_0.gguf \
--spec-type draft-mtp --spec-draft-n-max 3 \
--n-gpu-layers 99 --ctx-size 8192 --flash-attn on --jinja
--spec-draft-n-max is the draft depth: 2 maximizes acceptance, 3 maximizes
throughput, 4 starts to regress. Tune per workload.
Benchmarks (RTX A6000, ctx 8192, flash-attn, greedy; 6-prompt code+general mix)
n-max sweep (Q8_0)
| config | decode tok/s | acceptance | speedup |
|---|---|---|---|
| base (no MTP) | 71.0 | — | 1.00× |
| MTP n-max 2 | 118.3 | 0.766 | 1.67× |
| MTP n-max 3 | 122.6 | 0.651 | 1.73× |
| MTP n-max 4 | 120.8 | 0.565 | 1.70× |
Across quants (MTP n-max 3)
| quant | base tok/s | MTP tok/s | speedup | acceptance |
|---|---|---|---|---|
| Q4_K_M | 105.4 | 145.3 | 1.38× | 0.659 |
| Q8_0 | 71.0 | 122.6 | 1.73× | 0.651 |
Acceptance is quant-stable (~0.65 @ n-max 3 even with the Q4 head). Q4_K_M is fastest in absolute terms; the relative MTP gain grows with precision (Q8's slow bandwidth-bound baseline has more to gain from the parallel verify).
"Lossless" — read this
MTP speculative decoding is distribution-lossless: every drafted token is verified against the target, so the output distribution is unchanged. It is not bitwise-identical to plain decode at greedy/temp 0 — the batched verification path computes target logits in a different floating-point reduction order than sequential decoding, which can flip a greedy argmax and fork the text. Both outputs are equally valid and equal quality; this is expected llama.cpp behavior, not a defect of these weights.
How these were built
# 1. graft the mtp.* head into the base trunk (15 tensors, 1 nextn layer)
python graft.py --donor protoLabsAI/Ornith-1.0-9B-MTP \
--target deepreinforce-ai/Ornith-1.0-9B --out ./ornith-9b-mtp-kl
# 2. convert (the converter remaps mtp.* -> blk.<32>.nextn.* automatically)
python convert_hf_to_gguf.py ./ornith-9b-mtp-kl --outfile out/...-BF16.gguf --outtype bf16
python convert_hf_to_gguf.py ./ornith-9b-mtp-kl --outfile out/ --outtype q8_0 --mtp # standalone draft
# 3. quantize
llama-quantize out/...-BF16.gguf out/...-Q4_K_M.gguf Q4_K_M
The graft.py recipe and the KL-distillation details live in the head repo
protoLabsAI/Ornith-1.0-9B-MTP.
Common error: wrong number of tensors expected 442 got 427
(or got 426 for the smaller quants — the gap is the 15 mtp.* head tensors.)
This happens if you run convert_hf_to_gguf.py directly on the base
deepreinforce-ai/Ornith-1.0-9B without grafting the head first. The base keeps
mtp_num_hidden_layers: 1 in its config.json (text_config) but ships none of the mtp.*
weights — so the converter writes block_count = 33 / nextn_predict_layers = 1 into the
GGUF metadata (declaring the blk.32 MTP layer) while leaving those 15 tensors empty. llama.cpp
then expects 442 tensors and finds 427 → load fails.
Fix: graft the head into the trunk before converting (step 1 above), then convert with no
--mtp flag. Note that only 4 of the 15 head tensors are named blk.32.nextn.* (eh_proj,
enorm, hnorm, shared_head_norm); the other 11 land as ordinary blk.32.* layer tensors
(attn_*, ffn_*, the norms) — so grepping for nextn shows only 4, but the head is complete.
Don't want to graft? You don't have to build the bundled file at all — run the base GGUF with
--model-draft mtp-ornith-9b-mtp-kl-Q8_0.gguf --spec-type draft-mtp. Functionally identical.
Provenance & license
- Base:
deepreinforce-ai/Ornith-1.0-9B(MIT) — a Qwen3.5-9B hybrid (linear-attention + full-attention) fine-tune. - MTP head:
protoLabsAI/Ornith-1.0-9B-MTP(MIT) — KL-distilled against Ornith's own hidden states. - These GGUFs are a derivative of both; MIT. Built by protoLabs.studio.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.