P

Ornith-1.0-9B-MTP-GGUF

LLMby protoLabsAI·Model page

protoLabsAI's GGUF quantization of Ornith-1.0-9B with multi-token prediction support for faster speculative decoding.

Share:

Base model

deepreinforce-ai/Ornith-1.0-9B

Model Description

GGUF builds of deepreinforce-ai/Ornith-1.0-9B with the KL-distilled MTP draft head from protoLabsAI/Ornith-1.0-9B-MTP baked in — so llama.cpp does lossless multi-token (self-)speculative decoding out of the box, no separate draft model required.

~1.4–1.7× single-stream decode speedup on a single RTX A6000, distribution-lossless. The head's per-token acceptance on llama.cpp matches the vLLM reference (0.766 here vs 0.762).

Just want the base model with no MTP? Use deepreinforce-ai/Ornith-1.0-9B-GGUF. These files add the nextn head on top of the same trunk.

Files

File Form Size Use
ornith-9b-mtp-kl-Q8_0.gguf bundled (trunk + head) 9.8 GB highest quality / biggest relative speedup
ornith-9b-mtp-kl-Q6_K.gguf bundled 7.6 GB near-lossless quant
ornith-9b-mtp-kl-Q5_K_M.gguf bundled 6.6 GB balanced
ornith-9b-mtp-kl-Q4_K_M.gguf bundled 5.8 GB fastest k-quant
ornith-9b-mtp-kl-IQ4_XS.gguf bundled (imatrix) 5.5 GB low VRAM, near-Q4 quality
ornith-9b-mtp-kl-IQ3_M.gguf bundled (imatrix) 4.7 GB lower VRAM
ornith-9b-mtp-kl-IQ2_M.gguf bundled (imatrix) 3.9 GB very low VRAM (~5 GB to serve)
ornith-9b-mtp-kl-BF16.gguf bundled (full precision) 18.4 GB the master; re-quantize from this
mtp-ornith-9b-mtp-kl-Q8_0.gguf standalone draft head 2.4 GB attach to a base GGUF via --model-draft

The IQ quants are i-quants built with an importance matrix (calibrated on the trunk) for quality at low bit-rates, with the MTP nextn head pinned to Q8_0 so speculative-decode acceptance holds even on the 2-bit trunk (verified ~0.81–0.84 accept on IQ2_M–IQ4_XS, on par with the k-quants). Serve them exactly like the k-quants (--spec-type draft-mtp).

Requires llama.cpp ≥ b9616 (Qwen3.5 qwen35 arch + --spec-type draft-mtp).

Run

Bundled (recommended) — the head travels in the file:

llama-server --model ornith-9b-mtp-kl-Q4_K_M.gguf \
  --n-gpu-layers 99 --ctx-size 8192 --flash-attn on --jinja \
  --spec-type draft-mtp --spec-draft-n-max 3

Standalone draft — pair the small head with any base Ornith-9B GGUF:

llama-server --model ornith-1.0-9b-Q4_K_M.gguf \
  --model-draft mtp-ornith-9b-mtp-kl-Q8_0.gguf \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --n-gpu-layers 99 --ctx-size 8192 --flash-attn on --jinja

--spec-draft-n-max is the draft depth: 2 maximizes acceptance, 3 maximizes throughput, 4 starts to regress. Tune per workload.

Benchmarks (RTX A6000, ctx 8192, flash-attn, greedy; 6-prompt code+general mix)

n-max sweep (Q8_0)

config decode tok/s acceptance speedup
base (no MTP) 71.0 1.00×
MTP n-max 2 118.3 0.766 1.67×
MTP n-max 3 122.6 0.651 1.73×
MTP n-max 4 120.8 0.565 1.70×

Across quants (MTP n-max 3)

quant base tok/s MTP tok/s speedup acceptance
Q4_K_M 105.4 145.3 1.38× 0.659
Q8_0 71.0 122.6 1.73× 0.651

Acceptance is quant-stable (~0.65 @ n-max 3 even with the Q4 head). Q4_K_M is fastest in absolute terms; the relative MTP gain grows with precision (Q8's slow bandwidth-bound baseline has more to gain from the parallel verify).

"Lossless" — read this

MTP speculative decoding is distribution-lossless: every drafted token is verified against the target, so the output distribution is unchanged. It is not bitwise-identical to plain decode at greedy/temp 0 — the batched verification path computes target logits in a different floating-point reduction order than sequential decoding, which can flip a greedy argmax and fork the text. Both outputs are equally valid and equal quality; this is expected llama.cpp behavior, not a defect of these weights.

How these were built

# 1. graft the mtp.* head into the base trunk (15 tensors, 1 nextn layer)
python graft.py --donor protoLabsAI/Ornith-1.0-9B-MTP \
                --target deepreinforce-ai/Ornith-1.0-9B --out ./ornith-9b-mtp-kl
# 2. convert (the converter remaps mtp.* -> blk.<32>.nextn.* automatically)
python convert_hf_to_gguf.py ./ornith-9b-mtp-kl --outfile out/...-BF16.gguf --outtype bf16
python convert_hf_to_gguf.py ./ornith-9b-mtp-kl --outfile out/ --outtype q8_0 --mtp   # standalone draft
# 3. quantize
llama-quantize out/...-BF16.gguf out/...-Q4_K_M.gguf Q4_K_M

The graft.py recipe and the KL-distillation details live in the head repo protoLabsAI/Ornith-1.0-9B-MTP.

Common error: wrong number of tensors expected 442 got 427

(or got 426 for the smaller quants — the gap is the 15 mtp.* head tensors.)

This happens if you run convert_hf_to_gguf.py directly on the base deepreinforce-ai/Ornith-1.0-9B without grafting the head first. The base keeps mtp_num_hidden_layers: 1 in its config.json (text_config) but ships none of the mtp.* weights — so the converter writes block_count = 33 / nextn_predict_layers = 1 into the GGUF metadata (declaring the blk.32 MTP layer) while leaving those 15 tensors empty. llama.cpp then expects 442 tensors and finds 427 → load fails.

Fix: graft the head into the trunk before converting (step 1 above), then convert with no --mtp flag. Note that only 4 of the 15 head tensors are named blk.32.nextn.* (eh_proj, enorm, hnorm, shared_head_norm); the other 11 land as ordinary blk.32.* layer tensors (attn_*, ffn_*, the norms) — so grepping for nextn shows only 4, but the head is complete.

Don't want to graft? You don't have to build the bundled file at all — run the base GGUF with --model-draft mtp-ornith-9b-mtp-kl-Q8_0.gguf --spec-type draft-mtp. Functionally identical.

Provenance & license

  • Base: deepreinforce-ai/Ornith-1.0-9B (MIT) — a Qwen3.5-9B hybrid (linear-attention + full-attention) fine-tune.
  • MTP head: protoLabsAI/Ornith-1.0-9B-MTP (MIT) — KL-distilled against Ornith's own hidden states.
  • These GGUFs are a derivative of both; MIT. Built by protoLabs.studio.
Author
P
protoLabsAI
Organization
protoLabsAI
Details
Downloads9.5K
Likes41
AccessOpen Source
Tasktext-generation
Trending41
Licensemit
CreatedJun 27, 2026
UpdatedJun 28, 2026
View on Hugging Face
Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

Ornith-1.0-9B-MTP-GGUF — AI Model Details | Applied