N

KVzap-mlp-Qwen3-8B

Otherby NVIDIA·Model page

NVIDIA's KVzap-mlp is a 75M-parameter MLP that predicts Qwen3-8B KV-cache states to enable faster inference via KV cache compression.

Share:

Model Card

License GitHub KVzap collection arXiv

KVzap is a fast, adaptive, and faithful KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold, following the Dynamic Memory Sparsification (DMS) inference strategy.

The method was introduced in the paper KVzap: Fast, Adaptive, and Faithful KV Cache Pruning.

KVzap is trained as a fast approximation of KVzip+, using 1.2M samples from Nemotron-Pretraining-Dataset-sample. Training code is available in the kvpress repository.

Usage

KVzap can be used with the kvpress library, through the custom KVPressTextGenerationPipeline, which is automatically registered as a transformers pipeline with the name kv-press-text-generation when kvpress is imported:

import requests
from transformers import pipeline
from kvpress import KVzapPress, DMSPress

model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")
press = DMSPress(KVzapPress(model_type="mlp"), threshold=-4)

# Prefilling compression only, thinking disabled
press.decoding = False
context = requests.get("https://arxiv.org/abs/2601.07891").text
question = "\n What is this article about in 2 sentences ?"
answer = pipe(context, question=question, press=press)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")

# Prefilling and decoding compression, thinking enabled
press.decoding = True
prompt = "What is the best hardware to run LLMs and why ?"
answer = pipe(prompt, press=press, enable_thinking=True, max_new_tokens=2000)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")

Citation

If you use KVzap in your research, please cite the following paper:

@article{jegou2025kvzap,
  title={KVzap: Fast, Adaptive, and Faithful KV Cache Pruning},
  author={Jegou, Simon and Jeblick, Maximilian},
  journal={arXiv preprint arXiv:2601.07891},
  year={2025},
  url={https://arxiv.org/abs/2601.07891}
}
Author
N
NVIDIA
Organization · ✓
nvidia
Details
Downloads548.7K
Likes4
AccessOpen Source
Taskother
Parameters76M
Licenseapache-2.0
Librarytransformers
CreatedDec 3, 2025
UpdatedJan 21, 2026
View on Hugging Face
Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

KVzap-mlp-Qwen3-8B — AI Model Details | Applied