N

KVzap-mlp-Qwen3-8B

Otherpor NVIDIA·Página del modelo

KVzap-mlp de NVIDIA es un MLP de 75M parámetros que predice los estados de caché KV de Qwen3-8B para acelerar la inferencia mediante compresión de caché KV.

Share:

Tarjeta del Modelo

License GitHub KVzap collection arXiv

KVzap is a fast, adaptive, and faithful KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold, following the Dynamic Memory Sparsification (DMS) inference strategy.

The method was introduced in the paper KVzap: Fast, Adaptive, and Faithful KV Cache Pruning.

KVzap is trained as a fast approximation of KVzip+, using 1.2M samples from Nemotron-Pretraining-Dataset-sample. Training code is available in the kvpress repository.

Usage

KVzap can be used with the kvpress library, through the custom KVPressTextGenerationPipeline, which is automatically registered as a transformers pipeline with the name kv-press-text-generation when kvpress is imported:

import requests
from transformers import pipeline
from kvpress import KVzapPress, DMSPress

model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")
press = DMSPress(KVzapPress(model_type="mlp"), threshold=-4)

# Prefilling compression only, thinking disabled
press.decoding = False
context = requests.get("https://arxiv.org/abs/2601.07891").text
question = "\n What is this article about in 2 sentences ?"
answer = pipe(context, question=question, press=press)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")

# Prefilling and decoding compression, thinking enabled
press.decoding = True
prompt = "What is the best hardware to run LLMs and why ?"
answer = pipe(prompt, press=press, enable_thinking=True, max_new_tokens=2000)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")

Citation

If you use KVzap in your research, please cite the following paper:

@article{jegou2025kvzap,
  title={KVzap: Fast, Adaptive, and Faithful KV Cache Pruning},
  author={Jegou, Simon and Jeblick, Maximilian},
  journal={arXiv preprint arXiv:2601.07891},
  year={2025},
  url={https://arxiv.org/abs/2601.07891}
}
Autor
N
NVIDIA
Organización · ✓
nvidia
Detalles
Descargas548.7K
Me gusta4
AccesoCódigo Abierto
Tareaother
Parámetros76M
Licenciaapache-2.0
Libreríatransformers
Creado3 dic 2025
Actualizado21 ene 2026
Ver en Hugging Face
Entiende todo el contexto.

Regístrate para leer casos de estudio completos, acceder a métricas detalladas y recibir todos los reportes.