How many parameters does KVzap-mlp-Qwen3-8B have?

KVzap-mlp-Qwen3-8B has approximately 0.1 billion parameters.

Who created KVzap-mlp-Qwen3-8B?

KVzap-mlp-Qwen3-8B was published by NVIDIA on Hugging Face.

KVzap-mlp-Qwen3-8B

Name: KVzap-mlp-Qwen3-8B
Author: NVIDIA

NVIDIA's KVzap-mlp is a 75M-parameter MLP that predicts Qwen3-8B KV-cache states to enable faster inference via KV cache compression.

Model Description

KVzap is a fast, adaptive, and faithful KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold, following the Dynamic Memory Sparsification (DMS) inference strategy.

The method was introduced in the paper KVzap: Fast, Adaptive, and Faithful KV Cache Pruning.

KVzap is trained as a fast approximation of KVzip+, using 1.2M samples from Nemotron-Pretraining-Dataset-sample. Training code is available in the kvpress repository.

Usage

KVzap can be used with the kvpress library, through the custom KVPressTextGenerationPipeline, which is automatically registered as a transformers pipeline with the name kv-press-text-generation when kvpress is imported:

import requests
from transformers import pipeline
from kvpress import KVzapPress, DMSPress

model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")
press = DMSPress(KVzapPress(model_type="mlp"), threshold=-4)

# Prefilling compression only, thinking disabled
press.decoding = False
context = requests.get("https://arxiv.org/abs/2601.07891").text
question = "\n What is this article about in 2 sentences ?"
answer = pipe(context, question=question, press=press)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")

# Prefilling and decoding compression, thinking enabled
press.decoding = True
prompt = "What is the best hardware to run LLMs and why ?"
answer = pipe(prompt, press=press, enable_thinking=True, max_new_tokens=2000)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")

Citation

If you use KVzap in your research, please cite the following paper:

@article{jegou2025kvzap,
  title={KVzap: Fast, Adaptive, and Faithful KV Cache Pruning},
  author={Jegou, Simon and Jeblick, Maximilian},
  journal={arXiv preprint arXiv:2601.07891},
  year={2025},
  url={https://arxiv.org/abs/2601.07891}
}

Author

NVIDIA

Organization · ✓

nvidia

Details

Downloads548.7K

Likes4

AccessOpen Source

Taskother

Parameters76M

Licenseapache-2.0

Librarytransformers

CreatedDec 3, 2025

UpdatedJan 21, 2026

View on Hugging Face

Get the full context.