MiniMax-M3-GGUF
MiniMax-M3-GGUF is a GGUF-quantized multimodal mixture-of-experts model by Unsloth AI for coding, video understanding, and agentic tasks.
Base model
Model Card
- EXPERIMENTAL GGUF / support for MiniMax-M3
- Jun 12 Update: You can now run MiniMax M3 in Unsloth Studio. See our Guide.
- Example of MiniMax M3 (5-bit GGUF) running in Unsloth Studio:
MiniMax-M3 support in llama.cpp is preliminary and not yet in a released build. To run these GGUFs, build llama.cpp from PR #24523:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/24523/head:minimax-m3
git checkout minimax-m3
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j --target llama-cli llama-server
Then run a quant. The model is large (~428B params), so offload across GPUs with -ngl 99 or keep the weights in CPU RAM:
./build/bin/llama-cli -hf unsloth/MiniMax-M3-GGUF:UD-IQ1_M
Note: MiniMax Sparse Attention is not supported yet, so inference falls back to dense attention.
MiniMax-M3
Highlights:
- Native Multimodality: M3 undergoes mixed-modality training from the very first step, enabling deeper semantic fusion across text, image, and video.
- Context Scaling via Sparse Attention: M3 introduces MiniMax Sparse Attention (MSA) to improve long context efficiency. M3 delivers 9× prefill and 15× decode speedups compared to M2 at 1M context, reducing per-token compute to 1/20.
- Coding & Cowork Capability: M3 achieves frontier-level performance across long-horizon agentic benchmarks, excelling in both coding and cowork.
Model Details
| Architecture | MoE + MSA (MiniMax Sparse Attention) |
| Total Parameters | ~428B |
| Activated Parameters | ~23B |
| Experts | 128 (4 active per token) |
| Layers | 60 |
| Context Length | 1M tokens |
| Modalities | Text, Image, Video |
| Precision | bfloat16 |
| Transformers | ≥ 4.52.4 (trust_remote_code=True) |
| License | MiniMax Community License |
How to Use
M3 supports two reasoning modes:
- thinking — for complex reasoning, agentic tasks, and long-horizon collaboration.
- non-thinking — for latency-sensitive scenarios such as chat and code completion.
Local Deployment
Download the model:
hf download MiniMaxAI/MiniMax-M3 --local-dir MiniMax-M3
You can also get model weights from ModelScope.
Inference Parameters
We recommend the following parameters for best performance: temperature=1.0, top_p=0.95, top_k=40. Default system prompt:
You are a helpful assistant. Your name is MiniMax-M3 and was built by MiniMax.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.