Unlimited-OCR-GGUF
Cuantización GGUF de Sahil Chachra del modelo Unlimited-OCR de Baidu, un modelo de visión y lenguaje multilingüe para el análisis de documentos.
Modelo base
Descripción del Modelo
GGUF quantizations of baidu/Unlimited-OCR, a 3B vision-language OCR model that pushes DeepSeek-OCR one step further (one-shot, long-horizon document parsing). This repo contains a full spread of K-quants and i-quants of the language model plus the vision projector (mmproj) needed for image input.
⚠️ Requires a DeepSeek-OCR–aware llama.cpp build (PR #17400). Unlimited-OCR uses the DeepSeek-OCR architecture (a SAM+CLIP DeepEncoder vision tower with a DeepSeek-V2 MoE text decoder). Support is not yet merged into upstream
main— stock llama.cpp will not load these files. Build the PR branch (instructions below).
Files
Every run needs two files: one language model GGUF (pick a quant) plus the shared vision projector. The projector is fp16 and identical for all quants.
| File | Quant | Bits | Size | Notes |
|---|---|---|---|---|
Unlimited-OCR-BF16.gguf |
BF16 | 16 | 5.47 GiB | Full-precision conversion. The base every quant is made from; reference quality. |
Unlimited-OCR-Q8_0.gguf |
Q8_0 | 8 | 2.91 GiB | Near-lossless. Best quality short of BF16; recommended if you have the disk/RAM. |
Unlimited-OCR-Q6_K.gguf |
Q6_K | 6 | 2.43 GiB | Very high quality, essentially indistinguishable from Q8_0 for OCR. |
Unlimited-OCR-Q5_K_M.gguf |
Q5_K_M | 5 | 2.07 GiB | High quality. Great balance when you can spare a bit more than Q4. |
Unlimited-OCR-Q5_K_S.gguf |
Q5_K_S | 5 | 1.95 GiB | High quality, slightly smaller than Q5_K_M. |
Unlimited-OCR-Q4_K_M.gguf |
Q4_K_M | 4 | 1.82 GiB | Recommended default — best overall size/quality trade-off. |
Unlimited-OCR-Q4_K_S.gguf |
Q4_K_S | 4 | 1.68 GiB | Slightly smaller than Q4_K_M with a small quality cost. |
Unlimited-OCR-Q3_K_M.gguf |
Q3_K_M | 3 | 1.45 GiB | Compact. Usable when memory is tight; some quality loss. |
Unlimited-OCR-IQ4_XS.gguf |
IQ4_XS | 4 | 1.53 GiB | i-quant: smaller than Q4_K_S at similar quality (built with imatrix). |
Unlimited-OCR-IQ4_NL.gguf |
IQ4_NL | 4 | 1.59 GiB | i-quant (non-linear): 4-bit tuned for ARM/edge; good on Jetson/Apple. |
Unlimited-OCR-IQ3_M.gguf |
IQ3_M | 3 | 1.35 GiB | i-quant: solid 3-bit quality for the size (imatrix). |
Unlimited-OCR-IQ3_XXS.gguf |
IQ3_XXS | 3 | 1.24 GiB | i-quant: very small 3-bit; noticeable quality loss but runnable. |
Unlimited-OCR-IQ2_M.gguf |
IQ2_M | 2 | 1.15 GiB | i-quant: smallest here; experimental, lowest quality — for tight memory only. |
Vision projector (required for all of the above):
| File | Type | Size |
|---|---|---|
mmproj-Unlimited-OCR-F16.gguf |
F16 | 774.27 MiB |
Sizes are the on-disk GGUF sizes. The vision encoder is kept at F16 (not quantized) — it is small and quantizing it hurts OCR accuracy. i-quants were built with an importance matrix (imatrix) computed from a general-text calibration set.
Build llama.cpp with DeepSeek-OCR support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/24975/head:pr24975 && git checkout pr24975
cmake -B build -DCMAKE_BUILD_TYPE=Release # add -DGGML_CUDA=ON for NVIDIA
cmake --build build -j --target llama-mtmd-cli llama-server
Quick start
Download one quant + the projector (you always need both):
huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \
--include "Unlimited-OCR-Q4_K_M.gguf" "mmproj-Unlimited-OCR-F16.gguf" --local-dir ./uocr
Run it on an image:
./build/bin/llama-mtmd-cli \
-m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
--mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
--image document.png \
-p "<|grounding|>Convert the document to markdown." \
--temp 0
Use
--temp 0for OCR (deterministic). Add-n 4096(or more) for long/dense documents.
Prompting guide
Unlimited-OCR uses the DeepSeek-OCR prompt vocabulary. The prompt is just an instruction;
prefix it with <|grounding|> whenever you also want bounding boxes for what was read.
| Task | Prompt (-p) |
|---|---|
| Document → Markdown (layout-aware, with boxes) | `< |
| Plain text OCR (just the text, no layout) | Free OCR. |
| OCR with bounding boxes | `< |
| Native Unlimited-OCR parse | document parsing. |
| Parse a figure / chart / diagram | Parse the figure. |
| Describe the image (general VQA) | Describe this image in detail. |
| Find specific text (referring grounding) | `< |
Worked examples
1) Document → clean Markdown (tables, headings, reading order):
./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
--mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
--image invoice.png --temp 0 -n 4096 \
-p "<|grounding|>Convert the document to markdown."
2) Just the raw text, no layout / no boxes:
./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
--mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
--image receipt.jpg --temp 0 -p "Free OCR."
3) Locate a specific string and get its box:
./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
--mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
--image form.png --temp 0 \
-p "<|grounding|>Locate <|ref|>Invoice Number<|/ref|> in the image."
Understanding the output (grounding tokens)
With <|grounding|>, the model interleaves the recognized text with detection boxes:
<|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623
<|det|>text [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra
<|det|>text [37, 483, 329, 543]<|/det|>Total Due: $44.00
Each [x1, y1, x2, y2] is the bounding box (top-left → bottom-right) of that span, in the
coordinate space of the model's input image. Drop the <|det|>...<|/det|> tags if you only
want the text, or parse them to overlay boxes / build a layout. Without <|grounding|> you get
plain text (or Markdown) with no box tags.
Tip — long documents: Unlimited-OCR targets one-shot long-horizon parsing. For multi-page scans, run page-by-page and concatenate. If output ever repeats/loops on a dense page, add a mild repetition penalty, e.g.
--repeat-penalty 1.05, and keep--temp 0.
Serving (OpenAI-compatible API)
./build/bin/llama-server \
-m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
--mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
-c 8192 --host 0.0.0.0 --port 8080
Call it with an image (base64 data URL):
IMG=$(base64 -w0 document.png)
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"temperature": 0,
"messages": [{ "role": "user", "content": [
{ "type": "text", "text": "<|grounding|>Convert the document to markdown." },
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,'"$IMG"'" } }
]}]
}'
Python (OpenAI SDK) is identical — point base_url at http://localhost:8080/v1, send a
text part with the prompt above and an image_url part with the data URL.
About the model
- Architecture:
DeepseekOCRForCausalLM— DeepEncoder vision (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → DeepSeek-V2 MoE text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token). - Task: multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing). The original supports gundam (crop) and base resolution modes.
- License: MIT (inherited from the base model).
How these were made
- Converted
baidu/Unlimited-OCRto GGUF with the PR #17400convert_hf_to_gguf.py. The converter targets DeepSeek-OCR, so the config's top-levelarchitectureswas set toDeepseekOCRForCausalLMandlanguage_config.architecturestoDeepseekV2ForCausalLM(the model is otherwise byte-identical to DeepSeek-OCR's tensor layout). - Exported the text decoder (BF16) and the vision tower (
--mmproj, F16) separately. - Built an importance matrix from a general-text corpus and produced the K-/i-quants with
llama-quantize. - Verified: the BF16 GGUF + mmproj correctly OCR a test document (text + grounding boxes)
via
llama-mtmd-clibefore quantizing.
Limitations
- Needs the PR #17400 llama.cpp build until DeepSeek-OCR support lands in
main. - Very low-bit i-quants (IQ3_XXS, IQ2_M) trade real accuracy for size — prefer Q4_K_M or higher for production OCR.
- The vision encoder runs in fp16 regardless of the chosen text quant.
Credits
- Base model: baidu/Unlimited-OCR (MIT) — builds on deepseek-ai/DeepSeek-OCR.
- GGUF / DeepSeek-OCR llama.cpp support: ggml-org/llama.cpp#17400.
- Quantized by sahilchachra.
Regístrate para leer casos de estudio completos, acceder a métricas detalladas y recibir todos los reportes.
Regístrate para leer casos de estudio completos, acceder a métricas detalladas y recibir todos los reportes.