¿Quién creó Unlimited-OCR-GGUF?

Unlimited-OCR-GGUF fue publicado por Sahil Chachra en Hugging Face.

Unlimited-OCR-GGUF

Name: Unlimited-OCR-GGUF
Author: Sahil Chachra

Multimodalpor Sahil Chachra·Página del modelo ↗

Cuantización GGUF de Sahil Chachra del modelo Unlimited-OCR de Baidu, un modelo de visión y lenguaje multilingüe para el análisis de documentos.

Modelo base

baidu/Unlimited-OCR

Descripción del Modelo

GGUF quantizations of baidu/Unlimited-OCR, a 3B vision-language OCR model that pushes DeepSeek-OCR one step further (one-shot, long-horizon document parsing). This repo contains a full spread of K-quants and i-quants of the language model plus the vision projector (mmproj) needed for image input.

⚠️ Requires a DeepSeek-OCR–aware llama.cpp build (PR #17400). Unlimited-OCR uses the DeepSeek-OCR architecture (a SAM+CLIP DeepEncoder vision tower with a DeepSeek-V2 MoE text decoder). Support is not yet merged into upstream main — stock llama.cpp will not load these files. Build the PR branch (instructions below).

Files

Every run needs two files: one language model GGUF (pick a quant) plus the shared vision projector. The projector is fp16 and identical for all quants.

File	Quant	Bits	Size	Notes
`Unlimited-OCR-BF16.gguf`	BF16	16	5.47 GiB	Full-precision conversion. The base every quant is made from; reference quality.
`Unlimited-OCR-Q8_0.gguf`	Q8_0	8	2.91 GiB	Near-lossless. Best quality short of BF16; recommended if you have the disk/RAM.
`Unlimited-OCR-Q6_K.gguf`	Q6_K	6	2.43 GiB	Very high quality, essentially indistinguishable from Q8_0 for OCR.
`Unlimited-OCR-Q5_K_M.gguf`	Q5_K_M	5	2.07 GiB	High quality. Great balance when you can spare a bit more than Q4.
`Unlimited-OCR-Q5_K_S.gguf`	Q5_K_S	5	1.95 GiB	High quality, slightly smaller than Q5_K_M.
`Unlimited-OCR-Q4_K_M.gguf`	Q4_K_M	4	1.82 GiB	Recommended default — best overall size/quality trade-off.
`Unlimited-OCR-Q4_K_S.gguf`	Q4_K_S	4	1.68 GiB	Slightly smaller than Q4_K_M with a small quality cost.
`Unlimited-OCR-Q3_K_M.gguf`	Q3_K_M	3	1.45 GiB	Compact. Usable when memory is tight; some quality loss.
`Unlimited-OCR-IQ4_XS.gguf`	IQ4_XS	4	1.53 GiB	i-quant: smaller than Q4_K_S at similar quality (built with imatrix).
`Unlimited-OCR-IQ4_NL.gguf`	IQ4_NL	4	1.59 GiB	i-quant (non-linear): 4-bit tuned for ARM/edge; good on Jetson/Apple.
`Unlimited-OCR-IQ3_M.gguf`	IQ3_M	3	1.35 GiB	i-quant: solid 3-bit quality for the size (imatrix).
`Unlimited-OCR-IQ3_XXS.gguf`	IQ3_XXS	3	1.24 GiB	i-quant: very small 3-bit; noticeable quality loss but runnable.
`Unlimited-OCR-IQ2_M.gguf`	IQ2_M	2	1.15 GiB	i-quant: smallest here; experimental, lowest quality — for tight memory only.

Vision projector (required for all of the above):

File	Type	Size
`mmproj-Unlimited-OCR-F16.gguf`	F16	774.27 MiB

Sizes are the on-disk GGUF sizes. The vision encoder is kept at F16 (not quantized) — it is small and quantizing it hurts OCR accuracy. i-quants were built with an importance matrix (imatrix) computed from a general-text calibration set.

Build llama.cpp with DeepSeek-OCR support

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/24975/head:pr24975 && git checkout pr24975
cmake -B build -DCMAKE_BUILD_TYPE=Release        # add -DGGML_CUDA=ON for NVIDIA
cmake --build build -j --target llama-mtmd-cli llama-server

Quick start

Download one quant + the projector (you always need both):

huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \
  --include "Unlimited-OCR-Q4_K_M.gguf" "mmproj-Unlimited-OCR-F16.gguf" --local-dir ./uocr

Run it on an image:

./build/bin/llama-mtmd-cli \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image document.png \
  -p "<|grounding|>Convert the document to markdown." \
  --temp 0

Use --temp 0 for OCR (deterministic). Add -n 4096 (or more) for long/dense documents.

Prompting guide

Unlimited-OCR uses the DeepSeek-OCR prompt vocabulary. The prompt is just an instruction; prefix it with <|grounding|> whenever you also want bounding boxes for what was read.

Task	Prompt (`-p`)
Document → Markdown (layout-aware, with boxes)	`<
Plain text OCR (just the text, no layout)	`Free OCR.`
OCR with bounding boxes	`<
Native Unlimited-OCR parse	`document parsing.`
Parse a figure / chart / diagram	`Parse the figure.`
Describe the image (general VQA)	`Describe this image in detail.`
Find specific text (referring grounding)	`<

Worked examples

1) Document → clean Markdown (tables, headings, reading order):

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image invoice.png --temp 0 -n 4096 \
  -p "<|grounding|>Convert the document to markdown."

2) Just the raw text, no layout / no boxes:

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image receipt.jpg --temp 0 -p "Free OCR."

3) Locate a specific string and get its box:

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image form.png --temp 0 \
  -p "<|grounding|>Locate <|ref|>Invoice Number<|/ref|> in the image."

Understanding the output (grounding tokens)

With <|grounding|>, the model interleaves the recognized text with detection boxes:

<|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623
<|det|>text  [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra
<|det|>text  [37, 483, 329, 543]<|/det|>Total Due: $44.00

Each [x1, y1, x2, y2] is the bounding box (top-left → bottom-right) of that span, in the coordinate space of the model's input image. Drop the <|det|>...<|/det|> tags if you only want the text, or parse them to overlay boxes / build a layout. Without <|grounding|> you get plain text (or Markdown) with no box tags.

Tip — long documents: Unlimited-OCR targets one-shot long-horizon parsing. For multi-page scans, run page-by-page and concatenate. If output ever repeats/loops on a dense page, add a mild repetition penalty, e.g. --repeat-penalty 1.05, and keep --temp 0.

Serving (OpenAI-compatible API)

./build/bin/llama-server \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  -c 8192 --host 0.0.0.0 --port 8080

Call it with an image (base64 data URL):

IMG=$(base64 -w0 document.png)
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "temperature": 0,
  "messages": [{ "role": "user", "content": [
    { "type": "text", "text": "<|grounding|>Convert the document to markdown." },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,'"$IMG"'" } }
  ]}]
}'

Python (OpenAI SDK) is identical — point base_url at http://localhost:8080/v1, send a text part with the prompt above and an image_url part with the data URL.

About the model

Architecture: DeepseekOCRForCausalLM — DeepEncoder vision (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → DeepSeek-V2 MoE text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token).
Task: multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing). The original supports gundam (crop) and base resolution modes.
License: MIT (inherited from the base model).

How these were made

Converted baidu/Unlimited-OCR to GGUF with the PR #17400 convert_hf_to_gguf.py. The converter targets DeepSeek-OCR, so the config's top-level architectures was set to DeepseekOCRForCausalLM and language_config.architectures to DeepseekV2ForCausalLM (the model is otherwise byte-identical to DeepSeek-OCR's tensor layout).
Exported the text decoder (BF16) and the vision tower (--mmproj, F16) separately.
Built an importance matrix from a general-text corpus and produced the K-/i-quants with llama-quantize.
Verified: the BF16 GGUF + mmproj correctly OCR a test document (text + grounding boxes) via llama-mtmd-cli before quantizing.

Limitations

Needs the PR #17400 llama.cpp build until DeepSeek-OCR support lands in main.
Very low-bit i-quants (IQ3_XXS, IQ2_M) trade real accuracy for size — prefer Q4_K_M or higher for production OCR.
The vision encoder runs in fp16 regardless of the chosen text quant.

Credits

Base model: baidu/Unlimited-OCR (MIT) — builds on deepseek-ai/DeepSeek-OCR.
GGUF / DeepSeek-OCR llama.cpp support: ggml-org/llama.cpp#17400.
Quantized by sahilchachra.

Autor

Sahil Chachra

Usuario

sahilchachra

Detalles

Descargas35.4K

Me gusta69

AccesoCódigo Abierto

Tareaimage-text-to-text

Tendencia49

Licenciamit

Libreríagguf

Creado23 jun 2026

Actualizado1 jul 2026

Ver en Hugging Face

Idiomas

multilingual

Entiende todo el contexto.

Regístrate para leer casos de estudio completos, acceder a métricas detalladas y recibir todos los reportes.