SC

Unlimited-OCR-GGUF

Multimodalby Sahil Chachra·Model page

Sahil Chachra's GGUF quantization of Baidu's Unlimited-OCR, a multilingual vision-language model for document parsing.

Share:

Base model

baidu/Unlimited-OCR

Model Description

GGUF quantizations of baidu/Unlimited-OCR, a 3B vision-language OCR model that pushes DeepSeek-OCR one step further (one-shot, long-horizon document parsing). This repo contains a full spread of K-quants and i-quants of the language model plus the vision projector (mmproj) needed for image input.

⚠️ Requires a DeepSeek-OCR–aware llama.cpp build (PR #17400). Unlimited-OCR uses the DeepSeek-OCR architecture (a SAM+CLIP DeepEncoder vision tower with a DeepSeek-V2 MoE text decoder). Support is not yet merged into upstream main — stock llama.cpp will not load these files. Build the PR branch (instructions below).

Files

Every run needs two files: one language model GGUF (pick a quant) plus the shared vision projector. The projector is fp16 and identical for all quants.

File Quant Bits Size Notes
Unlimited-OCR-BF16.gguf BF16 16 5.47 GiB Full-precision conversion. The base every quant is made from; reference quality.
Unlimited-OCR-Q8_0.gguf Q8_0 8 2.91 GiB Near-lossless. Best quality short of BF16; recommended if you have the disk/RAM.
Unlimited-OCR-Q6_K.gguf Q6_K 6 2.43 GiB Very high quality, essentially indistinguishable from Q8_0 for OCR.
Unlimited-OCR-Q5_K_M.gguf Q5_K_M 5 2.07 GiB High quality. Great balance when you can spare a bit more than Q4.
Unlimited-OCR-Q5_K_S.gguf Q5_K_S 5 1.95 GiB High quality, slightly smaller than Q5_K_M.
Unlimited-OCR-Q4_K_M.gguf Q4_K_M 4 1.82 GiB Recommended default — best overall size/quality trade-off.
Unlimited-OCR-Q4_K_S.gguf Q4_K_S 4 1.68 GiB Slightly smaller than Q4_K_M with a small quality cost.
Unlimited-OCR-Q3_K_M.gguf Q3_K_M 3 1.45 GiB Compact. Usable when memory is tight; some quality loss.
Unlimited-OCR-IQ4_XS.gguf IQ4_XS 4 1.53 GiB i-quant: smaller than Q4_K_S at similar quality (built with imatrix).
Unlimited-OCR-IQ4_NL.gguf IQ4_NL 4 1.59 GiB i-quant (non-linear): 4-bit tuned for ARM/edge; good on Jetson/Apple.
Unlimited-OCR-IQ3_M.gguf IQ3_M 3 1.35 GiB i-quant: solid 3-bit quality for the size (imatrix).
Unlimited-OCR-IQ3_XXS.gguf IQ3_XXS 3 1.24 GiB i-quant: very small 3-bit; noticeable quality loss but runnable.
Unlimited-OCR-IQ2_M.gguf IQ2_M 2 1.15 GiB i-quant: smallest here; experimental, lowest quality — for tight memory only.

Vision projector (required for all of the above):

File Type Size
mmproj-Unlimited-OCR-F16.gguf F16 774.27 MiB

Sizes are the on-disk GGUF sizes. The vision encoder is kept at F16 (not quantized) — it is small and quantizing it hurts OCR accuracy. i-quants were built with an importance matrix (imatrix) computed from a general-text calibration set.

Build llama.cpp with DeepSeek-OCR support

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/24975/head:pr24975 && git checkout pr24975
cmake -B build -DCMAKE_BUILD_TYPE=Release        # add -DGGML_CUDA=ON for NVIDIA
cmake --build build -j --target llama-mtmd-cli llama-server

Quick start

Download one quant + the projector (you always need both):

huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \
  --include "Unlimited-OCR-Q4_K_M.gguf" "mmproj-Unlimited-OCR-F16.gguf" --local-dir ./uocr

Run it on an image:

./build/bin/llama-mtmd-cli \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image document.png \
  -p "<|grounding|>Convert the document to markdown." \
  --temp 0

Use --temp 0 for OCR (deterministic). Add -n 4096 (or more) for long/dense documents.


Prompting guide

Unlimited-OCR uses the DeepSeek-OCR prompt vocabulary. The prompt is just an instruction; prefix it with <|grounding|> whenever you also want bounding boxes for what was read.

Task Prompt (-p)
Document → Markdown (layout-aware, with boxes) `<
Plain text OCR (just the text, no layout) Free OCR.
OCR with bounding boxes `<
Native Unlimited-OCR parse document parsing.
Parse a figure / chart / diagram Parse the figure.
Describe the image (general VQA) Describe this image in detail.
Find specific text (referring grounding) `<

Worked examples

1) Document → clean Markdown (tables, headings, reading order):

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image invoice.png --temp 0 -n 4096 \
  -p "<|grounding|>Convert the document to markdown."

2) Just the raw text, no layout / no boxes:

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image receipt.jpg --temp 0 -p "Free OCR."

3) Locate a specific string and get its box:

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image form.png --temp 0 \
  -p "<|grounding|>Locate <|ref|>Invoice Number<|/ref|> in the image."

Understanding the output (grounding tokens)

With <|grounding|>, the model interleaves the recognized text with detection boxes:

<|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623
<|det|>text  [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra
<|det|>text  [37, 483, 329, 543]<|/det|>Total Due: $44.00

Each [x1, y1, x2, y2] is the bounding box (top-left → bottom-right) of that span, in the coordinate space of the model's input image. Drop the <|det|>...<|/det|> tags if you only want the text, or parse them to overlay boxes / build a layout. Without <|grounding|> you get plain text (or Markdown) with no box tags.

Tip — long documents: Unlimited-OCR targets one-shot long-horizon parsing. For multi-page scans, run page-by-page and concatenate. If output ever repeats/loops on a dense page, add a mild repetition penalty, e.g. --repeat-penalty 1.05, and keep --temp 0.


Serving (OpenAI-compatible API)

./build/bin/llama-server \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  -c 8192 --host 0.0.0.0 --port 8080

Call it with an image (base64 data URL):

IMG=$(base64 -w0 document.png)
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "temperature": 0,
  "messages": [{ "role": "user", "content": [
    { "type": "text", "text": "<|grounding|>Convert the document to markdown." },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,'"$IMG"'" } }
  ]}]
}'

Python (OpenAI SDK) is identical — point base_url at http://localhost:8080/v1, send a text part with the prompt above and an image_url part with the data URL.

About the model

  • Architecture: DeepseekOCRForCausalLMDeepEncoder vision (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → DeepSeek-V2 MoE text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token).
  • Task: multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing). The original supports gundam (crop) and base resolution modes.
  • License: MIT (inherited from the base model).

How these were made

  1. Converted baidu/Unlimited-OCR to GGUF with the PR #17400 convert_hf_to_gguf.py. The converter targets DeepSeek-OCR, so the config's top-level architectures was set to DeepseekOCRForCausalLM and language_config.architectures to DeepseekV2ForCausalLM (the model is otherwise byte-identical to DeepSeek-OCR's tensor layout).
  2. Exported the text decoder (BF16) and the vision tower (--mmproj, F16) separately.
  3. Built an importance matrix from a general-text corpus and produced the K-/i-quants with llama-quantize.
  4. Verified: the BF16 GGUF + mmproj correctly OCR a test document (text + grounding boxes) via llama-mtmd-cli before quantizing.

Limitations

  • Needs the PR #17400 llama.cpp build until DeepSeek-OCR support lands in main.
  • Very low-bit i-quants (IQ3_XXS, IQ2_M) trade real accuracy for size — prefer Q4_K_M or higher for production OCR.
  • The vision encoder runs in fp16 regardless of the chosen text quant.

Credits

Author
SC
Sahil Chachra
User
sahilchachra
Details
Downloads35.4K
Likes69
AccessOpen Source
Taskimage-text-to-text
Trending49
Licensemit
Librarygguf
CreatedJun 23, 2026
UpdatedJul 1, 2026
View on Hugging Face
Languages
multilingual
Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

Unlimited-OCR-GGUF — AI Model Details | Applied