Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA [D]

I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc (https://github.com/mayubo2333/MMLongBench-Doc). There were 171 questions in total, using Claude Sonnet 4.5 as the LLM.

Post-retry results:

Approach	Accuracy	$/query
LlamaCloud premium + full-context	59.6%	$0.1885
Azure premium + full-context	58.5%	$0.2051
Azure basic + full-context	54.4%	$0.1062
Agentic RAG	53.2%	$0.0827
Native PDF (vision LLM)	52.0%	$0.2552
LlamaCloud basic + full-context	50.9%	$0.1049

Native PDF came 5th of 6 on accuracy and was the most expensive arm at $0.2552 per query.

Two findings:

Vision underperformed on chart-heavy and table-heavy pages, the territory that the "vision LLMs make OCR obsolete" claim most often points to. Premium OCR with layout extraction held up better there.

The native-PDF arm had a 7% intrinsic failure rate (related to PDF file size) that survived retries. There were 27 first-pass failures, with 5 attempts of exponential backoff per failed query. Fifteen recovered, and 12 stayed permanently broken. These were concentrated in two specific PDFs that fail for predictable transport-layer reasons (the blog identifies them). OCR-based arms had a 0% intrinsic failure rate after retries.

Caveats: 30 docs is a small sample. I ran McNemar’s pairwise test to determine which gaps are real and which are within noise. Only 3 of 15 head-to-head gaps are statistically distinguishable at α = 0.05, so the order in the table is partly noise. The vision-versus-OCR finding survives the test.

Full writeup: https://www.surfsense.com/blog/agentic-rag-vs-long-context-llms-benchmark

submitted by /u/Uiqueblhats

Source Read original →

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA [D]

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

NVIDIA’s Cosmos-Framework Tutorial: Designing…

Hackers can use 9…

AI chip maker SambaNova…