ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training
Multimodal AI models are supposed to handle ever-longer documents, but how they’re trained to do so usually stays a trade secret. A new study shows that character recognition as a training task actually hurts performance and that question-answer pairs work far better.
Researchers from ByteDance Seed and the Hong Kong University of Science and Technology (HKUST) studied how image-language models can be trained efficiently on long documents. The result is a model called MMProLong, built on Alibaba’s open Qwen2.5-VL, that beats much larger competitors.
Modern multimodal AI models need to handle increasingly long inputs: entire PDF collections of rendered pages, hours of video, or agents that remember their tasks across many steps. AI labs like OpenAI, Google, and Alibaba tout context windows of up to 1 million tokens, capable of holding not just text but thousands of page images or video frames. But according to the authors, technical reports barely reveal what data a model should see and in what mix.
Asking questions teaches more than transcribing text
At first glance, the study’s central finding seems obvious. For a multimodal model to learn to find the right spot in a 100-page document, having it transcribe the text of every page barely helps. It’s more effective to ask questions whose answers are buried somewhere in those pages.
The researchers tested both approaches head-to-head. In one setup, the model had to perform text recognition either across all pages of a document or for a few selected pages, while the remaining pages stayed in context as distractions.
In the other setup, the researchers used a separate model (Seed 2.0 from ByteDance) to generate question-answer pairs for individual sections of a document. The question then went into training alongside the entire document, forcing the model to locate the relevant passage within a long context.
Pure text recognition as a training task actually worsened performance compared to the starting point. Question-answer training, on the other hand, brought clear gains. The model only learns to navigate long texts when it has to filter out and categorize information with a specific goal.
Diversity beats specialization
Three more findings turned up in the experiments. Feeding the model mainly very long documents at the top end of the context window isn’t worth it. A broader mix of shorter and longer examples works more reliably. Long-context ability isn’t a skill tied to a specific length but requires flexible searching across different distances.
The real bottleneck also turns out to be finding the relevant passage, not reasoning about it. A mix weighted toward extraction tasks with a smaller share of calculation tasks delivered the best results.
The third finding is surprising because it contradicts common practice with text-only language models. Adding short training examples doesn’t appear strictly necessary. The model largely kept its short-task abilities even when trained only on long question-answer data. The format of the data itself probably helps: even when the context is very long, the task is still framed as a question-answer interaction in the familiar instruction-following format.
Small but stable up to 512,000 tokens
With this recipe and a fairly modest training budget, MMProLong beats several much larger open models like InternVL3-38B and Gemma3-27B. The model was trained on only 128,000 tokens but stays stable at 256,000 and even 512,000 token input lengths, while the original model falls apart sharply at those ranges.
This ability also transfers to tasks the model was never specifically trained on, like understanding long videos. In an extra transfer experiment, the recipe proved effective on the stronger Qwen3-VL-8B too, even though that model is already built for long contexts.
The study is also interesting because it comes from an entirely different camp than Deepseek’s widely discussed work on the same problem. Deepseek tries to extend the long memory of AI models by processing texts as images and compressing them heavily, most recently with an encoder that re-sorts visual information by content. ByteDance Seed takes the opposite approach: optimize the training data instead of the architecture.
Subscribe now
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.



![Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA [D]](https://ai-maestro.online/wp-content/uploads/2026/05/vision-capable-llms-vs-ocr-for-long-document-including-chart-1-768x768.jpg)
