For developers and artists building visual AI tools, Zyphra has just released Zamba2-VL, a new family of open vision-language models designed to drastically speed up generation. The release includes three variants: 1.2B, 2.7B, and 7B parameters. All models utilise a hybrid SSM–Transformer backbone known as Zamba2.
Vision-language models (VLMs) process images and text simultaneously to answer questions about charts, documents, and photographs. While most open-source alternatives rely on dense Transformers for language processing, Zamba2-VL substitutes this with a hybrid state-space design. The objective is to maintain competitive accuracy while significantly reducing latency.
What is Zamba2-VL
The architecture adheres to the established LLaVA-style template for VLMs. A pre-trained vision encoder converts image patches into feature vectors. A lightweight MLP adapter then maps these features into the language model’s space. The language model subsequently processes an interleaved sequence of visual and textual tokens. The system supports both single and multi-image analysis, as well as object grounding.
Zyphra couples each Zamba2 backbone with the Vision Transformer from Qwen2.5-VL. This encoder was selected for two specific technical properties: it employs 2D rotary position embeddings and supports native dynamic-resolution processing. A two-layer MLP adapter bridges the encoder and the main backbone.
The Architecture
The Zamba2 backbone is where the design diverges from standard VLMs. It combines Mamba2 state-space layers with shared Transformer blocks. The Mamba2 layers operate in linear time using a fixed-size state, while a small number of shared attention layers are interspersed between them. Each shared block features a unique LoRA adapter at every layer.
The Mamba2 layers handle the bulk of computation efficiently. The shared attention layers preserve the in-context retrieval capabilities that pure-SSM models typically sacrifice. This hybrid approach trades the full expressivity of attention for the efficiency of state-space models.
Zamba2-VL utilises the Mistral v0.1 tokenizer. It was trained on 100B tokens comprising vision-text and pure-text data sourced from open web datasets.
Model Quality and Benchmarks
The research team assessed Zamba2-VL across 14 benchmarks covering chart, diagram, and document analysis. They also evaluated general perception, reasoning, and visual counting. All scores derive from Zyphra’s evaluation harness, built upon VLMEvalKit. The report compares performance against the Molmo2, Qwen3-VL, and InternVL3.5 families.
| Eval | Zamba2-VL-2.7B | InternVL3.5-2B | Qwen3-VL-2B | Molmo2-4B | Qwen3-VL-4B |
|---|---|---|---|---|---|
| DocVQA (test) | 90.9 | 89.4 | 93.3 | 87.8 | 95.3 |
| ChartQA (test) | 79.6 | 81.6 | 78.7 | 86.1 | 81.8 |
| OCRBench | 73.6 | 83.4 | 84.1 | 62.0 | 84.1 |
| CountBenchQA | 87.5 | 70.0 | 87.9 | 91.2 | 87.3 |
| PixMoCount (test) | 82.5 | 32.8 | 55.7 | 87.0 | 89.2 |
| MMMU (val) | 37.7 | 49.9 | 40.9 | 48.8 | 51.4 |
| MathVista (mini) | 51.0 | 61.4 | 51.8 | 56.5 | 63.6 |
InternVL3.5-2B and Qwen3-VL-2B occupy a similar size class, whereas Molmo2-4B and Qwen3-VL-4B are larger.
The performance pattern is mixed. Counting is the model’s strongest area. Zyphra reports the Zamba2-VL-1.2B achieving a score of 62.5 on PixMoCount, compared to 32.8 for InternVL3.5-1B and 17.7 for PerceptionLM-1B. Document understanding remains robust, with the 2.7B model scoring 90.9 on DocVQA. However, the model trails larger baselines on knowledge-heavy reasoning tasks like MMMU and MathVista.
Why Inference is Faster
Inference speed is where Zamba2-VL delivers its primary advantage. Transformer attention scales quadratically with sequence length. Multimodal inputs expand these sequences rapidly; a single high-resolution image can introduce several thousand vision tokens, while a short video clip can generate tens of thousands.
Zamba2-VL avoids the expanding KV cache inherent to attention mechanisms. It inherits near-linear-time prefill and a fixed-size recurrent state. On a 32k-token prefill, it outperforms the score-versus-TTFT plot. No Transformer-based VLM in the comparison matched its score at similar latency. The latency gap is at least an order of magnitude.
The efficiency advantage is most pronounced at the 1.2B and 2.7B scales, the range targeted for on-device and edge deployment.
Use Cases With Examples
The practical application lies in specific workflows. Document and form extraction benefits from the strong DocVQA results, suitable for invoice parsing or receipt digitisation at scale. Retail and inventory counting aligns with the PixMoCount and CountBenchQA strengths. Grounding support enables pointing to objects within product or UI images. On-device assistants benefit from the low time-to-first-token, with the 1.2B model targeting phones and edge boxes. Long visual inputs, such as multi-page PDFs, gain the most from linear-time prefill.
Getting Started
The three models are available in the Zyphra Zamba2-VL collection on Hugging Face. Inference runs via Zyphra’s transformers fork, based on transformers v4.57.1. Optimised Mamba2 kernels require a CUDA GPU for optimal latency.
Install the fork and its core dependencies:
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zamba2-vl"
pip install qwen-vl-utils==0.0.2
pip install flash_attnOptimised Mamba2 kernels require two additional packages:
pip install --no-build-isolation "causal-conv1d @ git+https://github.com/Zyphra/z-causal-conv1d.git@zamba2-vl"
pip install --no-build-isolation "mamba-ssm @ git+https://github.com/Zyphra/mamba.git@zamba2-vl"Then load the model and run a single-image query:




