While the latest wave of AI reasoning breakthroughs has relied on throwing billions of parameters at problems to cross cognitive hurdles, VibeThinker-3B is proving that a different approach works. Created by researchers at Sina Weibo Inc in China, this compact 3-billion-parameter model demonstrates that efficiency can outperform massive counterparts on specific tasks.
Released under an open-source MIT licence, VibeThinker-3B matches the performance of models hundreds of times its size on verifiable challenges, including mathematics, coding, and STEM subjects. For makers and artists building tools, this means high-fidelity reasoning engines are now viable for local deployment and cost-sensitive applications without needing massive cloud clusters.
What is VibeThinker-3B
VibeThinker-3B is a dense model constructed upon the Qwen2.5-Coder-3B base. It was not trained from scratch but post-trained using supervised fine-tuning, reinforcement learning, and self-distillation.
The training methodology continues the Spectrum-to-Signal Principle (SSP) established in the earlier VibeThinker-1.5B. Supervised Fine-Tuning (SFT) creates a broad range of valid reasoning paths, termed the ‘Spectrum.’ Reinforcement Learning then amplifies only the correct paths, creating the ‘Signal.’
The model is a specialist by design, targeting reasoning tasks where a verifier can confirm the answer. The research team advises using larger general models for open-domain knowledge retrieval. VibeThinker-3B focuses strictly on the reasoning aspect.
It runs on standard infrastructure. The model weights require transformers>=4.54.0. For faster inference, the team recommends vLLM==0.10.1 or SGLang>=0.4.9.post6. The BF16 weights are approximately 6 GB, fitting comfortably on a single GPU.
Benchmark Performance
On the AIME26 benchmark, VibeThinker-3B scores 94.3. According to the technical report, this is comparable to DeepSeek V3.2 (671B) and Kimi K2.5 (1T).
On LiveCodeBench v6, it achieves 80.2 Pass@1. On OJBench, a coding benchmark, it scores 38.6, which is lower than the largest models. On HMMT25 it scores 89.3, and on BruMO25 it reaches 93.8. On IMO-AnswerBench, a set of 400 problems at the level of the International Mathematical Olympiad, it scores 76.4.
The following table compares it against significantly larger reasoning models. The row marked ‘+CLR’ applies test-time scaling, known as Claim-Level Reliability Assessment.
| Model | Params | AIME26 | HMMT25 | IMO-Ans | LCBv6 | GPQA-D |
|---|---|---|---|---|---|---|
| VibeThinker-3B | 3B | 94.3 | 89.3 | 76.4 | 80.2 | 70.2 |
| VibeThinker-3B +CLR | 3B | 97.1 | 95.4 | 80.6 | — | 72.9 |
| GPT-OSS (high) | 120B | 93.2 | 90.0 | 75.6 | 81.9 | 80.1 |
| DeepSeek V3.2 | 671B | 94.2 | 90.2 | 78.3 | 80.8 | 82.4 |
| GLM-5 | 744B | 95.8 | 97.9 | 82.5 | 85.5 | 86.0 |
| Kimi K2.5 | 1T | 93.3 | 95.4 | 81.8 | 85.0 | 87.6 |
The results are consistent. On verifiable mathematics and code, the 3B model sits near the top tier. On GPQA-Diamond, a knowledge-heavy benchmark, the performance gap to large models remains visible.
The team also conducted an out-of-distribution coding test using recent LeetCode weekly and biweekly contests from April 25 to May 31, 2026. The model passed 123 of 128 first-attempt Python submissions, achieving a 96.1% acceptance rate on unseen problems.
Inside the Spectrum-to-Signal Pipeline
The post-training pipeline consists of four stages, each addressing a specific weakness of small reasoning models.
First is curriculum-based two-stage SFT. Stage 1 covers math, code, STEM, dialogue, and instruction following broadly. Stage 2 shifts to harder, longer-horizon samples filtered by reasoning length and difficulty. Diversity-Exploring Distillation preserves multiple valid solution paths through both stages.
Second is multi-domain Reasoning RL. The team reuses MaxEnt-Guided Policy Optimization (MGPO). MGPO weights prompts near the model’s current capability boundary, where correct and incorrect rollouts coexist. Training runs sequentially across Math, Code, and STEM.
A notable detail is that VibeThinker-3B drops progressive context expansion. The researchers found that high-truncation warm-ups hurt long reasoning at this scale. Consequently, RL uses a single 64K long-context window throughout.
Math RL adds a Long2Short stage. It redistributes reward among correct trajectories by length. Shorter correct answers receive higher reward, while longer ones receive lower, keeping the group mean unchanged. The goal is to reduce redundant tokens without losing accuracy.
Third, Offline Self-Distillation merges the RL checkpoints back into one student model. Fourth, Instruct RL improves instruction adherence. This stage explains the 93.4 IFEval and 74.5 IFBench scores, showing that reasoning tuning did not break controllability.
CLR: Scaling at Test Time, Not Parameter Count
Claim-Level Reliability Assessment (CLR) is the report’s test-time scaling method. It runs on answer-verifiable tasks and adds no parameters.
The procedure has two steps. The model first generates K = 32 trajectories per problem. From each, it extracts M = 5 decision-relevant claims plus a final answer.
The model then acts as its own verifier. It validates or falsifies each claim, producing binary verdicts. CLR maps these into a nonlinear trajectory reliability score, where one weak claim sharply lowers the weight.
Answers are clustered by equivalence, and the highest reliability-weighted answer wins. The full flow runs 8 times, and the averaged Pass@1 is reported. CLR lifts AIME26 to 97.1 and BruMO25 to 99.2.
Use Cases With Examples
The research team frames VibeThinker-3B as a specialist, so use cases follow the verifiable-reasoning boundary.
- Competitive math tutoring: It solves AIME and HMMT-style problems with full chains of reasoning. A study tool could generate worked solutions and self-check answers locally.
- Algorithmic coding help: The 96.1% LeetCode acceptance rate suggests strong one-shot Python generation. An IDE assistant could draft contest-style solutions and run hidden tests.
- Cost-sensitive RL or agent backends: A 3B model is cheap to serve at scale. Teams running many verifiable subtasks could route them here instead of a 600B+ model.
- On-device reasoning: BF16 weights fit one consumer GPU. Edge or offline deployments gain a reasoning engine without cloud calls.
Running It: Quick Start
Serving with vLLM exposes an OpenAI-compatible endpoint:
pip install vllm
vllm serve "WeiboAI/VibeThinker-3B"
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "WeiboAI/VibeThinker-3B",
"messages": [{"role":"user","content":"Prove there are infinitely many primes."}],
"temperature": 1.0, "top_p": 0.95
}'Direct Transformers usage mirrors the official card:
from transformers import AutoModelForCausalLM, AutoTokenizertok = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"WeiboAI/VibeThinker-3B", torch_dtype="bfloat16", device_map="auto")
msgs = [{"role": "user", "content": "Your prompt"}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
Source Read original →Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




