For developers and engineers, JetBrains has unveiled Mellum2, a 12-billion parameter Mixture-of-Experts model designed specifically to accelerate software engineering workflows. Released under the permissive Apache 2.0 license, this is not a generalist chatbot meant to replace existing frontier models. Instead, it functions as a high-speed specialist, optimised for code generation, debugging, and agentic reasoning within larger AI systems. Its primary value lies in its ability to handle complex coding tasks with the efficiency of a much smaller model, making it ideal for low-latency pipelines, local deployment, and routing decisions in multi-model architectures.
Architecture
Mellum2 utilises a Mixture-of-Experts (MoE) structure, boasting 12B total parameters but activating only 2.5B per token. This design allows the model to maintain the computational footprint of a 2.5B dense model while leveraging the capacity of a larger parameter set for specialised tasks. The architecture is built with 64 experts, of which 8 are activated for every token processed.
Key technical specifications:
- Layers: 28
- Hidden size: 2304
- MoE experts: 64 total, 8 active per token
- Attention: Grouped-Query Attention (GQA) configured with 32 query heads and 4 KV heads
- Sliding Window Attention (SWA): Implemented across three of every four layers with a window size of 1,024, while the remaining layer employs full attention
- Context length: 131,072 tokens
- Multi-Token Prediction (MTP) head: Acts as an auxiliary pre-training objective and a built-in draft model for speculative decoding
- Precision: bfloat16
- Vocabulary size: 98,304
The model processes natural language and code but lacks multimodal capabilities, meaning it cannot ingest images or video.
Pre-Training
The model underwent a training regimen spanning approximately 10.6 trillion tokens, structured across three phases. The data curriculum gradually shifted focus from diverse web content toward curated code and mathematical datasets.
Training employed the Muon optimizer using FP8 hybrid precision. The learning rate followed a Warmup-Hold-Decay schedule, linearly decaying to zero. Following the initial pre-training phase, the base model’s context window was extended to 128K tokens using a layer-selective YaRN method before post-training commenced.
The Model Family
JetBrains released six distinct checkpoints covering the entire training lifecycle:
| Checkpoint | Description |
|---|---|
| Mellum2-12B-A2.5B-Base-Pretrain | Base checkpoint prior to long-context extension |
| Mellum2-12B-A2.5B-Base | Final base model after context extension |
| Mellum2-12B-A2.5B-Instruct-SFT | Supervised fine-tuned instruction checkpoint |
| Mellum2-12B-A2.5B-Thinking-SFT | Supervised thinking checkpoint |
| Mellum2-12B-A2.5B-Instruct | RL-tuned instruction model |
| Mellum2-12B-A2.5B-Thinking | RL-tuned thinking model |
Post-training consisted of two stages: supervised fine-tuning (SFT), followed by reinforcement learning with verifiable rewards (RLVR) focused on mathematics, executable coding, tool use, instruction following, reasoning, and knowledge tasks.
The Instruct variant provides direct answers without externalising a chain of thought, making it suitable for low-latency tasks like tool invocation and direct instruction following. Conversely, the Thinking variant outputs an explicit reasoning trace before delivering a final answer, which is preferable for complex debugging, multi-step planning, or agentic flows where step-by-step logic is critical.
Benchmark Results
The following figures are self-reported by JetBrains, comparing Mellum2 against open-weight models ranging from 4B to 14B parameters.
Coding Performance:
| Benchmark | Mellum2 Instruct | Qwen3.5 (4B) | Qwen3.5 (9B) | Ministral 3 (14B) | OLMo-3 (7B) | Seed-Coder (8B) |
|---|---|---|---|---|---|---|
| LiveCodeBench v6 | 37.2 | 51.0 | 63.7 | 42.4 | 28.2 | 28.1 |
| EvalPlus | 78.4 | 69.4 | 71.8 | 74.1 | 67.3 | 73.8 |
| MultiPL-E | 67.1 | 51.0 | 67.1 | 71.5 | 36.1 | 77.0 |
Tool Use Performance:
| Benchmark | Mellum2 Instruct | Qwen3.5 (4B) | Qwen3.5 (9B) | Ministral 3 (14B) | OLMo-3 (7B) |
|---|---|---|---|---|---|
| BFCL v3 | 66.3 | 64.1 | 70.5 | 52.7 | 41.9 |
| BFCL v4 | 44.2 | 52.0 | 60.6 | 38.8 | 19.8 |
Math Performance:
| Benchmark | Mellum2 Instruct | Qwen3.5 (4B) | Qwen3.5 (9B) | Ministral 3 (14B) | OLMo-3 (7B) |
|---|---|---|---|---|---|
| AIME 2025+2026 | 41.7 | 38.3 | 58.3 | 33.3 | 40.0 |
| GSM-Plus | 80.5 | 85.2 | 87.9 | 86.6 | 85.8 |
Knowledge and Conversational Performance:
| Benchmark | Mellum2 Instruct | Qwen3.5 (4B) | Qwen3.5 (9B) | Ministral 3 (14B) | OLMo-3 (7B) |
|---|---|---|---|---|---|
| MMLU-Redux | 78.1 | 87.5 | 91.1 | 85.9 | 71.8 |
| GPQA Diamond | 40.9 | 76.8 | 79.8 | 58.6 | 40.9 |
| IFEval | 75.8 | 82.1 | 83.9 | 67.3 | 83.2 |
| MixEval | 62.2 | 65.9 | 71.1 | 71.2 | 59.4 |
Benchmark notes:
- EvalPlus represents the mean of HumanEval+ and MBPP+
- AIME scores are the mean of AIME 2025 and AIME 2026 (30 questions each)
- BFCL v4 is the macro-average of five subtasks: v1, v2, v3, web search, and memory
- Seed-CoderSource Read original →
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




