JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

For developers and engineers, JetBrains has unveiled Mellum2, a 12-billion parameter Mixture-of-Experts model designed specifically to accelerate software engineering workflows. Released under the permissive Apache 2.0 license, this is not a generalist chatbot meant to replace existing frontier models. Instead, it functions as a high-speed specialist, optimised for code generation, debugging, and agentic reasoning within larger AI systems. Its primary value lies in its ability to handle complex coding tasks with the efficiency of a much smaller model, making it ideal for low-latency pipelines, local deployment, and routing decisions in multi-model architectures.

Architecture

Mellum2 utilises a Mixture-of-Experts (MoE) structure, boasting 12B total parameters but activating only 2.5B per token. This design allows the model to maintain the computational footprint of a 2.5B dense model while leveraging the capacity of a larger parameter set for specialised tasks. The architecture is built with 64 experts, of which 8 are activated for every token processed.

Key technical specifications:

Layers: 28
Hidden size: 2304
MoE experts: 64 total, 8 active per token
Attention: Grouped-Query Attention (GQA) configured with 32 query heads and 4 KV heads
Sliding Window Attention (SWA): Implemented across three of every four layers with a window size of 1,024, while the remaining layer employs full attention
Context length: 131,072 tokens
Multi-Token Prediction (MTP) head: Acts as an auxiliary pre-training objective and a built-in draft model for speculative decoding
Precision: bfloat16
Vocabulary size: 98,304

The model processes natural language and code but lacks multimodal capabilities, meaning it cannot ingest images or video.

Pre-Training

The model underwent a training regimen spanning approximately 10.6 trillion tokens, structured across three phases. The data curriculum gradually shifted focus from diverse web content toward curated code and mathematical datasets.

Training employed the Muon optimizer using FP8 hybrid precision. The learning rate followed a Warmup-Hold-Decay schedule, linearly decaying to zero. Following the initial pre-training phase, the base model’s context window was extended to 128K tokens using a layer-selective YaRN method before post-training commenced.

The Model Family

JetBrains released six distinct checkpoints covering the entire training lifecycle:

Checkpoint	Description
Mellum2-12B-A2.5B-Base-Pretrain	Base checkpoint prior to long-context extension
Mellum2-12B-A2.5B-Base	Final base model after context extension
Mellum2-12B-A2.5B-Instruct-SFT	Supervised fine-tuned instruction checkpoint
Mellum2-12B-A2.5B-Thinking-SFT	Supervised thinking checkpoint
Mellum2-12B-A2.5B-Instruct	RL-tuned instruction model
Mellum2-12B-A2.5B-Thinking	RL-tuned thinking model

Post-training consisted of two stages: supervised fine-tuning (SFT), followed by reinforcement learning with verifiable rewards (RLVR) focused on mathematics, executable coding, tool use, instruction following, reasoning, and knowledge tasks.

The Instruct variant provides direct answers without externalising a chain of thought, making it suitable for low-latency tasks like tool invocation and direct instruction following. Conversely, the Thinking variant outputs an explicit reasoning trace before delivering a final answer, which is preferable for complex debugging, multi-step planning, or agentic flows where step-by-step logic is critical.

Benchmark Results

The following figures are self-reported by JetBrains, comparing Mellum2 against open-weight models ranging from 4B to 14B parameters.

Coding Performance:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)	Seed-Coder (8B)
LiveCodeBench v6	37.2	51.0	63.7	42.4	28.2	28.1
EvalPlus	78.4	69.4	71.8	74.1	67.3	73.8
MultiPL-E	67.1	51.0	67.1	71.5	36.1	77.0

Tool Use Performance:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)
BFCL v3	66.3	64.1	70.5	52.7	41.9
BFCL v4	44.2	52.0	60.6	38.8	19.8

Math Performance:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)
AIME 2025+2026	41.7	38.3	58.3	33.3	40.0
GSM-Plus	80.5	85.2	87.9	86.6	85.8

Knowledge and Conversational Performance:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)
MMLU-Redux	78.1	87.5	91.1	85.9	71.8
GPQA Diamond	40.9	76.8	79.8	58.6	40.9
IFEval	75.8	82.1	83.9	67.3	83.2
MixEval	62.2	65.9	71.1	71.2	59.4

Benchmark notes:

EvalPlus represents the mean of HumanEval+ and MBPP+
AIME scores are the mean of AIME 2025 and AIME 2026 (30 questions each)
BFCL v4 is the macro-average of five subtasks: v1, v2, v3, web search, and memory
Seed-Coder
Source Read original →
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.
Please enable JavaScript in your browser to complete this form.
Name
First
Last
Name Email
Email
AI Maestro is an independent British AI publication. We test what we recommend. More about us →

JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

Architecture

Pre-Training

The Model Family

Benchmark Results

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Warren Buffett’s Berkshire Hathaway…

How small businesses can…

The Trump Administration Is…