JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

For developers and engineers, JetBrains has unveiled Mellum2, a 12-billion parameter Mixture-of-Experts model designed specifically to accelerate software engineering workflows. Released under…

By AI Maestro June 2, 2026 3 min read
JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

For developers and engineers, JetBrains has unveiled Mellum2, a 12-billion parameter Mixture-of-Experts model designed specifically to accelerate software engineering workflows. Released under the permissive Apache 2.0 license, this is not a generalist chatbot meant to replace existing frontier models. Instead, it functions as a high-speed specialist, optimised for code generation, debugging, and agentic reasoning within larger AI systems. Its primary value lies in its ability to handle complex coding tasks with the efficiency of a much smaller model, making it ideal for low-latency pipelines, local deployment, and routing decisions in multi-model architectures.

Architecture

Mellum2 utilises a Mixture-of-Experts (MoE) structure, boasting 12B total parameters but activating only 2.5B per token. This design allows the model to maintain the computational footprint of a 2.5B dense model while leveraging the capacity of a larger parameter set for specialised tasks. The architecture is built with 64 experts, of which 8 are activated for every token processed.

Key technical specifications:

  • Layers: 28
  • Hidden size: 2304
  • MoE experts: 64 total, 8 active per token
  • Attention: Grouped-Query Attention (GQA) configured with 32 query heads and 4 KV heads
  • Sliding Window Attention (SWA): Implemented across three of every four layers with a window size of 1,024, while the remaining layer employs full attention
  • Context length: 131,072 tokens
  • Multi-Token Prediction (MTP) head: Acts as an auxiliary pre-training objective and a built-in draft model for speculative decoding
  • Precision: bfloat16
  • Vocabulary size: 98,304

The model processes natural language and code but lacks multimodal capabilities, meaning it cannot ingest images or video.

Pre-Training

The model underwent a training regimen spanning approximately 10.6 trillion tokens, structured across three phases. The data curriculum gradually shifted focus from diverse web content toward curated code and mathematical datasets.

Training employed the Muon optimizer using FP8 hybrid precision. The learning rate followed a Warmup-Hold-Decay schedule, linearly decaying to zero. Following the initial pre-training phase, the base model’s context window was extended to 128K tokens using a layer-selective YaRN method before post-training commenced.

The Model Family

JetBrains released six distinct checkpoints covering the entire training lifecycle:

CheckpointDescription
Mellum2-12B-A2.5B-Base-PretrainBase checkpoint prior to long-context extension
Mellum2-12B-A2.5B-BaseFinal base model after context extension
Mellum2-12B-A2.5B-Instruct-SFTSupervised fine-tuned instruction checkpoint
Mellum2-12B-A2.5B-Thinking-SFTSupervised thinking checkpoint
Mellum2-12B-A2.5B-InstructRL-tuned instruction model
Mellum2-12B-A2.5B-ThinkingRL-tuned thinking model

Post-training consisted of two stages: supervised fine-tuning (SFT), followed by reinforcement learning with verifiable rewards (RLVR) focused on mathematics, executable coding, tool use, instruction following, reasoning, and knowledge tasks.

The Instruct variant provides direct answers without externalising a chain of thought, making it suitable for low-latency tasks like tool invocation and direct instruction following. Conversely, the Thinking variant outputs an explicit reasoning trace before delivering a final answer, which is preferable for complex debugging, multi-step planning, or agentic flows where step-by-step logic is critical.

Benchmark Results

The following figures are self-reported by JetBrains, comparing Mellum2 against open-weight models ranging from 4B to 14B parameters.

Coding Performance:

BenchmarkMellum2 InstructQwen3.5 (4B)Qwen3.5 (9B)Ministral 3 (14B)OLMo-3 (7B)Seed-Coder (8B)
LiveCodeBench v637.251.063.742.428.228.1
EvalPlus78.469.471.874.167.373.8
MultiPL-E67.151.067.171.536.177.0

Tool Use Performance:

BenchmarkMellum2 InstructQwen3.5 (4B)Qwen3.5 (9B)Ministral 3 (14B)OLMo-3 (7B)
BFCL v366.364.170.552.741.9
BFCL v444.252.060.638.819.8

Math Performance:

BenchmarkMellum2 InstructQwen3.5 (4B)Qwen3.5 (9B)Ministral 3 (14B)OLMo-3 (7B)
AIME 2025+202641.738.358.333.340.0
GSM-Plus80.585.287.986.685.8

Knowledge and Conversational Performance:

BenchmarkMellum2 InstructQwen3.5 (4B)Qwen3.5 (9B)Ministral 3 (14B)OLMo-3 (7B)
MMLU-Redux78.187.591.185.971.8
GPQA Diamond40.976.879.858.640.9
IFEval75.882.183.967.383.2
MixEval62.265.971.171.259.4

Benchmark notes:

  • EvalPlus represents the mean of HumanEval+ and MBPP+
  • AIME scores are the mean of AIME 2025 and AIME 2026 (30 questions each)
  • BFCL v4 is the macro-average of five subtasks: v1, v2, v3, web search, and memory
  • Seed-Coder

    Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

    Name
Scroll to Top