Introducing North Mini Code: Cohere’s First Model For Developers

For makers and artists building the next generation of software, the release of North Mini Code signals a shift from generic chatbots…

By AI Maestro June 9, 2026 4 min read
Introducing North Mini Code: Cohere’s First Model For Developers

For makers and artists building the next generation of software, the release of North Mini Code signals a shift from generic chatbots to specialised engineering partners. Cohere has unveiled this 30-billion parameter Mixture-of-Experts model, featuring just 3 billion active parameters, specifically to handle complex agentic software engineering tasks. Available on Hugging Face under the Apache 2.0 license, it is the inaugural model in Cohere’s new family, trained exclusively for terminal-based workflows and high-quality code generation.

Performance and Capabilities

On Artificial Analysis’ Coding Index, North Mini Code scores 33.4. This result places it ahead of Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), and Devstral Small 2 (24B Dense). Crucially, it outperforms significantly larger competitors such as Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B). The model ranks among the strongest open-source coding options within its size class, proving that efficiency does not require sacrificing capability.

Built for Real-World Agents

Effective code agents require models that remain robust across diverse tooling environments. Cohere trained North Mini Code using multiple scaffolds rather than optimising for a single interface. This strategy ensures the model serves as a reliable foundation for various harnesses, including OpenCode, which relies on fine-grained, individually typed tools returning structured JSON responses.

Technical Architecture

North Mini Code is a decoder-only Transformer-based sparse Mixture-of-Experts model. Its attention mechanism interleaves sliding-window attention with RoPE and global attention without positional embeddings in a 3:1 ratio. The feed-forward block consists of an MoE structure with 128 experts, activating eight per token, utilising SwiGLU activation functions. The router applies a sigmoid activation to logits before top-k selection, preceded by a single dense layer.

Post-Training Strategy

The development pipeline employs a two-stage cascaded supervised fine-tuning (SFT) approach followed by reinforcement learning with verifiable rewards (RLVR). The initial SFT stage mixes programming, reasoning, and instruction-following data, where code datasets account for 70% of trainable tokens. The second stage focuses strictly on agentic and reasoning-driven samples, increasing code data to 61% of trainable tokens. This mixture includes only tool calls and completions verified as executable and correct.

The team utilised over 70,000 verifiable tasks across approximately 5,000 unique repositories, maintaining disjoint subsets for synthetic data generation and RLVR. Context lengths of 64K and 128K were used for the first and second SFT stages respectively. This “long-to-longer” cascade prevents high-quality code data from being dominated by non-code tokens in early stages, avoiding behavioural conflicts and ensuring robust performance on long-context tasks.

Generalising Across Harnesses

To ensure usability in unpredictable software development settings, the model was exposed to diverse coding harnesses during the second SFT stage. While the primary benchmark data focused on SWE-Agent, the inclusion of additional harness data yielded a 10% performance gain on the OpenCode evaluation without degrading SWE-Agent results. Notably, the model achieved 61.0% pass@1 on mini-SWE-Agent, demonstrating that harnesses with overlapping tool capabilities share sufficient representational structure for positive transfer.

Similarly, for Terminal-Bench, the model was primed with a small amount of plain-text chat data to adapt to Terminus 2, where interactions occur via chat turns rather than native tool calling. The training process emphasised sufficient variation in harnesses to force the model to establish genuine links between instructions and behaviours, preventing simple template regurgitation.

Asynchronous Reinforcement Learning

Coding-agent rollouts are notoriously variable in length, often causing synchronous RL loops to idle while waiting for the longest trajectories. Cohere decoupled sampling from learning by running a trainer alongside a vLLM sidecar that serves rollouts continuously. Policy weights are exported every few learner steps, ensuring the sampler remains only slightly off-policy at any given moment.

The team implemented a windowed First-in-First-Out (FIFO) queue to manage data distribution, allowing a small fraction of the queue to be consumed in completion order to drain stragglers. Training utilised CISPO, a log-likelihood objective with token-level importance sampling correction. This approach aggregates loss at the token level, ensuring that gradient signals scale with trajectory length and that long agentic traces are not down-weighted relative to short ones.

A single multi-environment online RL training run spanned both terminal-based tasks and software engineering tasks. Each batch consisted of 512 rollouts with a group size of eight. The model received binary rewards derived from unit-test-based verifiers, with a penalty of zero for invalid tool calls or unparseable outputs. This sharp drop in rewards during the initial steps effectively reduced the rate of hallucinated or malformed tool calls.

Results

The RLVR training improved the final model’s performance by 7.9% (absolute) pass@1 on Terminal-Bench v2 and 3.0% (absolute) on SWE-Bench. Joint training across both environments produced stronger results than training on each separately and demonstrated better generalisation to out-of-distribution tasks.

Key takeaways

  • North Mini Code delivers top-tier coding performance for its size class, beating significantly larger models like Nemotron 3 Super.
  • A two-stage SFT cascade combined with a FIFO queue strategy ensures robust handling of long-context agentic workflows.
  • Exposure to diverse harnesses during training enables the model to generalise effectively across different tooling environments without benchmark degradation.
  • Joint multi-environment RL training yields superior results compared to single-task optimisation, improving both terminal and software engineering capabilities.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top