MiniMax M3: Open-weight model with a million-token context challenges proprietary leaders

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 1, 2026 4 min read
MiniMax M3: Open-weight model with a million-token context challenges proprietary leaders


MiniMax M3: Open-weight model with a million-token context challenges proprietary leaders

Key Points

  • Chinese AI company MiniMax is releasing M3, a new open-weight model that combines strong coding performance, native multimodality, and a one-million-token context window.
  • The new “MiniMax Sparse Attention” architecture only processes relevant data blocks. This cuts compute to one-twentieth and speeds up input processing by more than nine times.
  • In benchmarks and long-running autonomy tests, M3 hits results on par with top models like Opus 4.7 and GPT-5.5. The model is available via API, and the weights will be published shortly.

Chinese AI company MiniMax has released its new model M3. It’s billed as the first open-weight model to combine top-tier coding performance, a one-million-token context window, and native multimodality.

According to MiniMax, that combination was previously out of reach for open models and reserved for proprietary systems like Opus 4.7, GPT-5.5, or Gemini 3.1 Pro. A new attention mechanism makes the leap possible by stretching the context window to one million tokens without letting compute costs spiral out of control. In internal tests, M3 also planned, debugged, and self-corrected on its own over many hours.

Benchmarks put M3 in proprietary territory

On SWE-Bench Pro, an established software development benchmark, M3 scores 59 percent according to MiniMax. That puts it ahead of GPT-5.5 and Gemini 3.1 Pro, but just behind Opus 4.7. M3 also lands in proprietary-class territory on terminal tasks and tool use. On autonomous web search, it actually pulls ahead of Opus 4.7 (79.3) with 83.5 points on BrowseComp. Anthropic has since shipped Opus 4.8, a somewhat stronger model.

To get closer to real developer workflows, MiniMax built a simulator framework that mimics typical behavior patterns. These include refining requirements, discussing solution approaches, reacting to intermediate results, and carrying tasks across multiple contexts. This exposes the model to multi-turn collaboration during training, not just single, clearly defined prompts.

Three tests show long-running autonomy

MiniMax describes three internal experiments designed to show how these capabilities work together. In the first, the team had M3 independently reproduce a paper on LLM fine-tuning. The model worked for nearly twelve hours without intervention, produced 18 commits and 23 figures, and confirmed the paper’s key findings.

In the second test, M3 was asked to optimize a compute kernel for matrix multiplications on Nvidia Hopper GPUs, one of the most compute-intensive building blocks in large-model inference. Experienced teams typically need one to two weeks for this, according to MiniMax. M3 got only a task description, a benchmark script, and a non-functional code skeleton with no reference solution to copy from. After about 24 hours, the model had pushed Hopper hardware utilization from 7.6 to 71.3 percent. Most other tested models gave up after a few dozen attempts, while M3 worked through several plateaus and didn’t reach its best solution until attempt 145.

When optimizing an FP8 kernel, M3 reaches 71.3 percent of Hopper peak performance after 147 runs, pulling ahead of Opus 4.7. Anthropic’s model needs far fewer runs, though.

In the third test, PostTrainBench, M3 was tasked with independently training four base models, synthesizing data, training, evaluating, and iterating without human input. The model landed just behind Opus 4.7 and GPT-5.5 but well ahead of the remaining tested models.

MiniMax says M3 was trained with mixed modalities from the start. So-called interleaved data, where text and images are woven together within a sequence, turned out to matter more than initially expected. After reworking the data pipeline, training scales to the order of 100 trillion tokens.

A new attention mechanism makes million-token context affordable

The technical foundation is a new attention variant called MiniMax Sparse Attention (MSA). Classic full attention compares every token against every other token, so compute costs grow quadratically with input length. MSA avoids this by calculating attention scores only for selected segments rather than every token pair.

The stored context, known as the key-value cache (KV cache), gets split into blocks. A preliminary filtering step decides which blocks are actually relevant to the current query. Only those blocks go into the full calculation.

There’s also a change at the GPU computation level. Normally, the model loads the matching KV blocks from memory for each individual query, and many blocks get fetched multiple times. MSA flips the logic and processes blocks sequentially. For each block, all queries that need it get batched together. Each block only has to be read from memory once, in a contiguous access pattern instead of scattered jumps. MiniMax says its implementation runs more than four times faster than competing open-source alternatives.

All told, M3 needs just one-twentieth of its predecessor’s compute per token at one million tokens of context. Input prompts are processed more than nine times faster, and responses are generated more than fifteen times faster.

Pricing and availability

M3 is available through the MiniMax API. Requests up to 512,000 input tokens are billed at the standard rate; longer contexts cost more. A thinking mode can be toggled on or off per request. The token plan starts at $20 per month for roughly 1.7 billion tokens and goes up to $120 for 9.8 billion tokens. Model weights and a technical report will be published on Hugging Face and GitHub within the next ten days, MiniMax says.

MiniMax has also updated its in-house agent app, MiniMax Code, which is also set to go open-source.

About three months ago, MiniMax released M2.7, a model the company said was actively involved in its own development, running autonomous optimization loops over more than 100 rounds and handling 30 to 50 percent of the workflow for MiniMax’s internal RL team.

Subscribe now

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top