Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems

Mistral AI has launched Leanstral 1.5, a code agent model designed for Lean 4. The system is licensed under Apache 2.0 and a free API endpoint is now available. It updates the earlier Leanstral-2603 model and belongs to the Mistral Small 4 family.

Leanstral 1.5 targets automated theorem proving and proof engineering. It operates within Lean 4, a proof assistant that checks logical steps mechanically. The tool can express complex objects like perfectoid spaces and properties of Rust fragments.

Architecture and Specifications

The model uses a mixture-of-experts architecture. It routes each token to a few specialised sub-networks to keep compute low while maintaining high capacity. Leanstral contains 128 experts, activating four per token.

Total size is 119 billion parameters, with 6.5 billion activated per token. Context length reaches 256k tokens. Input accepts text and images, while output remains text only.

Training Methodology

Training occurred in three stages: mid-training, supervised fine-tuning, and reinforcement learning with CISPO. Two reinforcement-learning environments shaped the model’s agentic behaviour.

In the multiturn environment, the model receives a theorem statement. It must prove or disprove it. The system submits a proof, reads Lean compiler feedback, and refines across attempts until success or budget exhaustion.

In the code agent environment, Leanstral functions inside a raw filesystem. It edits files, runs bash commands, and uses the Lean language server. That server exposes goals, errors, and type information in real time.

This setup allows the model to complete partial proofs, build auxiliary lemmas, and persist through context compaction. Compaction compresses earlier context so long tasks fit the window. Correctness is verified by Mistral’s fork of SafeVerify against target theorems.

Benchmark Results

The Mistral team reports that Leanstral 1.5 saturates miniF2F. It reaches 100% on both the validation and test sets. It solves 587 of 672 PutnamBench problems.

The model sets a new state-of-the-art on the FATE-H and FATE-X algebra benchmarks. Mistral lists 87% on FATE-H and 34% on FATE-X. On FLTEval, pass@1 rises from 21.9 to 28.9. Pass@8 rises from 31.9 to 43.2.

FLTEval is built from real pull requests to the Fermat’s Last Theorem repository. On it, Leanstral surpasses Opus 4.6’s 39.6 at one-seventh the cost. It also widens its lead over open-source models three to ten times larger. Pass@8 means eight attempts are allowed per problem.

Benchmark	Leanstral 1.5	Detail
miniF2F (val + test)	100%	Saturated, per Mistral
PutnamBench	587 / 672	~$4 per problem
FATE-H	87%	New state-of-the-art
FATE-X	34%	New state-of-the-art
FLTEval pass@1	28.9	Up from 21.9
FLTEval pass@8	43.2	Beats Opus 4.6’s 39.6

On PutnamBench, Leanstral edges Seed-Prover 1.5 high by 7 problems. It does so at about $4 per problem. Mistral estimates Seed-Prover’s high setting near $300 or more per problem.

That setting runs a budget of 10 H20-days per problem. Mistral also compares against Goedel-Architect and AxProverBase. It notes Aleph Prover costs roughly $54 to $68 per problem.

Test-time scaling is the model’s defining behaviour. Raising the token budget per attempt lifts PutnamBench Pass@8. Mistral team reports 44 solved at 50k, 244 at 200k, 493 at 1M, and 587 at 4M.

Case Studies and Use Cases

Leanstral trained mainly on mathematics, but it also verifies code. Mistral team documents two case studies that matter for engineers.

First, Leanstral proved O(log n) time complexity for a real AVL tree implementation. AVL trees are self-balancing binary search trees. The proof used structural induction and monadic time tracking via the TimeM monad. It ran over 2.7 million tokens across 22 compactions. It established a bound near 48 steps per height unit, plus a constant.
Second, Leanstral found real bugs in open-source code. An automated pipeline used Aeneas to translate Rust into Lean. Leanstral inferred user intent and generated correctness properties. It attempted each property in four tries, then the negation in four more.

Across 57 repositories, it flagged 47 violated properties and 11 genuine bugs. Five were previously unreported on GitHub. One bug sat in the sign function for zigzag decoding in `datrs/varinteger`. On input `Std.U64.MAX`, the expression `(value + 1)` overflowed. That caused crashes in debug mode and silent corruption in release.

Practical use cases follow directly from these examples. Dev teams can complete partial proofs inside a repository. They can generate correctness properties for a function automatically. They can stress-test Rust code by proving or disproving inferred invariants.

Getting Started: Code and Deployment

The simplest path is Mistral Vibe, Mistral’s agent CLI. Leanstral runs on Mistral’s free plan. Enable ‘Labs models’ in your account, then create an API key.

Install Vibe, add the Lean agent, then launch it:

# 1. Set up Mistral Vibe
uv tool install mistral-vibe
uv tool update mistral-vibe
vibe --setup

# 2. Inside vibe, install Leanstral, then leave vibe
/leanstall
exit

# 3. Launch the Lean agent
vibe --agent lean

For self-hosting, install vLLM 0.24.0 or newer, then serve the weights:

# Installs mistral_common >= 1.11.5 automatically
uv pip install -U vllm --torch-backend=auto

vllm serve mistralai/Leanstral-1.5-119B-A6B \
  --max-model-len 200000 \
  --tensor-parallel-size 4 \
  --attention-backend FLASH_ATTN_MLA \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --reasoning-parser mistral

Call the server through the OpenAI-compatible client. Set `reasoning_effort` to `high` for complex prompts, or `none` for speed:

from openai import OpenAI

# Point the OpenAI client at your vLLM server
client = OpenAI(api_key="EMPTY", base_url="")

TEMP = 1.0
MAX_TOK = 32000
REASONING = "high"  # switch to 'none' for faster answers

model = client.models.list().data[0].id

messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "Define the transition rules as an inductive proposition in Lean 4."}
    ]},
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=TEMP,
    max_tokens=MAX_TOK,
    reasoning_effort=REASONING,
)

print(response.choices[0].message.content)
print(response.choices[0].message.reasoning)

Leanstral also supports OpenAI-style tool calling. You can expose a function such as `lean_run_code` to compile snippets. Mistral further recommends the `lean-lsp-mcp` server for tighter Lean integration.

What it means

For developers working with formal methods, the shift is practical rather than theoretical. The $4 per problem cost on PutnamBench makes high-level verification accessible to teams that previously relied on expensive cloud instances. The ability to infer user intent and generate correctness properties means engineers can automate parts of the verification pipeline that usually require manual theorem construction. The detection of silent memory corruption bugs in Rust code demonstrates that the tool can move beyond theoretical math into tangible software reliability issues.

Source Read original →

Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems

Architecture and Specifications

Training Methodology

Benchmark Results

Case Studies and Use Cases

Getting Started: Code and Deployment

What it means

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Quoting Josh W. Comeau

Open Source AI Gap…

Designing a Schema-Guided Invoice…

Architecture and Specifications

Training Methodology

Benchmark Results

Case Studies and Use Cases

Getting Started: Code and Deployment

What it means

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Quoting Josh W. Comeau

Open Source AI Gap…

Designing a Schema-Guided Invoice…