Alexandria’s latest AI model ran autonomously for 35 hours to optimize code for its own custom chip
Key Takeaways
- The Alibaba Qwen team has released a new AI model, Qwen3.7-Max, designed for autonomous tasks and available exclusively through an API.
- In practical benchmarks, the model outperformed many competing models in speed by optimizing code fully autonomously over 35 hours of continuous work on a custom chip architecture it had not seen during training.
- The model also demonstrated its ability to detect undesirable behavior and cheating attempts during its own training process, showcasing its robustness and self-monitoring capabilities.
A kernel experiment that ran for 35 hours
Qwen3.7-Max was tasked with optimizing a hardware-based attention kernel for the open-source inference software SGLang. The hardware used was a cloud instance equipped with T-Head-ZW-M890 accelerators, an AI chip platform from Alibaba’s own semiconductor arm.
Despite never having seen this specific chip architecture during training, Qwen3.7-Max ran 432 kernel tests over 35 hours and made 1,158 total tool calls to compile, measure, and revise the code in loops. It caught compilation errors and identified performance bottlenecks on its own.
The result was an average 10x speedup over the reference implementation. Competitor models such as GLM 5.1 hit a 7.3x speedup, Kimi K2.6 got to 5x, DeepSeek V4 Pro managed 3.3x, and Qwen3.6-Plus barely moved the needle at 1.1x. Models that quit early ended their sessions after five straight rounds with no tool calls.
Training splits task, tool environment, and validator
The new Max version builds on a training approach first rolled out with Qwen3.5. Each training task is broken into three independent pieces: the actual task, the tool environment, and the validator that checks the result. These can be mixed and matched freely.
By practicing across different tool environments and checking results using various test methods, Qwen3.7-Max is designed to pick up strategies that work everywhere rather than shortcuts tied to one specific setup. This approach allows it to perform consistently across different agent frameworks.
The model polices its own training for reward hacking
During its own training, Qwen3.7-Max served as a watchdog, watching over 80 hours of software engineering tasks and running more than 10,000 checks to detect tricks the model might use to game its rewards.
The model wrote 13 new detection rules and flagged 1,618 cases. Compared to other models like Claude Opus 4.6 Max, Kimi K2.6 Thinking, GLM-5.1 Thinking, and DeepSeek V4 Pro Max, Qwen3.7-Max demonstrated strong performance in various benchmarks, including SWE-Verified (80.4), GPQA Diamond (92.4), HMMT 2026 February (97.1), and Apex (44.5).
One year in simulation tests long-term planning
To gauge long-term planning, the team used YC-Bench to simulate a startup’s full one-year life cycle. Qwen3.7-Max managed $2.08 million in total revenue and completed 237 tasks.
It outperformed its predecessor, Qwen3.6-Plus, which hit $1.05 million in revenue. The model’s performance was consistent across various benchmarks, including those provided by the team itself.
Beyond typical use cases
The Qwen team also showcased Qwen3.7-Max steering a four-legged robot using its own robotics framework and a paired navigation model. This demonstrates the model’s versatility in handling complex tasks beyond coding and software engineering.
Originally published at the-decoder.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




