For makers and artists, the lesson is stark: if your creative stack relies on rigid scaffolding around outdated base models, it will likely become obsolete before you even finish the project.
Two recent papers in Nature demonstrate that specialised AI agents can match or exceed physicians in diagnosing conditions and crafting treatment plans within simulated environments. Crucially, both systems achieved these results while running on underlying language models that are now considered outdated.
The German system MIRA surpassed doctors in diagnosing ailments such as pancreatic cancer and pneumonia. Meanwhile, Google’s AMIE generated more accurate treatment and testing strategies. These findings suggest that while AI is closing the gap on clinical value, the specific architectural tricks used to get there may not age well.
MIRA operates as an autonomous agent in a sealed virtual hospital
Medical Intelligence for Reasoning and Action, or MIRA, was developed by researchers at TUD Dresden and Heidelberg University. Unlike standard chat interfaces, this system functions as an autonomous agent inside a sealed, virtual electronic health record. It has access to more than 85,000 options across eleven different tools.
The workflow is comprehensive: it ingests patient histories, orders lab work, microbiology tests, and imaging, interprets the results, generates differential diagnoses, and writes treatment plans including prescriptions, surgical planning, and hospital admissions.
The team validated MIRA against more than 500 real emergency department cases drawn from the public MIMIC-IV dataset. In this setup, a second AI agent played the patient, disclosing only information present in the actual medical record.
Across eight disease categories, MIRA reached the correct diagnosis 88.9 percent of the time, measured against the documented diagnoses in the dataset. In a direct head-to-head comparison involving a subset of 311 cases under identical conditions, MIRA scored 87.8 percent. Four experienced specialists reached 78.1 percent, while a mixed team of residents and specialists managed 71.1 percent.
MIRA performed best on appendicitis (98.6 percent) and pancreatitis (92.3 percent). Both the AI and the doctors struggled more with pneumonia (72.4 percent) and urinary tract infections (77.6 percent).
Safety was also scrutinised. Blinded specialist reviewers who did not know whether a recommendation came from MIRA or a human found no dangerous drug interactions, no incorrect dosing for patients with impaired kidney function, and no risky painkiller prescriptions. MIRA was nearly perfect at capturing a patient’s current medications and nailed the question of whether a patient needed hospitalization, missing not a single case that required it. Performance remained steady even when test patients spoke only German or French, or acted particularly anxious. The source code is available on GitHub.
AMIE pairs two agents with clinical guidelines
Google’s AMIE takes a different approach, focusing on managing patients across multiple visits. The system consists of two parts: a conversational agent that handles the fast, friendly dialogue with the patient, and a second agent that works in the background, thinking more carefully and cross-referencing the case against medical guidelines.
In a tightly controlled study, Google compared AMIE with 21 primary care physicians across 100 cases spanning multiple visits. The benchmark was the UK’s NICE Guidance and BMJ Best Practice guidelines. Actors portrayed patients via text chat.
According to the study, AMIE matched the physicians on treatment decisions but beat them on plan accuracy and guideline adherence. At the first visit, AMIE’s overall plan was rated appropriate in 95 percent of cases, compared to 72 percent for the physicians. Both specialist reviewers and the patient actors preferred AMIE more often than the human doctors.
To test drug knowledge, the team built a dedicated benchmark called RxQA, based on two national drug formularies and verified by licensed pharmacists. AMIE outscored the primary care physicians on the harder questions. However, the test was tough for both sides; even on the easier questions, the best score stayed below 75 percent.
Both teams warn against jumping to conclusions
The authors are clear about the limits of their findings. MIRA recommended “care that deviated from best practices” for a “small but non-zero” share of patients. The simulated patient’s answers may also have been “more structured than real speech of patients in emergency departments.” Furthermore, it cannot be ruled out entirely that the freely available MIMIC-IV dataset was already part of the training data for the models used. If so, the measured performance would represent a ceiling rather than a realistic estimate. The comparison physicians also worked within the German emergency department system, which differs from other countries.
The AMIE developers call their study a “milestone” but stress that neither the case selection nor the text-only conversations reflect a real clinic. The system shows “promising capabilities” but is “not ready for real-world translation.” More work is needed to address “latent reasoning errors” that can creep into the system’s hidden reasoning steps.
Jakob Kather, whose research group co-developed MIRA, told the Financial Times: “We are getting a preview of how AI could transform medicine.” He compared AI agents like these to an airplane’s autopilot: “These systems can support and relieve medical professionals by taking over routine tasks, but ultimate responsibility will always remain with the physicians.”
Independent experts temper the excitement
Researchers not involved in either study praised the careful methodology but pointed out that these are simulations. Catherine Pope, a professor of medical sociology at the University of Oxford, told the FT that this is “some remove from the messy, complex, human world of everyday healthcare.”
Julie Jacko, a professor of health informatics at the University of Edinburgh, said many of the reported advantages came down to “precision and completeness of plans” rather than “clear differences in clinical correctness.” The study “demonstrates performance against a structured standard rather than fully capturing the complexity of real clinical decision-making.”
Scaffolding helps weak models most, stronger ones don’t need it
One of the most revealing findings is buried in AMIE’s supplementary experiments. As often happens with peer-reviewed studies, both systems rely on older AI models. AMIE still runs on Google’s older Gemini 1.5 Flash. MIRA uses OpenAI’s GPT-4o and o1-preview. All of these have since been surpassed by newer generations.
Google’s researchers swapped out individual components to figure out what actually drives performance: the elaborate scaffolding of two-agent architecture, guideline matching, and specialized training, or simply the underlying language model.
With the older Gemini 1.5 Flash, the specialized setup delivered the big performance boost the study describes. But when the researchers dropped the same setup onto the newer Gemini 2.5 Flash, the advantage almost vanished.
The specialized system, in other words, compensates for the older model’s weaknesses by forcing structured reasoning, making it cite guidelines, and suppressing hallucinations. A stronger model can do all of that on its own. The paper acknowledges that AMIE’s value shrinks as the base model improves. In fact, newer general-purpose models like Gemini 2.5 Pro, o3, and GPT-5 already score “largely comparable” to the full AMIE system on the RxQA drug test.
In practice, AMIE appears to have been overtaken by the pace of AI development. It’s a pattern that keeps repeating: scaffolding around language models becomes redundant as stronger models arrive, sometimes because the scaffolding itself feeds into training data for the next generation. That doesn’t make the ideas behind it worthless: In coding and increasingly in other areas, scaffolding tools like Claude Code, OpenAI Codex, and Claude Cowork give models access to tools, context, and memory. Even stronger models still perform better on complex tasks with that kind of support. But the scaffolding has to keep up with model performance, or it eventually becomes dead weight.
MIRA lacks this kind of analysis. Part of its architecture, though, is less about patching model weaknesses and more about connecting the AI to a hospital’s clinical systems. That part wouldn’t become obsolete with stronger models.
Key takeaways
- Specialised AI agents can match or beat doctors in simulated clinical tasks, but they currently rely on base models that are already outdated.
- The specific architectural “scaffolding” used to boost performance on older models often becomes redundant when newer, stronger general-purpose models are applied.
- Researchers emphasise that these systems are not ready for real-world translation, citing issues with structured patient data and latent reasoning errors.
- While the technology offers a preview of future automation, ultimate responsibility for medical decisions will remain with human physicians.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




