Import AI 455: Automating AI Research

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 10, 2026 6 min read
Import AI 455: Automating AI Research

“`html




Import AI 455: Automating AI Research

Import AI

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

AI systems are about to start building themselves. What does that mean?

I’m writing this post because when I look at all the publicly available information, I reluctantly come to the view that there’s a likely chance (60%+) that no-human-involved AI R&D — an AI system powerful enough that it could plausibly autonomously build its own successor — happens by the end of 2028. This is a big deal. It’s hard for me to wrap my head around it. The implications are so large that I feel dwarfed by them, and I’m not sure society is ready for the kinds of changes implied by achieving automated AI R&D.

The purpose of this essay is to enumerate why I think the takeoff towards fully automated AI R&D is happening. I’ll discuss some of the consequences of this, but mostly I expect to spend the majority of this essay discussing the evidence for this belief, and will spend most of 2026 working through the implications.

In terms of timing, I don’t expect this to happen in 2026. But I think we could see an example of a “model end-to-end trains it successor” within a year or two — certainly a proof-of-concept at the non-frontier model stage, though frontier models may be harder (they’re a lot more expensive and are the product of a lot of humans working extremely hard).

My reasoning for this stems primarily from public information: papers on arXiv, bioRxiv, and NBER, as well as observing the products being deployed into the world by the frontier companies. From this data I arrive at the conclusion that all the pieces are in place for automating the production of today’s AI systems — the engineering components of AI development. And if scaling trends continue, we should prepare for models to get creative enough that they may be able to substitute for human researchers at having creative ideas for novel research paths, thus pushing forward the frontier themselves, as well as refining what is already known.

Upfront caveat: For much of this piece I’m going to try to assemble a mosaic view of AI progress out of things that have happened with many individual benchmarks. As anyone who studies benchmarks knows, all benchmarks have some idiosyncratic flaws. The important thing to me is the aggregate trend which emerges through looking at all of these datapoints together, and you should assume that I am aware of the drawbacks of each individual datapoint.

Let’s go through some of the evidence:

The coding singularity — capabilities over time

AIs have revolutionized the production of code. This has happened due to two related trends: AI systems have gotten better at writing complicated real-world code, and AI systems have gotten much better at chaining together many linear coding tasks (e.g., writing code, then testing it) independent of human oversight.

Two things that exemplify this trend are SWE-Bench and the METR time horizons plot.

Solving real-world software engineering problems — SWE-Bench

SWE-Bench is a widely used coding test which evaluates how well AI systems can solve real world GitHub issues. When SWE-Bench launched in late 2023 the best score at the time was Claude 2 which had an overall success rate of ~2%. Claude Mythos Preview gets 93.9%, effectively saturating the benchmark.

Measuring an AI system’s ability to complete tasks that take people a long time — METR

METR makes a plot that tells us about the complexity of tasks AIs can complete, measured by how many hours a skilled human would take to do them. The key measure here is one which tells you the rough time horizon over which AI systems can be 50% reliable at a basket of tasks.

Here, progress has been extremely striking: In 2022, GPT 3.5 could do tasks that might take a person about ~30 seconds. In 2023, this rose to 4 minutes with GPT-4. In 2024, this rose to 40 minutes (o1). In 2025, it reached ~6 hours (GPT 5.2 (High)). In 2026, it has already risen to ~12 hours (Opus 4.6).

Ajeya Cotra, a longtime AI forecaster who works at METR, thinks it isn’t unreasonable to expect AI systems to do tasks that take ~100 hours by the end of 2026.

The more skilled AI systems get and the better they get at working independently of us, the more they can help automate chunks of AI R&D

Key ingredients in delegation are a) confidence in the skills of the person, and b) confidence in their ability to work independently of you in a way that is aligned with your intentions.

When we look at the competency of AI at coding, it seems that AI systems are getting far more skilled and also able to work independently of people for longer and longer periods before needing re-calibration. This correlates with what we see around us — engineers and researchers are now delegating larger and larger chunks of their work to AI systems, and as capabilities rise, so too does the complexity and importance of the work being delegated.

AI is getting good at core science skills essential to AI R&D

Think about modern science — a huge amount of it is about specifying a direction where you want to generate some empirical information, running experiments to generate that information, then sanity-checking the results of the experiment. The combination of advances in coding over time combined with the general world modeling capabilities of LLMs has yielded tools that are already helping to speed up human scientists and partially automate aspects of R&D broadly.

Implementing entire scientific papers and doing the experiments — CORE-Bench

CORE-Bench is a Computational Reproducibility Agent Benchmark. This benchmark challenges AI systems to “reproduce the results of a research paper given its repository. The agent must install libraries, packages, and dependencies and run the code. If the code runs successfully, the agent needs to search through all outputs to answer the task questions.” CORE-Bench was introduced in September 2024 and the best scoring system at the time was a GPT-4o model in a scaffold called CORE-Agent which scored ~21.5% on the hardest set of tasks in the benchmark.

In December 2025 one of the authors of CORE-Bench declared the benchmark ‘solved’, with an Opus 4.5 model achieving 95.5%.

Building entire machine learning systems to solve Kaggle competitions — MLE-Bench

MLE-Bench is an OpenAI-built benchmark which examines how well AI systems can compete (offline) in “75 diverse Kaggle competitions across a variety of domains, including natural language processing, computer vision, and signal processing.” At launch in October 2024, the top scoring system (an o1 model inside an agent scaffold) got 16.9%. As of February 2026, the best scoring system (Gemini3 inside an agent harness with search) gets 64.4%.

Kernel design — some of the types of work include:

  • Using DeepSeek’s models to try to build better GPU kernels
  • Automating the conversion of PyTorch modules to CUDA code
  • Meta using LLMs to automate the generation of optimized Triton kernels for use within its infrastructure
  • Using LLMs to help write kernels for non-standard hardware like Huawei’s Ascend chips (“AscendCraft”)
  • Fine-tuning open weight models for GPU kernel design (“Cuda Agent”, “Kernel Engine”)

Concluding thoughts:

We have seen significant progress in the engineering and scientific capabilities of AI systems. These advancements suggest that we are nearing a point where AI systems could autonomously build their successors, leading to transformative changes in how research is conducted.

If this prediction holds true, it will require society to prepare for new norms in research methodology and ethical considerations around automated AI R&D. The implications are vast and complex, making the task of preparing for such a future challenging but essential.

Key Takeaways

  • AIs have become adept at coding tasks, particularly those involving real-world software engineering problems.
  • AI systems can now perform increasingly complex scientific experiments and replicate research results with high reliability.
  • The automation of key AI development tasks is underway, suggesting that we are on a path to automated AI R&D by the end of 2028 or soon after.

“`

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top