I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math

“`html I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math Summary…

By AI Maestro May 15, 2026 1 min read
I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math

“`html




I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math

Summary of Findings

A few months ago, I encountered an issue with the DeepSeek-R1 paper regarding models improving through verifiable rewards. This led me to wonder if a model could teach itself by correcting its own mistakes without human intervention.

The Plan

I chose Qwen 2.5 base as my starting point and aimed to create pairs of (broken attempt, working attempt) for the model to learn from. I ran HumanEval tests on this initial setup but found that it had a significant drop in performance when compared to its original state.

Unexpected Results

I realized that the issue lay in how the grader was evaluating the model’s attempts. After fixing this, Qwen 2.5 base improved from scoring 25 problems on HumanEval to 112 problems correct. This improvement came after fine-tuning on its own mistakes.

Beyond Baseline

Further experimentation with larger models and different architectures showed consistent improvements. For instance, Qwen 2.5 14B base improved from 95 points to 106 points on HumanEval, demonstrating that this method is not specific to any particular model or architecture.

Math Problems

I also tested the approach with math problems and found that it could match models like GPT-3.5 in solving math tasks. By adjusting the training loop, I was able to create a scenario where the model would gradually move towards harder problems as it failed repeatedly.

Key Takeaways

  • The method of fine-tuning on self-generated pairs can lead to significant improvements in models’ performance across various tasks.
  • This approach is not limited to specific architectures or model families, indicating its broad applicability within the AI landscape.
  • However, there are limitations such as needing a sufficient number of correct attempts for fine-tuning and potential issues with certain types of data like code versus real-world applications.

Open source code and instructions for reproduction can be found here.

“`

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top