Making AI chatbots helpful weakens their ability to simulate human behavior, large-scale study finds
A large-scale study shows that the training process turning raw language models into helpful chatbots also weakens their ability to mimic human behavior. The effect gets worse with each new generation.
Language models are increasingly used as stand-ins for human test subjects to predict reactions to policy measures, simulate clinical training for psychiatrists, or model how students learn.
A new study from an international research consortium, including scientists from Helmholtz Munich, arrives at an inconvenient finding: the very training steps that turn language models into useful assistants make them worse at modeling human behavior.
The study builds on Psych-201, a new dataset of transcripts from behavioral experiments. It covers about 208,000 participants and roughly 26 million individual responses from hundreds of experiments, several times larger than any previous collection of its kind.
Each data point captures a participant’s full run through an experiment, along with detailed metadata like age, nationality, questionnaire responses, and other traits. The dataset was assembled through an open research collaboration involving researchers from more than 35 institutions.
Base models beat their fine-tuned counterparts
The researchers compared models from the Qwen3, Llama3, and OLMo 3 families, testing both base models and their various post-trained variants. Base models are trained only to predict the next word in text.
From there, extra training produces the versions tuned for instruction-following, step-by-step reasoning, or image processing. The metric: how well each model predicts the actual answers human participants gave.
The result holds across all families and sizes. Base models predict human behavior better than their post-trained descendants. The effect shows up for every common training objective, hitting hardest with reasoning models, followed by instruction tuning and vision extensions. In nearly every head-to-head comparison, the base model outperforms its specialized variant.
One obvious counter-explanation: maybe assistant models just answer more deterministically and fail to capture the natural spread of human behavior. The researchers tested this with an accuracy analysis on a subset of tasks with discrete answer options. Post-trained models still performed worse, making higher determinism unlikely as the sole explanation.
The gap widens with every generation
While base models steadily improve from Qwen2 through Qwen2.5 to Qwen3, getting better at predicting human behavior with each generation, the gap to their derived assistant models keeps growing. Ongoing advances in post-training are making the divergence from human behavior worse.
The biggest distortion shows up in language tasks and reasoning. The researchers offer a plausible explanation: base models are, at their core, models of human language and therefore well-calibrated for language processing tasks. Post-training techniques like reinforcement learning from human feedback push them away from that original objective toward more user-friendly or normatively correct answers.
The same thing happens with reasoning. Human decisions are shaped by heuristics and systematic biases that base models apparently pick up. Reasoning training optimizes for logically correct answers instead, overwriting exactly the human quirks that matter for behavioral simulation.
A popular shortcut doesn’t work
A second finding concerns a widely used technique: giving language models participant-specific information to put them into a particular role. In the study, this took the form of an interview format where demographic details about each person were prepended before the experiment. Where available, the prompts included age, gender, nationality, education, clinical diagnoses, and questionnaire scores.
The effect was practically zero. That held even when the analysis was limited to developmental psychology experiments, where age-related differences should be informative. Earlier work had shown that persona prompts can produce human-like response distributions at the population level. But the new study questions whether they actually predict individual behavior or just look plausible on the surface.
Centaur shows targeted training can still help
The authors see their findings as a variation of a known problem: extra training toward specific goals can degrade abilities acquired during pretraining. To test whether this is a hard limit, they looked at Centaur – a model specifically fine-tuned on a portion of the behavioral data.
Centaur showed much higher agreement with human behavior even on new tasks that weren’t part of its training. So extra training can help, but only when it targets behavioral modeling rather than logical correctness.
For research practice, the takeaway is clear: the convenient, readily available assistant models aren’t automatically the best choice for behavioral simulations. The researchers recommend either raw base models or variants trained specifically for behavioral simulation. Code and data are available on Hugging Face and GitHub.
That chatbot models have their pitfalls as digital test subjects isn’t new. A recent study of nine open-source language models found that optimizing for more human-sounding output comes at the cost of factual precision, and a classifier unmasked AI responses with 70 to 80 percent accuracy. The persona trick also worked worse than expected.
Another study found that models can barely pose as weak or strong learners on command, with their hit rates shifting by less than a percentage point. And when it comes to reasoning, a deep gap persists anyway: an analysis of more than 170,000 reasoning traces showed that reasoning models think differently than humans, falling into a kind of sequential autopilot.
Subscribe now
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




