Why you shouldn’t leave model selection on default in Copilot, Gemini and other AI tools

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 24, 2026 3 min read
Why you shouldn’t leave model selection on default in Copilot, Gemini and other AI tools

Why you shouldn’t leave model selection on default in Copilot, Gemini and other AI tools

Key Points

  • An experiment shows that Microsoft Copilot makes up country-specific stereotypes when analyzing text data instead of actually looking at what the data says.
  • In tests using simulated answers about career goals, the AI in standard mode claimed Italians were more interested in art than Brits. The problem: the underlying datasets for both countries were identical.
  • The experiment ran Copilot in “Auto” mode, which is supposed to pick the best model for a given task. It didn’t. Reasoning models handled the task just fine, but users need to know how and when to switch to a reasoning model depending on the tool. Most users likely don’t.

An experiment shows how Microsoft’s AI assistant Copilot applies stereotypes when analyzing data instead of actually reading it. Thinking models solve the task but sometimes need users to know their tools.

Microsoft Copilot has become the go-to tool for quick data analysis at many companies. But an experiment by mathematician Adam Kucharski shows that when analyzing text data, the tool can spit out results that have nothing to do with the actual data. Instead, it falls back on stereotypes baked into the underlying language model.

For the test, Kucharski created 2,000 simulated free-text responses about emotions and labeled them “UK.” He then copied the same 2,000 responses and labeled them “US.” The combined 4,000 entries were shuffled and handed to Copilot in “Auto” mode for analysis.

The result: Copilot delivered a detailed summary of how US and UK respondents supposedly differed. “Based on the dataset you shared, US and UK responses differ mainly in tone, intensity, and wording style, even though they express similar emotional states,” the tool concluded. But the data was identical.

Copilot sees Italians as artists and Americans as business people

In a second experiment, Kucharski pushed harder. He had a language model generate 200 statements about career goals and copied the dataset five times for the US, UK, France, Germany, and Italy.

Copilot again produced country-specific differences: Italians were three times more likely to show interest in arts careers than Brits, and Americans were 1.5 times more business-oriented than the French. All five groups contained the same clichéd and biased statements.

When Kucharski asked Copilot to dig deeper, the tool first ran a simple keyword-based count. As expected, it returned identical results for all countries. But Copilot ignored its own finding. Instead, it offered a quantified analysis that once again showed made-up differences, this time with completely fabricated percentages.

Copilot’s Auto mode is the main culprit

The analysis ran in “Auto” mode, which Microsoft says should pick the best model on its own. It obviously didn’t. Most users probably stick with this default in Copilot and in other tools too. The version Kucharski tested is the standard Copilot that comes with a Microsoft 365 Business account. The majority of Copilot users most likely run this version.

“Which means there’s a real risk that people are currently using AI to produce analysis that bears no resemblance to what people actually said,” Kucharski writes. If these kinds of analyses were applied to real datasets, groups with no actual differences could end up looking worlds apart, all because of the language model’s built-in assumptions about demographic groups.

Thinking models get it right

I repeated the career goals test with Microsoft Copilot and Google’s new Gemini Flash 3.5 model. In both cases, the fast models (“Instant” / Auto, Flash 3.5) responded with country stereotypes instead of catching that the data is identical.

ChatGPT Instant and Claude Opus 4.7 automatically kicked into extended reasoning mode, wrote Python code to analyze the dataset, and spotted the duplicates. Switching Copilot and Gemini manually to their more capable thinking models also catches the duplication.

Even thinking models aren’t a free pass for data analysis, though. Catching identical data works mostly when the duplication is obvious, Kucharski says. With real datasets, where, say, British and American respondents give similar but not identical answers, counting tools like Python scripts might not cut it, Kucharski argues. The model might fall back on its built-in biases, which is the real issue: you don’t know when the model hits its limits, and it’s difficult to tell whether it happened or how much it skewed the results.

Anyone who goes with their gut when picking a prompt or model also risks hindsight bias: after the fact, it always feels obvious that a different model would have nailed it. Kucharski recommends writing down what result you expect before switching models and running simple sanity checks before trusting any AI-generated analysis.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top