Google Gemini 1.5 and Flash LLMs Show Significant Advances Hidden in Research

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 12, 2026 4 min read
Google Gemini 1.5 and Flash LLMs Show Significant Advances Hidden in Research

For the builders and artists relying on generative models, the landscape has shifted beneath our feet. Google has quietly upgraded its Gemini 1.5 engine, but unlike the usual fanfare, the full truth is buried in a research paper that has been silently expanded on their servers. While the initial announcement at Google I/O focused on the Pro and Flash variants, the actual documentation hosted on Deepmind servers has grown significantly since the public release. This discrepancy means developers are currently working with a version of the model that is more capable than what was officially detailed.

Consider the documentation itself. The version of the paper titled “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” originally published on arXiv.org on 25 April 2024, ran to 77 pages. However, the counterpart hosted directly on Google’s infrastructure is now 153 pages long. This represents a doubling of the text in just two weeks, suggesting the inclusion of substantial data regarding the Flash model and other improvements that were withheld from the initial conference presentation.

While Google may have reasons for this selective disclosure, the implications for creators are undeniable. The expanded research reveals significant leaps in how the models handle complex reasoning and adherence to complex instructions, moving beyond simple text generation into deeper utility.

Superior Mathematical Reasoning

A standout feature in the updated research is the focus on advancing mathematical reasoning. Google has trained a specialised variant of the Pro model designed to tackle open-ended quantitative problems. Unlike standard models that rush to a conclusion, this approach allocates additional inference time to allow the system to explore a wider range of possibilities, mimicking the contemplative process of a human mathematician.

In training Gemini 1.5 Pro, we want to understand how far we can push the quantitative reasoning capabilities of LLMs, with the goal of solving increasingly challenging and open-ended problems. Solving such problems typically requires extensive domain-specific study. In addition, mathematicians often benefit from extended periods of contemplation while formulating solutions; written mathematics tends to focus on the final result, obscuring the rich thought processes that precede it…We aim to emulate this by training a math-specialized model and providing it additional inference time computation, allowing it to explore a wider range of possibilities.

The results of this strategy are evident in benchmark testing. The standard Gemini 1.5 Pro model achieved a score of 67.7 on the MATH benchmark, positioning it midway between Anthropic‘s Claude 3 Opus and OpenAI’s GPT-4 Turbo. However, the dedicated Math-Specialized 1.5 Pro model outperforms both competitors across the board, excelling in tests such as AIME 2024, Math Odyssey, HiddenMath, and IMO-Bench.

Enhanced Instruction Following

Another area where the gap between the April and May versions of the paper is stark is instruction following. The initial report detailed 406 human-rated prompts covering tasks like content generation, summarisation, and coding. By the time the server-hosted version was updated, this list had swelled to 1,326 prompts, incorporating enterprise-specific challenges such as data extraction and multi-document summarisation.

In addition, they capture enterprise tasks such as information extraction, data/table understanding, and multi-document summarization. These prompts are long, 307 words on average. They have between one to tens of instructions with a mean of about 8. Different from the Gemini 1.0 Technical Report (Gemini-Team et al., 2023), we also use another set of 406 prompts from human raters that covers varied topics and instruction types. These prompts are shorter, 66 words on average, with one to more than a dozen instructions (average count is five).

For evaluation, human annotators were asked to rate whether a response follows (or not) each of the instructions present in the prompt. We aggregate these human judgements into two metrics: per-instruction accuracy (the percentage of instructions over the full evaluation set that are followed) and full-response accuracy (percentage of prompts where every instruction was followed).

Our results for the two prompt sets are shown in Table 13. The Gemini 1.5 models show strong gains on the set of 1326 long and enterprise prompts: the 1.5 Pro model has 32% improvement in response accuracy from the 1.0 Pro model, by fully following 59% of all the long prompts. Even the smaller 1.5 Flash model has a 24% increase here. At instruction-level, the 1.5 Pro model reaches 90% accuracy.

For the set of 406 shorter prompts, the Gemini 1.5 models follow 86-87% of the diverse instructions. 65% of the prompts were fully followed, performing similar to Gemini 1.0 Pro.

This expansion means that for makers dealing with complex, multi-step workflows, the new models are significantly less likely to drop the ball on specific constraints within a prompt.

Key takeaways

  • Google has silently expanded its Gemini 1.5 research paper on Deepmind servers to 153 pages, doubling the original 77-page arXiv version released in April.
  • The updated models demonstrate superior mathematical reasoning, with the Math-Specialized Pro version outperforming both GPT-4 Turbo and Claude 3 Opus on advanced benchmarks.
  • Instruction following capabilities have improved drastically, with the 1.5 Pro model achieving 90% per-instruction accuracy on complex, enterprise-grade prompts.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top