New ways to balance cost and reliability in the Gemini API
Today, we are introducing two new service tiers for the Gemini API: Flex and Priority. These options allow you to have more control over both cost and reliability through a single, unified interface.
As AI systems evolve from simple chat interactions into complex autonomous agents, developers often need to manage between two distinct types of tasks:
- Background processes: High-volume workflows like data enrichment or “thinking” operations that don’t require immediate responses.
- User-facing features: Features such as chatbots and co-pilots where high reliability is essential.
Previously, managing both required splitting your architecture between a standard synchronous service and the asynchronous Batch API. The new Flex and Priority tiers bridge this gap by allowing you to route background jobs to Flex and interactive tasks to Priority, all using standard synchronous endpoints. This simplifies management without compromising on cost or performance benefits.
Flex Inference: Scale innovation for 50% less
Flex Inference is our new cost-optimized tier designed for latency-tolerant workloads with minimal overhead of batch processing.
- Half the price of Standard API: Pay half the standard rate by lowering the criticality of your request (making it less reliable and increasing latency).
- Synchronous simplicity: Flex uses a synchronous interface, allowing you to use familiar endpoints without managing input/output files or polling for job completion.
- Best for: Background CRM updates, large-scale research simulations, and any task where the model “browses” or “thinks” in the background.
To start using Flex, simply configure the
service_tier
parameter in your request:
Flex tier is available for all paid tiers and can be used with both `GenerateContent` and Interactions API requests.
Priority Inference: Highest reliability for critical apps
The new Priority Inference tier offers our highest level of assurance at a premium price point. This ensures that your most important traffic remains uninterrupted, even during peak platform usage.
- Top priority: Requests are served with the highest criticality, ensuring they receive the best possible service.
- Graceful degradation: If your traffic exceeds Priority limits, overflow requests will be automatically handled at the Standard tier to maintain application availability and ensure business continuity.
- Transparent feedback: The API response indicates which tier served your request, providing clear visibility into performance and billing.
- Fits for: Real-time customer support bots, live content moderation pipelines, and any time-sensitive task that needs immediate attention.
To use Priority Inference, set the
service_tier
parameter accordingly:
Priority inference is available for users with Tier 2 or Tier 3 paid projects across both `GenerateContent` and Interactions API endpoints.
To see how these tiers work in practice, check out the cookbook for runnable code examples. Visit the Gemini API documentation to view the full pricing breakdown and start optimizing your production tiers today.
Key Takeaways
- The Flex tier is designed for cost optimization with a 50% price reduction compared to the Standard API.
- Prioritizing tasks in the Priority tier ensures they receive high reliability, even during peak usage, without losing service due to overflow.
- Both tiers are available across all paid projects and can be used for both `GenerateContent` and Interactions APIs requests.
Originally published at blog.google. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

