Getting more from each token: How Copilot improves context handling and model routing

For makers and artists relying on AI to build their craft, efficiency is no longer just about saving money on tokens. It is about ensuring every interaction counts. As Copilot evolves into an agentic system—handling planning, editing, debugging, and tool orchestration across extended sessions—the goal is to make the model smarter about its own usage. This means minimising the noise Copilot repeats from turn to turn and ensuring the right engine is chosen for the specific task at hand.

We are addressing this through two main avenues: refining the Copilot harness to focus more on the actual work, and expanding the Auto system so Copilot can autonomously select the optimal model without requiring manual intervention. This article details the improvements to the VS Code harness and the ongoing rollout of Auto across the platform.

Reducing repetition with smart caching and deferred tools

In extended GitHub Copilot sessions within VS Code, the system traditionally loads a heavy amount of recurring data for the model: instructions, repository context, conversation history, tool definitions, and current task state. While some of this is essential, much of it can now be cached, deferred, or loaded only when strictly necessary.

Two new features in GitHub Copilot for VS Code handle this optimisation. Prompt caching allows the system to reuse model state for repeated prompt prefixes rather than recomputing them on every request. Tool search enables the model to load tool definitions on demand, rather than flooding the context window with every full tool schema on every turn.

This distinction becomes critical as agents utilise more tools. A session might require access to MCP tools, terminal commands, file operations, and workspace search. Loading every full tool definition upfront adds a fixed cost to each turn, even if only a few tools are relevant. With tool search, Copilot maintains a broad available toolset while sending only the necessary schema into the model.

For a deeper technical analysis of the implementation, including cache-control breakpoints and provider-specific tool search, see the VS Code technical deep dive.

How Auto model selection fits into the workflow

Auto addresses a fundamental question: which model is the best fit for this specific task right now?

Following your initial prompt, Copilot analyses task intent and current model health to select the most appropriate engine. Whether you need a quick explanation, a focused edit, or a complex multi-file change, these tasks do not all require the same level of reasoning. Auto makes this decision automatically, sparing you the need to tune model settings manually.

Evaluations show that no single model consistently outperforms all others across every task. Often, a more efficient model achieves the same outcome as a stronger one, while higher-tier models are only essential for tasks requiring deep reasoning. Auto learns where enhanced reasoning improves results and routes accordingly, aiming not to trade quality for cost, but to use the model that best fits the work.

The mechanics of Auto model selection

Auto relies on two primary signals: the real-time health of available models and the nature of the work being requested.

Real-time model health: A dynamic engine monitors model availability, utilisation, speed, error rates, and cost. A model might be capable of handling a task, but not necessarily the best choice at that specific moment. Auto considers current system conditions to route to a model that is both capable and ready.
Task-aware routing with HyDRA: This routing model evaluates factors such as reasoning depth, code complexity, debugging difficulty, and tool orchestration needs. HyDRA identifies models meeting the quality bar and selects the best fit among them.

Combined, these signals prevent a one-size-fits-all approach. The objective is not to send every task to the largest model or every task to the cheapest one, but to choose the model that aligns with the work.

Implementing Auto in real-world workflows

Optimising routing in tests is only half the battle. To make Auto useful in daily work, we had to account for how developers actually use Copilot: conversations lengthen, context accumulates, tasks shift, and users work in many languages.

Cache-aware routing. While switching models on every turn sounds flexible, it can undermine efficiency. If a conversation stays on the same model, the prompt prefix can be cached and reused. Switching models mid-conversation breaks this cache, which can cost more than the routing change saves. Auto avoids this by routing at natural cache boundaries: at the start of a conversation when there is no cache to lose, and after compaction when older turns are summarised and the prefix resets. Between these points, the selected model remains fixed to allow the cache to build.

Routing across languages. Copilot serves developers globally, so routing must work in languages other than English. We trained the routing model on conversations across 16 language families, including CJK and European languages. Evaluations showed routing accuracy stayed within four points of the English baseline across language groups, with no statistically significant quality gap.

Learning when escalation matters. Instead of labelling tasks as simply “easy” or “hard,” we trained the router to identify where models actually diverge. For training queries, responses from less capable and more capable models were scored across quality dimensions. The router learns when a stronger model adds value and when a more efficient model can produce an equally good result. For context-dependent messages in long agentic sessions, the router is trained on complete multi-turn conversations, including original user intent, recent assistant responses, and metadata.

The expansion of Auto with task intent

Auto with task intent is already live in Visual Studio Code, github.com, and mobile. It provides Copilot with more signal regarding the type of work you are doing—whether coding, debugging, planning, or using tools—allowing for a better model choice.

We are continuing to expand this experience across Copilot surfaces. Next steps include bringing Auto with task intent to more platforms and adding ways for teams to set Auto as the default.

Auto with task intent is arriving in Copilot CLI, GitHub App, and additional IDEs.
Copilot Free and Student plans will be simplified to leverage Auto as the sole model selection option.
Admin controls will allow organisations to set Auto as the default or enforce it as the only option.

Maximising value from your AI credits

Copilot is becoming more efficient by default, but specific habits can help your credits stretch further.

Start with Auto. Auto is the strong default for many tasks because it selects a model based on your intent without requiring manual selection every time.
Keep context focused. Start a new session when switching tasks, compact long-running sessions when needed, and specify the files you want Copilot to use. Less unnecessary context means more session time dedicated to actual work.
Avoid changing settings mid-session. Switching models, reasoning levels, context size, or tool configuration can break cache reuse and force context rebuilding. Set up the session as desired and keep related work together.
Plan before parallelising. For larger tasks, ask Copilot to plan first. While parallel agents are useful when work can truly be split, they consume credits simultaneously, so use them deliberately.
Use only the tools you need. Tools and MCP servers are powerful, but broad toolsets add extra context. Enable only what is relevant to the task. Use agent finder in GitHub Copilot to streamline tool usage.
Check your usage. The AI usage page shows where credits are going across features and models. In Copilot CLI, session-level usage can help you spot expensive patterns while working.

For the full guide, refer to How to get more out of your AI credits.

Get started

Auto model selection is available today across supported Copilot experiences. To learn more, see the documentation.

Key takeaways

Efficiency is about intelligence, not just cost. Improvements in prompt caching and deferred tool loading ensure Copilot spends fewer tokens on repetition and more on actual task execution.
Auto selects the right engine for the job. By combining real-time model health checks with task-aware routing via HyDRA, the system automatically chooses the best balance of quality and speed without manual tuning.
Workflow habits impact credit usage. Maintaining consistent model selection during a session preserves cache efficiency, while starting fresh for new tasks and using only necessary tools keeps costs down.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Getting more from each token: How Copilot improves context handling and model routing

Reducing repetition with smart caching and deferred tools

How Auto model selection fits into the workflow

The mechanics of Auto model selection

Implementing Auto in real-world workflows

The expansion of Auto with task intent

Maximising value from your AI credits

Get started

Key takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

NEA’s Tiffany Luck says…

After unveiling ridiculously expensive…

Roelof Botha joins SpaceX’s…

Reducing repetition with smart caching and deferred tools

How Auto model selection fits into the workflow

The mechanics of Auto model selection

Implementing Auto in real-world workflows

The expansion of Auto with task intent

Maximising value from your AI credits

Get started

Key takeaways

More in AI Tools & Reviews

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

NEA’s Tiffany Luck says…

After unveiling ridiculously expensive…

Roelof Botha joins SpaceX’s…