Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard
We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.
Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.
VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.
As can be seen below, models perform poorly on VAKRA – in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks.
Task Description
As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities.

Capability 1: API Chaining using Business Intelligence APIs
This capability includes 2,077 test instances across 54 domains, requiring the use of tools from the SLOT-BIRD and SEL-BIRD collections (Elder et al., 2026). Compared to the setup in Elder et al., the tool universe in SLOT-BIRD and SEL-BIRD is expanded through the inclusion of a larger number of domains. Each domain is restricted to one tool collection, and tasks involve chaining 1–12 tool calls to arrive at the final answer.
{
"query": "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?",
"tool_calls":[
{
"name": "get_data",
"arguments":{"tool_universe_id="486ea46224d1-aeb8037c5e78"},
"label": "retrieved_data_1"
},
{
"name": "select_data_equal_to",
"arguments":{"data_label":"retrieved_data_1","key_name":"play_speed","value":31},
"label": "FILTERED_DF_0"
},
{
"name": "select_data_equal_to",
"arguments":{"data_label":"FILTERED_DF_0","key_name":"play_dribble","value":53},
"label": "FILTERED_DF_1"
},
{
"name": "select_data_equal_to",
"arguments":{"data_label":"FILTERED_DF_1","key_name":"play_passing","value":32},
"label": "FILTERED_DF_2"
},
{"name":{get_team_name},"arguments":{"data_label":"FILTERED_DF_2","n":1}}}],
"answer": "FC Barcelona"
}
Fig 2: Data sample from SEL-BIRD collection
| Data Preview | ||
|---|---|---|
{"handle": "retrieved_data_1", "num_records": 2, "key_details": [{"name": "team_name", "dtype": "str", "first_3_values": ["FC Barcelona", "Manchester City"]}, {"name": "play_speed", "dtype": "int32", "first_3_values": [31, 40]}, {"name": "play_dribble", "dtype": "int32", "first_3_values": [53, 30]}, {"name": "play_passing", "dtype": "int32", "first_3_values": [32, 16]}]} | ||
Capability 2: Tool Selection using Dashboard APIs
This capability includes 1,597 instances across 17 domains, requiring tools from an expanded REST-BIRD collection (Elder et al.). These use endpoint-style interfaces that provide highly specific, query-aligned endpoints that encapsulate most computation. They are served as REST APIs running in a FastAPI server, which is wrapped by the MCP server. This task requires selecting the correct APIs from the domain-specific tool set (as shown in the example in Figure 1). Each domain contains a minimum of 6 to a maximum of 328 tools (with an average of 116 tools). Similar to the previous task, the
get_data
tool configures the MCP server to expose only the relevant domain-specific APIs.
Capability 3: Multi-Hop Reasoning using Dashboard APIs
The Capability 3 segment of the benchmark has 869 test instances drawn from 38 subject domains. These instances rely again on the REST-BIRD API collection, but add multi-hop reasoning to the challenge (refer to example in Figure 1). Multi-hop questions require multiple pieces of supporting evidence to be extracted and combined to reach an answer. The instances in this section require between one and five logical hops to answer a query. The question types distribution for queries within the test dataset is shown below in Figure 4.

Capability 4: Multi-Hop, Multi-Source Reasoning and Policy Adherence
The Capability 4 includes 644 instances across 41 domains and is also built on the REST-BIRD API collection. Figure 4 above shows a distribution of hybrid hops for test queries without policies. It contains the most complex queries with the following characteristics:
- Multi-Source: This segment adds document indices per domain. Queries in this capability could require information from these document indexes as well as API calls. Similar to Capability 3, this task also has Multi-Hop queries. The required information source applies at the per-hop level, so, for example, a question may entail three logical hops with sources: API – RAG (Document Retrieval) – API.
- Multi-Turn: This segment of the dataset also adds multi-turn conversations to the setting. Each instance is a dialog with multiple turns. The data is released as context-response pairs, where the context encodes the current dialog history and the agent is only responsible for answering the current turn.
- Tool-usage Policies: A subset of these instances includes tool-use policies that the agent is required to follow. These policies take the form of plain-text instructions about the knowledge sources that the agent is allowed to access and under which circumstances. For example:
If a user's query pertains to Technology & Software, which is/are about Topics focusing on codebases, software platforms, applications, and user interactions in tech, make sure you try answering them by only using document retrievers. Do not use other types of tools.
The baseline agent in the project repo imposes adherence to these policies through a simple addition to the prompt:
"You are a helpful assistant with access to tools.\n Tool Usage Constraint: {additional_instructions}.". Of course, agent builders are free to choose any constraint enforcement mechanism.
Evaluation Framework
Evaluation Metric
The VAKRA Evaluator operates over two key inputs for each sample: a predicted final response and the corresponding tool-call trajectory. The tool calls from the predicted trajectory are executed in the same environment as the ground truth to verify intermediate tool outputs.
- For Capability 4 tasks, policy adherence is first verified programmatically (this step is not applied to other capabilities).
- The predicted tool call sequence is then compared against the ground truth sequence.
- Only samples with valid trajectories proceed to final response evaluation.

- Specifically, we first perform a programmatic check, verifying whether all information present in the ground-truth tool responses is recovered by the predicted tool responses. This check may be inconclusive in cases involving partial matches, semantic equivalence, or differences in representation (e.g., ordering, aggregation, or formatting).
- In such cases, we apply a secondary LLM-based evaluation, adapted from the CRAG framework Yang et al., 2024, to determine whether the predicted trajectory retrieves all required information despite structural differences. This step uses an adapted prompt to determine whether the predicted trajectory captures all required information, even if obtained through a different sequence of tool calls.
Final Response Evaluation
For trajectories that pass the previous check, the final response is evaluated using an LLM-based judge. This step ensures that the response is (i) grounded in the predicted tool outputs, and (ii) factually consistent with the ground truth answer, accounting for potential variations in phrasing or structure.
Key Takeaways
- VAKRA provides a comprehensive benchmark for evaluating AI agents’ ability to reason and act in enterprise-like environments.
- The VAKRA framework allows for both final response evaluation and analysis of the tool execution trajectory, providing deeper insights into agent performance.
- Models often struggle with multi-hop reasoning tasks due to their reliance on structured API interactions and unstructured document retrieval under natural-language constraints.
Originally published at huggingface.co. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




