EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

What this means for makers and artists

For developers building voice assistants, the landscape has shifted from generic testing to rigorous, domain-specific validation. The latest EVA-Bench release moves beyond simple script reading to simulate the friction of real enterprise environments. Creators of audio and conversational AI must now account for three distinct sectors—airlines, IT support, and healthcare HR—each demanding precise handling of complex policies, authentication hurdles, and structured data. This benchmark forces models to navigate workflows where a single misstep in a confirmation code or policy interpretation breaks the interaction, providing a much sterner reality check than previous evaluations.

Expanding the scope across three domains

Previously limited to a single sector, the benchmark now covers 213 evaluation scenarios distributed across 121 tools. The expansion includes Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD). This represents a fourfold increase in scenario coverage compared to the original release. To ensure fairness and difficulty, every scenario was stress-tested against three leading frontier models: OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6. Full datasets are open-source and ready for immediate download via the Hugging Face library:

from datasets import load_dataset

# Airline Customer Service Management (CSM) — 50 scenarios
airline = load_dataset("ServiceNow-AI/eva-bench", "airline", split="test")
# Enterprise IT Service Management (ITSM) — 80 scenarios
itsm = load_dataset("ServiceNow-AI/eva-bench", "itsm", split="test")
# Healthcare HR Service Delivery (HRSD) — 83 scenarios
hrsd = load_dataset("ServiceNow-AI/eva-bench", "medical", split="test")

Design principles for realistic testing

The creation of these datasets followed five core principles to ensure they reflect actual enterprise usage rather than idealised test cases.

Voice-first scope

Not every business process is suitable for voice interaction. The team identified tasks actually handled over the phone in real-world practice, selecting only the most frequent flows to keep scenarios grounded in authentic call patterns.

Realism

Tool schemas mirror production APIs, while policies are drawn from genuine enterprise constraints. In the Healthcare HRSD domain, scenarios are rooted in actual US healthcare administration, incorporating specific elements like NPI numbers, FMLA regulations, and insurance coverage rules.

Variety

Repeating identical tasks yields little insight. The dataset defines specific workflows sampling three scenario types: single-intent calls, multi-intent calls involving up to four requests per conversation, and adversarial calls designed to bypass troubleshooting or misclassify urgency. Crucially, unsatisfiable goals are included, as real call volumes are rarely happy paths and models often struggle more with impossible requests than successful ones.

Authentication

Authentication remains a primary failure point for voice agents. Every domain incorporates authentication flows, calibrated to the specific task. For instance, OTP-based elevation appears only where a production system would genuinely require it, rather than being applied uniformly.

Reproducibility

To distinguish genuine capability gaps from random chance, every scenario is engineered to have exactly one correct resolution path. User goal construction ensures the simulator possesses all necessary information to behave consistently, and generation logic eliminates cases where multiple valid action sequences could achieve the same outcome.

Scenario generation methodology

Scenarios are created using SyGra, a graph-based synthetic data pipeline powered by GPT-5.4. Each scenario requires three jointly consistent components generated simultaneously to prevent inconsistencies:

User goal

To ensure reproducibility, the user simulator must behave identically across runs. Vague intents lead to inconsistent judgments. Instead, the user goal is structured as a decision tree covering every potential situation. It specifies exactly what the user should request, including a negotiation sequence detailing when to push back, ask for alternatives, or accept offers. Edge cases, such as accepting a standby flight versus an alternate airport, are handled with explicit instructions. The resolution condition requires concrete evidence of completed actions, like a confirmation number, rather than verbal promises, ensuring the simulator stays on the line until the action is verified.

Initial scenario database

This represents the backend state the agent’s tools will query and modify. It is generated alongside the user goal to ensure every entity referenced—such as booking IDs, account details, and credentials—exists and is consistent within the database.

Expected final database state (ground truth)

The expected outcome is derived by running the generation LLM against the agent instructions, user goal, and initial database to produce a full action trace. As the LLM executes write tool calls, the database updates incrementally, establishing the terminal state as the ground truth for verifiers.

Joint generation is critical because these components are deeply interdependent. Independent generation could introduce silent errors, such as a case ID in the user goal that does not exist in the database, which would corrupt the evaluation. A multi-stage validation loop enforces consistency after each generation attempt:

Structural check: Validates the scenario database against a Pydantic schema to catch type errors and missing fields.
LLM-based validator: Checks consistency holistically, ensuring user-facing details match database records, cross-references are valid, and authentication data is configured correctly.
LLM-based trace verification: Checks the full conversation trace for policy compliance, correct action sequencing, completion of terminal actions, and the absence of alternative write paths that would introduce non-determinism.

Further validation

All scenarios underwent multiple rounds of manual review. Reviewers confirmed that policies were applied consistently, user goals were specific enough to admit exactly one resolution, expected final states were internally consistent, and adversarial scenarios clearly identified policy violations. Ambiguous records were corrected or discarded.

As a final pass, three frontier models—OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6—were run on text-only versions of the scenarios. For any scenario where a model scored zero, the team manually investigated whether the failure stemmed from genuine model error or a dataset issue, such as an ambiguous policy or a bug in the tool executor. Records with identified issues were corrected or removed, ensuring all selected samples are solvable by at least one frontier model.

Deep dive: ITSM and Healthcare HRSD

While the airline dataset targets specific challenges, the new ITSM and Healthcare HRSD datasets were selected to target distinct axes of difficulty. All three require accurate transcription of structured named entities, such as confirmation codes and employee identifiers, but differ in their primary challenges and the number of tools involved.

Multilingual support

Evaluating only in English offers limited insight into performance in other languages. Speech recognition accuracy and conversational fluency can degrade in language-specific ways, causing a high-performing English agent to fail completely elsewhere. To provide real insight into multilingual deployments, the benchmark is expanding to adapt not just the conversation language but the entire evaluation pipeline to the target culture. This includes localising:

Names of locations referenced in scenarios
User names and email addresses
Phone number formats

This enables the user simulator to provide an authentic experience in the language of choice. Beyond the dataset, metrics and judges are also being updated to build a trustworthy evaluation across languages.

Get the data

EVA-Bench is fully open-source under the MIT license. The dataset, evaluation framework, and leaderboard are publicly available on the Hugging Face page. Users can load records directly using the datasets library.

Key takeaways

Three distinct domains, fourfold growth: The benchmark now spans Airline, ITSM, and Healthcare HRSD with 213 scenarios across 121 tools, offering a much broader test of voice agent adaptability.
Rigorous joint generation: Scenarios are created using a graph-based pipeline where user goals, databases, and ground truths are generated together to eliminate inconsistencies and ensure reproducibility.
Frontier model validation: Every scenario was validated against GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 to ensure solvability while maintaining a high difficulty standard.
Real-world constraints: The datasets prioritise realistic enterprise constraints, including specific authentication flows, adversarial user behaviour, and unsatisfiable goals.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

What this means for makers and artists

Expanding the scope across three domains