New Microsoft tool lets devs spin up AI behavior tests using text descriptions

When developers and researchers build custom AI systems, generic safety checks are no longer enough. The real challenge now is ensuring your…

By AI Maestro June 2, 2026 2 min read
New Microsoft tool lets devs spin up AI behavior tests using text descriptions

When developers and researchers build custom AI systems, generic safety checks are no longer enough. The real challenge now is ensuring your specific model adheres to your unique product policies, constraints, and business rules. Microsoft has launched a new open-source tool called ASSERT to bridge that gap, allowing teams to generate rigorous, application-specific behaviour tests simply by writing natural language descriptions.

Turning text descriptions into automated tests

ASSERT, which stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing, automates the creation of test cases. Instead of manually writing thousands of scenarios, developers provide high-level goals, policies, or intended behaviours in plain English. The framework then translates these inputs into structured sets of acceptable and unacceptable actions.

The tool generates problem scenarios, runs them against the target system, and scores the results. Crucially, it records the full execution path, including intermediate steps and tool calls. This visibility allows engineers to inspect exactly where and why a system fails to meet expectations.

Customising evaluations for real-world contexts

The framework accepts system context, available tools, and specific constraints to tailor the evaluation. For instance, a developer configuring a document research agent could specify rules such as:

  • Never sending emails to individuals outside the organisation.
  • Restricting access to confidential data to C-level executives only.
  • Providing concise summaries that account for prior context.

ASSERT uses these directives to automatically generate test cases that verify compliance with these rules on an ongoing basis.

Why specific context matters

Microsoft notes that broad, general-purpose evaluations often miss the nuances required when an AI model is shaped by an application’s specific context, policies, and tools. Generic benchmarks cannot fully capture whether a system behaves correctly within a proprietary ecosystem.

“One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” said Sarah Bird, chief product officer of Responsible AI at Microsoft. “Because if you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar […] What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific.”

Bird highlighted that ASSERT supports the entire lifecycle of an AI system. It can be used during the initial build phase, after deployment, and for continuous monitoring to ensure behaviour remains aligned over time.

This release arrives as the industry shifts focus from raw capability to repeatable testing and regression checks. While researchers have long relied on general benchmarks like Stanford’s HELM, MLCommons’ AILuminate, and groups such as METR to measure model performance, the need for targeted, context-aware validation is growing as models become more integrated into specific workflows.

Key takeaways

  • ASSERT converts natural language policy descriptions into automated, scored test cases tailored to specific application contexts.
  • The tool provides deep visibility into AI decision paths, recording intermediate actions and tool calls to pinpoint failures.
  • Evaluations must move beyond general benchmarks to include application-specific dimensions for truly trustworthy systems.
  • The framework supports testing at every stage, from initial development through to continuous post-deployment monitoring.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top