LLM Guard scored 0/8 on a USENIX 2025 multi-turn jailbreak. Here’s what caught it instead.

**What Happened:**
LLM Guard scored 0 out of 8 on a multi-turn jailbreak called Crescendo, designed to evade detection by output-based monitors. Each individual turn in the attack looked benign but collectively it undermined LLM Guard’s ability to detect anomalies across turns.

**Why It Matters:**
The key insight here is that while LLM Guard scored each prompt independently and had no memory of previous interactions, it failed to see through the multi-turn attack because it lacked a mechanism to understand how the model’s state evolved over time. In contrast, Arc Sentry flagged Crescendo at Turn 3 due to an increase in score from 0.031 to 0.232 on what appeared to be a normal prompt. This highlighted that monitoring models based solely on their output is insufficient; it’s crucial to observe the model’s state and how it changes over time.

– **LLM Guard’s Independence:** LLM Guard operates by scoring each input independently without maintaining context or history, which led to its failure in this attack.
– **State-Based Monitoring:** Arc Sentry detected the anomaly because it was looking at the evolving state of the model rather than just the output from individual turns. This underscores the importance of state-aware monitoring for multi-turn attacks.
– **Future Guidance:** The incident with Crescendo highlights that future AI systems need to be more attuned to both the static and dynamic aspects of language models, including how their internal states evolve over multiple interactions.

The divergence between these two approaches-LLM Guard’s independent scoring versus Arc Sentry’s state-based monitoring-illustrates why combining different types of checks is essential for robust security.

Source Read original →