The attack doesn’t come from your users.
It comes from your agent’s environment, the emails it reads, the webpages it visits, the documents it retrieves, the database rows it queries.
Every piece of external content your agent processes is a potential instruction source. And your agent has no way to tell the difference between data it was sent to process and commands it should follow.
This is not theoretical. It is happening in production systems right now.
Once you give an agent tools, email access, browser access, API calls, memory writes, the stakes change completely. A poisoned document doesn’t just return bad text. It tells your agent what to do next. And your agent does it.
We tested this. Arc Gate blocked 100% of agentic tool poisoning attacks across 54 scenarios from ETH Zurich’s AgentDojo benchmark. 99% on 200 blind test cases from University of Illinois InjecAgent. 0% false positives on legitimate workflows.
Arc Sentry caught a USENIX 2025 multi-turn jailbreak at Turn 3. LLM Guard caught 0 out of 8 turns on the same attack.
The difference is architecture. Text classifiers read what the prompt says. Arc Gate enforces where instructions are allowed to come from. Arc Sentry reads what the model’s internal state does, before generate() is even called.
If your agent touches the real world, you need a runtime governance layer.
Finance agent demo — no signup: https://web-production-6e47f.up.railway.app/finance-demo
Arc Gate — hosted proxy, one URL change: https://github.com/9hannahnine-jpg/arc-gate — $29/month
Arc Sentry — self-hosted models: https://github.com/9hannahnine-jpg/arc-sentry — pip install arc-sentry
submitted by /u/Turbulent-Tap6723
[link] [comments]
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




