“`html
Introduction
I have been exploring how to use a geometric framework, specifically the E8 lattice, to route safety decisions in language models (LLMs). My goal was to see if we could eliminate the need for bloated and latency-heavy LLM judges by leveraging this high-dimensional mathematical substrate.
The Architecture: STE-Snapped E8 Policy Heads
To achieve this, I trained a supervised classifier head directly on top of MiniLM sentence embeddings. This allowed us to project them into the E8 lattice coordinates while maintaining continuous gradient learning through the use of a Straight-Through Estimator (STE). The architecture now looks like this:
request → MiniLM → E8 soft-blend head (STE-snapped) → Rule-margin hybrid controller → JSON template
Clean Success: Phase 37 Holdouts
We expanded the suite to include 28 different policy cases and used a hybrid controller that integrates our E8 head with a margin-based threshold of $0.20$ to trigger human escalation or rule overrides. On clean data, the generalization across unseen policy families was excellent:
- Exact Label Match: 0.979
- Decision Match: 0.986
- Policy Match: 0.979
- Unsafe Allow: 0.000
- Over-Refusal: 0.014
The Crash: Adversarial Evasion (Phase 38)
To test the robustness of our architecture, we subjected it to a 40-case adversarial suite that included various evasion techniques, indirect harm, multilingual attacks, and policy-priority conflicts. The results were devastating:
- Exact Label Match: 0.950
- Unsafe Allow: 0.000
- Harmful Miss: 0.000
- Benign Block: — (Not applicable)
The Transfer Deficit: Phase 40
To see if adversarial robustness could be learned by the E8 geometric head, we trained it on adversarial data while holding out one entire adversarial family at a time. While this helped in fitting the boundary for seen adversarial vectors, it failed to transfer to unseen ones:
- Exact Label Match: 0.467 (Direct Head)
- Unsafe Allow: 0.533
- Harmful Miss: 0.533
- Benign Block: — (Not applicable)
Key Takeaways
- The hybrid rule layer significantly improved safety under adversarial conditions.
- Direct geometric heads are not safe controllers and can leak unsafe allows.
- Robustness to unseen adversarial strategies requires an additional, audited deterministic rule layer.
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




