Investigating a Transformer’s Forward Activations Through a Lossy Dual E8 (E16) Lattice Bottleneck
I explored whether routing a transformer’s forward activations through a lossy Dual E8 lattice and injecting them back into the residual stream was feasible. This investigation aimed to determine where the boundary of generative stability lies.
The Mechanism
Standard language models (LLMs) have their states represented as high-dimensional floats. Instead of applying typical scalar quantization methods like INT4, I mapped these high-dimensional activations onto a conceptual torus using a sinusoidal mapping and projected them onto Dual E8 lattice hemispheres.
The β = 0.20 Sweep (Qwen2.5-0.5B)
Sweeping the blend ratio β from 0.10 to 0.50 across layers 8–13 of `Qwen2.5-0.5B` revealed a sharp phase transition:
- β ≥ 0.25**: Generation succumbs to heavy repetition pressure and semantic drift. The geometry acts as an attractor, trapping the decoding process (“loop-lock”).
- β = 0.20**: This is the highest injection ratio of lossy geometric signal that maintains both numerical activation fidelity (Avg Cosine > 0.99) and open-ended generation quality (low repeated n-grams).
- β ≤ 0.10**: The perturbation is largely absorbed by the transformer’s layer normalizations, making the intervention invisible.
Here are the data from a 300-iteration sweep:
| β | Min Cosine | Avg Cosine | Max MSE | Rep-3g (Repetition Rate) |
|---|---|---|---|---|
| 0.10 | 0.9972 | 0.9979 | 0.0024 | 0.134 |
| 0.20 | 0.9907 | 0.9916 | 0.0106 | 0.093 |
| 0.25 | 0.9839 | 0.9865 | 0.0171 | 0.084 |
| 0.30 | 0.9648 | 0.9771 | 0.0255 | 0.190 |
| 0.50 | 0.9171 | 0.9288 | 0.0850 | 0.412 |
Semantic scoring (evaluating prompt relevance and similarity to the unmodified baseline):
| β | Avg Cosine | Rep-3g | Relevance | Patched-to-Baseline Sim |
|---|---|---|---|---|
| 0.10 | 0.9980 | 0.223 | 0.781 | 0.889 |
| 0.20 | 0.9918 | 0.075 | 0.752 | 0.854 |
| 0.25 | 0.9871 | 0.232 | 0.717 | 0.801 |
| 0.30 | 0.9760 | 0.392 | 0.725 | 0.764 |
Generalization Across Larger Models
The β = 0.20 boundary generalizes across larger model sizes (`Qwen2.5-1.5B` and `Qwen2.5-3B`) in the activation-cosine axis:
| Model | β | Min Cosine | Avg Cosine | Max MSE | Rep-3g |
|---|---|---|---|---|---|
| 1.5B | 0.10 | 0.9988 | 0.9989 | 0.0027 | 0.267 |
| β = 0.20 | 0.9862 | 0.9939 | 0.0105 | 0.128 | |
| β = 0.25 | 0.9904 | 0.9919 | 0.0166 | 0.398 | |
| β = 0.30 | 0.9733 | 0.9815 | 0.0235 | 0.307 | |
| β = 0.40 | 0.9368 | 0.9551 | 0.0487 | 0.191 | |
| 3B (4-bit) | 0.10 | 0.9964 | 0.9976 | 0.0122 | 0.033 |
| β = 0.20 | 0.9861 | 0.9904 | 0.0455 | 0.115 | |
| β = 0.25 | 0.9604 | 0.9799 | 0.0654 | 0.043 | |
| β = 0.30 | 0.9702 | 0.9778 | 0.0987 | 0.050 | |
| β = 0.40 | 0.9158 | 0.9390 | 0.1728 | 0.025 |
Note: In the 3B model, repetition pressure remained low across all sweeps, but the validation cosine degraded identically at β ≥ 0.25.
Storage Compression Prototypes
Utilizing the Dual E8/E16 lattice as a computational substrate also yields high theoretical storage efficiency in early prototypes:
- KV Cache (8×): FP16 KV cache compressed to INT8 coordinates, reducing footprint from 0.21 MB to 0.02 MB.
- Weights (112×): Projected a dense $[4864, 896]$ MLP weight matrix down to a 0.07 MB E16 footprint. (Cosine similarity of the uncalibrated weight matrix multiplication was limited to $\sim$0.078, indicating that Quantization-Aware Training is mandatory for parameter viability). A pre-projected decompression bypass was designed to run matrix multiplications directly against lattice coordinates without upcasting, avoiding memory bandwidth bottlene
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




