The six models
Heretic and Huihui are the top two for capability preservation: Huihui has the smallest benchmark deltas, Heretic has the lowest KL divergence.
All five abliterated models reach near-complete safety removal. AEON’s "enhanced capabilities" claim is contradicted by the data.
Discontinued: HauhauCS in all future comparisons due to lossless claims being debunked and the tool being plagiarized.
Benchmarks
| Task | Base | Heretic | HauhauCS | Huihui | AEON | Abliterix |
|---|---|---|---|---|---|---|
| MMLU | 83.3% | 82.8% | 83.9% | 83.4% | 82.9% | 81.3% |
| HellaSwag | 83.5% | 83.2% | 83.1% | 83.5% | 82.7% | 77.3% |
| ARC Challenge | 59.1% | 58.0% | 57.9% | 59.5% | 56.1% | 53.2% |
| WinoGrande | 77.7% | 77.7% | 77.7% | 77.4% | 75.3% | 74.9% |
| TruthfulQA MC2 | 56.7% | 51.1% | 47.2% | 54.8% | 46.1% | 48.7% |
| PiQA | 81.0% | 81.0% | 81.0% | 81.2% | 80.4% | 75.7% |
| GSM8K (7168 tok) | 34.4% | 27.5% | 51.0% | 75.1% | 51.2% | 37.6% |
| Lambada (ppl) | 3.18 | 3.24 | 3.35 | 3.15 | 3.44 | 9.12 |
There is something strange with the GSM8K results, and I don’t know yet the cause. So please take it with a grain of salt. If I find out the exact reason for these strange scores, I’ll update here.
Delta vs base
| Task | Heretic | Hauhuiu | AEON | Abliterix |
|---|---|---|---|---|
| MMLU | -0.5 | +0.1 | -2.0 | -6.0 |
| HellaSwag | -0.3 | +0.0 | -0.8 | -6.2 |
| ARC Challenge | -1.1 | +0.4 | -3.0 | -5.9 |
| WinoGrande | +0.0 | -0.3 | -2.4 | -2.8 |
| TruthfulQA MC2 | -5.6 | -1.9 | -10.6 | -8.0 |
| PiQA | +0.0 | +0.2 | -0.6 | -5.3 |
| GSM8K | -6.9 | +16.8 | +40.7 | +3.2 |
HarmBench
| Variant | ASR (percent) | Empty | Full CoT ASR (percent) |
|---|---|---|---|
| Base | 25.8% | 1 | 26.0% |
| Huihui | 98.5% | 5 | 99.8% |
| HauhauCS | 94.5% | 22 | 100.0% |
| Abliterix | 94.5% | 22 | 100.0% |
| Heretic | 92.5% | 30 | 100.0% |
| AEON | 88.8% | 45 | 100.0% |




