“`html
Summary
I was getting a bit frustrated by the relatively slow PP/s on 27B, 31B dense models of my Bosgame M5 Strix Halo. So I decided to do some scrambling to overcome it.
Details
- In short:
- Strix halo alone (124GB UMA VRAM) is already nice but adding 1 or 2 eGPUs can be good for running the recently popular 27B or 31B dense models.
- The native bandwidth limit of eGPUs can be mitigated. I tried scrambling a 2-slot NVLink (cheaper than 3 slots) setup with a simple cooling mod on 3090s, which might result in up to several times better PP/s and TG/s on small densed models.
- Using riser cable can achieve eGPU’s slot flexibility to fit 2-slot NVLink with a small mod on typical motherboard PCIe 3090 cards.
- The power efficiency of 27B dense models is better when running on Strix halo alone via llama cpp, compared to combined 3 GPUs.
- NVLink does not do anything on llama.cpp’s layer split. I have tried recent -sm tensor and gained Tg/s was around 30% but pp/s down performance was too big, so I stopped and continued with vLLM on dual 3090.
- Adding eGPUs can improve performance for running larger dense models.
- NVLink does not significantly benefit llama.cpp’s layer splits, but it helps in managing bandwidth limits with multiple GPUs.
- Varying quantization and KV cache settings can lead to different results in power efficiency and model performance.
Test Environment
| Recipe | Quantization | KV cache | Context | Concurrency | Drafter |
|---|---|---|---|---|---|
| docker-compose-dual (small, INT4 Standard) | AutoRound INT4 | fp8_e5m2 | 131K | 4 (total ~524K) | MTP=3 |
| turbo (High-Concurrency) | AutoRound INT4 | TQ3 (3-bit) | 262K | 4 (total ~1048K) | MTP=3 |
| mixed-bf16 (Precision, kinda Q6 feeling) | Mixed (INT4+8) | bfloat16 | 110K | 2 (total ~220K) | MTP=3 |
| mixed-fp8 (Sweet Spot) | Mixed (INT4+8) | fp8_e5m2 | 131K | 2 (total ~262K) | MTP=2 |
| autoround INT8 (Largest) | AutoRound INT8 | fp8_e5m2 | 115K | 1 (total ~115K) | MTP=3 |
Results
Power efficiency: For 27B dense models, the eGPU setup has better power efficiency. However, when running the 122B model, Strix halo alone running on llama cpp was actually more power efficient.

Key Takeaways
“`
Source Read original →
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




