Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)

“`html Scrambling to Max StrixHalo (+NVLink dual eGPU 3090 mod) Summary I was getting a bit frustrated by the relatively slow PP/s…

By AI Maestro May 22, 2026 2 min read
Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)

“`html




Scrambling to Max StrixHalo (+NVLink dual eGPU 3090 mod)

Summary

I was getting a bit frustrated by the relatively slow PP/s on 27B, 31B dense models of my Bosgame M5 Strix Halo. So I decided to do some scrambling to overcome it.

Details

  • In short:
    • Strix halo alone (124GB UMA VRAM) is already nice but adding 1 or 2 eGPUs can be good for running the recently popular 27B or 31B dense models.
    • The native bandwidth limit of eGPUs can be mitigated. I tried scrambling a 2-slot NVLink (cheaper than 3 slots) setup with a simple cooling mod on 3090s, which might result in up to several times better PP/s and TG/s on small densed models.
    • Using riser cable can achieve eGPU’s slot flexibility to fit 2-slot NVLink with a small mod on typical motherboard PCIe 3090 cards.
    • The power efficiency of 27B dense models is better when running on Strix halo alone via llama cpp, compared to combined 3 GPUs.
    • NVLink does not do anything on llama.cpp’s layer split. I have tried recent -sm tensor and gained Tg/s was around 30% but pp/s down performance was too big, so I stopped and continued with vLLM on dual 3090.

    Test Environment

    RecipeQuantizationKV cacheContextConcurrencyDrafter
    docker-compose-dual (small, INT4 Standard)AutoRound INT4fp8_e5m2131K4 (total ~524K)MTP=3
    turbo (High-Concurrency)AutoRound INT4TQ3 (3-bit)262K4 (total ~1048K)MTP=3
    mixed-bf16 (Precision, kinda Q6 feeling)Mixed (INT4+8)bfloat16110K2 (total ~220K)MTP=3
    mixed-fp8 (Sweet Spot)Mixed (INT4+8)fp8_e5m2131K2 (total ~262K)MTP=2
    autoround INT8 (Largest)AutoRound INT8fp8_e5m2115K1 (total ~115K)MTP=3

    Results

    Power efficiency: For 27B dense models, the eGPU setup has better power efficiency. However, when running the 122B model, Strix halo alone running on llama cpp was actually more power efficient.

    Power efficiency graph
    Power efficiency of 27B dense models

    Key Takeaways

    • Adding eGPUs can improve performance for running larger dense models.
    • NVLink does not significantly benefit llama.cpp’s layer splits, but it helps in managing bandwidth limits with multiple GPUs.
    • Varying quantization and KV cache settings can lead to different results in power efficiency and model performance.

    “`

    Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

    Name
Scroll to Top