Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)

“`html

Scrambling to Max StrixHalo (+NVLink dual eGPU 3090 mod)

Summary

I was getting a bit frustrated by the relatively slow PP/s on 27B, 31B dense models of my Bosgame M5 Strix Halo. So I decided to do some scrambling to overcome it.

Details

In short:

Strix halo alone (124GB UMA VRAM) is already nice but adding 1 or 2 eGPUs can be good for running the recently popular 27B or 31B dense models.
The native bandwidth limit of eGPUs can be mitigated. I tried scrambling a 2-slot NVLink (cheaper than 3 slots) setup with a simple cooling mod on 3090s, which might result in up to several times better PP/s and TG/s on small densed models.
Using riser cable can achieve eGPU’s slot flexibility to fit 2-slot NVLink with a small mod on typical motherboard PCIe 3090 cards.
The power efficiency of 27B dense models is better when running on Strix halo alone via llama cpp, compared to combined 3 GPUs.
NVLink does not do anything on llama.cpp’s layer split. I have tried recent -sm tensor and gained Tg/s was around 30% but pp/s down performance was too big, so I stopped and continued with vLLM on dual 3090.

Test Environment

Recipe	Quantization	KV cache	Context	Concurrency	Drafter
docker-compose-dual (small, INT4 Standard)	AutoRound INT4	fp8_e5m2	131K	4 (total ~524K)	MTP=3
turbo (High-Concurrency)	AutoRound INT4	TQ3 (3-bit)	262K	4 (total ~1048K)	MTP=3
mixed-bf16 (Precision, kinda Q6 feeling)	Mixed (INT4+8)	bfloat16	110K	2 (total ~220K)	MTP=3
mixed-fp8 (Sweet Spot)	Mixed (INT4+8)	fp8_e5m2	131K	2 (total ~262K)	MTP=2
autoround INT8 (Largest)	AutoRound INT8	fp8_e5m2	115K	1 (total ~115K)	MTP=3

Results

Power efficiency: For 27B dense models, the eGPU setup has better power efficiency. However, when running the 122B model, Strix halo alone running on llama cpp was actually more power efficient.

Power efficiency graph — Power efficiency of 27B dense models

Key Takeaways

Adding eGPUs can improve performance for running larger dense models.
NVLink does not significantly benefit llama.cpp’s layer splits, but it helps in managing bandwidth limits with multiple GPUs.
Varying quantization and KV cache settings can lead to different results in power efficiency and model performance.

“`

Source Read original →

Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)

Summary

Details

Test Environment

Results

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

datasette-apps 0.2a0

Ten advances in mathematics…

Judge denies xAI’s request…

Summary

Details

Test Environment

Results

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

datasette-apps 0.2a0

Ten advances in mathematics…

Judge denies xAI’s request…