Drastically improve prompt processing speed for -n-cpu-moe partially offloaded models

Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models

I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can significantly improve prompt processing throughput, as long as you also raise --n-cpu-moe enough to keep the run inside VRAM.

Results

ubatch	n-cpu-moe	prefill	generation
256	25	240.03 tok/s	33.14 tok/s
512 (default)	26	380.27 tok/s	32.29 tok/s
2048	25	1112.54 tok/s	32.96 tok/s
4096	26	1682.47 tok/s	32.38 tok/s
8192	28	2090.68 tok/s	30.05 tok/s

Compared with the llama.cpp default -ub 512, prompt processing went from about 380 tok/s to about 2091 tok/s, roughly a 5.5x gain. Compared with the smaller -ub 256 run, it was about an 8.7x gain. Token generation dropped from about 32.3 tok/s at default settings to 30.1 tok/s at -ub 8192, a modest reduction of about 7%.

The catch is that the larger ubatch needs more GPU compute workspace. On my machine, -ub 4096 needed --n-cpu-moe 26, and -ub 8192 needed --n-cpu-moe 28. So this is a throughput trade: moving a few more MoE layers to CPU makes enough room for the bigger batch, allowing prompt-heavy workloads to get dramatically faster while generation gets a little slower.

Note that the first four prefill points are pp4096; the 8192 ubatch point is from a pp8192 run. Treat this as an informal tuning result rather than a perfectly controlled benchmark.

Conclusion

This trick can close the gap in prompt processing performance between less powerful hardware like my RTX 3090 and more capable machines, such as the DGX Spark. If I had known about this, it might have influenced my decision not to purchase a high-end machine for these specific tasks.

Key Takeaways

Increasing -ub from 512 to 8192 can increase prompt processing speed by over 4.5x.
Raising --n-cpu-moe is necessary when increasing -ub.
The trade-off between prompt processing and generation speeds exists with this approach.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models