Drastically improve prompt processing speed for -n-cpu-moe partially offloaded models

Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models

Summary

I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can significantly improve prompt processing throughput. This article details how raising --n-cpu-moe helps manage GPU memory constraints.

Details

The llama.cpp defaults are -b 2048 and -ub 512; I included that default run as its own point in the chart. Here are some informal llama-bench results:

ubatch	n-cpu-moe	prefill (tok/s)	generation (tok/s)
256	25	240.03 tok/s	33.14 tok/s
512 (default)	26	380.27 tok/s	32.29 tok/s
2048	25	1112.54 tok/s	32.96 tok/s
4096	26	1682.47 tok/s	32.38 tok/s
8192	28	2090.68 tok/s	30.05 tok/s

Compared to the default -ub 512 run, prompt processing went from about 380 token/sec to around 2091 token/sec, which is a 5.5x improvement. The smaller batch (-ub 256) had an even more significant gain at about 8.7x faster processing.

However, the larger ubatch needs more GPU compute workspace. On my machine, -ub 4096 required --n-cpu-moe 26, and -ub 8192 needed --n-cpu-moe 28. This means moving more MoE layers to CPU can make room for the bigger batch while still keeping generation throughput close to the default.

Note: The first four prefill points are from a run with pp4096, and the 8192 ubatch point is from a run with pp8192. These results should be treated as informal tuning rather than precise benchmarks.

The trick of increasing ubatch can close the gap between homebrew models like gpt-oss-120b and more optimized ones, such as those available on services like Anthropic or Claude. It’s a useful strategy to maximize GPU memory usage for prompt-heavy tasks.

Key Takeaways

Raising the -ub parameter can significantly improve prompt processing speed.
This requires increasing --n-cpu-moe to manage GPU memory constraints.
The trade-off is between faster prompt processing and slightly slower generation speeds, especially with very large batches.

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models

Summary

Details

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Sam Altman’s personal investments…

AI turning aggressive generalists…

My god there is…