Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 12, 2026 1 min read
Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models




Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models

Summary

I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can significantly improve prompt processing throughput. This article details how raising --n-cpu-moe helps manage GPU memory constraints.

Details

The llama.cpp defaults are -b 2048 and -ub 512; I included that default run as its own point in the chart. Here are some informal llama-bench results:

ubatchn-cpu-moeprefill (tok/s)generation (tok/s)
25625240.03 tok/s33.14 tok/s
512 (default)26380.27 tok/s32.29 tok/s
2048251112.54 tok/s32.96 tok/s
4096261682.47 tok/s32.38 tok/s
8192282090.68 tok/s30.05 tok/s

Compared to the default -ub 512 run, prompt processing went from about 380 token/sec to around 2091 token/sec, which is a 5.5x improvement. The smaller batch (-ub 256) had an even more significant gain at about 8.7x faster processing.

However, the larger ubatch needs more GPU compute workspace. On my machine, -ub 4096 required --n-cpu-moe 26, and -ub 8192 needed --n-cpu-moe 28. This means moving more MoE layers to CPU can make room for the bigger batch while still keeping generation throughput close to the default.

Note: The first four prefill points are from a run with pp4096, and the 8192 ubatch point is from a run with pp8192. These results should be treated as informal tuning rather than precise benchmarks.

The trick of increasing ubatch can close the gap between homebrew models like gpt-oss-120b and more optimized ones, such as those available on services like Anthropic or Claude. It’s a useful strategy to maximize GPU memory usage for prompt-heavy tasks.

Key Takeaways

  • Raising the -ub parameter can significantly improve prompt processing speed.
  • This requires increasing --n-cpu-moe to manage GPU memory constraints.
  • The trade-off is between faster prompt processing and slightly slower generation speeds, especially with very large batches.
Scroll to Top