Summary
I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can significantly improve prompt processing throughput. This article details how raising --n-cpu-moe helps manage GPU memory constraints.
Details
The llama.cpp defaults are -b 2048 and -ub 512; I included that default run as its own point in the chart. Here are some informal llama-bench results:
| ubatch | n-cpu-moe | prefill (tok/s) | generation (tok/s) |
|---|---|---|---|
| 256 | 25 | 240.03 tok/s | 33.14 tok/s |
| 512 (default) | 26 | 380.27 tok/s | 32.29 tok/s |
| 2048 | 25 | 1112.54 tok/s | 32.96 tok/s |
| 4096 | 26 | 1682.47 tok/s | 32.38 tok/s |
| 8192 | 28 | 2090.68 tok/s | 30.05 tok/s |
Compared to the default -ub 512 run, prompt processing went from about 380 token/sec to around 2091 token/sec, which is a 5.5x improvement. The smaller batch (-ub 256) had an even more significant gain at about 8.7x faster processing.
However, the larger ubatch needs more GPU compute workspace. On my machine, -ub 4096 required --n-cpu-moe 26, and -ub 8192 needed --n-cpu-moe 28. This means moving more MoE layers to CPU can make room for the bigger batch while still keeping generation throughput close to the default.
Note: The first four prefill points are from a run with pp4096, and the 8192 ubatch point is from a run with pp8192. These results should be treated as informal tuning rather than precise benchmarks.
The trick of increasing ubatch can close the gap between homebrew models like gpt-oss-120b and more optimized ones, such as those available on services like Anthropic or Claude. It’s a useful strategy to maximize GPU memory usage for prompt-heavy tasks.
Key Takeaways
- Raising the
-ubparameter can significantly improve prompt processing speed. - This requires increasing
--n-cpu-moeto manage GPU memory constraints. - The trade-off is between faster prompt processing and slightly slower generation speeds, especially with very large batches.
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

