In this article
Summary
I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can significantly improve prompt processing throughput. This article details how raising --n-cpu-moe helps manage GPU memory constraints.
Details
The llama.cpp defaults are -b 2048 and -ub 512; I included that default run as its own point in the chart. Here are some informal llama-bench results:
| ubatch | n-cpu-moe | prefill (tok/s) | generation (tok/s) |
|---|---|---|---|
| 256 | 25 | 240.03 tok/s | 33.14 tok/s |
| 512 (default) | 26 | 380.27 tok/s | 32.29 tok/s |
| 2048 | 25 | 1112.54 tok/s | 32.96 tok/s |
| 4096 | 26 | 1682.47 tok/s | 32.38 tok/s |
| 8192 | 28 | 2090.68 tok/s | 30.05 tok/s |
Compared to the default -ub 512 run, prompt processing went from about 380 token/sec to around 2091 token/sec, which is a 5.5x improvement. The smaller batch (-ub 256) had an even more significant gain at about 8.7x faster processing.
However, the larger ubatch needs more GPU compute workspace. On my machine, -ub 4096 required --n-cpu-moe 26, and -ub 8192 needed --n-cpu-moe 28. This means moving more MoE layers to CPU can make room for the bigger batch while still keeping generation throughput close to the default.
Note: The first four prefill points are from a run with pp4096, and the 8192 ubatch point is from a run with pp8192. These results should be treated as informal tuning rather than precise benchmarks.
The trick of increasing ubatch can close the gap between homebrew models like gpt-oss-120b and more optimized ones, such as those available on services like Anthropic or Claude. It’s a useful strategy to maximize GPU memory usage for prompt-heavy tasks.
Key Takeaways
- Raising the
-ubparameter can significantly improve prompt processing speed. - This requires increasing
--n-cpu-moeto manage GPU memory constraints. - The trade-off is between faster prompt processing and slightly slower generation speeds, especially with very large batches.




