Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models

“`html A British Reddit user shared a technique for significantly improving the processing speed of large language models like GPT-120B, especially in…

By AI Maestro May 12, 2026 1 min read
Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models

“`html

A British Reddit user shared a technique for significantly improving the processing speed of large language models like GPT-120B, especially in handling long or complex prompts. By adjusting parameters such as `–n-cpu-moe` and `-ub`, they were able to achieve substantial throughput gains on their RTX 3090 GPU.

The key insight is that increasing the physical micro-batch size (`-ub`) can dramatically speed up prompt processing, provided you also increase the number of CPU-MoE layers (`–n-cpu-moe`). This results in a trade-off where generation speeds may decrease slightly but overall throughput improves significantly. The user found that for their setup, bumping `-ub` from 512 to 8192 resulted in a nearly 6x increase in prompt processing throughput while keeping the model within the VRAM limits.

  • Increasing `–n-cpu-moe` alongside larger batches (`-ub`) can significantly boost the speed of large language models like GPT-120B, especially when dealing with long or complex prompts.
  • This technique leverages both GPU memory and CPU resources to optimize performance for prompt-heavy workloads without compromising overall model efficiency too much.
  • The user notes that this trick could have influenced their decision regarding purchasing a more powerful machine like the DGX Spark, as it still offers better throughput improvements compared to the default settings on a similar setup.

“`

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top