Drastically improve prompt processing speed for -n-cpu-moe partially offloaded models

“`html

A British Reddit user shared a technique for significantly improving the processing speed of large language models like GPT-120B, especially in handling long or complex prompts. By adjusting parameters such as `–n-cpu-moe` and `-ub`, they were able to achieve substantial throughput gains on their RTX 3090 GPU.

The key insight is that increasing the physical micro-batch size (`-ub`) can dramatically speed up prompt processing, provided you also increase the number of CPU-MoE layers (`–n-cpu-moe`). This results in a trade-off where generation speeds may decrease slightly but overall throughput improves significantly. The user found that for their setup, bumping `-ub` from 512 to 8192 resulted in a nearly 6x increase in prompt processing throughput while keeping the model within the VRAM limits.

Increasing `–n-cpu-moe` alongside larger batches (`-ub`) can significantly boost the speed of large language models like GPT-120B, especially when dealing with long or complex prompts.
This technique leverages both GPU memory and CPU resources to optimize performance for prompt-heavy workloads without compromising overall model efficiency too much.
The user notes that this trick could have influenced their decision regarding purchasing a more powerful machine like the DGX Spark, as it still offers better throughput improvements compared to the default settings on a similar setup.

“`

Source Read original →

Drastically improve prompt processing speed for –n-cpu-moe partially offloaded models

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

datasette-apps 0.2a0

Ten advances in mathematics…

Judge denies xAI’s request…