How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s

“`html

A Reddit user shared their experience with achieving high token generation rates using both DFlash and MTP techniques. They used a dual 3090 GPU setup along with an AMD Ryzen 9 9900X processor, 32GB RAM, and the latest NVIDIA drivers.

The user followed specific instructions for setting up P2P communication between GPUs, which allowed them to utilize their hardware effectively.
They tested both DFlash and MTP techniques with different models from the LLaMA family. For DFlash, they achieved around 40 tokens per second using a forked driver version of NVIDIA’s software. For MTP, they managed about 50 tokens per second with recent versions of llama.cpp.
The user noted that their initial Qwen3.5-27B model was generating approximately 40 tokens per second without any optimizations. The improvements in token generation speed were attributed to the new speculative decoding techniques and more efficient memory management through P2P communication.

“`

### Takeaways
– Users can achieve high token generation rates (over 50 t/s) using DFlash and MTP techniques with optimized hardware configurations.
– Efficient GPU-to-GPU communication via P2P significantly boosts performance, as seen in the user’s dual 3090 setup.
– Speculative decoding and better memory management are key factors in achieving these high speeds.

Source Read original →