Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Key Takeaways

The results show that for the dense model, MTP (Multi-Token Prediction) is faster than DFlash at both concurrency levels.
For the MoE model, DFlash outperforms MTP in terms of throughput and speedups compared to baseline decoding.
The gains were more pronounced with the MoE model due to its lower active parameter count during inference.

To deploy these models, it is recommended to test both approaches on your specific setup to determine which is most effective. The results can vary based on the model, prompts, hardware, and serving configuration.

For more detailed analysis and further reading: