**What Happened:**
A user on Reddit, **youcloudsofdoom**, posted a query about the best outputs achievable with Qwen 3.6 (35B parameters) running on dual NVIDIA A100 (3090) GPUs. The post highlighted that prior to the MTP merge, they were achieving impressive results of around 1500 tokens per second (p/p) and 120 tokens per generation (t/g). However, after testing with the merged MTP model, their performance dropped significantly to approximately 80 t/g. The user expressed a preference for using their CPU fallback mechanism at 3500 p/p and 80 t/g until someone provides an alternative solution.
**Why It Matters:**
This discussion underscores the ongoing efforts in improving the efficiency of large language model (LLM) models, particularly those running on GPU hardware. The user’s experience highlights potential performance variations across different model configurations and the importance of finding optimal settings for specific use cases. This thread also serves as a platform for sharing insights and experiences with users who are keen to explore how they can best leverage Qwen 3.6 for their needs, whether it be through alternative models or fine-tuning strategies.
**Takeaways:**
– The MTP merge has led to noticeable performance changes in Qwen 3.6 on GPU hardware.
– Users are actively seeking optimal configurations to maximize model efficiency and output quality.
– There is a need for community-driven experimentation and sharing of best practices to support users in their AI endeavors.
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




