MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 13, 2026 1 min read
MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Key Takeaways

  • The model has achieved impressive performance with a token throughput of 362.03 tokens per second (TPS).
  • The inference engine used is based on vllm fork v0.20.1 with ROCm7.2.1.
  • Results are for single inference with two prompts of 1k and 15k tokens, without using MTP or DFlash due to their limitations and the desire for full precision.
  • The model is fully usable with existing agentic harnesses like Claude Code, Hermes, etc., according to the author’s assessment.

There is still room for improvement by optimizing software and hardware stacks further. For instance, using a PCIe switch could potentially reduce latency, while more optimized DFlash/MTP implementations without additional overhead might be considered for future improvements.

For Makers and Artists:

  • The model’s performance highlights the potential of modern GPUs in running large-scale language models efficiently, which can benefit the development and execution of AI-powered creative tools.
  • Making use of this model could enable artists to incorporate more sophisticated and powerful text generation capabilities into their workflows without significant hardware upgrades.

As the field continues to evolve, we can expect even better performance from models like Qwen in terms of both speed and capability. This opens up exciting possibilities for integrating advanced AI tools directly into creative pipelines, enabling new forms of collaboration between humans and machines.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top