Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

**What Happened:** A new model called Orthrus-Qwen3-8B has been released, achieving up to 7.8× tokens per forward (TPF) on the Qwen3-8B base model. This is a significant improvement over existing methods like speculative decoding and diffusion models. The key feature of this model is that it injects a trainable diffusion attention module into each layer of the frozen AR Transformer, ensuring that its output distribution remains identical to the original model.

**Why It Matters:** This achievement demonstrates the feasibility of achieving high-throughput in AI language models without modifying the base architecture. By keeping the backbone frozen, Orthrus-Qwen3-8B avoids introducing biases or hallucinations that might arise from altering the weights of a pre-trained model. The results are particularly noteworthy because they outperform other state-of-the-art methods like speculative decoding and diffusion-based approaches, which often suffer from accuracy degradation due to changes in base model parameters.

**Takeaways:**
1. **High Throughput Without Weight Modifications:** Orthrus-Qwen3-8B shows that it’s possible to achieve very high throughput (up to 7.8× TPF) without changing the weights of a pre-trained model, preserving its original biases and knowledge.
2. **Provably Identical Output Distribution:** The model maintains the same output distribution as the base model, ensuring consistency in performance across different tasks.
3. **Minimal Training Overhead:** Achieving such high throughput required only 16% of the parameters to be trained and less than one billion tokens, demonstrating that this approach is efficient in terms of computational resources.

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

software trying to catch…

PINN is predicting trivial…

Orthrus: Memory-Efficient Parallel Token…