Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

Can a 5090 with qwen3.6 achieve > 3,000 tok/s? Bring your pitchforks (open-dllm)

So background – these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR → diffusion (it’s already working from older models).

Link to the Open-dLLM and Open-Diffusion Large Language Model

I forked the codebase – ran it through opencode with free deepseek-flash / GLM5.1 overnight to upgrade to support qwen3.6 – because the codebase is > 6 months old. I got AI to mash up LDLM and a most recent paper in the mix: Link to the paper by Viacheslav Meshchaninov, Alexander Shabalin, Egor Chimbulatov, Nikita Gushchin, Ilya Koziev, and Alexander Korotin.

I asked it to build a config for qwen 3.6 model + upgrade with LDLM and spit some numbers on outputs with “honest” assumptions – big one is sequence length – throughput likely to fall off with higher outputs.

Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)

Model	Dim	Trainable Params	Diffusion Steps	Throughput
Qwen3.6-35B-A3B	2048	1.39B	10	3,238 tok/s
Qwen3.6-35B-A3B	2048	1.39B	4	~6,500 tok/s
Qwen3.6-27B	5120	6.75B	10	745 tok/s
Qwen3.6-27B	5120	6.75B	4	~1,500 tok/s

Assumptions & Caveats

Untrained weights: These benchmarks use randomly initialized Perceiver/decoder/diffusion-head weights. A trained model will have identical throughput but produce coherent output. Quality benchmarks (perplexity, HumanEval) will be published after training completes.
No encoder in the loop: The frozen Qwen3.6 encoder is not used during generation — it’s only needed for training (to produce latent targets). At inference, the diffusion head denoises random noise, then the Perceiver decoder maps latents to tokens. The encoder is deleted before benchmarking (del autoencoder.token_encoder).
Seq len = 64: The benchmark uses a short sequence length (64 tokens). Longer sequences will reduce throughput proportionally. The 4-step throughput numbers are linear extrapolations from the 10-step measurements.
Batch size = 1: Single-sequence generation only. Throughput scales near-linearly with batch size for the 35B-A3B (dim=2048 fits easily in VRAM), less so for the 27B (dim=5120).
CPU RAM requirement: While the encoder is not used at inference, it must fit in system RAM during training (~54GB for 27B, ~22GB for 35B-A3B in bf16). The Qwen3.6 architecture uses Triton kernels (flash-linear-attention) that cannot run on CPU, so the encoder forward pass during training requires GPU offloading — a multi-GPU setup is recommended for training.
Qwen3.6 requires trust_remote_code=True: The model uses custom architecture code (Qwen3_5ForConditionalGeneration) that is not in standard transformers releases. Ensure your transformers version supports it (>=4.54).
35B-A3B is MoE: Only 3B of its 35B parameters are active per token, giving it a much smaller hidden dim (2048) than the 27B dense model (5120). This is why the LDLM trainable components are 5x smaller and 4x faster.
Not an apples-to-apples comparison with AR models: The diffusion model generates all tokens in parallel across N diffusion steps, while AR generates one token at a time. The "tok/s" metric favors diffusion for short sequences but does not reflect output quality, which depends on training convergence.

Code is here – with git issues enabled

Link to the Open-dLLM repository

wandb training metrics

Link to wandb run for Qwen3.6-35B-A3B-LDLM

If anyone has spare VAST.AI credits / Azure credits / Google credits, let me know and I’ll hook you up.

Key Takeaways

The Qwen3.6 model with the LDLM (Large Diffusion Model) achieves throughput of 3,238 tokens per second for a sequence length of 64 tokens.
This benchmark uses untrained weights and has no encoder in the loop during inference, focusing on the diffusion head and decoder performance.
For longer sequences, throughput decreases proportionally due to the linear nature of this setup. The model is optimized for batch sizes that fit easily into VRAM, with a notable difference observed between 35B-A3B and 27B models in terms of hidden dimensions and diffusion steps.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)

Assumptions & Caveats

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Microsoft Build 2026: The…

OpenAI expands Codex with…

Trump signs executive order…

Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)

Assumptions & Caveats

Key Takeaways

More in AI News

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Microsoft Build 2026: The…

OpenAI expands Codex with…

Trump signs executive order…