Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 16, 2026 3 min read
Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm)



Can a 5090 with qwen3.6 achieve > 3,000 tok/s? Bring your pitchforks (open-dllm)

So background – these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR → diffusion (it’s already working from older models).

Link to the Open-dLLM and Open-Diffusion Large Language Model

I forked the codebase – ran it through opencode with free deepseek-flash / GLM5.1 overnight to upgrade to support qwen3.6 – because the codebase is > 6 months old. I got AI to mash up LDLM and a most recent paper in the mix: Link to the paper by Viacheslav Meshchaninov, Alexander Shabalin, Egor Chimbulatov, Nikita Gushchin, Ilya Koziev, and Alexander Korotin.

I asked it to build a config for qwen 3.6 model + upgrade with LDLM and spit some numbers on outputs with “honest” assumptions – big one is sequence length – throughput likely to fall off with higher outputs.

Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)

ModelDimTrainable ParamsDiffusion StepsThroughput
Qwen3.6-35B-A3B20481.39B103,238 tok/s
Qwen3.6-35B-A3B20481.39B4~6,500 tok/s
Qwen3.6-27B51206.75B10745 tok/s
Qwen3.6-27B51206.75B4~1,500 tok/s

Assumptions & Caveats

  • Untrained weights: These benchmarks use randomly initialized Perceiver/decoder/diffusion-head weights. A trained model will have identical throughput but produce coherent output. Quality benchmarks (perplexity, HumanEval) will be published after training completes.
  • No encoder in the loop: The frozen Qwen3.6 encoder is not used during generation — it’s only needed for training (to produce latent targets). At inference, the diffusion head denoises random noise, then the Perceiver decoder maps latents to tokens. The encoder is deleted before benchmarking (del autoencoder.token_encoder).
  • Seq len = 64: The benchmark uses a short sequence length (64 tokens). Longer sequences will reduce throughput proportionally. The 4-step throughput numbers are linear extrapolations from the 10-step measurements.
  • Batch size = 1: Single-sequence generation only. Throughput scales near-linearly with batch size for the 35B-A3B (dim=2048 fits easily in VRAM), less so for the 27B (dim=5120).
  • CPU RAM requirement: While the encoder is not used at inference, it must fit in system RAM during training (~54GB for 27B, ~22GB for 35B-A3B in bf16). The Qwen3.6 architecture uses Triton kernels (flash-linear-attention) that cannot run on CPU, so the encoder forward pass during training requires GPU offloading — a multi-GPU setup is recommended for training.
  • Qwen3.6 requires trust_remote_code=True: The model uses custom architecture code (Qwen3_5ForConditionalGeneration) that is not in standard transformers releases. Ensure your transformers version supports it (>=4.54).
  • 35B-A3B is MoE: Only 3B of its 35B parameters are active per token, giving it a much smaller hidden dim (2048) than the 27B dense model (5120). This is why the LDLM trainable components are 5x smaller and 4x faster.
  • Not an apples-to-apples comparison with AR models: The diffusion model generates all tokens in parallel across N diffusion steps, while AR generates one token at a time. The "tok/s" metric favors diffusion for short sequences but does not reflect output quality, which depends on training convergence.

Code is here – with git issues enabled

Link to the Open-dLLM repository

wandb training metrics

Link to wandb run for Qwen3.6-35B-A3B-LDLM

If anyone has spare VAST.AI credits / Azure credits / Google credits, let me know and I’ll hook you up.

Key Takeaways

  • The Qwen3.6 model with the LDLM (Large Diffusion Model) achieves throughput of 3,238 tokens per second for a sequence length of 64 tokens.
  • This benchmark uses untrained weights and has no encoder in the loop during inference, focusing on the diffusion head and decoder performance.
  • For longer sequences, throughput decreases proportionally due to the linear nature of this setup. The model is optimized for batch sizes that fit easily into VRAM, with a notable difference observed between 35B-A3B and 27B models in terms of hidden dimensions and diffusion steps.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top