So background – these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR → diffusion (it’s already working from older models).
Link to the Open-dLLM and Open-Diffusion Large Language Model
I forked the codebase – ran it through opencode with free deepseek-flash / GLM5.1 overnight to upgrade to support qwen3.6 – because the codebase is > 6 months old. I got AI to mash up LDLM and a most recent paper in the mix: Link to the paper by Viacheslav Meshchaninov, Alexander Shabalin, Egor Chimbulatov, Nikita Gushchin, Ilya Koziev, and Alexander Korotin.
I asked it to build a config for qwen 3.6 model + upgrade with LDLM and spit some numbers on outputs with “honest” assumptions – big one is sequence length – throughput likely to fall off with higher outputs.
Inference Throughput (Qwen3.6 LDLM, untrained, RTX 5090 32GB)
| Model | Dim | Trainable Params | Diffusion Steps | Throughput |
|---|---|---|---|---|
| Qwen3.6-35B-A3B | 2048 | 1.39B | 10 | 3,238 tok/s |
| Qwen3.6-35B-A3B | 2048 | 1.39B | 4 | ~6,500 tok/s |
| Qwen3.6-27B | 5120 | 6.75B | 10 | 745 tok/s |
| Qwen3.6-27B | 5120 | 6.75B | 4 | ~1,500 tok/s |
Assumptions & Caveats
- Untrained weights: These benchmarks use randomly initialized Perceiver/decoder/diffusion-head weights. A trained model will have identical throughput but produce coherent output. Quality benchmarks (perplexity, HumanEval) will be published after training completes.
- No encoder in the loop: The frozen Qwen3.6 encoder is not used during generation — it’s only needed for training (to produce latent targets). At inference, the diffusion head denoises random noise, then the Perceiver decoder maps latents to tokens. The encoder is deleted before benchmarking (
del autoencoder.token_encoder). - Seq len = 64: The benchmark uses a short sequence length (64 tokens). Longer sequences will reduce throughput proportionally. The 4-step throughput numbers are linear extrapolations from the 10-step measurements.
- Batch size = 1: Single-sequence generation only. Throughput scales near-linearly with batch size for the 35B-A3B (dim=2048 fits easily in VRAM), less so for the 27B (dim=5120).
- CPU RAM requirement: While the encoder is not used at inference, it must fit in system RAM during training (~54GB for 27B, ~22GB for 35B-A3B in bf16). The Qwen3.6 architecture uses Triton kernels (flash-linear-attention) that cannot run on CPU, so the encoder forward pass during training requires GPU offloading — a multi-GPU setup is recommended for training.
- Qwen3.6 requires
trust_remote_code=True: The model uses custom architecture code (Qwen3_5ForConditionalGeneration) that is not in standard transformers releases. Ensure yourtransformersversion supports it (>=4.54). - 35B-A3B is MoE: Only 3B of its 35B parameters are active per token, giving it a much smaller hidden dim (2048) than the 27B dense model (5120). This is why the LDLM trainable components are 5x smaller and 4x faster.
- Not an apples-to-apples comparison with AR models: The diffusion model generates all tokens in parallel across N diffusion steps, while AR generates one token at a time. The "tok/s" metric favors diffusion for short sequences but does not reflect output quality, which depends on training convergence.
Code is here – with git issues enabled
Link to the Open-dLLM repository
wandb training metrics
Link to wandb run for Qwen3.6-35B-A3B-LDLM
If anyone has spare VAST.AI credits / Azure credits / Google credits, let me know and I’ll hook you up.
Key Takeaways
- The Qwen3.6 model with the LDLM (Large Diffusion Model) achieves throughput of 3,238 tokens per second for a sequence length of 64 tokens.
- This benchmark uses untrained weights and has no encoder in the loop during inference, focusing on the diffusion head and decoder performance.
- For longer sequences, throughput decreases proportionally due to the linear nature of this setup. The model is optimized for batch sizes that fit easily into VRAM, with a notable difference observed between 35B-A3B and 27B models in terms of hidden dimensions and diffusion steps.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




