I added native MTP to exo for Qwen3.6 MLX models; here are the exactness and speed results

I opened my first contribution to exo: native multi-token prediction support for Qwen3.6-style MLX checkpoints. I hope it is useful.

The personal motivation was simple: I am waiting for Mac Studios to arrive and wanted to use exo as a local distributed inference cluster across them. Native MTP looked like one of the pieces worth getting right before that setup lands.

For supported model cards, it should work out of the box. The macOS setting is on by default, and the CLI path enables native MTP unless EXO_NATIVE_MTP_ENABLED=0 is set. The current native-MTP path is single-node only: if exo distributes a model across multiple machines, it falls back to the normal path for now.

The part I cared about most was exactness. MTP heads draft candidate tokens, but the target model still verifies them before anything is emitted. For greedy decode, the goal is the same token IDs as the target-only path. For sampling, the path uses speculative probability-ratio acceptance for the request’s temperature/top_p/top_k/min_p distribution.

Short version from the current broad sweep:

Model	Mode	Mean tok/s	vs MTP off	Acceptance
27B native-MTP	MTP off	17.27	1.00x	n/a
27B native-MTP	K=1	29.56	1.71x	85.7%
27B native-MTP	K=2	34.06	1.97x	75.4%
27B native-MTP	K=3	33.79	1.96x	66.4%
35B-A3B native-MTP	MTP off	85.14	1.00x	n/a
35B-A3B native-MTP	K=1	98.59	1.16x	55.8%
35B-A3B native-MTP	K=2	92.27	1.08x	38.3%
35B-A3B native-MTP	K=3	80.53	0.95x	27.4%

So the practical result is:

On M5 Max 48GB RAM, 27B: 17.27 -> 34.06 tok/s at K=2, +97.2% / 1.97x.
35B-A3B: 85.14 -> 98.59 tok/s at K=1, +15.8% / 1.16x.
Works out of the box for supported single-node native-MTP model cards; set EXO_NATIVE_MTP_ENABLED=0, or use the native settings dialog to opt-out.

The PR also includes the product plumbing around it:

model cards expose native-MTP default/max K;
/v1/models reports native-MTP capability;
supported model cards dispatch native MTP by default when the local checkpoint has recoverable MTP weights and the instance is placed on one node;
final generation stats report drafter_kind="native_mtp" and num_draft_tokens;
temperature/top_p/top_k/min_p are threaded into the drafter instead of forcing the path to be greedy-only.

The implementation work was mostly systems cleanup: one-pass prompt/MTP cache setup for the 35B MoE/GDN path, hidden-state-only target-body calls where logits are not consumed, MLX-side accepted-prefix counting, K=1 concat avoidance, and overlap between MTP draft/cache evaluation and verifier graph construction.

Current scope/limitations:

enabled only for model cards that explicitly declare native-MTP metadata;
native-MTP dispatch is single-node in this PR; multi-node distributed placement still uses the normal path;
stateful logits processors such as repetition/presence/frequency penalties are not routed through native MTP yet;
K>=4 is not enabled.

TL;DR:

On my M5 Max 48GB RAM laptop: 27B: 17.27 -> 34.06 tok/s at K=2, +97.2% / 1.97x.
35B-A3B: 85.14 -> 98.59 tok/s at K=1, +15.8% / 1.16x.
Works out of the box for supported single-node native-MTP model cards; set EXO_NATIVE_MTP_ENABLED=0, or use the native settings dialog to opt-out.

Preview

Key Takeaways

The PR adds native multi-token prediction support for Qwen3.6-style MLX checkpoints in exo.
This feature is useful for users waiting for Mac Studios and aims to streamline local inference cluster setup.
The implementation includes model card metadata, CLI configuration options, and product plumbing adjustments.
For 27B models, K=2 and K=3 offer significant improvements in token throughput over the default MTP path (K=1).
For 35B-A3B models, using K=1 is most effective for achieving higher token generation rates compared to other values of K.

What it means for makers and artists:

This feature can significantly enhance the performance of large language model inference in a local environment, potentially reducing latency and improving throughput.
Makers and artists can now take advantage of more efficient token generation without needing to distribute models across multiple machines manually.
The introduction of K=1 for 35B-A3B models indicates that higher values of K might not be universally beneficial and could require careful consideration based on specific model architectures.

Source Read original →