I opened my first contribution to exo: native multi-token prediction support for Qwen3.6-style MLX checkpoints. I hope it is useful.
The personal motivation was simple: I am waiting for Mac Studios to arrive and wanted to use exo as a local distributed inference cluster across them. Native MTP looked like one of the pieces worth getting right before that setup lands.
For supported model cards, it should work out of the box. The macOS setting is on by default, and the CLI path enables native MTP unless EXO_NATIVE_MTP_ENABLED=0 is set. The current native-MTP path is single-node only: if exo distributes a model across multiple machines, it falls back to the normal path for now.
The part I cared about most was exactness. MTP heads draft candidate tokens, but the target model still verifies them before anything is emitted. For greedy decode, the goal is the same token IDs as the target-only path. For sampling, the path uses speculative probability-ratio acceptance for the request’s temperature/top_p/top_k/min_p distribution.
Short version from the current broad sweep:
| Model | Mode | Mean tok/s | vs MTP off | Acceptance |
|---|
| 27B native-MTP | MTP off | 17.27 | 1.00x | n/a |
| 27B native-MTP | K=1 | 29.56 | 1.71x | 85.7% |
| 27B native-MTP | K=2 | 34.06 | 1.97x | 75.4% |
| 27B native-MTP | K=3 | 33.79 | 1.96x | 66.4% |
| 35B-A3B native-MTP | MTP off | 85.14 | 1.00x | n/a |
| 35B-A3B native-MTP | K=1 | 98.59 | 1.16x | 55.8% |
| 35B-A3B native-MTP | K=2 | 92.27 | 1.08x | 38.3% |
| 35B-A3B native-MTP | K=3 | 80.53 | 0.95x | 27.4% |
So the practical result is:
- On M5 Max 48GB RAM, 27B: 17.27 -> 34.06 tok/s at K=2, +97.2% / 1.97x.
- 35B-A3B: 85.14 -> 98.59 tok/s at K=1, +15.8% / 1.16x.
- Works out of the box for supported single-node native-MTP model cards; set
EXO_NATIVE_MTP_ENABLED=0, or use the native settings dialog to opt-out.
The PR also includes the product plumbing around it:
- model cards expose native-MTP default/max K;
/v1/models reports native-MTP capability;- supported model cards dispatch native MTP by default when the local checkpoint has recoverable MTP weights and the instance is placed on one node;
- final generation stats report
drafter_kind="native_mtp" and num_draft_tokens; - temperature/top_p/top_k/min_p are threaded into the drafter instead of forcing the path to be greedy-only.
The implementation work was mostly systems cleanup: one-pass prompt/MTP cache setup for the 35B MoE/GDN path, hidden-state-only target-body calls where logits are not consumed, MLX-side accepted-prefix counting, K=1 concat avoidance, and overlap between MTP draft/cache evaluation and verifier graph construction.
Current scope/limitations:
- enabled only for model cards that explicitly declare native-MTP metadata;
- native-MTP dispatch is single-node in this PR; multi-node distributed placement still uses the normal path;
- stateful logits processors such as repetition/presence/frequency penalties are not routed through native MTP yet;
- K>=4 is not enabled.
TL;DR:
- On my M5 Max 48GB RAM laptop: 27B: 17.27 -> 34.06 tok/s at K=2, +97.2% / 1.97x.
- 35B-A3B: 85.14 -> 98.59 tok/s at K=1, +15.8% / 1.16x.
- Works out of the box for supported single-node native-MTP model cards; set
EXO_NATIVE_MTP_ENABLED=0, or use the native settings dialog to opt-out.
Preview