MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

“`html Qwen on a 6GB VRAM Laptop: Not Worth It Qwen on a 6GB VRAM Laptop: Not Worth It I have an…

By AI Maestro May 17, 2026 4 min read
MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

“`html




Qwen on a 6GB VRAM Laptop: Not Worth It

Qwen on a 6GB VRAM Laptop: Not Worth It

I have an Asus gaming laptop from 2021 that I bought used for £500 last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model.

Hardware

  • Asus ROG Zephyrus G14 laptop, 2021 model
  • AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
  • NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
  • 24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

  • Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
  • llama.cpp version: 9198 (a6d6183db) built from current master branch with GNU 13.3.0 for Linux x86_64
  • CUDA 12.0 installed from Ubuntu repositories

Test Setup

  • Unsloth Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL model (pushing the maximum this system can run; I used the same model for both MTP and non-MTP, just varying the command line arguments so the MTP part of the model was not used in all runs)
  • q8_0 quantization for the main KV cache (I don’t want to compromise on quality too much)
  • context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
  • for MTP, I used –spec-draft-n-max 2 (I know that 3 might be slightly better in some cases, but decided to stick to this to make the results comparable)
  • mmap enabled (it’s the only way I can run this model without freezing my machine…)

I varied these parameters:

  • MTP vs non-MTP (including/omitting MTP specific CLI parameters)
  • ubatch size: 512, 1024, 1536, 2048
  • draft model KV cache quantization: either q8_0 or q4_0 (always same for both K & V)
  • –fit-target set to the lowest value (in steps of 64) that works without OOM errors

Here is an example of a full llama-server command (MTP 1 in the table below):

build/bin/llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
--threads 8 \
-ub 512 \
--parallel 1 \
--fit-target 448 \
-c 65536 \
-ctk q8_0 \
-ctv q8_0 \
-ctkd q8_0 \
-ctvd q8_0 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--temp 0.6 \
--top-p 0.95 \
--min-p 0.0 \
--top-k 20 \
--repeat-penalty 1.0 \
--presence-penalty 0.0 \
--spec-type draft-mtp \
--spec-draft-n-max 2

Results

No MTP 2No MTP 3No MTP 4MTP 1MTP 2MTP 3MTP 4
Setupubatch sizedKV quant typefit-target valueMB benchmarkMB acceptance rate%Summarization PPSummarization TGSummarization acceptance rate%
No MTP 1512025.017823.8
1024023.129222.5
1536024.529924.4
2048023.043626.1
512q8_044827.381.514326.176.5
1024q8_096018.782.713825.972.0
512q4_044826.481.513925.373.4
1024q4_096025.482.719823.773.7

I also tried higher ubatch values with MTP, but the results were so bad (TG 10-15 tok/s, probably due to running out of RAM and swapping) that I aborted those runs.

Verdict

  • The baseline “No MTP 4” with ubatch=2048 is clearly the best non-MTP setup. It reached PP speeds over 400 tok/s and TG speeds of 23-26 tok/s.
  • The “MTP 1” run with ubatch=512 reached the best TG speed (over 27 tok/s) in mtp-bench but was tied with “No MTP 4” on the summarization task TG. PP speeds were much lower than any non-MTP setups.
  • Increasing ubatch size in MTP can improve PP speeds a bit, especially in the “MTP 4” setup which also used q4_0 quantization for the draft KV cache. But this practically eliminated the benefit in TG speeds while still more than halving PP speeds.
  • In short: MTP is not worth it in this setting. Tiny increase in TG for some cases, but always a giant drop in PP speeds. If PP speeds for MTP are later improved in llama.cpp (this was listed as a known issue in the PR), this might change.

Observations

  • I was surprised to see that using q4_0 quantization for the draft model KV cache had negligible impact on draft model accuracy. This saves a tiny bit of VRAM, so might be a useful trick for very VRAM constrained setups.
  • There is a bit of unexplained variation between measurements, probably due to random change, CPU/GPU temperature throttling etc. Not too bad, but take with a grain of salt.
  • VRAM is obviously very tight from the start. The MTP VRAM overhead easily pushes the system into a badly performing scenario.
  • The –fit and –fit-target options don’t seem to take into account the MTP overhead; you need to reserve some memory for MTP and this amount depends mainly on the ubatch size. Thus you have to set –fit-target manually if you want to squeeze the maximum performance out of your limited VRAM. In my case, setting fit-target to a number a bit less than the ubatch size seemed to work, but YMMV.

Notes

This post was constructed from 100% organic ingredients. No AIs were harmed in the process.

My second post here. Happy to answer any questions.



“`


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top