MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

“`html

Qwen on a 6GB VRAM Laptop: Not Worth It

Qwen on a 6GB VRAM Laptop: Not Worth It

I have an Asus gaming laptop from 2021 that I bought used for £500 last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model.

Hardware

Asus ROG Zephyrus G14 laptop, 2021 model
AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
llama.cpp version: 9198 (a6d6183db) built from current master branch with GNU 13.3.0 for Linux x86_64
CUDA 12.0 installed from Ubuntu repositories

Test Setup

Unsloth Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL model (pushing the maximum this system can run; I used the same model for both MTP and non-MTP, just varying the command line arguments so the MTP part of the model was not used in all runs)
q8_0 quantization for the main KV cache (I don’t want to compromise on quality too much)
context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
for MTP, I used –spec-draft-n-max 2 (I know that 3 might be slightly better in some cases, but decided to stick to this to make the results comparable)
mmap enabled (it’s the only way I can run this model without freezing my machine…)

I varied these parameters:

MTP vs non-MTP (including/omitting MTP specific CLI parameters)
ubatch size: 512, 1024, 1536, 2048
draft model KV cache quantization: either q8_0 or q4_0 (always same for both K & V)
–fit-target set to the lowest value (in steps of 64) that works without OOM errors

Here is an example of a full llama-server command (MTP 1 in the table below):

build/bin/llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
--threads 8 \
-ub 512 \
--parallel 1 \
--fit-target 448 \
-c 65536 \
-ctk q8_0 \
-ctv q8_0 \
-ctkd q8_0 \
-ctvd q8_0 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--temp 0.6 \
--top-p 0.95 \
--min-p 0.0 \
--top-k 20 \
--repeat-penalty 1.0 \
--presence-penalty 0.0 \
--spec-type draft-mtp \
--spec-draft-n-max 2

Results

No MTP 2No MTP 3No MTP 4MTP 1MTP 2MTP 3MTP 4

Setup	ubatch size	dKV quant type	fit-target value	MB benchmark	MB acceptance rate%	Summarization PP	Summarization TG	Summarization acceptance rate%
No MTP 1	512	–	0	25.0	–	178	23.8	–
1024	–	0	23.1	–	292	22.5	–
1536	–	0	24.5	–	299	24.4	–
2048		0	23.0		436	26.1
512	q8_0	448	27.3	81.5	143	26.1	76.5
1024	q8_0	960	18.7	82.7	138	25.9	72.0
512	q4_0	448	26.4	81.5	139	25.3	73.4
1024	q4_0	960	25.4	82.7	198	23.7	73.7

I also tried higher ubatch values with MTP, but the results were so bad (TG 10-15 tok/s, probably due to running out of RAM and swapping) that I aborted those runs.

Verdict

The baseline “No MTP 4” with ubatch=2048 is clearly the best non-MTP setup. It reached PP speeds over 400 tok/s and TG speeds of 23-26 tok/s.
The “MTP 1” run with ubatch=512 reached the best TG speed (over 27 tok/s) in mtp-bench but was tied with “No MTP 4” on the summarization task TG. PP speeds were much lower than any non-MTP setups.
Increasing ubatch size in MTP can improve PP speeds a bit, especially in the “MTP 4” setup which also used q4_0 quantization for the draft KV cache. But this practically eliminated the benefit in TG speeds while still more than halving PP speeds.
In short: MTP is not worth it in this setting. Tiny increase in TG for some cases, but always a giant drop in PP speeds. If PP speeds for MTP are later improved in llama.cpp (this was listed as a known issue in the PR), this might change.

Observations

I was surprised to see that using q4_0 quantization for the draft model KV cache had negligible impact on draft model accuracy. This saves a tiny bit of VRAM, so might be a useful trick for very VRAM constrained setups.
There is a bit of unexplained variation between measurements, probably due to random change, CPU/GPU temperature throttling etc. Not too bad, but take with a grain of salt.
VRAM is obviously very tight from the start. The MTP VRAM overhead easily pushes the system into a badly performing scenario.
The –fit and –fit-target options don’t seem to take into account the MTP overhead; you need to reserve some memory for MTP and this amount depends mainly on the ubatch size. Thus you have to set –fit-target manually if you want to squeeze the maximum performance out of your limited VRAM. In my case, setting fit-target to a number a bit less than the ubatch size seemed to work, but YMMV.

Notes

This post was constructed from 100% organic ingredients. No AIs were harmed in the process.

My second post here. Happy to answer any questions.

“`

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

Qwen on a 6GB VRAM Laptop: Not Worth It

Hardware

Software

Test Setup

Results

Verdict

Observations

Notes

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Warelay -> OpenClaw

ChatGPT may have helped…

Developers who use local…

Qwen on a 6GB VRAM Laptop: Not Worth It

Hardware

Software

Test Setup

Results

Verdict

Observations

Notes

More in AI News

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Warelay -> OpenClaw

ChatGPT may have helped…

Developers who use local…