ByteShape Qwen3.6-35B-A3B: 30% Faster Than Unsloth IQ on a 6GB VRAM Laptop
TL;DR
- The ByteShape quant is 30% faster than the similarly sized Unsloth quant in terms of text generation speed.
- This improvement is observed across both the prompt processing (PP) and text generation (TG) tasks.
- The PP speed difference is slight, but the TG speed increase is notable.
Introduction
A few days ago, I experimented with MTP on a 6GB VRAM laptop. That didn’t work so well; CPU offload hurts MTP performance badly. However, now I’ve tried out the new ByteShape quant for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality.
Hardware
- Asus ROG Zephyrus G14 laptop, 2021 model
- AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
- NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
- 24GB RAM (DDR4 3200 MT/s), 1TB SSD
Software
- Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
- llama.cpp version: 9203 (87589042c) built from current master branch with GNU 13.3.0 for Linux x86_64
- CUDA 12.0 installed from Ubuntu repositories
Test Setup
I fixed the following for all experiments:
- context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
- mmap off, mlock on, ubatch size 2048 (gives much better PP speed than the default 512)
- no mmproj (no image input support needed for now)
Configuration
My models-preset.ini contents:
version = 1
[m]
m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
fit = true
fit-target = 64
c = 65536
chat-template-kwargs = {"preserve_thinking": true}
temp = 0.6
top-p = 0.95
min-p = 0.0
ctx-checkpoints = 64
flash-attn = on
b = 2048
ub = 2048
jinja = true
ctk = q8_0
ctv = q8_0
threads = 6
parallel = 1
cache-ram = 4096
mmap = false
mlock = true
Benchmark Results
I used a test prompt of approximately 10k tokens, followed by 1.5-2k tokens of generation. I ran the benchmark twice and got pretty much exactly the same numbers.
| Unsloth | ByteShape | Δ | |
|---|---|---|---|
| PP tok/s | 585 | 564 | -4% |
| TG tok/s | 25.4 | 33.1 | +30% |
The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on text generation compared to the Unsloth quant! PP speed is slightly lower for ByteShape though.
Observations
- Part of the difference may be explained by imatrix (IQ) vs regular (Q) quants. Unsloth UD-IQ4_XS is imatrix, and I understand that these are slower to compute on CPU. A better comparison would be against the ByteShape GPU-5 quant, which is also imatrix in my understanding. But I wanted an upgrade over UD-IQ4_XS and definitely got it!
- I noticed that my TG performance seems to degrade over time by ~10% or more without changing the setup. I suspect suspending and then awakening the laptop repeatedly somehow hurts, but I haven’t figured out the reason; it’s not just memory pressure building up AFAICT. Rebooting the machine brings me the best performance, so I did that before benchmarking.
- I haven’t made any detailed quality measurements between the models. The ByteShape model seems very similar; possibly the thinking output is generally somewhat shorter than with Unsloth, but that could be a measurement error. I hope that someone does an independent comparison between ByteShape and other quants in terms of output quality, because their claims seem to be a bit too good to be true!
Notes
This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




