ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

ByteShape Qwen3.6-35B-A3B: 30% Faster Than Unsloth IQ on a 6GB VRAM Laptop

TL;DR

The ByteShape quant is 30% faster than the similarly sized Unsloth quant in terms of text generation speed.
This improvement is observed across both the prompt processing (PP) and text generation (TG) tasks.
The PP speed difference is slight, but the TG speed increase is notable.

Introduction

A few days ago, I experimented with MTP on a 6GB VRAM laptop. That didn’t work so well; CPU offload hurts MTP performance badly. However, now I’ve tried out the new ByteShape quant for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality.

Hardware

Asus ROG Zephyrus G14 laptop, 2021 model
AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
llama.cpp version: 9203 (87589042c) built from current master branch with GNU 13.3.0 for Linux x86_64
CUDA 12.0 installed from Ubuntu repositories

Test Setup

I fixed the following for all experiments:

context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
mmap off, mlock on, ubatch size 2048 (gives much better PP speed than the default 512)
no mmproj (no image input support needed for now)

Configuration

My models-preset.ini contents:

version = 1
[m]
m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
fit = true
fit-target = 64
c = 65536
chat-template-kwargs = {"preserve_thinking": true}
temp = 0.6
top-p = 0.95
min-p = 0.0
ctx-checkpoints = 64
flash-attn = on
b = 2048
ub = 2048
jinja = true
ctk = q8_0
ctv = q8_0
threads = 6
parallel = 1
cache-ram = 4096
mmap = false
mlock = true

Benchmark Results

I used a test prompt of approximately 10k tokens, followed by 1.5-2k tokens of generation. I ran the benchmark twice and got pretty much exactly the same numbers.

	Unsloth	ByteShape	Δ
PP tok/s	585	564	-4%
TG tok/s	25.4	33.1	+30%

The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on text generation compared to the Unsloth quant! PP speed is slightly lower for ByteShape though.

Observations

Part of the difference may be explained by imatrix (IQ) vs regular (Q) quants. Unsloth UD-IQ4_XS is imatrix, and I understand that these are slower to compute on CPU. A better comparison would be against the ByteShape GPU-5 quant, which is also imatrix in my understanding. But I wanted an upgrade over UD-IQ4_XS and definitely got it!
I noticed that my TG performance seems to degrade over time by ~10% or more without changing the setup. I suspect suspending and then awakening the laptop repeatedly somehow hurts, but I haven’t figured out the reason; it’s not just memory pressure building up AFAICT. Rebooting the machine brings me the best performance, so I did that before benchmarking.
I haven’t made any detailed quality measurements between the models. The ByteShape model seems very similar; possibly the thinking output is generally somewhat shorter than with Unsloth, but that could be a measurement error. I hope that someone does an independent comparison between ByteShape and other quants in terms of output quality, because their claims seem to be a bit too good to be true!

Notes

This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

TL;DR

Introduction

Hardware

Software

Test Setup

Configuration

Benchmark Results

Observations

Notes

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

How to Speed Up…

Alphabet plans to raise…

Nvidia chases $200B CPU…