ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

ByteShape Qwen3.6-35B-A3B: 30% Faster Than Unsloth IQ on a 6GB VRAM Laptop TL;DR The ByteShape quant is 30% faster than the similarly…

By AI Maestro May 22, 2026 3 min read
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

ByteShape Qwen3.6-35B-A3B: 30% Faster Than Unsloth IQ on a 6GB VRAM Laptop

TL;DR

  • The ByteShape quant is 30% faster than the similarly sized Unsloth quant in terms of text generation speed.
  • This improvement is observed across both the prompt processing (PP) and text generation (TG) tasks.
  • The PP speed difference is slight, but the TG speed increase is notable.

Introduction

A few days ago, I experimented with MTP on a 6GB VRAM laptop. That didn’t work so well; CPU offload hurts MTP performance badly. However, now I’ve tried out the new ByteShape quant for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality.

Hardware

  • Asus ROG Zephyrus G14 laptop, 2021 model
  • AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
  • NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
  • 24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

  • Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
  • llama.cpp version: 9203 (87589042c) built from current master branch with GNU 13.3.0 for Linux x86_64
  • CUDA 12.0 installed from Ubuntu repositories

Test Setup

I fixed the following for all experiments:

  • context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
  • mmap off, mlock on, ubatch size 2048 (gives much better PP speed than the default 512)
  • no mmproj (no image input support needed for now)

Configuration

My models-preset.ini contents:

version = 1
[m]
m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
fit = true
fit-target = 64
c = 65536
chat-template-kwargs = {"preserve_thinking": true}
temp = 0.6
top-p = 0.95
min-p = 0.0
ctx-checkpoints = 64
flash-attn = on
b = 2048
ub = 2048
jinja = true
ctk = q8_0
ctv = q8_0
threads = 6
parallel = 1
cache-ram = 4096
mmap = false
mlock = true

Benchmark Results

I used a test prompt of approximately 10k tokens, followed by 1.5-2k tokens of generation. I ran the benchmark twice and got pretty much exactly the same numbers.

UnslothByteShapeΔ
PP tok/s585564-4%
TG tok/s25.433.1+30%

The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on text generation compared to the Unsloth quant! PP speed is slightly lower for ByteShape though.

Observations

  • Part of the difference may be explained by imatrix (IQ) vs regular (Q) quants. Unsloth UD-IQ4_XS is imatrix, and I understand that these are slower to compute on CPU. A better comparison would be against the ByteShape GPU-5 quant, which is also imatrix in my understanding. But I wanted an upgrade over UD-IQ4_XS and definitely got it!
  • I noticed that my TG performance seems to degrade over time by ~10% or more without changing the setup. I suspect suspending and then awakening the laptop repeatedly somehow hurts, but I haven’t figured out the reason; it’s not just memory pressure building up AFAICT. Rebooting the machine brings me the best performance, so I did that before benchmarking.
  • I haven’t made any detailed quality measurements between the models. The ByteShape model seems very similar; possibly the thinking output is generally somewhat shorter than with Unsloth, but that could be a measurement error. I hope that someone does an independent comparison between ByteShape and other quants in terms of output quality, because their claims seem to be a bit too good to be true!

Notes

This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top