Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

“`html

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Here are some insights from my experimentation with Qwen3.6-35B-A3B, a large language model (LLM) running on an NVIDIA RTX 3070 Ti GPU with 8GB of VRAM.

Key Takeaways

The model can handle up to 1 million context tokens, but performance starts to degrade noticeably after this point.
Mixing the model’s layers in VRAM is not recommended as it leads to poor performance and memory exhaustion.
Running a Linux-based environment like Ubuntu Server instead of Windows 11 can significantly improve inference throughput by reducing system RAM usage.

A screenshot illustrating the misunderstanding of MoE (Model with External Memory) model architecture

Here are some performance numbers for different configurations:

Platform	Inference TPS	System Memory Usage (GB)	Highest Context Token Size Reached (K)
Windows 11	<27 tps, drops quickly beyond 100k context	28GB+ full	512K at turbo quant 4 for KV
Ubuntu Server (fresh install)	>34 tps, often peaks around 37 tps	22GB full	1M context on IQ4_NL_XL and turbo4 quant for KV

I have an older GPU that I can use as a host OS to keep the 3070 Ti dedicated to running Qwen. This setup has been working well so far.

For those looking to optimize their own LLM setups, these are some observations:

The model only needs approximately 3.5 billion parameters in VRAM at runtime.
Messing with engine parameters like forcing all layers into VRAM or adjusting other runtime settings can actually hinder performance.
Running a lightweight Linux environment instead of Windows 11 can free up significant system memory, leading to higher inference throughput.

Note: These numbers are based on my specific hardware and configuration. Individual results may vary depending on the model’s quantization level, batch size, and other factors.

“`

Source Read original →