“`html
Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps
Here are some insights from my experimentation with Qwen3.6-35B-A3B, a large language model (LLM) running on an NVIDIA RTX 3070 Ti GPU with 8GB of VRAM.
Key Takeaways
- The model can handle up to 1 million context tokens, but performance starts to degrade noticeably after this point.
- Mixing the model’s layers in VRAM is not recommended as it leads to poor performance and memory exhaustion.
- Running a Linux-based environment like Ubuntu Server instead of Windows 11 can significantly improve inference throughput by reducing system RAM usage.
Here are some performance numbers for different configurations:
| Platform | Inference TPS | System Memory Usage (GB) | Highest Context Token Size Reached (K) |
|---|---|---|---|
| Windows 11 | <27 tps, drops quickly beyond 100k context | 28GB+ full | 512K at turbo quant 4 for KV |
| Ubuntu Server (fresh install) | >34 tps, often peaks around 37 tps | 22GB full | 1M context on IQ4_NL_XL and turbo4 quant for KV |
I have an older GPU that I can use as a host OS to keep the 3070 Ti dedicated to running Qwen. This setup has been working well so far.
For those looking to optimize their own LLM setups, these are some observations:
- The model only needs approximately 3.5 billion parameters in VRAM at runtime.
- Messing with engine parameters like forcing all layers into VRAM or adjusting other runtime settings can actually hinder performance.
- Running a lightweight Linux environment instead of Windows 11 can free up significant system memory, leading to higher inference throughput.
Note: These numbers are based on my specific hardware and configuration. Individual results may vary depending on the model’s quantization level, batch size, and other factors.
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




