2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

**What Happened:**
A user named **snapo84** shared their current AI model setup on r/LocalLLaMA. They are running two old RTX 2080 Ti GPUs, each with 22GB VRAM upgraded to support a total of 44GB. The model being run is Qwen3.6 (a variant of Qwen) which has 27B parameters. The user notes that their setup runs at an impressive speed of 38 tokens per second. They also mention using the F16 KV cache for better performance.

**Why It Matters:**
This post highlights a significant achievement in running large-scale models like Qwen on relatively old hardware, specifically RTX 2080 Ti GPUs. The user demonstrates how optimizations such as quantization (IQ4_XS), using an F16 KV cache, and the `–split-mode tensor` option can push performance to new levels without requiring more powerful or recent hardware. This is particularly noteworthy given that the user’s setup runs at a compute-bound limit of 150W, indicating that efficiency in managing resources is key.

**Takeaways:**
– **Old GPUs Can Handle Large Models:** Running Qwen3.6 on two old RTX 2080 Ti GPUs with upgraded VRAM shows it’s possible to leverage older hardware for cutting-edge AI tasks.
– **Optimizations Matter:** The combination of quantization (IQ4_XS), F16 KV cache, and `–split-mode tensor` significantly boosts performance from the default settings.
– **Resource Management is Key:** Running at 95% VRAM consumption with `–fit on` doesn’t provide a significant benefit over manually managing context length. This suggests that better resource utilization strategies are needed for optimal performance.

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

software trying to catch…

PINN is predicting trivial…

Orthrus: Memory-Efficient Parallel Token…