24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

“`html I recently ran two models with a substantial number of parameters—Qwen 3.6 (35B-A3B) and Gemma 4 (26B-A4B)—on an older GTX 1080…

By AI Maestro May 13, 2026 1 min read
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

“`html

I recently ran two models with a substantial number of parameters—Qwen 3.6 (35B-A3B) and Gemma 4 (26B-A4B)—on an older GTX 1080 graphics card, which has only 8 GB VRAM. To achieve this, I used the llama.cpp library with a custom configuration to fit within the available memory constraints.

The key finding was that forcing the token embedding table onto the GPU with the flag --override-tensor-draft "token_embd\.weight=CUDA0" resulted in significant speed improvements. Without this adjustment, Gemma 4’s MTP (model training procedure) introduced a bottleneck due to its reliance on keeping the embedding table on CPU.

  • The trick was offloading cold expert weights into system RAM and streaming them over PCIe to the GPU while keeping hot layers and the key-value cache on the GPU. This setup maximized the use of available memory and computational resources.
  • Forcing the token embedding table onto the GPU with a specific flag enabled the model to avoid this performance bottleneck, leading to substantial speed improvements.
  • The overall process required pinning an NVIDIA build to a specific branch (Pascal is going legacy), forcing GCC version 14 for CUDA 12.9 compatibility, and patching CUDA’s math_functions.h to ensure glibc 2.41 compatibility. These steps were necessary to support the GPU setup.

“`

“`


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top