Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

“`html A Reddit user shared their success running the DeepSeek-V4-Flash model locally on a budget machine using four RTX 2080 Ti GPUs.…

By AI Maestro May 20, 2026 1 min read
Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

“`html

A Reddit user shared their success running the DeepSeek-V4-Flash model locally on a budget machine using four RTX 2080 Ti GPUs. The setup cost less than $2,500 and managed to achieve around 255 prefill tokens per second (pTs/s).

  • The team optimized custom Turing CUDA kernels for W8A8 matrix multiplication, which is crucial for the model’s performance.
  • Heterogeneous inference was employed to efficiently split memory between 1TB of system RAM and four 22GB VRAM GPUs.
  • They implemented a pipelined execution strategy to reduce communication overhead, typical in MoE (Model with External Memory) models.

This achievement demonstrates that even legacy hardware can be leveraged for running state-of-the-art language models like DeepSeek-V4-Flash. The open-source nature of the project allows for further exploration and improvement by the community.

“`

“`md
– They optimized custom Turing CUDA kernels tailored to accelerate W8A8 (INT8) matrix multiplication, a critical operation for the model’s performance.
– They employed heterogeneous inference, splitting memory between 1TB of system RAM and four 22GB VRAM GPUs to maximize hardware utilization.
– A pipelined execution strategy was implemented to reduce communication overhead, which is common in models like DeepSeek-V4-Flash that use MoE (Model with External Memory).
“`


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top