Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

“`html

A Reddit user shared their success running the DeepSeek-V4-Flash model locally on a budget machine using four RTX 2080 Ti GPUs. The setup cost less than $2,500 and managed to achieve around 255 prefill tokens per second (pTs/s).

The team optimized custom Turing CUDA kernels for W8A8 matrix multiplication, which is crucial for the model’s performance.
Heterogeneous inference was employed to efficiently split memory between 1TB of system RAM and four 22GB VRAM GPUs.
They implemented a pipelined execution strategy to reduce communication overhead, typical in MoE (Model with External Memory) models.

This achievement demonstrates that even legacy hardware can be leveraged for running state-of-the-art language models like DeepSeek-V4-Flash. The open-source nature of the project allows for further exploration and improvement by the community.

“`

“`md
– They optimized custom Turing CUDA kernels tailored to accelerate W8A8 (INT8) matrix multiplication, a critical operation for the model’s performance.
– They employed heterogeneous inference, splitting memory between 1TB of system RAM and four 22GB VRAM GPUs to maximize hardware utilization.
– A pipelined execution strategy was implemented to reduce communication overhead, which is common in models like DeepSeek-V4-Flash that use MoE (Model with External Memory).
“`

Source Read original →