“`html
A Reddit user shared their success running the DeepSeek-V4-Flash model locally on a budget machine using four RTX 2080 Ti GPUs. The setup cost less than $2,500 and managed to achieve around 255 prefill tokens per second (pTs/s).
- The team optimized custom Turing CUDA kernels for W8A8 matrix multiplication, which is crucial for the model’s performance.
- Heterogeneous inference was employed to efficiently split memory between 1TB of system RAM and four 22GB VRAM GPUs.
- They implemented a pipelined execution strategy to reduce communication overhead, typical in MoE (Model with External Memory) models.
This achievement demonstrates that even legacy hardware can be leveraged for running state-of-the-art language models like DeepSeek-V4-Flash. The open-source nature of the project allows for further exploration and improvement by the community.
“`
“`md
– They optimized custom Turing CUDA kernels tailored to accelerate W8A8 (INT8) matrix multiplication, a critical operation for the model’s performance.
– They employed heterogeneous inference, splitting memory between 1TB of system RAM and four 22GB VRAM GPUs to maximize hardware utilization.
– A pipelined execution strategy was implemented to reduce communication overhead, which is common in models like DeepSeek-V4-Flash that use MoE (Model with External Memory).
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




