I have (even faster) DeepSeek V4 Pro at home

“`html I have (even faster) DeepSeek V4 Pro at home DeepSeek V4 Pro Update A few days ago, I shared my DeepSeek…

By AI Maestro May 15, 2026 2 min read

“`html




I have (even faster) DeepSeek V4 Pro at home

DeepSeek V4 Pro Update

A few days ago, I shared my DeepSeek V4 Pro at home. Here’s an update:

Running DeepSeek V4 in ktransformers

I managed to run this model using the ktransformers framework, specifically with the sglang and kt-kernel. I followed the tutorial for DeepSeek V4 Flash and adjusted some options (NUMA, cores) for my hardware, which includes an Epyc 9374F processor and a RTX Pro 6000 Max-Q GPU.

Performance Testing

To check how well the model performs in this environment, I used llama-benchy. Here are the results for different context depths:

  • Depth 0:
  • | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
    |:—————————-|——-:|————-:|————:|—————-:|—————-:|
    | deepseek-ai/DeepSeek-V4-Pro | pp512 | 39.76 ± 0.00 | | 12878.44 ± 0.00 | 12877.59 ± 0.00 | 12878.44 ± 0.00 |
    | deepseek-ai/DeepSeek-V4-Pro | tg32 | 7.54 ± 0.00 | 8.00 ± 0.00 | | |

  • Depth 2048:
  • | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
    |:—————————-|————–:|————-:|————:|—————–:|—————–:|—————–:|
    | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d2048 | 45.13 ± 0.00 | | 56726.85 ± 0.00 | 56725.93 ± 0.00 | 56726.85 ± 0.00 |
    | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d2048 | 7.32 ± 0.00 | 8.00 ± 0.00 | | |

  • Depth 4096:
  • | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
    |:—————————-|————–:|————-:|————:|——————:|—————–:|——————:|
    | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d4096 | 45.75 ± 0.00 | | 100729.28 ± 0.00 | 100728.46 ± 0.00 | 100729.28 ± 0.00 |
    | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d4096 | 7.29 ± 0.00 | 8.00 ± 0.00 | | |

  • Depth 8192:
  • | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
    |:—————————-|————–:|————-:|————:|—————–:|—————–:|——————:|
    | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d8192 | 45.97 ± 0.00 | | 189354.94 ± 0.00 | 189354.03 ± 0.00 | 189354.94 ± 0.00 |
    | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d8192 | 7.25 ± 0.00 | 8.00 ± 0.00 | | |

  • Depth 16384:
  • | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
    |:—————————-|————–:|————-:|————:|—————–:|—————–:|——————:|
    | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d16384 | 46.16 ± 0.00 | | 365997.22 ± 0.00 | 365996.26 ± 0.00 | 365997.22 ± 0.00 |
    | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d16384 | 7.17 ± 0.00 | 8.00 ± 0.00 | | |

  • Depth 32768:
  • | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
    |:—————————-|————–:|————-:|————:|—————–:|—————–:|——————:|
    | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d32768 | 46.18 ± 0.00 | | 720687.13 ± 0.00 | 720685.67 ± 0.00 | 720687.13 ± 0.00 |
    | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d32768 | 7.07 ± 0.00 | 8.00 ± 0.00 | | |

  • Depth 65536:
  • | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
    |:—————————-|————–:|————-:|————:|——————:|—————–:|——————:|
    | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d65536 | 46.09 ± 0.00 | | 1433019.29 ± 0.00 | 1433016.42 ± 0.00 | 1433019.29 ± 0.00 |
    | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d65536 | 6.80 ± 0.00 | 7.00 ± 0.00 | | |

  • Depth 131072:
  • | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
    |:—————————-|—————:|————-:|————:|——————:|—————–:|——————:|
    | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d131072 | 45.81 ± 0.00 | | 2872297.51 ± 0.00 | 2872296.30 ± 0.00 | 2872297.51 ± 0.00 |
    | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d131072 | 6.38 ± 0.00 | 7.00 ± 0.00 | | |

    During the test with a depth of 64k (which took over 20 minutes), llama-benchy did not report any results despite sglang finishing processing the request, so I had to abort the test.

    Update: It appears that llama-benchy applies depth settings even during the warmup phase. This means it processed 64k of context twice: once for warming up and again for the actual test. To fix this, I used --no-warmup --no-adapt-prompt. Now, only one context pass is performed.

    This entire process runs using the original model files without any need for conversion.

    Resource Usage

    • GPU VRAM usage: 90815MiB / 97887MiB
    • GPU power usage: ~100W during PP, ~150W during TG
    • RAM usage: 907.5GB / 1152GB
    • CPU+MB power usage: ~400W

    Key Takeaways

    • The model’s performance improves significantly with larger context depths.
    • No need for conversion to run the original model files.
    • Optimizations like --no-warmup --no-adapt-prompt are necessary for accurate benchmarking.

    For more details, see my previous post: I have DeepSeek V4 Pro at home.

    Thanks for reading!



    “`


    Originally published at reddit.com. Curated by AI Maestro.

    Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

    Name
Scroll to Top