“`html
DeepSeek V4 Pro Update
A few days ago, I shared my DeepSeek V4 Pro at home. Here’s an update:
Running DeepSeek V4 in ktransformers
I managed to run this model using the ktransformers framework, specifically with the sglang and kt-kernel. I followed the tutorial for DeepSeek V4 Flash and adjusted some options (NUMA, cores) for my hardware, which includes an Epyc 9374F processor and a RTX Pro 6000 Max-Q GPU.
Performance Testing
To check how well the model performs in this environment, I used llama-benchy. Here are the results for different context depths:
- Depth 0:
- Depth 2048:
- Depth 4096:
- Depth 8192:
- Depth 16384:
- Depth 32768:
- Depth 65536:
- Depth 131072:
- GPU VRAM usage: 90815MiB / 97887MiB
- GPU power usage: ~100W during PP, ~150W during TG
- RAM usage: 907.5GB / 1152GB
- CPU+MB power usage: ~400W
- The model’s performance improves significantly with larger context depths.
- No need for conversion to run the original model files.
- Optimizations like
--no-warmup --no-adapt-promptare necessary for accurate benchmarking.
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|——-:|————-:|————:|—————-:|—————-:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 | 39.76 ± 0.00 | | 12878.44 ± 0.00 | 12877.59 ± 0.00 | 12878.44 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 | 7.54 ± 0.00 | 8.00 ± 0.00 | | |
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|—————–:|—————–:|—————–:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d2048 | 45.13 ± 0.00 | | 56726.85 ± 0.00 | 56725.93 ± 0.00 | 56726.85 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d2048 | 7.32 ± 0.00 | 8.00 ± 0.00 | | |
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|——————:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d4096 | 45.75 ± 0.00 | | 100729.28 ± 0.00 | 100728.46 ± 0.00 | 100729.28 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d4096 | 7.29 ± 0.00 | 8.00 ± 0.00 | | |
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|—————–:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d8192 | 45.97 ± 0.00 | | 189354.94 ± 0.00 | 189354.03 ± 0.00 | 189354.94 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d8192 | 7.25 ± 0.00 | 8.00 ± 0.00 | | |
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|—————–:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d16384 | 46.16 ± 0.00 | | 365997.22 ± 0.00 | 365996.26 ± 0.00 | 365997.22 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d16384 | 7.17 ± 0.00 | 8.00 ± 0.00 | | |
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|—————–:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d32768 | 46.18 ± 0.00 | | 720687.13 ± 0.00 | 720685.67 ± 0.00 | 720687.13 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d32768 | 7.07 ± 0.00 | 8.00 ± 0.00 | | |
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|——————:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d65536 | 46.09 ± 0.00 | | 1433019.29 ± 0.00 | 1433016.42 ± 0.00 | 1433019.29 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d65536 | 6.80 ± 0.00 | 7.00 ± 0.00 | | |
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|—————:|————-:|————:|——————:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d131072 | 45.81 ± 0.00 | | 2872297.51 ± 0.00 | 2872296.30 ± 0.00 | 2872297.51 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d131072 | 6.38 ± 0.00 | 7.00 ± 0.00 | | |
During the test with a depth of 64k (which took over 20 minutes), llama-benchy did not report any results despite sglang finishing processing the request, so I had to abort the test.
Update: It appears that llama-benchy applies depth settings even during the warmup phase. This means it processed 64k of context twice: once for warming up and again for the actual test. To fix this, I used --no-warmup --no-adapt-prompt. Now, only one context pass is performed.
This entire process runs using the original model files without any need for conversion.
Resource Usage
Key Takeaways
For more details, see my previous post: I have DeepSeek V4 Pro at home.
Thanks for reading!
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.


