I have (even faster) DeepSeek V4 Pro at home

“`html

I have (even faster) DeepSeek V4 Pro at home

DeepSeek V4 Pro Update

A few days ago, I shared my DeepSeek V4 Pro at home. Here’s an update:

Running DeepSeek V4 in ktransformers

I managed to run this model using the ktransformers framework, specifically with the sglang and kt-kernel. I followed the tutorial for DeepSeek V4 Flash and adjusted some options (NUMA, cores) for my hardware, which includes an Epyc 9374F processor and a RTX Pro 6000 Max-Q GPU.

Performance Testing

To check how well the model performs in this environment, I used llama-benchy. Here are the results for different context depths:

Depth 0:

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|——-:|————-:|————:|—————-:|—————-:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 | 39.76 ± 0.00 | | 12878.44 ± 0.00 | 12877.59 ± 0.00 | 12878.44 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 | 7.54 ± 0.00 | 8.00 ± 0.00 | | |

Depth 2048:

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|—————–:|—————–:|—————–:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d2048 | 45.13 ± 0.00 | | 56726.85 ± 0.00 | 56725.93 ± 0.00 | 56726.85 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d2048 | 7.32 ± 0.00 | 8.00 ± 0.00 | | |

Depth 4096:

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|——————:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d4096 | 45.75 ± 0.00 | | 100729.28 ± 0.00 | 100728.46 ± 0.00 | 100729.28 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d4096 | 7.29 ± 0.00 | 8.00 ± 0.00 | | |

Depth 8192:

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|—————–:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d8192 | 45.97 ± 0.00 | | 189354.94 ± 0.00 | 189354.03 ± 0.00 | 189354.94 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d8192 | 7.25 ± 0.00 | 8.00 ± 0.00 | | |

Depth 16384:

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|—————–:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d16384 | 46.16 ± 0.00 | | 365997.22 ± 0.00 | 365996.26 ± 0.00 | 365997.22 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d16384 | 7.17 ± 0.00 | 8.00 ± 0.00 | | |

Depth 32768:

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|—————–:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d32768 | 46.18 ± 0.00 | | 720687.13 ± 0.00 | 720685.67 ± 0.00 | 720687.13 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d32768 | 7.07 ± 0.00 | 8.00 ± 0.00 | | |

Depth 65536:

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|————–:|————-:|————:|——————:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d65536 | 46.09 ± 0.00 | | 1433019.29 ± 0.00 | 1433016.42 ± 0.00 | 1433019.29 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d65536 | 6.80 ± 0.00 | 7.00 ± 0.00 | | |

Depth 131072:

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:—————————-|—————:|————-:|————:|——————:|—————–:|——————:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d131072 | 45.81 ± 0.00 | | 2872297.51 ± 0.00 | 2872296.30 ± 0.00 | 2872297.51 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d131072 | 6.38 ± 0.00 | 7.00 ± 0.00 | | |

~~During the test with a depth of 64k (which took over 20 minutes), llama-benchy did not report any results despite sglang finishing processing the request, so I had to abort the test.~~

Update: It appears that llama-benchy applies depth settings even during the warmup phase. This means it processed 64k of context twice: once for warming up and again for the actual test. To fix this, I used --no-warmup --no-adapt-prompt. Now, only one context pass is performed.

This entire process runs using the original model files without any need for conversion.

Resource Usage

GPU VRAM usage: 90815MiB / 97887MiB
GPU power usage: ~100W during PP, ~150W during TG
RAM usage: 907.5GB / 1152GB
CPU+MB power usage: ~400W

Key Takeaways

The model’s performance improves significantly with larger context depths.
No need for conversion to run the original model files.
Optimizations like --no-warmup --no-adapt-prompt are necessary for accurate benchmarking.

For more details, see my previous post: I have DeepSeek V4 Pro at home.

Thanks for reading!

“`

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

I have (even faster) DeepSeek V4 Pro at home

DeepSeek V4 Pro Update

Running DeepSeek V4 in ktransformers

Performance Testing

Resource Usage

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

software trying to catch…

PINN is predicting trivial…

Orthrus: Memory-Efficient Parallel Token…

DeepSeek V4 Pro Update

Running DeepSeek V4 in ktransformers

Performance Testing

Resource Usage

Key Takeaways

More in AI Guides & Tutorials

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

software trying to catch…

PINN is predicting trivial…

Orthrus: Memory-Efficient Parallel Token…