DGX Spark agentic usage numbers

“`html

DGX Spark agentic usage numbers

Key Takeaways

The DGX Spark environment is now supporting an AGENT-like model with better performance.
QuantTrio/Qwen3.6-35B-A3B-AWQ: Single stream throughput is around 35.6 tps, yielding ~60 concurrent requests for a context length of 180,000 tokens.
RedHatAI/Qwen3.6-35B-A3B-NVFP4: The single-stream performance is 51 tps at a context length of 30k tokens with an output of 5000 tokens. For four concurrent requests, it achieves ~139 tps.

The RedHatAI/Qwen model also includes several additional features such as:

MTP Avg Draft acceptance rate: Approximately 78% for draft acceptances.
Tool calls: Currently broken, requiring investigation or alternative solutions.

The RedHatAI/Qwen model is running in a Docker container with specific configurations to optimize its performance. The key configuration options include:

- --quantization compressed-tensors
- --moe-backend flashinfer_cutlass
- --tensor-parallel-size 1
- --gpu-memory-utilization 0.87
- --max-model-len 180072
- --max-num-seqs 16
- --max-num-batched-tokens 16384
- --kv-cache-dtype fp8_e4m3

The script used to benchmark the model’s performance is provided below:

#!/bin/bash

# Functionality: Test for 4-way concurrent requests with a 30K-token prompt.
# Output:
# - Per-request details including time spent, tokens processed, and TPS (tokens per second).
# - Aggregate information like wall time and total completion.

TTFT=0
DECODE=0
USAGE=0

for i in {1..4}; do
  read FIRST LAST < /tmp/timing_${i}.txt
  TTFT=$(echo "scale=3; $FIRST - $START" | bc)
  DECODE=$(echo "scale=3; $LAST - $FIRST" | bc)
  
  # Extract usage information from the JSONL file.
  USAGE=$(jq -s 'map(select(.usage != null)) | last.usage // {}' /tmp/stream_${i}.jsonl 2>/dev/null)

  PROMPT=$(echo "$USAGE" | jq -r '.prompt_tokens // 0')
  COMP=$(echo "$USAGE" | jq -r '.completion_tokens // 0')

  TPS=$(echo "scale=2; if ($DECODE > 0) $COMP / $DECODE else 0" | bc -l 2>/dev/null || echo "0")

  TOTAL_COMP=$((TOTAL_COMP + COMP))
done

# Print aggregate results.
printf "Wall time: %ss\n" "$ELAPSED"
printf "Total completion: %s tokens\n" "$TOTAL_COMP"
printf "Aggregate TPS: %.2f\n" "$(echo "scale=2; $TOTAL_COMP / $ELAPSED" | bc)"

This script runs four parallel requests and collects timing information for each request. It then calculates the wall time, total completion in tokens, and aggregate throughput (TPS).

“`

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

DGX Spark agentic usage numbers

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Alphabet plans to raise…

Nvidia chases $200B CPU…

Kaximia on channeling aggression,…