“`html
Key Takeaways
- The DGX Spark environment is now supporting an AGENT-like model with better performance.
- QuantTrio/Qwen3.6-35B-A3B-AWQ: Single stream throughput is around 35.6 tps, yielding ~60 concurrent requests for a context length of 180,000 tokens.
- RedHatAI/Qwen3.6-35B-A3B-NVFP4: The single-stream performance is 51 tps at a context length of 30k tokens with an output of 5000 tokens. For four concurrent requests, it achieves ~139 tps.
The RedHatAI/Qwen model also includes several additional features such as:
- MTP Avg Draft acceptance rate: Approximately 78% for draft acceptances.
- Tool calls: Currently broken, requiring investigation or alternative solutions.
The RedHatAI/Qwen model is running in a Docker container with specific configurations to optimize its performance. The key configuration options include:
- --quantization compressed-tensors - --moe-backend flashinfer_cutlass - --tensor-parallel-size 1 - --gpu-memory-utilization 0.87 - --max-model-len 180072 - --max-num-seqs 16 - --max-num-batched-tokens 16384 - --kv-cache-dtype fp8_e4m3
The script used to benchmark the model’s performance is provided below:
#!/bin/bash
# Functionality: Test for 4-way concurrent requests with a 30K-token prompt.
# Output:
# - Per-request details including time spent, tokens processed, and TPS (tokens per second).
# - Aggregate information like wall time and total completion.
TTFT=0
DECODE=0
USAGE=0
for i in {1..4}; do
read FIRST LAST < /tmp/timing_${i}.txt
TTFT=$(echo "scale=3; $FIRST - $START" | bc)
DECODE=$(echo "scale=3; $LAST - $FIRST" | bc)
# Extract usage information from the JSONL file.
USAGE=$(jq -s 'map(select(.usage != null)) | last.usage // {}' /tmp/stream_${i}.jsonl 2>/dev/null)
PROMPT=$(echo "$USAGE" | jq -r '.prompt_tokens // 0')
COMP=$(echo "$USAGE" | jq -r '.completion_tokens // 0')
TPS=$(echo "scale=2; if ($DECODE > 0) $COMP / $DECODE else 0" | bc -l 2>/dev/null || echo "0")
TOTAL_COMP=$((TOTAL_COMP + COMP))
done
# Print aggregate results.
printf "Wall time: %ss\n" "$ELAPSED"
printf "Total completion: %s tokens\n" "$TOTAL_COMP"
printf "Aggregate TPS: %.2f\n" "$(echo "scale=2; $TOTAL_COMP / $ELAPSED" | bc)"
This script runs four parallel requests and collects timing information for each request. It then calculates the wall time, total completion in tokens, and aggregate throughput (TPS).
“`
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




