Running the Same Models Across Different Hardware
I wanted to see how different models perform across various hardware configurations. This article details a comprehensive benchmarking study where I ran the same set of models on three different rigs: Strix Halo, RTX 3090, and RTX 5070.
Dataset
The dataset consisted of 55 runs with five backends (rocm, vulkan, cpu, cuda, vllm-cuda) and models ranging from 0.35B parameters (LFM2.5) to 35B-A3B (Qwen3.5 MoE). The workloads included short-prompt chat, long-context RAG, code generation with a long output, and an agent shape at concurrency 1 and 4.
Key Findings
- The RTX 5070 (Vulkan) outperformed the RTX 3090 (CUDA) in every model that fit within its 12 GiB VRAM limit, with notable exceptions being long-context tasks like RAG.
- On models requiring more than 12 GiB of VRAM, such as Qwen3.6-27B for chat, the RTX 3090 outperformed both Strix ROCm and Vulkan by significant margins, often by over a factor of two in throughput.
- The difference between Strix Vulkan and Strix ROCm was generally small but consistent—often around 5% faster with Vulkan. Notable exceptions were models like Gemma-4-26B-A4B where Vulkan outperformed ROCm by about 10%. This could be attributed to kernel optimizations specific to the gfx1151 architecture.
- Quantization cost varied significantly for different models, particularly Qwen3.6-27B chat, which showed a range from approximately 14% (Q2) to over 28% (Q6) in terms of throughput compared to base models like Q4.
- Reasoning models often appeared slower than their actual performance when only measuring the output token rate. For example, Qwen3.5/3.6 used a hidden reasoning_content channel which contributed significantly to overall decode rates but was invisible to users.
- The CPU on Strix provided usable performance for batch workloads like those encountered in coding assistant tasks, though it remained slower than its GPU counterparts.
For more details and the methodology behind these benchmarks, you can visit my public page: https://calebcoffie.com/benchmarks. The full write-up is available here: https://calebcoffie.com/blog/introducing-open-weight-model-benchmarks.
Additionally, I haven’t yet run the vLLM on Strix due to a timeout issue with its backend-readiness checks. Similarly, models in the 70-130B range that are exclusive to Strix have not been benchmarked.
This study aims to provide an additional data point for users interested in comparing various hardware configurations and their performance with different AI models.
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




