Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU – two models tested Gemma4 and Qwen3.6 – Figured I’d share in case it helps anyone else

I’m running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it’s just a lot easier than building from source, which I was doing previously).…

By AI Maestro May 23, 2026 2 min read
Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU – two models tested Gemma4 and Qwen3.6 – Figured I’d share in case it helps anyone else
Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU - two models tested Gemma4 and Qwen3.6 - Figured I'd share in case it helps anyone else

I’m running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it’s just a lot easier than building from source, which I was doing previously). The MI60 (or MI50) are just a real pain in the behind to get working with Ubuntu 24.04. That container has it up in minutes, real timesaver.

Anyway, my personal use case for LLM’s is primarily for Frigate to review camera footage and cut down on “notification noise” (it’s like having a human review footage to determine what I need to know about and what I don’t). The other use is for HomeAssistant. I ditched all my Alexa devices and replaced it with this (it’s amazing).

Anyway, I wanted to be sure I was getting the absolute most out of my hardware for speed and efficiency. I had Claude write me a script that would do batch testing of two models — Gemma 4 26B.A4B Q4_1 and Qwen3 35B.A3B Q4_0 — against three KV cache pre-fill depths (0, 1,000, and 6,000 tokens) with a fixed 512-token prompt and 128 generation tokens per run, each repeated 5 times internally by llama-bench for statistical stability. The knobs turned were: flash attention on vs. off; KV cache quantisation at three levels (f16 default, q8_0, and q4_0); ubatch size at four values (512, 2048, 4096, and 8192); logical batch size at two values (2048 and 8192); CPU thread count at three values (8, 12, and 24); and two ROCm-specific environment variables — GGML_ROCM_FORCE_MMQ (1 vs. 0, switching between quantised matmul kernels and rocBLAS GEMM) and HSA_ENABLE_SDMA (enabled vs. disabled, switching between DMA and blit-copy memory transfers). Sections 1 through 7 each varied exactly one parameter while holding all others at the production baseline, enabling clean attribution of any performance change to a single cause. Section 8 then stacked three combinations of the most promising individual results — SDMA disabled with q8_0 KV, SDMA disabled with q4_0 KV, and SDMA disabled plus MMQ off plus q8_0 KV — to determine whether gains compounded or cancelled when applied together. The production llama-server container was stopped before each run to ensure exclusive GPU access, and each model configuration was launched as a fresh throwaway container from the same image used in production, with identical device mappings, volume mounts, and environment variables.

Benchmark sweep script executed 30 total runs across 8 sections

Key Takeaways

  • I found optimal settings for my use case of using LLMs for home monitoring and HomeAssistant.
  • The testing involved running 30 total runs across eight sections, varying parameters like flash attention, KV cache quantisation levels, batch size, CPU thread count, and ROCm environment variables.
  • The results showed significant improvements in the speed of both HomeAssistant (reduced to less than 1.2 seconds for voice commands) and Frigate (less than 18 seconds for review summaries).

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top