![]() | I’m running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it’s just a lot easier than building from source, which I was doing previously). The MI60 (or MI50) are just a real pain in the behind to get working with Ubuntu 24.04. That container has it up in minutes, real timesaver. Anyway, my personal use case for LLM’s is primarily for Frigate to review camera footage and cut down on “notification noise” (it’s like having a human review footage to determine what I need to know about and what I don’t). The other use is for HomeAssistant. I ditched all my Alexa devices and replaced it with this (it’s amazing). Anyway, I wanted to be sure I was getting the absolute most out of my hardware for speed and efficiency. I had Claude write me a script that would do batch testing of two models — Gemma 4 26B.A4B Q4_1 and Qwen3 35B.A3B Q4_0 — against three KV cache pre-fill depths (0, 1,000, and 6,000 tokens) with a fixed 512-token prompt and 128 generation tokens per run, each repeated 5 times internally by llama-bench for statistical stability. The knobs turned were: flash attention on vs. off; KV cache quantisation at three levels (f16 default, q8_0, and q4_0); ubatch size at four values (512, 2048, 4096, and 8192); logical batch size at two values (2048 and 8192); CPU thread count at three values (8, 12, and 24); and two ROCm-specific environment variables — ![]() |
Key Takeaways
- I found optimal settings for my use case of using LLMs for home monitoring and HomeAssistant.
- The testing involved running 30 total runs across eight sections, varying parameters like flash attention, KV cache quantisation levels, batch size, CPU thread count, and ROCm environment variables.
- The results showed significant improvements in the speed of both HomeAssistant (reduced to less than 1.2 seconds for voice commands) and Frigate (less than 18 seconds for review summaries).
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




![Anthropic posted a profit while xAI burned $4.2B. The AI profitability numbers finally leaked.[D]](https://ai-maestro.online/wp-content/uploads/2026/05/anthropic-posted-a-profit-while-xai-burned-4-2b-the-ai-profi-768x768.jpg)
![Hebbian architecture AI model [R]](https://ai-maestro.online/wp-content/uploads/2026/05/hebbian-architecture-ai-model-r-768x768.jpg)
![AgentLantern: exposing the hidden graph of AI agent projects [P]](https://ai-maestro.online/wp-content/uploads/2026/05/agentlantern-exposing-the-hidden-graph-of-ai-agent-projects-768x768.jpg)