“`html
Optimizing Qwen 3.6 27B mtp for Hermes Agent on a Single RTX 3090
The updated version of the model, running as the backend for the Hermes agent, shows significant improvements in performance when using specific flags and leveraging the latest release from llama.cpp.
Hardware/Software Setup
- RTX 3090 (24GB VRAM) — currently undervolted to keep temperatures down
- CPU: Ryzen 7 5700G / 64GB RAM
- Qwen3.6-27B-IQ4_NL.gguf model file
- llama-server (compiled from source, commit #23234) — running the latest b9200 update
- Hermes agent configured to use a maximum of 64K context slots for limiting spillover
Initial Performance Issues
Using the standard recommended mtp flags (–spec-draft-n-max 6 and –spec-draft-p-min 0.75) resulted in suboptimal performance. The agent workflows, which are rigid by nature, led to poor results for agentic loops:
- Prompt processing: ~560 t/s
- Token generation: 17.06 t/s on short tasks and ~9.5 t/s during heavy context reasoning loops
- Draft acceptance rate: hovered around 22–26%
Making the Optimal Configuration
To address these issues, we need to make some adjustments:
- Drop the lookahead to 3 tokens
- Remove the p-min threshold
- Limit parallel slots to 1
- Enable the –flash-attn option
- Use q8_0 for both cache types (cache-type-k and cache-type-v)
Results After Optimizations
The optimized configuration, combined with the latest b9200 update from llama.cpp, resulted in substantial improvements:
- Prompt processing: jumped to ~991 t/s, allowing the RTX 3090 to efficiently handle large system prompts.
- Token generation: hit a peak of 27.44 t/s on short tasks and stabilized at 13.69 t/s during heavy context loops, where the agent switches between tool calls and main memory operations.
- Draft acceptance rate: maintained an impressive ~70% on standard turns.
With these optimizations, we see a significant boost in performance from the RTX 3090, even under the constraints of a 24GB VRAM card. This demonstrates that leveraging recent updates can yield substantial gains without compromising the model’s integrity or usability.
- The optimized flags and b9200 update significantly improved prompt processing and token generation for the Hermes agent on a single RTX 3090.
- Limiting parallel slots to one and adjusting lookahead size led to better performance metrics, especially in multi-turn agentic workflows.
- The latest model release from llama.cpp provided critical improvements that enabled these optimizations to achieve substantial gains in throughput and efficiency.
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




