Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090

“`html

Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090

Optimizing Qwen 3.6 27B mtp for Hermes Agent on a Single RTX 3090

The updated version of the model, running as the backend for the Hermes agent, shows significant improvements in performance when using specific flags and leveraging the latest release from llama.cpp.

Hardware/Software Setup

RTX 3090 (24GB VRAM), currently undervolted to keep temperatures down
CPU: Ryzen 7 5700G / 64GB RAM
Qwen3.6-27B-IQ4_NL.gguf model file
llama-server (compiled from source, commit #23234), running the latest b9200 update
Hermes agent configured to use a maximum of 64K context slots for limiting spillover

Initial Performance Issues

Using the standard recommended mtp flags (–spec-draft-n-max 6 and –spec-draft-p-min 0.75) resulted in suboptimal performance. The agent workflows, which are rigid by nature, led to poor results for agentic loops:

Prompt processing: ~560 t/s
Token generation: 17.06 t/s on short tasks and ~9.5 t/s during heavy context reasoning loops
Draft acceptance rate: hovered around 22–26%

Making the Optimal Configuration

To address these issues, we need to make some adjustments:

Drop the lookahead to 3 tokens
Remove the p-min threshold
Limit parallel slots to 1
Enable the –flash-attn option
Use q8_0 for both cache types (cache-type-k and cache-type-v)

Results After Optimizations

The optimized configuration, combined with the latest b9200 update from llama.cpp, resulted in substantial improvements:

Prompt processing: jumped to ~991 t/s, allowing the RTX 3090 to efficiently handle large system prompts.
Token generation: hit a peak of 27.44 t/s on short tasks and stabilized at 13.69 t/s during heavy context loops, where the agent switches between tool calls and main memory operations.
Draft acceptance rate: maintained an impressive ~70% on standard turns.

With these optimizations, we see a significant boost in performance from the RTX 3090, even under the constraints of a 24GB VRAM card. This demonstrates that leveraging recent updates can yield substantial gains without compromising the model’s integrity or usability.

The optimized flags and b9200 update significantly improved prompt processing and token generation for the Hermes agent on a single RTX 3090.
Limiting parallel slots to one and adjusting lookahead size led to better performance metrics, especially in multi-turn agentic workflows.
The latest model release from llama.cpp provided critical improvements that enabled these optimizations to achieve substantial gains in throughput and efficiency.
“`
Source Read original →
Related reading
Optimizing speed & quality on Qwen3.6 27b
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents
The SignalThe Signal: Edition 02Read this edition →Every Friday: the one AI story that actually mattered, plus the tools worth your time.
AI Maestro is an independent British AI publication. We test what we recommend, and we write it the way we would say it. More about us

Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090

Optimizing Qwen 3.6 27B mtp for Hermes Agent on a Single RTX 3090

Hardware/Software Setup

Initial Performance Issues

Making the Optimal Configuration

Results After Optimizations

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Some of the nation’s…

Meituan Releases LongCat-2.0: A…

Amazon will stop accepting…

Optimizing Qwen 3.6 27B mtp for Hermes Agent on a Single RTX 3090

Hardware/Software Setup

Initial Performance Issues

Making the Optimal Configuration

Results After Optimizations

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Some of the nation’s…

Meituan Releases LongCat-2.0: A…

Amazon will stop accepting…