Optimizing speed & quality on Qwen3.6 27b

“`html

Users are seeking advice on optimizing the performance of Qwen3.6 27B, a large language model deployed for tasks like agency harnesses such as Pi/Hermes.

The primary concern is balancing inference speed with maintaining high precision and efficiency across different hardware configurations.
One user reports observing inference speeds ranging from ~300-500 tokens per second (tok/s) for prompt processing and ~22-30 tok/sec for token generation at a 100k context window, using 40GB of VRAM (with 4-channel DDR4 RAM).
The user seeks feedback on whether their current setup is optimal or if further improvements can be made with different flags or variables in the `llama-server` command.

“`

Source Read original →