“`html
Users are seeking advice on optimizing the performance of Qwen3.6 27B, a large language model deployed for tasks like agency harnesses such as Pi/Hermes.
- The primary concern is balancing inference speed with maintaining high precision and efficiency across different hardware configurations.
- One user reports observing inference speeds ranging from ~300-500 tokens per second (tok/s) for prompt processing and ~22-30 tok/sec for token generation at a 100k context window, using 40GB of VRAM (with 4-channel DDR4 RAM).
- The user seeks feedback on whether their current setup is optimal or if further improvements can be made with different flags or variables in the `llama-server` command.
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




