Very happy with Qwen 3.5 122B output. But is slowness expected?

I’m running the 122-billion Qwen 3.5, specifically Qwen3.5-122B-A10B-Q5_K_M, on DGX Spark (128 GB contiguous memory).

I’m (very!) impressed with the general knowledge output. I can talk to it in multiple languages, and don’t feel the need to consult online frontier models for any encyclopaedic, general "handyman" or other day-to-day questions. My local Qwen seems sufficient.

This said, the output seems slow, around 19 tokens/s. Is this speed expected? I’m running the model from llama-server (latest compile as of yesterday), and the chat UI is Open WebUI. Are there any speed optimizations I can make in this setup without compromising the quality of output/

nice -n -10 ./llama-server -m ~/modelki/Qwen3.5-122B-A10B-Q5_K_M-00001-of-00003.gguf --alias "Qwen3.5_122" --fit on -ngl 999 --min_p 0.01 --temp 0.6 --top-p 0.95 --ctx-size 262144 --port 8002 --jinja --host 0.0.0.0 --flash-attn on

submitted by /u/breksyt
[link] [comments]

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.