CPU is just a secondhand 10900x. Using 128k context, unquantized kv cache. Model is at q8_0 to mitigate some weird behavior I was seeing at lower quants.
Speed is very slow at around 50tps pp, 10tps tg, but usable for coding agent workflows.
Anybody else running MoE models in this size class on relatively low-end hardware? For my purposes, speed is less important than accuracy, as long as it’s not like literally all day. Any other models you’d recommend I’d try or additional optimization tips that could help within my constraints? I wish they’d released the draft model for MTP on this model but it looks like they declined to do so for 2.7.
My ik_llama flags — sorry for the funny formatting, this is pasted out of my vibe coded NixOS config:
"${ik-llama-cuda}/bin/llama-server" + " -m ${modelPath}" + " --host 0.0.0.0" + " --port ${toString cfg.port}" + " -c ${toString cfg.contextLength}" + " -ngl 999" + " --cpu-moe" + " -sm graph" + " -fa on" + " -t 16" + " -tb 16" + " -b 4096" + " -ub 4096" + " -np 1" + " -muge" + " -ger" + " --jinja" + " --metrics" + " --temp 1.0" + " --top-p 0.95" + " --top-k 40" + " --min-p 0.01" submitted by /u/wombweed
[link] [comments]
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




