Tesla P40 running qwen 3.6

Does anyone know why qwen 3.6 MTP spec decoding won’t work with Tesla P40 when the K cache is quantized?

I was able to get mtp qwen 3.6 27B Q5 running at 20t/s on my tesla p40. But only after removing any quantization of the K cache (running at F16). I had no trouble running turbo3 k cache without MTP on the turboquant fork of llama.cpp, but using the atomic fork to get MTP working it would only give garbage output characters with any kind of q4_0, turbo3 on K cache.

Anyone know what’s up with that? Here’s my powershell start script

$env:TERM = "xterm-256color" $Host.UI.SupportsVirtualTerminal $env:CUDA_VISIBLE_DEVICES = "1" $env:GGML_PRINT_STATS = "1" $env:LLAMA_CUDA_F16 = "1" $tit='P40-QWEN3.6-27B-DENSE-Q5KXL-MTP' $host.ui.RawUI.WindowTitle = $tit $Host.UI.RawUI.BackgroundColor='DarkGray' $env:CUDA_PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" $env:PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\libnvvp;" + $env:PATH G:\code\atomic-llama-cpp-turboquant\build\bin\llama-server.exe ` --log-file c:\logs\$tit-$(Get-Date -Format "yyyyMMddHHmmss").log ` --log-prefix ` --log-timestamps ` --spec-type nextn --draft-max 6 --draft-min 1 ` --model "g:\models\Qwen3.6-27B-UD-Q5_K_XL.gguf" ` -md "g:\models\Qwen3.6-27B-UD-Q5_K_XL.gguf" ` --timeout 3300 ` --host 192.168.99.3 ` --port 9902 ` -np 1 ` --no-mmap ` --gpu-layers 999 ` -c 45000 ` -b 174 ` -ub 174 ` --top-k 20 --top-p 0.95 --min-p 0.05 ` --repeat-penalty 1.0 ` --presence-penalty 1.5 ` --cache-type-k f16 ` --cache-type-v turbo3 pause

submitted by /u/PairOfRussels

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Tesla P40 running qwen 3.6

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Apple approves Poke as…

Meta steals a tactic…

Nemotron 3.5 Content Safety:…