Quick Update on Running MTP Models with LLAMA.cpp Docker Images
This article follows up from a previous discussion about running MTP models with LLAMA.cpp. There have been notable improvements and updates to both the MTP pull request and the main LLAMA.cpp branch, including support for image processing and various bug fixes.
What’s New: MTP Models
I recently built new Docker images tailored for running these MTP models on my local machine. This makes it easier to manage and run them without needing to keep up with the latest official builds, which are still in progress.
- CUDA 13 Server Image:
havenoammo/llama:cuda13-server - CUDA 12 Server Image:
havenoammo/llama:cuda12-server - Vulkan Server Image:
havenoammo/llama:vulkan-server - Intel Server Image:
havenoammo/llama:intel-server - ROCM Server Image:
havenoammo/llama:rocm-server
If you are already using LLAMA.cpp Docker images, switching to these new ones will be straightforward. They provide a convenient way to run MTP models until official builds support this feature.
MTP Models Release and Quantization Details
Unsloth has released new MTP models for Qwen 3.6, which have been quantized at lower levels (e.g., Q3_K and Q4_K). I kept my versions at a Q8 quantization level to maintain better prediction performance.
My previous models are now obsolete, as the new versions have been updated to match these released by Unsloth.
How to Use These Docker Images with MTP Models
To run one of my MTP models using the provided Docker image, you can use a command like this:
docker run --gpus all --rm \
-p 8080:8080 \
-v ./models:/models \
havenoammo/llama:cuda13-server \
-m /models/Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf \
--port 8080 \
--host 0.0.0.0 \
-n -1 \
--parallel 1 \
--ctx-size 262144 \
--fit-target 844 \
--mmap \
-ngl -1 \
--flash-attn on \
--metrics \
--temp 1.0 \
--min-p 0.0 \
--top-p 0.95 \
--top-k 20 \
--jinja \
--ubatch-size 512 \
--batch-size 2048 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--spec-type mtp \
--spec-draft-n-max 3
Adjust the flags as needed for your specific use case. The key parameters are --spec-type mtp and --spec-draft-n-max 3, which define how the model operates with MTP.
Key Takeaways
- New Docker images have been built for running various LLAMA.cpp models, including those supporting MTP functionality.
- MTP models for Qwen 3.6 have been released by Unsloth and are now available through a quantized version of the model.
- The provided Docker images can be used to run these MTP models without needing to keep up with official builds.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




