llama.cpp docker images to run MTP models

Quick Update on Running MTP Models with LLAMA.cpp Docker Images

This article follows up from a previous discussion about running MTP models with LLAMA.cpp. There have been notable improvements and updates to both the MTP pull request and the main LLAMA.cpp branch, including support for image processing and various bug fixes.

What’s New: MTP Models

I recently built new Docker images tailored for running these MTP models on my local machine. This makes it easier to manage and run them without needing to keep up with the latest official builds, which are still in progress.

CUDA 13 Server Image: havenoammo/llama:cuda13-server
CUDA 12 Server Image: havenoammo/llama:cuda12-server
Vulkan Server Image: havenoammo/llama:vulkan-server
Intel Server Image: havenoammo/llama:intel-server
ROCM Server Image: havenoammo/llama:rocm-server

If you are already using LLAMA.cpp Docker images, switching to these new ones will be straightforward. They provide a convenient way to run MTP models until official builds support this feature.

MTP Models Release and Quantization Details

Unsloth has released new MTP models for Qwen 3.6, which have been quantized at lower levels (e.g., Q3_K and Q4_K). I kept my versions at a Q8 quantization level to maintain better prediction performance.

My previous models are now obsolete, as the new versions have been updated to match these released by Unsloth.

How to Use These Docker Images with MTP Models

To run one of my MTP models using the provided Docker image, you can use a command like this:

docker run --gpus all --rm \
  -p 8080:8080 \
  -v ./models:/models \
  havenoammo/llama:cuda13-server \
  -m /models/Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf \
  --port 8080 \
  --host 0.0.0.0 \
  -n -1 \
  --parallel 1 \
  --ctx-size 262144 \
  --fit-target 844 \
  --mmap \
  -ngl -1 \
  --flash-attn on \
  --metrics \
  --temp 1.0 \
  --min-p 0.0 \
  --top-p 0.95 \
  --top-k 20 \
  --jinja \
  --ubatch-size 512 \
  --batch-size 2048 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --spec-type mtp \
  --spec-draft-n-max 3

Adjust the flags as needed for your specific use case. The key parameters are --spec-type mtp and --spec-draft-n-max 3, which define how the model operates with MTP.

Key Takeaways

New Docker images have been built for running various LLAMA.cpp models, including those supporting MTP functionality.
MTP models for Qwen 3.6 have been released by Unsloth and are now available through a quantized version of the model.
The provided Docker images can be used to run these MTP models without needing to keep up with official builds.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

llama.cpp docker images to run MTP models

Quick Update on Running MTP Models with LLAMA.cpp Docker Images

What’s New: MTP Models

MTP Models Release and Quantization Details

How to Use These Docker Images with MTP Models

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

How to Speed Up…

Alphabet plans to raise…

Nvidia chases $200B CPU…