Experts first llama.cpp

This is for all with 12GB VRAM.

Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers. The reason is I own an RTX 2060 with 12GB VRAM. That sounds big but is too little for dense models. That is why I use mainly MoE models because of that. The problem is, you need to split some layers to the CPU lane.

As you all surely know, Qwen3.6-35B-A3B uses only 8 experts per token; the rest are unused, so why not fill the experts into VRAM instead of complete layers full of unused experts?

I started to create a UI to monitor which experts are used. This already showed me that the first layers are more important to have on VRAM than the last ones; the reason is that they would change the experts more frequently than the others. Unfortunately, n-cpu-moe with llama.cpp will let the first layers on the CPU, so I tried -ot, but that’s another story. With the optimized setup, I was able to reach about 22 tk/s. (Remember the 2060 has only about half the CUDA cores of a 3060.) With the default --n-cpu-moe, I get 19 tk/s

I only run Q6 models, since the degradation at coding is visible. My context is not quantized (same reason), and because of Java development, I need a big context window of 100k.

However, with my expert variant and a hit rate of about 62%, it increased to 26 tks. The break-even point was at a 42% hit rate. This means the prompt has used 42% of the chosen experts on the GPU in my cache. As I tested smaller sizes of RAM (built-in argument to specify the VRAM usage), another use case came into my mind. With a good profile, you can reduce the usage a lot without sacrificing speed.

Now, to my question. Is there a person who would like to give it a test? I really would like to know how it behaves on a 3060/4060 or similar. (CUDA is a requirement, and Qwen 35B A3B or Gemma 26B A4B). Currently, it is tested only on Linux.

Really, I don’t want to earn any stars or so. I don’t care; I just want to know how much it increases the token generation on which NVIDIA graphics card.

It would need the following: checkout and build https://github.com/adrianhoehne/llama.cpp

Start it with the additional arguments:

./build/bin/llama-server --moe-layer-perf-out experts.json \ --cpu-moe \ --ctx-size 100000 \ --parallel 1

Then start a prompt and wait. This will take longer than usual because every expert is still on the CPU.

After that, exchange the arguments to:

./build/bin/llama-server --moe-hot-cache experts.json \ --moe-hot-cache-max-mib -1 \ --moe-hot-cache-auto-reserve-mib 1024 \ --moe-hot-cache-update-rate 0.10 \ --cpu-moe \ --ctx-size 100000 \ --parallel 1

And start measurement.

I also included the view of which experts are used to the Llama UI:

Button for ui

submitted by /u/comanderxv

Key Takeaways

– A user created a fork of llama.cpp with an experimental implementation of experts instead of layers to utilize the limited VRAM on their RTX 2060.
– The expert variant, when tested with a hit rate of about 62%, increased token generation by up to 26 tps compared to the default implementation.
– The user is seeking feedback from other users who have access to CUDA-capable NVIDIA graphics cards (specifically mentioning Qwen3.6-35B-A3B or Gemma 26B A4B) to understand how it performs on these systems.
– To run this experiment, one needs to checkout and build the repository https://github.com/adrianhoehne/llama.cpp, then start the server with specific arguments.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Experts first llama.cpp

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

How to Speed Up…

Alphabet plans to raise…

Nvidia chases $200B CPU…