![]() | This is for all with 12GB VRAM. Hi, I created a fork of As you all surely know, I started to create a UI to monitor which experts are used. This already showed me that the first layers are more important to have on VRAM than the last ones; the reason is that they would change the experts more frequently than the others. Unfortunately, I only run Q6 models, since the degradation at coding is visible. My context is not quantized (same reason), and because of Java development, I need a big context window of 100k. However, with my expert variant and a hit rate of about 62%, it increased to 26 tks. The break-even point was at a 42% hit rate. This means the prompt has used 42% of the chosen experts on the GPU in my cache. As I tested smaller sizes of RAM (built-in argument to specify the VRAM usage), another use case came into my mind. With a good profile, you can reduce the usage a lot without sacrificing speed. Now, to my question. Is there a person who would like to give it a test? I really would like to know how it behaves on a 3060/4060 or similar. (CUDA is a requirement, and Really, I don’t want to earn any stars or so. I don’t care; I just want to know how much it increases the token generation on which NVIDIA graphics card. It would need the following: checkout and build https://github.com/adrianhoehne/llama.cpp Start it with the additional arguments: Then start a prompt and wait. This will take longer than usual because every expert is still on the CPU. After that, exchange the arguments to: And start measurement. I also included the view of which experts are used to the Llama UI:
|
Key Takeaways
– A user created a fork of llama.cpp with an experimental implementation of experts instead of layers to utilize the limited VRAM on their RTX 2060.
– The expert variant, when tested with a hit rate of about 62%, increased token generation by up to 26 tps compared to the default implementation.
– The user is seeking feedback from other users who have access to CUDA-capable NVIDIA graphics cards (specifically mentioning Qwen3.6-35B-A3B or Gemma 26B A4B) to understand how it performs on these systems.
– To run this experiment, one needs to checkout and build the repository https://github.com/adrianhoehne/llama.cpp, then start the server with specific arguments.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.





