Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

“`html

Local LLM Autocomplete and Agentic Coding on a Single GPU

Local LLM Autocomplete + Agentic Coding on a Single 16GB GPU + 64GB RAM

Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that’s actually viable.

Why these models:

Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L: This model is still the best for infill imo. I tried Gemma4 E4B and Qwen3.5 9B/4B, both produce weird suggestions.
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL: This model is good at agentic coding at Q8 if you give it a good prompt. At Q4 it’s not usable, and lower quants have noticeable quality issues.

Because of its 3B active params, the Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L model is still fast and fits into the remaining 8GB VRAM. The unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL model, on the other hand, performs well with a good prompt but requires more RAM for optimal performance.

Commands:

bash llama-server -hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L \ -ngl 99 --no-mmap --ctx-size 0 -ctk q8_0 -ctv q8_0 \ -np 1 --temp 0.5 --top-p 0.95 --top-k 20 --min-p 0.0 --port 8081
Note: I have no idea which hyperparameters to use for Qwen2.5, maybe someone will enlighten me and I’ll edit the post.

bash llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL \ --no-mmap --no-mmproj -fitt 0 -ngl 99 --cpu-moe \ -b 2048 -ub 2048 --jinja \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01
llama.cpp autofits the model and I get ~145k context with this command. You can use -ctv q8_0 -ctk q8_0 if you want more context.

The 35B-A3B model runs at a speed of:

pp4096 | 2093.93 ± 22.64 tg128 | 35.29 ± 0.48

The RTX 5080 with RAM offloading is now a viable setup for running multiple large language models.
Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L is the preferred model for autocomplete tasks due to its performance and VRAM efficiency.
The 35B-A3B model, when run with a good prompt, can handle complex autocompletion tasks efficiently.
“`
Source Read original →
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.
Please enable JavaScript in your browser to complete this form.
Name
First
Last
Email Name
Email
AI Maestro is an independent British AI publication. We test what we recommend. More about us →

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM