Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

“`html Local LLM Autocomplete and Agentic Coding on a Single GPU Local LLM Autocomplete + Agentic Coding on a Single 16GB GPU…

By AI Maestro May 12, 2026 1 min read
Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

“`html




Local LLM Autocomplete and Agentic Coding on a Single GPU

Local LLM Autocomplete + Agentic Coding on a Single 16GB GPU + 64GB RAM

Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that’s actually viable.

Why these models:

  • Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L: This model is still the best for infill imo. I tried Gemma4 E4B and Qwen3.5 9B/4B, both produce weird suggestions.
  • unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL: This model is good at agentic coding at Q8 if you give it a good prompt. At Q4 it’s not usable, and lower quants have noticeable quality issues.

Because of its 3B active params, the Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L model is still fast and fits into the remaining 8GB VRAM. The unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL model, on the other hand, performs well with a good prompt but requires more RAM for optimal performance.

Commands:

bash llama-server -hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L \ -ngl 99 --no-mmap --ctx-size 0 -ctk q8_0 -ctv q8_0 \ -np 1 --temp 0.5 --top-p 0.95 --top-k 20 --min-p 0.0 --port 8081

Note: I have no idea which hyperparameters to use for Qwen2.5, maybe someone will enlighten me and I’ll edit the post.

bash llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL \ --no-mmap --no-mmproj -fitt 0 -ngl 99 --cpu-moe \ -b 2048 -ub 2048 --jinja \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01

llama.cpp autofits the model and I get ~145k context with this command. You can use -ctv q8_0 -ctk q8_0 if you want more context.

The 35B-A3B model runs at a speed of:

pp4096 | 2093.93 ± 22.64 tg128 | 35.29 ± 0.48

  • The RTX 5080 with RAM offloading is now a viable setup for running multiple large language models.
  • Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L is the preferred model for autocomplete tasks due to its performance and VRAM efficiency.
  • The 35B-A3B model, when run with a good prompt, can handle complex autocompletion tasks efficiently.

“`

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name

Scroll to Top