Vulkan or CPU llama cpp backend for local llm for coding/code assist

Vulkan or CPU llama cpp backend for local llm for coding/code assist Hi all, I recently started a new job where we’re…

By AI Maestro May 12, 2026 2 min read
Vulkan or CPU llama cpp backend for local llm for coding/code assist

Vulkan or CPU llama cpp backend for local llm for coding/code assist

Hi all,

I recently started a new job where we’re doing Python development for a CI/CD metadata consolidation library for analytics. We have strict rules that prohibit the use of any models like Claude, CodeX, or GitHub Copilot (whether they are free or paid). I have a laptop with 32GB dual-channel DDR5 5200MT/s RAM and an i7-1365U running Ubuntu.

I tried several approaches to set up a local LLM for code assistance. Initially, I attempted to run llama-cpp using the Vulkan backend with Qwen (a variant of Qwen), which resulted in Out Of Memory errors while trying to ingest a 340-line file with a context limit set at 24k. Next, I experimented with GitHub Copilot and OLLAMA, but couldn’t get it to work for code interactions on the same Qwen model.

I then tried using the py-codex extension in VS Code, which resulted in both the development environment and chat windows becoming non-responsive. The localhost URL was live, though.

After these failures, I turned to LMStudio with a CPU backend for Qwen 3.5 (both 4B and 9B variants) using the Roo extension in VS Code. This setup works but feels suboptimal compared to what we’re aiming for.

We have a codebase that is being demoed in two to three weeks as an MVP. The files are mostly test cases written in pytest, with some reaching up to 6000 lines of code. During this period, we’ll be refactoring and improving the library.

I’ve been grappling with several questions related to developer experience:

  • How can I best represent our project’s documentation using tools like pdoc or similar?
  • What is a suitable model and backend for local LLM-based code assistance, and what kind of integration would be needed in terms of extensions?
  • Are there any specific tools like LMStudio, Opencode, or Pi that could streamline this process?

I’m looking for guidance on how to improve the developer experience without relying on external models. Any insights or recommendations would be greatly appreciated.

Key Takeaways

  • Vulkan and CPU backends both have limitations when it comes to handling large contexts in Python applications.
  • Llama-cpp may not be as robust for long-context scenarios, leading to OOM errors.
  • The CPU backend of LMStudio is a viable option but might lack the performance or features needed for a production environment.

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top