How to Run LLMs Locally with Ollama: The Complete 2026 Setup Guide

Step-by-step guide to running LLMs locally with Ollama in 2026 — hardware requirements, model selection, API setup, LiteLLM integration, and performance tuning.

By AI Maestro May 11, 2026 5 min read
How to Run LLMs Locally with Ollama: The Complete 2026 Setup Guide

Running large language models on your own hardware is no longer a graduate-student-with-a-server-room proposition. Ollama has made local LLM deployment genuinely accessible — a single command installs the runtime, another pulls a model, and you have a local inference endpoint that any application can use. This guide covers everything from first install to a production-grade local setup.

Why Run LLMs Locally?

Before getting into the how, the why matters. Local LLMs make sense when:

  • Privacy: Your data doesn’t leave your machine. Medical records, confidential business data, personal writing — none of it touches a third-party server.
  • Cost at volume: Once you’ve made the hardware investment, inference is essentially free. At high token volumes, local beats cloud API on cost by a significant margin.
  • Offline operation: Air-gapped systems, remote locations, unstable connectivity.
  • Experimentation: Swap models freely, try fine-tuned variants, test quantisation levels — no API charges, no rate limits.
  • Latency: For some use cases, local inference (even on consumer hardware) is faster than round-tripping to a cloud API because there’s no network overhead.

Hardware Requirements

The most important variable is VRAM for GPU inference. The rule of thumb: a model needs roughly 1GB of VRAM per billion parameters at 4-bit quantisation (Q4). So:

  • 7B model at Q4: ~4-5GB VRAM — runs on a GTX 1080, most modern laptop GPUs
  • 13B model at Q4: ~8-9GB VRAM — GTX 3070, RTX 4060 Ti
  • 34B model at Q4: ~20-22GB VRAM — RTX 3090, RTX 4090
  • 70B model at Q4: ~40-44GB VRAM — requires dual 3090s, an A100, or Mac M-series with 64GB+ unified memory

Mac M-series chips are uniquely good for local LLMs because their unified memory architecture allows the GPU to access full system RAM. An M4 Max with 128GB can run a 70B model at reasonable speed — something that would require expensive data-centre GPUs on x86.

CPU inference is also an option. It’s slow — typically 2-10 tokens per second for 7B models on a modern CPU — but it works, and any machine with 16GB of RAM can run smaller models.

Installing Ollama

Ollama runs on Linux, macOS, and Windows. Installation takes under a minute.

Linux / macOS

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download. After installation, Ollama runs as a background service and is accessible at http://localhost:11434.

Docker

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Pulling and Running Models

Ollama’s model library covers the major open-weight families. Use ollama pull to download, ollama run for interactive chat.

# General purpose — excellent quality/speed balance
ollama pull qwen2.5:7b       # 7B, Chinese-English bilingual, strong coder
ollama pull llama3.3:70b     # Meta's 70B — near-frontier quality
ollama pull mistral-nemo:12b # Mistral's 12B — fast, European
ollama pull phi4:14b         # Microsoft's 14B — surprisingly capable

# Coding specialists
ollama pull deepseek-coder-v2:16b
ollama pull qwen2.5-coder:7b

# Vision/multimodal
ollama pull llama3.2-vision:11b

# Run interactively
ollama run qwen2.5:7b

Choosing a quantisation level

Quantisation reduces model size and VRAM usage at a small quality cost. The levels Ollama uses:

  • Q8: Near-full quality, ~1GB per billion params. Use when you have the VRAM.
  • Q4_K_M: Best quality-size balance for most use cases. Default in most Ollama tags.
  • Q3_K_M: Further size reduction, noticeable quality drop on complex reasoning.
  • Q2_K: Smallest, significant quality loss. Only for very constrained hardware.

Using the API

Ollama exposes an OpenAI-compatible REST API, so any existing code that works with OpenAI’s API works with Ollama by changing the base URL.

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Explain how transformers work"}]
  }'

# Python with openai library
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Write me a Python function to parse JSON"}]
)
print(response.choices[0].message.content)

Production Setup: Ollama + LiteLLM

For anything beyond personal use, add LiteLLM as a proxy layer. It gives you:

  • Single endpoint, multiple backend models
  • Rate limiting and concurrency controls
  • Cost tracking and logging
  • Automatic fallback to cloud API when local is overloaded
  • OpenAI-compatible API that any client library can use
pip install litellm[proxy]

# Start proxy with config
litellm --config config.yaml --port 4000
# config.yaml
model_list:
  - model_name: local-fast
    litellm_params:
      model: ollama/qwen2.5:7b
      api_base: http://localhost:11434

  - model_name: local-powerful
    litellm_params:
      model: ollama/llama3.3:70b
      api_base: http://localhost:11434

  - model_name: cloud-fallback
    litellm_params:
      model: claude-haiku-3-5
      api_key: sk-ant-...

litellm_settings:
  fallbacks:
    - local-fast:
        - cloud-fallback

Performance Tuning

A few settings make a significant difference to throughput:

  • Parallel requests: OLLAMA_NUM_PARALLEL=4 allows multiple concurrent requests. Default is 1.
  • Context size: OLLAMA_MAX_LOADED_MODELS=2 keeps two models hot in VRAM simultaneously.
  • GPU layers: Ollama auto-detects, but you can force with num_gpu in Modelfile.
  • Flash attention: Enabled by default on supported GPUs. Reduces VRAM by ~30% for long contexts.

Common Issues and Fixes

  • Slow inference on first token: Model is loading. Keep Ollama running between requests to keep models warm.
  • Out of VRAM: The model is offloading layers to CPU (check ollama ps — look for “GPU 100%”). Try a more aggressively quantised variant.
  • Crashes on large context: Reduce num_ctx in the model parameters. Default is 2048; some models support 128K but VRAM usage scales with context length.
  • Windows: model path too long: Ollama models default to %USERPROFILE%\.ollama. Set OLLAMA_MODELS env var to a shorter path.

Key Takeaways

  • Ollama is the easiest path to local LLM deployment in 2026. One-command install, huge model library, OpenAI-compatible API.
  • A GPU with 12GB+ VRAM runs genuinely capable 13B models smoothly. 24GB gets you to 34B.
  • Mac M-series chips are the best consumer hardware for large models due to unified memory.
  • Add LiteLLM as a proxy for any production setup — it adds fallbacks, rate limiting, and model routing with minimal overhead.
  • Start with qwen2.5:7b or mistral-nemo:12b for day-to-day use. Pull llama3.3:70b when you need serious capability.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top