How to Run LLMs Locally with Ollama: The Complete 2026 Setup Guide

Running large language models on your own hardware is no longer a graduate-student-with-a-server-room proposition. Ollama has made local LLM deployment genuinely accessible, a single command installs the runtime, another pulls a model, and you have a local inference endpoint that any application can use. This guide covers everything from first install to a production-grade local setup.

Why Run LLMs Locally?

Before getting into the how, the why matters. Local LLMs make sense when:

Privacy: Your data doesn’t leave your machine. Medical records, confidential business data, personal writing, none of it touches a third-party server.
Cost at volume: Once you’ve made the hardware investment, inference is essentially free. At high token volumes, local beats cloud API on cost by a significant margin.
Offline operation: Air-gapped systems, remote locations, unstable connectivity.
Experimentation: Swap models freely, try fine-tuned variants, test quantisation levels, no API charges, no rate limits.
Latency: For some use cases, local inference (even on consumer hardware) is faster than round-tripping to a cloud API because there’s no network overhead.

Hardware Requirements

The most important variable is VRAM for GPU inference. The rule of thumb: a model needs roughly 1GB of VRAM per billion parameters at 4-bit quantisation (Q4). So:

7B model at Q4: ~4-5GB VRAM, runs on a GTX 1080, most modern laptop GPUs
13B model at Q4: ~8-9GB VRAM, GTX 3070, RTX 4060 Ti
34B model at Q4: ~20-22GB VRAM, RTX 3090, RTX 4090
70B model at Q4: ~40-44GB VRAM, requires dual 3090s, an A100, or Mac M-series with 64GB+ unified memory

Mac M-series chips are uniquely good for local LLMs because their unified memory architecture allows the GPU to access full system RAM. An M4 Max with 128GB can run a 70B model at reasonable speed, something that would require expensive data-centre GPUs on x86.

CPU inference is also an option. It’s slow, typically 2-10 tokens per second for 7B models on a modern CPU, but it works, and any machine with 16GB of RAM can run smaller models.

Installing Ollama

Ollama runs on Linux, macOS, and Windows. Installation takes under a minute.

Linux / macOS

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download. After installation, Ollama runs as a background service and is accessible at http://localhost:11434.

Docker

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Pulling and Running Models

Ollama’s model library covers the major open-weight families. Use ollama pull to download, ollama run for interactive chat.

# General purpose, excellent quality/speed balance
ollama pull qwen2.5:7b       # 7B, Chinese-English bilingual, strong coder
ollama pull llama3.3:70b     # Meta's 70B, near-frontier quality
ollama pull mistral-nemo:12b # Mistral's 12B, fast, European
ollama pull phi4:14b         # Microsoft's 14B, surprisingly capable

# Coding specialists
ollama pull deepseek-coder-v2:16b
ollama pull qwen2.5-coder:7b

# Vision/multimodal
ollama pull llama3.2-vision:11b

# Run interactively
ollama run qwen2.5:7b

Choosing a quantisation level

Quantisation reduces model size and VRAM usage at a small quality cost. The levels Ollama uses:

Q8: Near-full quality, ~1GB per billion params. Use when you have the VRAM.
Q4_K_M: Best quality-size balance for most use cases. Default in most Ollama tags.
Q3_K_M: Further size reduction, noticeable quality drop on complex reasoning.
Q2_K: Smallest, significant quality loss. Only for very constrained hardware.

Using the API

Ollama exposes an OpenAI-compatible REST API, so any existing code that works with OpenAI’s API works with Ollama by changing the base URL.

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Explain how transformers work"}]
  }'

# Python with openai library
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Write me a Python function to parse JSON"}]
)
print(response.choices[0].message.content)

Production Setup: Ollama + LiteLLM

For anything beyond personal use, add LiteLLM as a proxy layer. It gives you:

Single endpoint, multiple backend models
Rate limiting and concurrency controls
Cost tracking and logging
Automatic fallback to cloud API when local is overloaded
OpenAI-compatible API that any client library can use

pip install litellm[proxy]

# Start proxy with config
litellm --config config.yaml --port 4000

# config.yaml
model_list:
  - model_name: local-fast
    litellm_params:
      model: ollama/qwen2.5:7b
      api_base: http://localhost:11434

  - model_name: local-powerful
    litellm_params:
      model: ollama/llama3.3:70b
      api_base: http://localhost:11434

  - model_name: cloud-fallback
    litellm_params:
      model: claude-haiku-3-5
      api_key: sk-ant-...

litellm_settings:
  fallbacks:
    - local-fast:
        - cloud-fallback

Performance Tuning

A few settings make a significant difference to throughput:

Parallel requests: OLLAMA_NUM_PARALLEL=4 allows multiple concurrent requests. Default is 1.
Context size: OLLAMA_MAX_LOADED_MODELS=2 keeps two models hot in VRAM simultaneously.
GPU layers: Ollama auto-detects, but you can force with num_gpu in Modelfile.
Flash attention: Enabled by default on supported GPUs. Reduces VRAM by ~30% for long contexts.

Common Issues and Fixes

Slow inference on first token: Model is loading. Keep Ollama running between requests to keep models warm.
Out of VRAM: The model is offloading layers to CPU (check ollama ps, look for “GPU 100%”). Try a more aggressively quantised variant.
Crashes on large context: Reduce num_ctx in the model parameters. Default is 2048; some models support 128K but VRAM usage scales with context length.
Windows: model path too long: Ollama models default to %USERPROFILE%\.ollama. Set OLLAMA_MODELS env var to a shorter path.

Key Takeaways

Ollama is the easiest path to local LLM deployment in 2026. One-command install, huge model library, OpenAI-compatible API.
A GPU with 12GB+ VRAM runs genuinely capable 13B models smoothly. 24GB gets you to 34B.
Mac M-series chips are the best consumer hardware for large models due to unified memory.
Add LiteLLM as a proxy for any production setup, it adds fallbacks, rate limiting, and model routing with minimal overhead.
Start with qwen2.5:7b or mistral-nemo:12b for day-to-day use. Pull llama3.3:70b when you need serious capability.

How to Run LLMs Locally with Ollama: The Complete 2026 Setup Guide

Why Run LLMs Locally?

Hardware Requirements

Installing Ollama

Linux / macOS

Windows

Docker

Pulling and Running Models

Choosing a quantisation level

Using the API

Production Setup: Ollama + LiteLLM

Performance Tuning

Common Issues and Fixes

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Meta kills Muse Image…

S&P Global sees OpenAI…

The fight against AI…

Why Run LLMs Locally?

Hardware Requirements

Installing Ollama

Linux / macOS

Windows

Docker

Pulling and Running Models

Choosing a quantisation level

Using the API

Production Setup: Ollama + LiteLLM

Performance Tuning

Common Issues and Fixes

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Meta kills Muse Image…

S&P Global sees OpenAI…

The fight against AI…