Running large language models on your own hardware is no longer a graduate-student-with-a-server-room proposition. Ollama has made local LLM deployment genuinely accessible — a single command installs the runtime, another pulls a model, and you have a local inference endpoint that any application can use. This guide covers everything from first install to a production-grade local setup.
Why Run LLMs Locally?
Before getting into the how, the why matters. Local LLMs make sense when:
- Privacy: Your data doesn’t leave your machine. Medical records, confidential business data, personal writing — none of it touches a third-party server.
- Cost at volume: Once you’ve made the hardware investment, inference is essentially free. At high token volumes, local beats cloud API on cost by a significant margin.
- Offline operation: Air-gapped systems, remote locations, unstable connectivity.
- Experimentation: Swap models freely, try fine-tuned variants, test quantisation levels — no API charges, no rate limits.
- Latency: For some use cases, local inference (even on consumer hardware) is faster than round-tripping to a cloud API because there’s no network overhead.
Hardware Requirements
The most important variable is VRAM for GPU inference. The rule of thumb: a model needs roughly 1GB of VRAM per billion parameters at 4-bit quantisation (Q4). So:
- 7B model at Q4: ~4-5GB VRAM — runs on a GTX 1080, most modern laptop GPUs
- 13B model at Q4: ~8-9GB VRAM — GTX 3070, RTX 4060 Ti
- 34B model at Q4: ~20-22GB VRAM — RTX 3090, RTX 4090
- 70B model at Q4: ~40-44GB VRAM — requires dual 3090s, an A100, or Mac M-series with 64GB+ unified memory
Mac M-series chips are uniquely good for local LLMs because their unified memory architecture allows the GPU to access full system RAM. An M4 Max with 128GB can run a 70B model at reasonable speed — something that would require expensive data-centre GPUs on x86.
CPU inference is also an option. It’s slow — typically 2-10 tokens per second for 7B models on a modern CPU — but it works, and any machine with 16GB of RAM can run smaller models.
Installing Ollama
Ollama runs on Linux, macOS, and Windows. Installation takes under a minute.
Linux / macOS
curl -fsSL https://ollama.com/install.sh | shWindows
Download the installer from ollama.com/download. After installation, Ollama runs as a background service and is accessible at http://localhost:11434.
Docker
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamaPulling and Running Models
Ollama’s model library covers the major open-weight families. Use ollama pull to download, ollama run for interactive chat.
# General purpose — excellent quality/speed balance
ollama pull qwen2.5:7b # 7B, Chinese-English bilingual, strong coder
ollama pull llama3.3:70b # Meta's 70B — near-frontier quality
ollama pull mistral-nemo:12b # Mistral's 12B — fast, European
ollama pull phi4:14b # Microsoft's 14B — surprisingly capable
# Coding specialists
ollama pull deepseek-coder-v2:16b
ollama pull qwen2.5-coder:7b
# Vision/multimodal
ollama pull llama3.2-vision:11b
# Run interactively
ollama run qwen2.5:7bChoosing a quantisation level
Quantisation reduces model size and VRAM usage at a small quality cost. The levels Ollama uses:
- Q8: Near-full quality, ~1GB per billion params. Use when you have the VRAM.
- Q4_K_M: Best quality-size balance for most use cases. Default in most Ollama tags.
- Q3_K_M: Further size reduction, noticeable quality drop on complex reasoning.
- Q2_K: Smallest, significant quality loss. Only for very constrained hardware.
Using the API
Ollama exposes an OpenAI-compatible REST API, so any existing code that works with OpenAI’s API works with Ollama by changing the base URL.
# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "qwen2.5:7b",
"messages": [{"role": "user", "content": "Explain how transformers work"}]
}'
# Python with openai library
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="qwen2.5:7b",
messages=[{"role": "user", "content": "Write me a Python function to parse JSON"}]
)
print(response.choices[0].message.content)Production Setup: Ollama + LiteLLM
For anything beyond personal use, add LiteLLM as a proxy layer. It gives you:
- Single endpoint, multiple backend models
- Rate limiting and concurrency controls
- Cost tracking and logging
- Automatic fallback to cloud API when local is overloaded
- OpenAI-compatible API that any client library can use
pip install litellm[proxy]
# Start proxy with config
litellm --config config.yaml --port 4000# config.yaml
model_list:
- model_name: local-fast
litellm_params:
model: ollama/qwen2.5:7b
api_base: http://localhost:11434
- model_name: local-powerful
litellm_params:
model: ollama/llama3.3:70b
api_base: http://localhost:11434
- model_name: cloud-fallback
litellm_params:
model: claude-haiku-3-5
api_key: sk-ant-...
litellm_settings:
fallbacks:
- local-fast:
- cloud-fallbackPerformance Tuning
A few settings make a significant difference to throughput:
- Parallel requests:
OLLAMA_NUM_PARALLEL=4allows multiple concurrent requests. Default is 1. - Context size:
OLLAMA_MAX_LOADED_MODELS=2keeps two models hot in VRAM simultaneously. - GPU layers: Ollama auto-detects, but you can force with
num_gpuin Modelfile. - Flash attention: Enabled by default on supported GPUs. Reduces VRAM by ~30% for long contexts.
Common Issues and Fixes
- Slow inference on first token: Model is loading. Keep Ollama running between requests to keep models warm.
- Out of VRAM: The model is offloading layers to CPU (check
ollama ps— look for “GPU 100%”). Try a more aggressively quantised variant. - Crashes on large context: Reduce
num_ctxin the model parameters. Default is 2048; some models support 128K but VRAM usage scales with context length. - Windows: model path too long: Ollama models default to
%USERPROFILE%\.ollama. SetOLLAMA_MODELSenv var to a shorter path.
Key Takeaways
- Ollama is the easiest path to local LLM deployment in 2026. One-command install, huge model library, OpenAI-compatible API.
- A GPU with 12GB+ VRAM runs genuinely capable 13B models smoothly. 24GB gets you to 34B.
- Mac M-series chips are the best consumer hardware for large models due to unified memory.
- Add LiteLLM as a proxy for any production setup — it adds fallbacks, rate limiting, and model routing with minimal overhead.
- Start with
qwen2.5:7bormistral-nemo:12bfor day-to-day use. Pullllama3.3:70bwhen you need serious capability.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.



