Running a large language model used to mean paying OpenAI a monthly subscription and calling it done. In 2026, the landscape looks completely different. You can run a capable 70B-parameter model on a machine that fits under your desk, rent dedicated GPU capacity by the hour, or stitch together hybrid setups that balance cost, privacy, and capability. This guide cuts through the marketing and gives you the real numbers, real trade-offs, and real-world scenarios for each approach.
The Four Paths
There are essentially four ways to put an LLM to work today:
- Cloud API — pay per token to providers like Anthropic, OpenAI, Google, or Mistral
- Local self-hosting — run open-weight models on your own hardware via Ollama, LM Studio, or llama.cpp
- GPU rental — rent dedicated or on-demand GPU capacity from RunPod, Vast.ai, Lambda Labs, or cloud providers
- Hybrid — local for cheap tasks, API for hard ones (what most serious users end up doing)
Cloud API: The Easy Button
Cloud APIs are the path of least resistance. You get state-of-the-art models with no hardware costs, zero maintenance, and immediate access to new model versions on release day. The trade-off is cost at scale and data leaving your infrastructure.
Typical 2026 pricing (per 1M tokens, input/output)
- Claude Sonnet 4.5: $3 / $15 — strong reasoning, excellent coding, 200K context
- GPT-4o: $2.50 / $10 — fast, multimodal, broad capability
- Gemini 1.5 Pro: $1.25 / $5 — 2M context window, competitive pricing
- Mistral Large: $2 / $6 — European data residency option, strong multilingual
- Claude Haiku 3.5: $0.25 / $1.25 — fastest Anthropic model, great for high-volume simple tasks
At low volume (under 1M tokens/day), cloud API is almost always the right answer. The maths only starts hurting at serious scale — a busy application sending 50M tokens/day to GPT-4o is spending ~$13,000 per month on output alone.
Best for: Prototyping, low-to-medium volume production, tasks requiring the latest models, anything where you need a 200K+ context window reliably.
Local Self-Hosting: The Control Freak’s Choice
Running models locally using Ollama, LM Studio, or llama.cpp has become dramatically more accessible in 2026. Llama 3.3 70B, Mistral Nemo, Qwen 2.5 72B, and DeepSeek R2 are all capable of handling real-world tasks on consumer hardware.
Hardware reality check
- RTX 4090 (24GB VRAM): Comfortably runs 7B models at full precision, 13B at Q4, 34B at Q2. ~£1,400 new. One-time cost.
- RTX 4070 Ti Super (16GB VRAM): Best price-per-VRAM on the market right now. 7B models smoothly, 13B quantised. ~£700.
- Mac M4 Max (128GB unified memory): Can run 70B models via Metal. Exceptional performance per watt. ~£3,000.
- Dual EPYC + 256GB RAM (CPU inference): Slow but works for 70B+ models without GPU spend. Token rates around 2-5 tok/s.
The real cost of local isn’t hardware — it’s electricity, maintenance, and the tokens-per-second ceiling. A 4090 generates roughly 60-80 tok/s on a 7B model but drops to 8-12 tok/s on a 70B Q4 model. That’s fine for single-user use, painful for concurrent multi-user loads.
Ollama: The standard for local deployment
Ollama has become the de facto standard for local LLM serving. Installation is a single command, model management is handled via ollama pull, and it exposes an OpenAI-compatible API. The model library covers everything from Llama 3.3 to Gemma 3, Phi-4, Qwen 2.5, and DeepSeek R1.
# Install and run a capable 7B model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:7b
ollama run qwen2.5:7bFor production setups, pair Ollama with LiteLLM as a proxy layer — you get rate limiting, model fallbacks, logging, and an OpenAI-compatible endpoint that any existing application can hit without code changes.
Best for: Privacy-sensitive workloads, high-volume repetitive tasks where you’ve already made the hardware investment, experimentation with open-weight models, offline/air-gapped environments.
GPU Rental: Elastic Horsepower
GPU rental sits between cloud API and full self-hosting. You get dedicated GPU capacity — often the same A100s or H100s the cloud providers use — without the per-token overhead of API pricing. You’re paying for GPU-hours, not tokens.
Current rental landscape (2026)
- RunPod: A100 SXM (80GB) ~$1.29/hr on spot, ~$1.79/hr on-demand. H100s available at ~$2.49/hr. Simple web UI, good Docker support.
- Vast.ai: Peer-to-peer GPU marketplace. RTX 4090s for ~$0.35/hr, A100s from ~$0.80/hr. Variable availability, best price if you can tolerate some instability.
- Lambda Labs: H100 clusters for serious training runs. Reserved pricing from $2.49/hr. More stable than spot markets, Kubernetes-ready.
- AWS/GCP/Azure: On-demand A100s at $3-4/hr. Expensive but enterprise-grade SLA, existing IAM integration, compliance frameworks.
The economics of GPU rental make sense for: model fine-tuning runs (short bursts of intensive compute), inference at scale on open-weight models where token-based API pricing would be more expensive, and batch processing jobs where throughput matters more than latency.
A rough breakeven point: if you’d spend more than £80/month on cloud API tokens, a rented A100 at £50/month (40 hours) running a capable open-weight model starts to make economic sense — assuming the model quality is adequate for your use case.
Best for: Fine-tuning, batch inference jobs, teams needing more throughput than local hardware provides, compliance contexts where cloud API data sharing is restricted but you still need scale.
The Hybrid Approach (What Most Serious Users Do)
The dirty secret of LLM deployment is that most practitioners end up with a hybrid setup. Local Ollama handles the cheap, high-volume, privacy-sensitive workloads (document summarisation, code generation, embeddings). Cloud API handles the tasks that genuinely need frontier model capability (complex reasoning, nuanced writing, novel problem-solving).
LiteLLM makes this easy to orchestrate — one endpoint, model fallback chains, cost logging, and a simple router that can send queries to the right backend based on complexity signals.
# LiteLLM config: local for cheap, cloud for hard
model_list:
- model_name: default
litellm_params:
model: ollama/qwen2.5:7b
api_base: http://localhost:11434
- model_name: hard
litellm_params:
model: claude-sonnet-4-5
api_key: sk-ant-...The Decision Matrix
| Factor | Cloud API | Local | GPU Rental | Hybrid |
|---|---|---|---|---|
| Setup time | Minutes | Hours | 30 min | Day+ |
| Model quality | Frontier | Near-frontier | Near-frontier | Both |
| Data privacy | Low | Total | Medium | Configurable |
| Cost at 10M tok/day | High | Electricity only | Medium | Low-medium |
| Latency | Low | Variable | Low | Variable |
| Maintenance | None | High | Low | Medium |
| Best context window | 2M+ (Gemini) | ~128K | Model-dependent | 2M+ (API) |
Key Takeaways
- Cloud API wins on simplicity and frontier capability. It’s the right starting point for almost everyone.
- Local self-hosting pays off at volume and for privacy-sensitive use cases. Ollama + LiteLLM is the current standard stack.
- GPU rental is underused — it’s the best option for fine-tuning runs and batch inference at scale.
- Hybrid setups combining local Ollama for cheap tasks and cloud API for hard ones deliver the best economics at production scale.
- The 70B quantised models available in 2026 are genuinely competitive with GPT-3.5-era performance at zero marginal cost once hardware is paid for.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




