LLM Tips, Tricks & Workarounds Practitioners Actually Use in 2026

Practical LLM tips and tricks from practitioners: prompting patterns, reliability techniques, context management, cost reduction, and local model optimisation.

By AI Maestro May 11, 2026 6 min read
LLM Tips, Tricks & Workarounds Practitioners Actually Use in 2026

The best LLM knowledge isn’t in the official documentation. It’s in the Reddit threads where someone figured out why their model keeps ignoring a constraint, the Discord messages where a developer shares the prompt structure that finally got reliable JSON output, and the GitHub issues where someone documents the exact token budget that causes context degradation. This guide collects the practical knowledge that practitioners have accumulated — the tricks that actually work.

Prompting Tricks That Actually Help

Put the important instruction at the bottom, not the top

This goes against intuition but is consistently observed across Claude, GPT-4o, and Llama models: the model pays more attention to instructions at the end of the prompt than the beginning, particularly in long prompts. If you have a critical constraint (“never mention competitor products”, “always respond in JSON”), put it in the final paragraph of your system prompt, not the first sentence.

Use “think step by step” selectively

Chain-of-thought prompting (“think step by step”) genuinely improves accuracy on reasoning tasks — the research consistently shows 10-30% improvement on maths and logic. But it hurts on simple tasks because it generates unnecessary tokens and sometimes introduces errors by overcomplicating a straightforward answer. Use it for complex reasoning, not for classification or simple generation.

XML tags beat markdown for structure in long prompts

When your prompt has multiple sections (context, instructions, examples, input data), XML-style tags produce more reliable model behaviour than markdown headers, particularly with Claude:

<context>
You are a code reviewer for a financial trading system...
</context>
<task>
Review the following function for security issues...
</task>
<code>
def process_trade(...)
</code>

The model treats tagged sections as distinct semantic zones rather than continuous text.

Few-shot examples are more powerful than you think

A single good example of the output format you want dramatically outperforms even detailed written instructions. If you want JSON output in a specific structure, showing one example of the structure beats writing “output a JSON object with the following fields…” by a wide margin. Three examples is often the sweet spot — enough to establish the pattern, not so many that you burn tokens.

Negative space instructions

Telling the model what NOT to do is often more effective than telling it what to do. Instead of “write a concise summary”, try “write a summary. Do not include background information, do not explain what the document is about, do not use hedging language like ‘appears to’ or ‘seems to’. Just the key facts.” The negatives constrain the output more precisely than positive instructions.

Reliability Tricks

Ask for confidence when you need reliability

Adding “If you are uncertain about any fact in your response, explicitly flag it with [UNCERTAIN]” dramatically improves the reliability of factual outputs. Models are better calibrated when explicitly asked to express uncertainty than when you simply trust their output. This works particularly well for technical documentation and code explanation.

Structured output: use JSON mode, not just “output JSON”

Asking for JSON in the prompt doesn’t guarantee valid JSON — models regularly produce malformed output, especially for complex schemas. Use the structured output / JSON mode APIs where available (OpenAI, Anthropic, Gemini all support this). For Ollama, use the format: json parameter. This constrains token sampling to valid JSON syntax and eliminates the problem entirely.

# Ollama structured output
curl http://localhost:11434/api/generate   -d '{"model": "qwen2.5:7b", "format": "json", "prompt": "Extract the names and dates from this text..."}'

Use a verification step for critical outputs

For anything where accuracy matters, use two model calls: one to generate the answer, a second to verify it. The second call has a simple system prompt: “You are a critic. Given the following question and answer, identify any factual errors, logical issues, or missing information.” The cost is doubled but so is reliability. This pattern is common in production AI pipelines.

Context Window Management

The context degradation problem

Most models exhibit accuracy degradation in the middle of very long contexts — a phenomenon sometimes called “lost in the middle.” Information at the beginning and end of a long context is recalled more reliably than information buried in the middle. For documents with critical information spread throughout, consider extracting and moving the most important sections to the end of the context.

Sliding window for long conversations

Long chat sessions accumulate context that hurts performance. A common production pattern: summarise older turns into a compact “conversation so far” block and replace the raw history. This keeps the effective context manageable while preserving the semantic content of previous turns. LiteLLM has built-in sliding window support.

Local Model Tricks

Keep models loaded in VRAM

Ollama unloads models from VRAM after a period of inactivity. For production use, ping the model with a lightweight request every 5 minutes to keep it warm. Cold starts for a 70B model can take 15-30 seconds — unacceptable for interactive applications.

Context-cache equivalence in Ollama

Ollama caches the KV-cache for repeated system prompts. If you use the same system prompt across requests, subsequent calls with that prompt are significantly faster — the prefill is essentially free. Structure your prompts to have a stable system prompt and variable user turns to maximise this.

Use smaller models for routing

A common pattern for efficient local setups: a small 3B-7B model acts as a router, classifying incoming queries by complexity and task type, then routing to the appropriate larger model. The router itself is cheap to run and dramatically reduces the load on expensive 70B inference.

Quality Evaluation Tricks

LLM-as-judge

For systematic quality evaluation without human labelling, use a capable model (Claude Sonnet, GPT-4o) as a judge. Feed it pairs of outputs and ask it to choose the better one, or score outputs against explicit criteria. This scales to thousands of evaluations at much lower cost than human evaluation and produces surprisingly consistent results on well-defined tasks.

Prompt regression testing

Treat prompts like code. Version-control your prompts, maintain a test suite of inputs and expected outputs, and run the test suite every time you modify a prompt. This prevents the common failure mode where optimising a prompt for one use case silently breaks it for another.

Cost Reduction Tricks

Batch API for non-real-time workloads

OpenAI, Anthropic, and Google all offer batch processing APIs at 50% of standard pricing. If your workload doesn’t need real-time responses — document analysis, bulk summarisation, overnight processing — batch mode halves your bill with no quality change.

Cache stable portions of your prompts

Anthropic’s Claude API supports prompt caching — stable portions of your system prompt are cached and not charged at full price on repeated calls. For applications with long system prompts (instructions, examples, context), this can reduce costs by 80%+ on the cached portion. Enable it with the cache_control parameter on content blocks.

Tiered model routing by task complexity

Not every task needs a frontier model. Routing simple tasks (classification, basic summarisation, extraction) to a fast cheap model (Claude Haiku, Gemini Flash, or local 7B) and only sending complex tasks to the full-power model can reduce API costs by 60-70% without perceptible quality loss on the simple tasks.

Key Takeaways

  • Put critical instructions at the end of your system prompt, not the beginning.
  • Use structured output / JSON mode APIs rather than asking for JSON in plain text.
  • A verification step doubles reliability for critical outputs at the cost of doubled inference time.
  • Keep local Ollama models warm with a lightweight ping every 5 minutes to avoid cold-start delays.
  • Use batch APIs for 50% cost reduction on non-real-time workloads.
  • Prompt caching in the Anthropic API can cut costs by 80%+ on applications with long stable system prompts.
  • LLM-as-judge is the most scalable approach to systematic quality evaluation without human labelling.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top