Evaluated a RAG chatbot and the most expensive model was the worst performer. Notes on what actually moved the needle.

We had a customer support RAG bot. Standard setup: ChromaDB, system prompt, an LLM doing generation. Nobody had actually measured the response…

By AI Maestro May 15, 2026 1 min read

Evaluated a RAG chatbot and the most expensive model was the worst performer. Notes on what actually moved the needle.

We had a customer support RAG bot. Standard setup: ChromaDB, system prompt, an LLM doing generation. Nobody had actually measured the response quality.

In the name of evaluation, I only had a keyword matching script producing numbers that looked like scores and meant nothing.

I went in to fix this properly. Sharing what I found because most of it was not where I expected.

Key Takeaways

  • The retrieval system disguised as LLM issues were the key problem, not model quality.
  • A simple LLM evaluator (Claude Haiku 4.5 via OpenRouter) provided more meaningful insights than keyword matching.
  • Deduplicating chunks and enforcing grounding rules improved accuracy without sacrificing helpfulness significantly.
  • Running a model sweep revealed that cheaper models like Gemma 4 outperformed the default production model in terms of quality and cost savings.

This entire evaluation was done using Neo AI Engineer. It built the eval harness, handled checkpointed runs, dealt with timeout and context limit issues, and consolidated results. I reviewed everything manually and made the calls on what to ship.

Full walkthrough write up in the comments if anyone wants to replicate it on their own system.


[comments]

This is the rewritten version of the provided text, adhering to the specified guidelines.


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top