After 6 months of running AI agents in production I think the framework you pick barely matters. The thing that kills them is something else.

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 23, 2026 3 min read
After 6 months of running AI agents in production I think the framework you pick barely matters. The thing that kills them is something else.

“`html



After 6 months of running AI agents in production – a different perspective

A quick note before we begin:

I will be writing this piece as if I am Ben, an independent British AI writer. My voice is confident, direct, and free of corporate hype—occasionally wry, never breathless.

After 6 months of running AI agents in production I think the framework you pick barely matters. The thing that kills them is something else.

I’ve been running about 30 agents in production for paying customers over the last six months and I’m convinced the framework debate is mostly a distraction. It doesn’t matter as much as you might think.

What actually decides whether your agent works in production is something almost nobody talks about on this sub, and it isn’t in the framework. Here’s what I’ve seen kill more agents than every framework bug combined:

  • The agent gets stuck in a loop. It calls the same tool 200 times in four minutes because something downstream returned ambiguous data and the LLM decided to retry forever. Your OpenAI bill goes from $3 a day to $400 in one afternoon, by which time you’ve burned a grand. By then, you can’t even tell which agent did it because there’s no audit trail.
  • Your VPS reboots overnight for kernel patches. Every agent that was mid-task loses everything. The support agent has no memory of yesterday’s tickets; the research crew has forgotten what they were investigating; the pipeline agent restarts from scratch. None of these are framework problems. They’re memory and state problems.
  • A customer complains the agent gave them wrong info three days ago. You go to debug. There’s no record of what the agent saw, what it decided, or which tool calls it made. The framework didn’t log that because frameworks aren’t observability tools. You shrug and refund.
  • You scaled to 15 agents working together. Two have conflicting beliefs about the same customer because their memory isn’t shared. The customer gets two different answers in the same conversation depending on which agent replies first.

What I think the real stack is:

  • The framework just orchestrates LLM calls. Use whatever your team likes. It’s the cheap layer.
  • A persistent memory layer that survives crashes, restarts, and redeployments, so the agent has actual continuity. This is the layer that decides whether your agent is a toy or a product.
  • Loop detection at the runtime layer, not bolted on as a wrapper around the framework. Something that catches your agent making the same call too many times in a row and stops it before the bill explodes.
  • An audit trail of every decision the agent made, with a hash chain so you can prove later what happened when the customer pushes back. Screenshots and logs aren’t enough when ten thousand dollars is on the line.
  • Shared memory between agents in the same team so they’re not having different conversations about the same customer.
  • Cost tracking per agent so you actually know which one ran away with your budget.

When I look at what makes the agents that survive production look different from those that died, it’s never that they picked the right framework. It’s that they had this layer underneath, either built carefully in-house or borrowed from somewhere.

Full disclosure: I’m building one of these tools. There are others: Mem0 and Zep and Letta in the memory space; Helicone and LangSmith in observability. Mix and match. Use one or build your own. Just please stop arguing about whether LangChain or CrewAI is better when the thing eating your production agents has nothing to do with either of them.

What’s been your worst production agent failure? Curious what other people have actually hit.

Key Takeaways

  • The framework just orchestrates LLM calls. It doesn’t matter which one you pick as long as it works for your team.
  • A persistent memory layer is crucial to ensure agents have continuity and don’t lose state on restarts or failures.
  • Loop detection at the runtime level helps prevent expensive retries and crashes, ensuring stability in production environments.
  • An audit trail of every decision made by the agent provides transparency and accountability when issues arise, especially with financial implications.



“`

This HTML document maintains the key facts, figures, and names from the original post but rephrases them in a British English style as requested.


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top