“`html
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
I’m building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with it depends.
Splitting into two jobs:
- Heavy one-shot generation — write a 400-line module, refactor a big file. That wants a big model, no argument. In my setup I route this to a dedicated coding model; I don’t ask the loop model to do it.
- The orchestration loop itself — read this, decide which tool, call it with the right arguments, look at the result, react. This post is only about (b).
For (b): how small can that model get before the loop stops being trustworthy? My balance point right now is Qwen3.6-35B-A3B (MoE, ~3B active) — the lightest setup where the loop holds up, still fine on a 12GB card with 30 expert offload (running 40 t/s prompt gen). Below that it degrades, and I’ve been trying to pin down what degrades first.
What degrades?
- The model gets the intent right but botches the call. Examples from smaller models I tested:
- Passes
overwrite=trueto anappend_filetool that has no such parameter. - Calls
grep_searchwith anoutput_modearg that doesn’t exist — it generalized it from a different tool. - Tries to invoke a
conclusion“tool” that was never a tool, because finishing the task feels like an action. - Passes
overwriteagain to yet another tool, having learned the wrong lesson from an earlier call.
The model doesn’t reason incorrectly. It’s a problem with tool-call discipline. The 35B-A3B does this rarely; small dense models do it constantly.
Things I tried to push the floor lower:
- Exposing the exact tool signature in the system prompt — generated
tool_name(arg1, arg2, opt=default)straight from the function, next to each tool, so the model sees the precise parameter list and, by omission, which parameters do NOT exist. Subjectively it helped a lot; not measured rigorously yet. - Repetition watchdogs — small models get stuck repeating the same failing (tool, args) call while the observation keeps erroring; their model of the state has drifted. I fingerprint recent actions and inject a
stop, change strategy
hint after N identical failures. Works, but it’s a band-aid.
What I’m after:
- For the orchestration role specifically — smallest model you actually trust in a loop?
- Is tool-call discipline the first thing that breaks for you too, or does something else go first?
- Better ways to make small models viable here — stricter tool schemas, light fine-tuning?
Repo’s here if useful — still rough: https://github.com/homoagens/pragma
You can probably go smaller than people think — if you fix tool-call discipline instead of just reaching for a bigger model.
Key Takeaways
- The smallest trusted model in the loop is Qwen3.6-35B-A3B, which has 3B active parameters.
- Tool-call discipline issues are often the first thing to break as models get smaller.
- To make small models viable, stricter tool schemas and light fine-tuning may be necessary.
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




