How small can the orchestration model in an agent be? (separating it from code-gen, that obviously wants a big model)

“`html

How Small Can the Orchestration Model in an Agent Be?

How small can the orchestration model in an agent be? (separating it from code-gen, that obviously wants a big model)

I’m building a local-first agent, a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend, and I want to be precise about a question that usually just gets answered with it depends.

Splitting into two jobs:

Heavy one-shot generation, write a 400-line module, refactor a big file. That wants a big model, no argument. In my setup I route this to a dedicated coding model; I don’t ask the loop model to do it.
The orchestration loop itself, read this, decide which tool, call it with the right arguments, look at the result, react. This post is only about (b).

For (b): how small can that model get before the loop stops being trustworthy? My balance point right now is Qwen3.6-35B-A3B (MoE, ~3B active), the lightest setup where the loop holds up, still fine on a 12GB card with 30 expert offload (running 40 t/s prompt gen). Below that it degrades, and I’ve been trying to pin down what degrades first.

What degrades?

The model gets the intent right but botches the call. Examples from smaller models I tested:

Passes overwrite=true to an append_file tool that has no such parameter.
Calls grep_search with an output_mode arg that doesn’t exist, it generalized it from a different tool.
Tries to invoke a conclusion “tool” that was never a tool, because finishing the task feels like an action.
Passes overwrite again to yet another tool, having learned the wrong lesson from an earlier call.

The model doesn’t reason incorrectly. It’s a problem with tool-call discipline. The 35B-A3B does this rarely; small dense models do it constantly.

Things I tried to push the floor lower:

Exposing the exact tool signature in the system prompt, generated tool_name(arg1, arg2, opt=default) straight from the function, next to each tool, so the model sees the precise parameter list and, by omission, which parameters do NOT exist. Subjectively it helped a lot; not measured rigorously yet.
Repetition watchdogs, small models get stuck repeating the same failing (tool, args) call while the observation keeps erroring; their model of the state has drifted. I fingerprint recent actions and inject a stop, change strategy hint after N identical failures. Works, but it’s a band-aid.

What I’m after:

For the orchestration role specifically, smallest model you actually trust in a loop?
Is tool-call discipline the first thing that breaks for you too, or does something else go first?
Better ways to make small models viable here, stricter tool schemas, light fine-tuning?

Repo’s here if useful, still rough: https://github.com/homoagens/pragma

You can probably go smaller than people think, if you fix tool-call discipline instead of just reaching for a bigger model.

Key Takeaways

The smallest trusted model in the loop is Qwen3.6-35B-A3B, which has 3B active parameters.
Tool-call discipline issues are often the first thing to break as models get smaller.
To make small models viable, stricter tool schemas and light fine-tuning may be necessary.

“`

Source Read original →

How small can the orchestration model in an agent be? (separating it from code-gen, that obviously wants a big model)