“`html
Head-to-head benchmarking of two open-weight models: Needle (26M) and Qwen3-0.6B
We ran a head-to-head comparison between the two models, focusing on tool-calling tasks using only a 4-core CPU without any GPU assistance or model selection bias.
Experimental Setup
- Setup: 50 queries across five tiers (simple, paraphrased, implicit, ambiguous, edge cases including foreign language and a ‘don’t call any tool’ trap).
- Tools: Used three mock tools for evaluation.
Key Metrics
| Metric | Needle (26M) | Qwen3-0.6B |
|---|---|---|
| tool_match overall | 72.0% | 56.0% |
| parse_success | 84.0% | 54.0% |
| args_match | match | 97.2% | 100.0% |
| mean latency | 10.9s | 47.9s |
Analysis of Failure Shapes
- Needle: Fails by choosing the wrong tool, but when it does pick a tool, args are right 97% of the time. Its main issue is that it sometimes routes system commands to search_web instead of run_command.
- Qwen3-0.6B: Fails by not calling a tool at all; every single one of its 22 misses is due to parse failures where it answers in prose instead of emitting
<tool_call>tags. When it does emit a call, args are perfect 100% of the time.
Tier Breakdown
- T1 and T2 (literal and paraphrased): Both models perform equally well with ~95% accuracy. Needle slightly outperforms Qwen3-0.6B in this tier.
- T3 (implicit, like ‘should I bring an umbrella in Amsterdam?’): Qwen3-0.6B falls off a cliff here, dropping to 80% accuracy while Needle just maps the intent without needing tools.
- T5 (edge cases): The only tier where Qwen3-0.6B wins is T5, by a margin of 10 points. This victory was due to handling Hindi and French queries cleanly, which broke Needle’s tokenizer for Devanagari fragments.
Additional Notes
- Needle: Scored 8% initially because it echoed the word ‘properties’ back as an argument value. Fixing this by writing a converter increased accuracy to 72%, with no other changes required.
- Qwen3-0.6B: Had issues emitting EOS and exceeded its token budget, leading to long runtimes (up to 256 tokens). Switching to the
tokenizer.apply_chat_template(tools=...)method withenable_thinking=Falsereduced runtime to ~37 seconds.
Conclusion and Future Work
- Needle: A dispatcher model that performs well for single-shot tool routing but lacks any conversational ability. It is suitable for tasks where a fixed palette of tools is required.
- Qwen3-0.6B: A small general-purpose model with some conversational capabilities, outperforming Needle in most tiers due to its ability to handle prose responses and provide helpful but incomplete information.
Key Takeaways
- The models diverge significantly in their failure patterns: Needle tends to choose the wrong tool, while Qwen3-0.6B often fails by not calling any tool at all.
- T5 is where Qwen3-0.6B wins decisively due to its ability to handle edge cases like foreign language queries and provide useful but incomplete responses.
- Qwen3-0.6B has a longer runtime issue related to token management, which can be mitigated with proper configuration adjustments.
View the full post and results here.
“`
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




