“`html

Benchmarked Needle 26M vs Qwen3-0.6B on CPU

Head-to-head benchmarking of two open-weight models: Needle (26M) and Qwen3-0.6B

We ran a head-to-head comparison between the two models, focusing on tool-calling tasks using only a 4-core CPU without any GPU assistance or model selection bias.

Experimental Setup

Setup: 50 queries across five tiers (simple, paraphrased, implicit, ambiguous, edge cases including foreign language and a ‘don’t call any tool’ trap).
Tools: Used three mock tools for evaluation.

Key Metrics

Metric	Needle (26M)	Qwen3-0.6B
tool_match overall	72.0%	56.0%
parse_success	84.0%	54.0%
args_match \| match	97.2%	100.0%
mean latency	10.9s	47.9s

Analysis of Failure Shapes

Needle: Fails by choosing the wrong tool, but when it does pick a tool, args are right 97% of the time. Its main issue is that it sometimes routes system commands to search_web instead of run_command.
Qwen3-0.6B: Fails by not calling a tool at all; every single one of its 22 misses is due to parse failures where it answers in prose instead of emitting <tool_call> tags. When it does emit a call, args are perfect 100% of the time.

Tier Breakdown

T1 and T2 (literal and paraphrased): Both models perform equally well with ~95% accuracy. Needle slightly outperforms Qwen3-0.6B in this tier.
T3 (implicit, like ‘should I bring an umbrella in Amsterdam?’): Qwen3-0.6B falls off a cliff here, dropping to 80% accuracy while Needle just maps the intent without needing tools.
T5 (edge cases): The only tier where Qwen3-0.6B wins is T5, by a margin of 10 points. This victory was due to handling Hindi and French queries cleanly, which broke Needle’s tokenizer for Devanagari fragments.

Additional Notes

Needle: Scored 8% initially because it echoed the word ‘properties’ back as an argument value. Fixing this by writing a converter increased accuracy to 72%, with no other changes required.
Qwen3-0.6B: Had issues emitting EOS and exceeded its token budget, leading to long runtimes (up to 256 tokens). Switching to the tokenizer.apply_chat_template(tools=...) method with enable_thinking=False reduced runtime to ~37 seconds.

Conclusion and Future Work

Needle: A dispatcher model that performs well for single-shot tool routing but lacks any conversational ability. It is suitable for tasks where a fixed palette of tools is required.
Qwen3-0.6B: A small general-purpose model with some conversational capabilities, outperforming Needle in most tiers due to its ability to handle prose responses and provide helpful but incomplete information.

Key Takeaways

The models diverge significantly in their failure patterns: Needle tends to choose the wrong tool, while Qwen3-0.6B often fails by not calling any tool at all.
T5 is where Qwen3-0.6B wins decisively due to its ability to handle edge cases like foreign language queries and provide useful but incomplete responses.
Qwen3-0.6B has a longer runtime issue related to token management, which can be mitigated with proper configuration adjustments.

View the full post and results here.

“`

Source Read original →

Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.

Head-to-head benchmarking of two open-weight models: Needle (26M) and Qwen3-0.6B

Experimental Setup

Key Metrics

Analysis of Failure Shapes

Tier Breakdown

Additional Notes

Conclusion and Future Work

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

NVIDIA’s Cosmos-Framework Tutorial: Designing…

Hackers can use 9…

AI chip maker SambaNova…

Head-to-head benchmarking of two open-weight models: Needle (26M) and Qwen3-0.6B

Experimental Setup

Key Metrics

Analysis of Failure Shapes

Tier Breakdown

Additional Notes

Conclusion and Future Work

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

NVIDIA’s Cosmos-Framework Tutorial: Designing…

Hackers can use 9…

AI chip maker SambaNova…