Needle: We Distilled Gemini Tool Calling Into a 26M Model

“`html Needle: A 26M Parameter Function-Calling Model Needle: We Distilled Gemini Tool Calling Into a 26M Model We have open-sourced Needle, a…

By AI Maestro May 12, 2026 2 min read
Needle: We Distilled Gemini Tool Calling Into a 26M Model

“`html




Needle: A 26M Parameter Function-Calling Model

Needle: We Distilled Gemini Tool Calling Into a 26M Model

We have open-sourced Needle, a function-calling (tool use) model with just 26 million parameters. It runs at 6000 tokens per second prefill and 1200 tokens per second decode on consumer devices.

A key insight from our work was that agentic experiences are built upon tool calling, not reasoning. The right primitive for this is cross-attention, and using feed-forward neural (FFN) layers in a model of this scale is wasteful. Needle is an experimental run designed specifically for single-shot function calling on consumer devices like phones, watches, or glasses.

Training Details

  • Predtrained on 200 billion tokens across 16 TPU v6e (27 hours)
  • Post-trained on 2 billion tokens of synthesized function-calling data (45 minutes)
  • The dataset was synthesized using Gemini, which offers 15 tool categories such as timers, messaging, navigation, smart home functions, and more.

You can test Needle right now and fine-tune it on your Mac or PC via the following link: https://github.com/cactus-compute/needle.

The full write-up on the architecture is available here: https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md.

We observed that the “no FFN” finding extends beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn’t need to memorize facts in FFN weights if they are provided within the input. Experimental results supporting this observation will be published soon.

While Needle outperforms models like FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M in single-shot function calling, these models have more scope and excel in conversational settings. We encourage you to test it on your own tools via the playground and fine-tune accordingly.

Needle is part of a broader effort to make on-device AI practical. We also build Cactus, an open-source inference engine for mobile and wearables, which we previously discussed: https://news.ycombinator.com/item?id=44524544.

Everything is licensed under the MIT license. Weights are available at https://huggingface.co/Cactus-Compute/needle.

  • Needle is a lightweight, 26M parameter model for function-calling tasks on consumer devices.
  • The model demonstrates the effectiveness of single-shot function calling without overcomplicating with FFNs.
  • We observed that this approach generalizes to tasks involving external structured knowledge, such as RAG and tool use.



“`


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top