Holo3.1: Fast & Local Computer Use Agents

For makers and artists building custom automation tools, the latest Holo3.1 update means your computer-use agents can finally run privately on a…

By AI Maestro June 2, 2026 3 min read
Holo3.1: Fast & Local Computer Use Agents

For makers and artists building custom automation tools, the latest Holo3.1 update means your computer-use agents can finally run privately on a user’s own machine without sending data to the cloud. This release bridges the gap between high-end server performance and local execution, allowing developers to deploy robust agents directly on Windows or Mac devices using consumer hardware.

Robustness Across Real-World Environments

Derived from the Qwen family, Holo3.1 prioritises stability in the specific settings where automation tools are actually used, rather than just in controlled testing labs. During the transition from evaluation to production, teams consistently found that strong results in one environment did not translate to others. Whether deploying on mobile devices, integrating into third-party agent stacks, or using different execution frameworks, each scenario introduced unique distribution shifts that previous versions struggled to handle.

Mobile Automation Gains

The update significantly expands capabilities beyond browser and desktop control, delivering major improvements for mobile environments. On the AndroidWorld benchmark, the 35B-A3B model jumps from 67% to 79.3%, while the smaller 4B and 9B variants see a rise from 58% to 72%.

Seamless Integration with Agent Stacks

To assist teams embedding Holo into external frameworks, the new version adds native support for function-calling protocols alongside the structured JSON outputs already present in Holo3. Testing across OSWorld and an internal suite covering e-commerce, business software, and collaboration workflows shows that function-calling and native execution now perform at near-parity levels. Furthermore, when evaluated within the Holotab product harness, Holo3.1 delivers an improvement of over 25% compared to the previous generation.

Efficient Model Sizes for Private Deployment

To facilitate local and on-device inference, new model sizes are available including 0.8B, 4B, and 9B variants. These offer a cost-effective and private alternative for deployment, sitting alongside the larger 35B-A3B model which retains state-of-the-art performance.

Local Inference Without Compromise

This release marks the first time quantized weights are shipped. Starting with the 35B-A3B checkpoints, users can now access FP8, Q4 GGUF, and NVFP4 formats. For the NVFP4 option, NVIDIA’s Model Optimizer was used in a W4A16 configuration. These checkpoints enable rapid local inference with minimal degradation in model performance. FP8 and NVFP4 achieve identical OSWorld scores, sitting only about two points below the full-precision BF16 checkpoint.

The speed increases are substantial. On DGX Spark hardware, NVFP4 W4A16 delivers 1.41× the total token throughput of FP8 and 1.74× that of BF16.

Running on Consumer Hardware

Q4 GGUF checkpoints are also released specifically for deploying Computer Use Agents on consumer hardware. The agent runs locally on a Windows or Mac machine, while the model can execute on that same device—reference numbers for Apple Silicon are included—or on a DGX Spark on the local network. In both scenarios, execution remains fully private and local, with no data leaving the user’s network.

On Spark, combining agent harness optimisations developed with NVIDIA and the NVFP4 quantisation yields a compound ~2× end-to-end speedup over the FP8 baseline. This cuts the average step time from 6.8 seconds down to 3.3 seconds. On DGX Spark, vLLM with NVFP4 achieves the highest request rate in both Default and Fast modes, outperforming Q4 GGUF and FP8. These enhancements will be included in an upcoming desktop agent harness.

Available Model Sizes

The Holo3.1 family is offered in four distinct sizes:

  • Holo3.1-0.8B – Optimised for ultra-lightweight local agents
  • Holo3.1-4B – Designed for cost-efficient deployment
  • Holo3.1-9B – Balances performance and latency
  • Holo3.1-35B-A3B – Delivers state-of-the-art performance

Optimised FP8, NVFP4, and Q4 GGUF checkpoints are also available for local and edge deployment.

Developers can access further details via the Technical Blog, the Holo Models API, or the Hugging Face Collection.

Key takeaways

  • Holo3.1 introduces native function-calling protocols and improved mobile performance, raising scores on AndroidWorld from 58% to 72% for smaller models.
  • New quantised checkpoints (FP8, NVFP4, and Q4 GGUF) allow for fast, private inference on consumer hardware without significant accuracy loss.
  • Local deployment on Windows or Mac is now viable, with end-to-end speedups of approximately 2× on compatible hardware reducing step times to under four seconds.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top