Meet OpenJarvis: A Local-First Framework for On-Device Personal AI Agents with Tools, Memory, and Learning

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 4, 2026 5 min read
Meet OpenJarvis: A Local-First Framework for On-Device Personal AI Agents with Tools, Memory, and Learning

For creators and developers, the rise of OpenJarvis signals a shift away from dependency on cloud APIs toward true on-device autonomy. By running inference, memory, and learning loops locally, this framework allows artists and makers to build personal AI agents that respect privacy and reduce costs, all while approaching the performance of top-tier cloud models.

Researchers from Stanford University and Lambda Labs have released OpenJarvis, an open-source system designed to execute personal AI agents entirely on local hardware. The accompanying paper demonstrates that models configured through this framework achieve results within 3.2 percentage points of the best cloud-based counterparts. Crucially, this performance comes with roughly 800 times lower marginal API costs and approximately 4 times lower latency under standard benchmark protocols. This work builds upon the team’s earlier Intelligence Per Watt study, which found that local models already handle 88.7% of single-turn queries at interactive speeds, with efficiency improving 5.3 times between 2023 and 2025.

Model Overview & Access

OpenJarvis is not a single pre-trained model but a flexible framework capable of composing any supported model with a configurable agent stack. The system has been evaluated across 11 local models drawn from four major families.

PropertyValue
LicenseApache 2.0
Framework releaseMarch 12, 2026
PaperarXiv:2605.17172 (posted May 16, 2026)
Repositorygithub.com/open-jarvis/OpenJarvis
Stars / forks~5.4k / ~1.2k (June 2026)
LanguagesPython (~83%), Rust (~9%), TypeScript (~7%)
Evaluated models11 local models across 4 families: Qwen3.5, Gemma4, Nemotron, Granite
Cloud baselinesClaude Opus 4.6, GPT-5.4, Gemini 3.1 Pro
Supported enginesOllama, vLLM, SGLang, llama.cpp, Apple Foundation Models, Exo (among others)
Context windowModel-dependent
InstallationSingle command; ~3 minutes on broadband
HardwareTested on 7 platforms, from Mac Mini M4 to NVIDIA DGX Spark

Architecture: Five Primitives and a Spec

OpenJarvis breaks down a personal AI system into five typed primitives, unified by a single declarative configuration object known as a spec.

  • Intelligence — defines the model, weights, generation parameters, and quantization format.
  • Engine — manages the inference runtime (such as Ollama or vLLM), batching, KV-cache settings, and hardware path.
  • Agents — controls the reasoning loop (ReAct or CodeAct), system prompts, tool-use policy, and turn limits.
  • Tools & Memory — handles external interfaces, retrieval backends, 25+ data connectors, and 32+ messaging channels, featuring native MCP support and interchangeable memory backends.
  • Learning — acts as the optimizer that updates the spec from traces, accepting LoRA, DSPy, GEPA, or LLM-guided spec search.

Each primitive is independently swappable, and the spec serializes all five into a TOML file. This modularity means two specs can share the same agent and tool configuration while differing only in model and engine, allowing identical behavior to run on a Mac Mini and a workstation without rewriting prompts.

LLM-guided spec search represents the framework’s second major contribution. It functions as a local–cloud collaboration where a frontier cloud model acts as a teacher. At search time, this teacher reads traces, diagnoses failure clusters, and proposes edits across Intelligence, Engine, Agents, and Tools & Memory. An edit is accepted only if it improves the target failure cluster without causing meaningful regressions elsewhere—a safeguard the team calls the gate with a default tolerance of 1%. The optimized spec then runs entirely on-device during inference, requiring zero cloud calls. The teacher is used solely at search time; at 100 queries per day, the amortized teacher cost falls below $0.001 per query within six months.

Previous approaches (GEPA, DSPy, LoRA) typically optimize one primitive at a time, and prompt optimizers alone recover only about 5 percentage points of the cloud–local gap. LLM-guided spec search recovers 13–32 percentage points because it edits across primitives jointly, at 7–11 times lower optimization cost than single-primitive baselines. The four-primitive move space contributes 5.5–16.5 percentage points, and the LLM proposer adds about 10 percentage points on average over an evolutionary search at the same move space.

Capabilities & Performance

OpenJarvis was evaluated across 8 benchmarks spanning 508 tasks, covering tool calling (ToolCall-15), agentic workflows (PinchBench), coding (LiveCodeBench), customer service (τ-Bench V2, τ²-Bench Telecom), general assistance (GAIA), and deep research (LiveResearchBench, DeepResearchBench).

The swap test: Replacing the intended cloud model with Qwen3.5-9B in existing frameworks like OpenClaw or Hermes Agent drops accuracy by 25–39 percentage points. However, when using the same model under an OpenJarvis spec, the residual drop shrinks to 5.6–16.5 percentage points—recovering 56–77% of the portability loss.

The accuracy frontier: The best single local model, Qwen3.5-122B, reaches 80.3% average accuracy versus Claude Opus 4.6 at 83.5%—a gap of just 3.2 percentage points. Local specs match or exceed cloud performance on 4 of 8 benchmarks: ToolCall-15, PinchBench, LiveCodeBench, and τ-Bench V2.

Cost and latency: Local configurations form the accuracy–efficiency frontier. Qwen3.5-122B delivers its 80.3% accuracy at roughly a thousandth of a cent per query, compared to $0.009 per query for Claude Opus 4.6—an approximately 800× marginal API-cost advantage. End-to-end latency drops by roughly 4× on agentic workloads, although the paper notes single-shot prompts can still favor cloud serving.

Search gains: LLM-guided spec search improves the Qwen3.5-9B student to 100% on PinchBench, 83% on LiveCodeBench, and 91% on LiveResearchBench. Across the full eight-benchmark suite, average gains per student model range from 13.1 to 31.5 percentage points. The authors report that these gains survive their robustness checks, including reward-weight variants, search-seed variance, and random restarts.

How to Use It

Installation requires a single command. On macOS, Linux, or WSL2:

curl -fsSL https://open-jarvis.github.io/OpenJarvis/install.sh | bash

Windows users run an equivalent PowerShell script. The installer provisions uv, a Python virtual environment, Ollama, and a starter model in about three minutes on broadband. A desktop GUI is available as a .dmg, .exe, .deb, .rpm, or .AppImage from the releases page.

After installation, running jarvis starts a chat session. Starter presets cover common workflows:

jarvis init --preset morning-digest-mac    # daily briefing with TTS
jarvis init --preset deep-research         # multi-hop research with citations
jarvis init --preset code-assistant        # agent with code execution and shell access
jarvis init --preset scheduled-monitor     # stateful agent on a schedule

The framework ships with eight built-in agents across three execution modes—on-demand, scheduled, and continuous. It connects to 25+ data sources (Gmail, Calendar, iMessage, Notion, Obsidian, Slack, GitHub, and others) and exposes agents over 32+ messaging channels (WhatsApp, Telegram, Discord, iMessage, Signal, and others).

Skills can be imported from external catalogs—about 150 from Hermes Agent and about 13,700 community skills from OpenClaw—all following the agentskills.io specification. A jarvis optimize skills --policy dspy command refines them from local trace history.

Key takeaways

  • OpenJarvis achieves near-cloud performance (within 3.2 percentage points) while delivering roughly 800 times lower API costs and 4 times lower latency.
  • The framework uses a modular spec-based architecture with five swappable primitives, allowing the same agent logic to run across different hardware and models.
  • LLM-guided spec search optimizes local agents by using a cloud model as a teacher to edit configurations, recovering 13–32 percentage points of the accuracy gap.
  • Installation is streamlined via a single command, supporting major platforms and offering preset agents for research, coding, and monitoring tasks.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top