In this article
Local models now triage the OpenClaw repository without cost
The removal of Anthropic‘s latest flagship model, Claude Fable 5, highlights the need for independent AI stacks. This shift has driven the adoption of local models like Gemma and Qwen to run classification tasks within agent harnesses. This method differs from traditional classifiers like BERT. A local model operating inside an agent harness can use structured outputs to assign labels. The team chose this approach because they already possessed the necessary local models and harness infrastructure, and they expect similar setups to grow as local model capabilities improve.
The project began with contributions to the OpenClaw repository. OpenClaw receives hundreds of issues and pull requests daily, requiring triage, prioritisation, and routing to maintainers. Onur, a maintainer of this specific vertical, must react quickly to critical errors. While large closed models like GPT-5, Opus, or Sonnet handle this easily, Onur has access to 128 GB of unified memory on an NVIDIA GB10 GPU. He set out to build a real-time notification system that filters and alerts him only on issues he is responsible for, using local open-weight models.
Using a ChatGPT Pro plan to trigger jobs on every new issue would exhaust the monthly quota. Running the system every two or six hours would batch issues, sacrificing real-time notifications for delayed processing. Running the logic on local hardware offers near-instantaneous alerts at no financial cost beyond electricity.
Categorizing issues and PRs
The team defined a finite set of labels representing the categories required for triage. A local model classifies each issue into one of these groups, such as local_models, self_hosted_inference, acp, agent_runtime, codex, ui_tui, and so on.
Classifying pull requests required a different approach. A simple request to a Chat Completions endpoint with a tool JSON schema would suffice in 2023, but the current environment supports agents. The team tested gemma-4-26b-a4b and qwen3.6-35b-a3b. With performance optimisations, both models generate hundreds of tokens per second locally.
An agent harness drives the classification run. The team bundles pi as a harness capable of calling local model endpoints. The agent receives the PR title, body, and a truncated excerpt of the diff in the initial prompt. It can then use the bash tool for read-only operations on the OpenClaw repository if it needs to inspect the codebase, or the final_json tool to submit the classification result.
Granting full bash access to a local model in a high-throughput setting is unsafe. A prompt-injected issue could steer the model away from classification. Instead, the team uses reposhell. This is a restricted shell similar to bash that allows only read-only operations like ls, find, cat, and grep on the OpenClaw repository. The model believes it is using bash, but any disallowed operation is rejected.
reposhell bound cwd=/repo/openclaw repos=openclaw
type help for allowed commands; exit or quit to leavereposhell /repo/openclaw> help
allowed: pwd, ls, find, rg, grep, sed -n, cat, head, tail, wc -l, git status –short, git show –name-only, git grep, git ls-files
search: rg -n -i “lm studio” or grep -R -n -i “lm studio” .
files: rg –files -g “*.ts” or git ls-files src
examples: rg -n reposhell README.md | sed is not allowed; use one simple command at a timereposhell /repo/openclaw> head README.md
# 🦞 OpenClaw — Personal AI Assistant<p align=”center”>
<picture>
<source media=”(prefers-color-scheme: light)” srcset=”https://raw.githubusercontent.com/openclaw/openclaw/main/docs/assets/openclaw-logo-text-dark.svg”>
<img src=”https://raw.githubusercontent.com/openclaw/openclaw/main/docs/assets/openclaw-logo-text.svg” alt=”OpenClaw” width=”500″>
</picture>
</p><p align=”center”>
reposhell /repo/openclaw> curl localhost
reposhell policy denied command: unsupported command “curl”
exit_code=2reposhell /repo/openclaw>
A specific session example demonstrates why this matters. The model qwen3.6-35b-a3b classified openclaw/openclaw#84621, titled Fix Kimi tool-call rewriting stop reason handling. The thinking block showed the model initially considering coding_agent_integrations because the changed path extensions/kimi-coding seemed plausible. The model used reposhell to inspect the local repository with simple read-only commands like ls extensions, ls extensions/kimi-coding, and cat extensions/kimi-coding/package.json. That package metadata revealed the extension was actually @openclaw/kimi-provider, an OpenClaw Kimi provider plugin. The model corrected the final labels to inference_api and tool_calling, explicitly excluding coding_agent_integrations.
The team bundles a specific pi configuration that performs only read-only operations and returns classification output. They call it localpager-agent, named after localpager, the main project. Each PR and issue generates a prompt, which is passed to the CLI alongside other arguments:
localpager-agent \ --model "<model-id>" \ --base-url "<openai-compatible-base-url>" \ --session-dir "<session-output-dir>" \ --final-schema "<runtime-schema.json>" \ --tools bash,final_json \ --reposhell-socket "<reposhell.sock>" \ --reposhell-default-repo "<repo-id>" \ --reposhell-visible-repos "<repo-id>[,<repo-id>...]" \ -p "$(cat <rendered-prompt.md>)"
Processing incoming PRs and issues
The orchestration between incoming PRs and the final Discord notification is simple. Only the classification step involves an LLM:
- The system uses openclaw/gitcrawl to act as a local mirror for the repository. When a new PR or issue arrives, each item is normalised into the same shape and written into localpager’s SQLite database. If the item is new, localpager creates a classification job for it.
- A worker claims jobs from that queue. It builds a GitHub context object containing the issue or PR title, body, labels, author, state, and optionally comments, changed files, and selected diff excerpts. The local model does not need to browse GitHub or open the URL itself. It is handed all relevant context.
- The context object is rendered into a prompt and passed to
localpager-agentas described previously. The agent can think and use reposhell, but must eventually output a classification result in the defined schema. - The output is stored back in the localpager SQLite database and relayed to Discord based on the notification policy configured by the user.
The architecture is semi-agentic. Labelling is done agentically, while sending a notification is handled by deterministic rules. This makes the notification pipeline faster by removing the need for inference for the most straightforward parts of the task. Local inference is free, but each task incurs a resource contention cost. GPU bandwidth should be reserved for tasks where inference is absolutely needed. This also reduces the chance of errors during notification.
Can local models triage PRs?
The first local versions of this system were noisy. The initial model tested, gemma-4-e4b-it, was useful for getting the end-to-end local pipeline working, but it tended to assign too many unrelated labels to a PR or issue. False positive labels made the Discord feed noisy and failed to focus attention on the right issues. This pushed the team toward testing larger local models, including gemma-4-26b-a4b and qwen3.6-35b-a3b, on a 330-row evaluation set.
For early prompt work, the team used DeepSeek-V4-Flash through the antirez DS4 implementation to create the earlier dataset labels. That setup used the DS4 server over CUDA. The team eventually abandoned DS4 as the labeler because it was not labeling consistently across runs. They also did not consider it the main localpager-agent model because it was too large to achieve sufficient throughput on their hardware. The DS4 server provided around 14 tokens per second with a maximum concurrency of 1.
To test model performance, the team selected and generated labels for 330 GitHub issues and PRs. Each item was labelled five times (three times with GPT-5.5 and twice with Opus 4.8). Models needed agreement to be accepted. This process involved hand adjudication, improving label definitions, and highlighting internal product design choices for the models. This produced a set of stable, reproducible labels to compare smaller models against.
No prompt optimisation was required for gemma-4-26b-a4b or qwen3.6-35b-a3b to get useful results on this evaluation set. Using the same routing prompt, Gemma achieved higher recall and lower wall-clock time per row, while Qwen achieved higher precision, higher exact match, and fewer false positives. The team also ran DeepSeek-V4-Flash on the same set as a reference. It had the fewest false positives, but the model size and throughput made it impractical for executing these tasks in real time on the NVIDIA GB10. Since each row can have multiple labels, false positives and false negatives are total label counts across all rows. The Qwen results below are after retrying structured-output failures where the model ran out of output tokens before calling final_json. For Gemma and Qwen, repeated-run metrics report mean ± sample standard deviation across three runs. DeepSeek-V4-Flash was run once as a reference.
| metric | < Source Read original →
|
|---|




