For creators and developers building autonomous agents, the ability to refine instruction sets without manual rewriting is a game-changer. Microsoft SkillOpt offers exactly that: a system that automates the improvement of AI skills through a cycle of testing, reflection, and updating. The following guide demonstrates how to implement this workflow using the SearchQA dataset. We configure the environment to connect with OpenAI-compatible models, establish a baseline performance metric, and then execute an optimisation loop. Throughout the process, we monitor accuracy gains, track token consumption, and visualise how the agent’s capabilities evolve from a generic starting point to a specialised expert.
SkillOpt Environment Configuration
The first step involves preparing the runtime environment. We import necessary libraries for file handling and data processing, then secure the OpenAI API key. The configuration specifies gpt-4o as the optimiser and gpt-4o-mini as the target model. We clone the Microsoft SkillOpt repository and install dependencies, ensuring the backend is set to communicate via the OpenAI-compatible API endpoint.
import os, re, json, glob, subprocess, pathlib, difflib
try:
from google.colab import userdata
OPENAI_KEY = userdata.get("OPENAI_API_KEY")
except Exception:
OPENAI_KEY = os.environ.get("OPENAI_API_KEY", "")
OPENAI_KEY = OPENAI_KEY or "sk-PASTE-YOUR-KEY-HERE"
assert OPENAI_KEY.startswith("sk-"), "Set a real OpenAI key (Colab Secrets -> OPENAI_API_KEY)."
OPTIMIZER_MODEL = "gpt-4o"
TARGET_MODEL = "gpt-4o-mini"
RUN = "outputs/searchqa_adv"
LIMIT = 24
RUN_KNOBS = dict(num_epochs=2, batch_size=8, minibatch=4, merge_batch=4,
workers=2, lr=4, lr_sched="cosine", limit=LIMIT)
if not pathlib.Path("/content/SkillOpt/scripts/train.py").exists():
subprocess.run("git clone --depth 1 https://github.com/microsoft/SkillOpt.git",
shell=True, cwd="/content")
subprocess.run('pip -q install -e . && pip -q install "openai>=1.0" pandas matplotlib',
shell=True, cwd="/content/SkillOpt")
os.chdir("/content/SkillOpt")
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://api.openai.com/v1"
os.environ["AZURE_OPENAI_API_KEY"] = OPENAI_KEY
os.environ["AZURE_OPENAI_AUTH_MODE"] = "openai_compatible"
SPLIT = "data/searchqa_id_split"
CFG = "configs/searchqa/default.yaml"
COMMON = ["--azure_openai_endpoint","https://api.openai.com/v1",
"--cfg-options","model.backend=azure_openai",
"model.azure_openai_auth_mode=openai_compatible"]Establishing the Baseline
Before attempting any optimisation, we must quantify the starting performance. We define utility functions to execute commands and parse the resulting accuracy metrics. The script locates the initial seed skill file, which contains a generic instruction to answer questions based on provided context. We evaluate this untrained skill against the unseen validation split of the SearchQA dataset. This establishes a hard and soft accuracy baseline against which all future improvements will be measured.
def run_cli(args, tag):
print("\n" + "#"*80 + f"\n# {tag}\n# $ " + " ".join(args) + "\n" + "#"*80)
p = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
buf = []
for line in p.stdout:
print(line, end=""); buf.append(line)
p.wait(); return "".join(buf)
def parse_acc(txt):
m = re.search(r"Results:\s*hard=([\d.]+)\s+soft=([\d.]+)", txt)
if m: return {"hard": float(m.group(1)), "soft": float(m.group(2))}
g = re.findall(r"hard=([\d.]+)", txt)
return {"hard": float(g[-1]), "soft": None} if g else None
seed = "skillopt/envs/searchqa/skills/initial.md"
if not pathlib.Path(seed).exists():
seed = "baseline_skill.md"; pathlib.Path(seed).write_text("You answer questions from the given context.\n")
base_out = run_cli(["python","scripts/eval_only.py","--config",CFG,
"--skill",seed,"--split","valid_unseen","--split_dir",SPLIT,
"--target_model",TARGET_MODEL,*COMMON,
"env.workers=1",f"env.limit={LIMIT}"],
"BASELINE EVAL (env seed skill, no training)")
base = parse_acc(base_out)Training and Visualisation
We initiate the core optimisation loop using the configured models. The process involves rollout, reflection, aggregation, selection, and validation-based gating. Key parameters include two epochs, a batch size of eight, and a cosine learning rate schedule. We also enable slow updates and meta-skill capabilities to ensure stability. Once training concludes, we load the history log to generate a dashboard. This visualisation tracks accuracy on both training and validation sets, monitors the edit budget alongside the learning rate, and charts cumulative token usage over the training steps.
k = RUN_KNOBS
train_out = run_cli(["python","scripts/train.py","--config",CFG,"--split_dir",SPLIT,
"--optimizer_model",OPTIMIZER_MODEL,"--target_model",TARGET_MODEL,"--out_root",RUN,
*COMMON,
"train.train_size=0",
f"train.num_epochs={k['num_epochs']}", f"train.batch_size={k['batch_size']}",
f"gradient.minibatch_size={k['minibatch']}", f"gradient.merge_batch_size={k['merge_batch']}",
f"gradient.analyst_workers={k['workers']}",
f"optimizer.learning_rate={k['lr']}", f"optimizer.lr_scheduler={k['lr_sched']}",
"optimizer.use_slow_update=true", "optimizer.use_meta_skill=true",
f"env.workers={k['workers']}", f"env.limit={k['limit']}"],
"TRAIN (rollout->reflect->aggregate->select->update->gate; slow-update + meta-skill)")
import pandas as pd, matplotlib.pyplot as plt
hist = json.loads(pathlib.Path(f"{RUN}/history.json").read_text())
df = pd.json_normalize(hist)
print("\nhistory.json columns:", list(df.columns))
def col(*cands):
for c in cands:
for actual in df.columns:
if c in actual.lower(): return actual
return None
c_step = col("step")
x = df[c_step] if c_step else range(len(df))
c_tr, c_va = col("train_acc","train_hard","train"), col("val_acc","val_hard","valid","val")
c_lr, c_tok = col("edit_budget","lr","learning_rate","budget"), col("token","cost")
fig, ax = plt.subplots(1, 3, figsize=(16,4))
if c_tr: ax[0].plot(x, df[c_tr], "o-", label="train acc")
if c_va: ax[0].plot(x, df[c_va], "s-", label="val acc (gate)")
if base and base["hard"] is not None: ax[0].axhline(base["hard"], ls="--", c="grey", label="baseline (seed)")
ax[0].set_title("Skill accuracy over steps"); ax[0].set_xlabel("step"); ax[0].legend(); ax[0].grid(alpha=.3)
if c_lr: ax[1].plot(x, df[c_lr], "d-", c="purple")
ax[1].set_title("Edit-budget / LR schedule (cosine)"); ax[1].set_xlabel("step"); ax[1].grid(alpha=.3)
if c_tok: ax[2].plot(x, pd.to_numeric(df[c_tok],errors="coerce").cumsum(), c="darkorange")
ax[2].set_title("Cumulative token usage"); ax[2].set_xlabel("step"); ax[2].grid(alpha=.3)
plt.tight_layout(); plt.savefig(f"{RUN}/training_dashboard.png", dpi=120); plt.show()Inspecting Skill Evolution
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




