A Coding Implementation on Microsoft SkillOpt for Instrumented Prompt Optimization, Skill Evolution Analysis, and Baseline Comparison

For creators and developers building autonomous agents, the ability to refine instruction sets without manual rewriting is a game-changer. Microsoft SkillOpt offers exactly that: a system that automates the improvement of AI skills through a cycle of testing, reflection, and updating. The following guide demonstrates how to implement this workflow using the SearchQA dataset. We configure the environment to connect with OpenAI-compatible models, establish a baseline performance metric, and then execute an optimisation loop. Throughout the process, we monitor accuracy gains, track token consumption, and visualise how the agent’s capabilities evolve from a generic starting point to a specialised expert.

SkillOpt Environment Configuration

The first step involves preparing the runtime environment. We import necessary libraries for file handling and data processing, then secure the OpenAI API key. The configuration specifies gpt-4o as the optimiser and gpt-4o-mini as the target model. We clone the Microsoft SkillOpt repository and install dependencies, ensuring the backend is set to communicate via the OpenAI-compatible API endpoint.

Copy Code

import os, re, json, glob, subprocess, pathlib, difflib
try:
   from google.colab import userdata
   OPENAI_KEY = userdata.get("OPENAI_API_KEY")
except Exception:
   OPENAI_KEY = os.environ.get("OPENAI_API_KEY", "")
OPENAI_KEY = OPENAI_KEY or "sk-PASTE-YOUR-KEY-HERE"
assert OPENAI_KEY.startswith("sk-"), "Set a real OpenAI key (Colab Secrets -> OPENAI_API_KEY)."
OPTIMIZER_MODEL = "gpt-4o"
TARGET_MODEL    = "gpt-4o-mini"
RUN  = "outputs/searchqa_adv"
LIMIT = 24
RUN_KNOBS = dict(num_epochs=2, batch_size=8, minibatch=4, merge_batch=4,
                workers=2, lr=4, lr_sched="cosine", limit=LIMIT)
if not pathlib.Path("/content/SkillOpt/scripts/train.py").exists():
   subprocess.run("git clone --depth 1 https://github.com/microsoft/SkillOpt.git",
                  shell=True, cwd="/content")
   subprocess.run('pip -q install -e . && pip -q install "openai>=1.0" pandas matplotlib',
                  shell=True, cwd="/content/SkillOpt")
os.chdir("/content/SkillOpt")
os.environ["AZURE_OPENAI_ENDPOINT"]  = "https://api.openai.com/v1"
os.environ["AZURE_OPENAI_API_KEY"]   = OPENAI_KEY
os.environ["AZURE_OPENAI_AUTH_MODE"] = "openai_compatible"
SPLIT = "data/searchqa_id_split"
CFG   = "configs/searchqa/default.yaml"
COMMON = ["--azure_openai_endpoint","https://api.openai.com/v1",
         "--cfg-options","model.backend=azure_openai",
         "model.azure_openai_auth_mode=openai_compatible"]

Establishing the Baseline

Before attempting any optimisation, we must quantify the starting performance. We define utility functions to execute commands and parse the resulting accuracy metrics. The script locates the initial seed skill file, which contains a generic instruction to answer questions based on provided context. We evaluate this untrained skill against the unseen validation split of the SearchQA dataset. This establishes a hard and soft accuracy baseline against which all future improvements will be measured.

Copy Code

def run_cli(args, tag):
   print("\n" + "#"*80 + f"\n# {tag}\n# $ " + " ".join(args) + "\n" + "#"*80)
   p = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   buf = []
   for line in p.stdout:
       print(line, end=""); buf.append(line)
   p.wait(); return "".join(buf)
def parse_acc(txt):
   m = re.search(r"Results:\s*hard=([\d.]+)\s+soft=([\d.]+)", txt)
   if m: return {"hard": float(m.group(1)), "soft": float(m.group(2))}
   g = re.findall(r"hard=([\d.]+)", txt)
   return {"hard": float(g[-1]), "soft": None} if g else None
seed = "skillopt/envs/searchqa/skills/initial.md"
if not pathlib.Path(seed).exists():
   seed = "baseline_skill.md"; pathlib.Path(seed).write_text("You answer questions from the given context.\n")
base_out = run_cli(["python","scripts/eval_only.py","--config",CFG,
                   "--skill",seed,"--split","valid_unseen","--split_dir",SPLIT,
                   "--target_model",TARGET_MODEL,*COMMON,
                   "env.workers=1",f"env.limit={LIMIT}"],
                  "BASELINE EVAL (env seed skill, no training)")
base = parse_acc(base_out)

Training and Visualisation

We initiate the core optimisation loop using the configured models. The process involves rollout, reflection, aggregation, selection, and validation-based gating. Key parameters include two epochs, a batch size of eight, and a cosine learning rate schedule. We also enable slow updates and meta-skill capabilities to ensure stability. Once training concludes, we load the history log to generate a dashboard. This visualisation tracks accuracy on both training and validation sets, monitors the edit budget alongside the learning rate, and charts cumulative token usage over the training steps.

Copy Code

k = RUN_KNOBS
train_out = run_cli(["python","scripts/train.py","--config",CFG,"--split_dir",SPLIT,
   "--optimizer_model",OPTIMIZER_MODEL,"--target_model",TARGET_MODEL,"--out_root",RUN,
   *COMMON,
   "train.train_size=0",
   f"train.num_epochs={k['num_epochs']}", f"train.batch_size={k['batch_size']}",
   f"gradient.minibatch_size={k['minibatch']}", f"gradient.merge_batch_size={k['merge_batch']}",
   f"gradient.analyst_workers={k['workers']}",
   f"optimizer.learning_rate={k['lr']}", f"optimizer.lr_scheduler={k['lr_sched']}",
   "optimizer.use_slow_update=true", "optimizer.use_meta_skill=true",
   f"env.workers={k['workers']}", f"env.limit={k['limit']}"],
   "TRAIN (rollout->reflect->aggregate->select->update->gate; slow-update + meta-skill)")
import pandas as pd, matplotlib.pyplot as plt
hist = json.loads(pathlib.Path(f"{RUN}/history.json").read_text())
df = pd.json_normalize(hist)
print("\nhistory.json columns:", list(df.columns))
def col(*cands):
   for c in cands:
       for actual in df.columns:
           if c in actual.lower(): return actual
   return None
c_step = col("step")
x = df[c_step] if c_step else range(len(df))
c_tr, c_va = col("train_acc","train_hard","train"), col("val_acc","val_hard","valid","val")
c_lr, c_tok = col("edit_budget","lr","learning_rate","budget"), col("token","cost")
fig, ax = plt.subplots(1, 3, figsize=(16,4))
if c_tr: ax[0].plot(x, df[c_tr], "o-", label="train acc")
if c_va: ax[0].plot(x, df[c_va], "s-", label="val acc (gate)")
if base and base["hard"] is not None: ax[0].axhline(base["hard"], ls="--", c="grey", label="baseline (seed)")
ax[0].set_title("Skill accuracy over steps"); ax[0].set_xlabel("step"); ax[0].legend(); ax[0].grid(alpha=.3)
if c_lr: ax[1].plot(x, df[c_lr], "d-", c="purple")
ax[1].set_title("Edit-budget / LR schedule (cosine)"); ax[1].set_xlabel("step"); ax[1].grid(alpha=.3)
if c_tok: ax[2].plot(x, pd.to_numeric(df[c_tok],errors="coerce").cumsum(), c="darkorange")
ax[2].set_title("Cumulative token usage"); ax[2].set_xlabel("step"); ax[2].grid(alpha=.3)
plt.tight_layout(); plt.savefig(f"{RUN}/training_dashboard.png", dpi=120); plt.show()

Inspecting Skill Evolution

Source Read original →

A Coding Implementation on Microsoft SkillOpt for Instrumented Prompt Optimization, Skill Evolution Analysis, and Baseline Comparison

SkillOpt Environment Configuration

Establishing the Baseline

Training and Visualisation

Inspecting Skill Evolution

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Sakana AI Releases Fugu-Cyber:…

Ruff v0.16.0

Meet Open Dreamer: A…

SkillOpt Environment Configuration

Establishing the Baseline

Training and Visualisation

Inspecting Skill Evolution

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Sakana AI Releases Fugu-Cyber:…

Ruff v0.16.0

Meet Open Dreamer: A…