Microsoft’s SkillOpt boosts GPT-5.5 by using nothing but a trained Markdown file

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 13, 2026 4 min read
Microsoft’s SkillOpt boosts GPT-5.5 by using nothing but a trained Markdown file

For creators and developers relying on AI agents, the latest research suggests that complex procedural knowledge can be injected into a system using nothing more than a concise, optimised text file. Microsoft and researchers from three Chinese universities have demonstrated that a frozen large language model can achieve significant performance leaps on structured tasks by treating its instruction documents as trainable states. Rather than altering the model’s weights or relying on hand-crafted rules, their method, SkillOpt, uses a separate optimiser to iteratively refine a Markdown file until it acts as a robust set of skills.

Treating text as a trainable state

SkillOpt operates by decoupling the target model from the learning process. A second language model acts as the optimiser, reading logs from the agent’s execution to identify recurring errors and successful patterns. It proposes specific edits to the instruction document—adding, removing, or replacing passages—which are only accepted if they improve performance on a held-out validation set.

The approach maps established deep learning concepts onto the text level. A learning rate mechanism limits the number of edits per step, while a scheduler reduces step size across training epochs. Rejected edits are stored in a buffer to serve as negative examples for future reflection. Furthermore, a slow update phase at the end of each epoch ensures stable edit directions, mimicking gradient smoothing in traditional training.

This design offers a clean separation between training and deployment. Once the optimisation phase is complete, the optimiser model is discarded. At inference time, the target model simply receives a plain Markdown file containing between 300 and 2,000 tokens as context.

Consistent gains across benchmarks

The team evaluated their approach on six benchmarks covering search, spreadsheets, document analysis, mathematics, and embodied action. Seven systems served as target models, including GPT-5.5 and the much smaller Qwen3.5-4B. Tasks were executed in direct chat as well as within agent environments like Codex and Claude Code.

Across every combination tested, SkillOpt either led or tied with the best comparison result. This holds against handwritten skills, single-pass LLM-generated skills, and specialised methods such as Trace2Skill, TextGrad, GEPA, and EvoSkill. On GPT-5.5 in direct chat, the average performance across all six benchmarks increased by approximately 23 points.

The most substantial improvements appeared on tasks with strict format requirements and tool use, such as spreadsheet editing. Smaller models also benefited, which the authors interpret as evidence that a well-trained skill delivers procedural knowledge these models lack in their base weights.

Transferability across models and environments

A key finding is the method’s transferability. A skill trained on a larger model also improves smaller models within the same family. For instance, a spreadsheet skill trained in the Codex loop worked unchanged in Claude Code, lifting performance there to the same level as a skill trained directly in that environment. Similarly, a math skill optimised on olympiad problems delivered gains on a related benchmark without any retraining.

Ablation studies explain why the method remains stable. Without a bounded edit budget, the skill drifts too far with each revision. Without the buffer for rejected edits, the optimiser repeats the same failed attempts. Removing the slow update at the end of each epoch cost the SpreadsheetBench more than twenty points, the largest drop in the entire experiment. Only the combination of bounded step size, validation gating, negative feedback, and long-term consolidation makes skill training behave like a controlled optimisation process.

Compact documents drive the improvement

The final skills remain compact; the finished documents rarely exceed 2,000 tokens, and the improvements result from just one to four accepted edits across four training epochs. On OfficeQA, the largest gain stemmed from a single accepted change.

The learned rules read as if an experienced practitioner had jotted them down after working with the benchmark. For spreadsheets, the skill learns to check the worksheet structure first and write directly evaluated values into the entire target range instead of using Excel formulas. In ALFWorld, it keeps a log of visited locations and avoids heading to the goal before picking up the target object. For document questions, it anchors the question to the right table row before accepting an answer. None of these rules refer to a specific task; they describe procedures.

The authors acknowledge that the method depends on reliable automatic scoring. For open-ended tasks where success is hard to measure, the validation step would require human or model-based judgments. SkillOpt also deliberately optimises a single document rather than a skill library, which could become a bottleneck for highly varied domains.

Positioning in the self-improvement landscape

While most current self-improvement approaches eventually tweak model weights, SkillOpt takes a remarkably lean path. OpenClaw-RL, a framework from Princeton researchers, uses follow-up signals from every interaction—such as user responses or test results—as a live training source. MetaClaw pulls compact behavioural rules from failed tasks and injects them into the prompt, updating weights only during idle phases via reinforcement learning. One parallel to SkillOpt: weaker models benefit the most in both cases because they lack procedural knowledge that a rule or skill can supply directly.

Other groups go further. AutoTTS allows a coding agent to search for better reasoning control algorithms on its own, shifting the human role from designing rules to designing the environment. Meta’s Hyperagents optimise the very mechanism they use to improve themselves. SkillOpt, by contrast, keeps the model frozen and changes nothing but a readable text file.

Key takeaways

  • SkillOpt demonstrates that procedural knowledge can be optimised via a separate model refining a Markdown file, without altering the target model’s weights.
  • The method achieved an average 23-point performance boost on GPT-5.5 across six benchmarks, particularly excelling in tasks requiring strict formatting and tool use.
  • Skills trained on larger models transfer effectively to smaller models and different environments, proving the robustness of the learned procedural rules.
  • Success relies on a controlled optimisation process involving bounded edit budgets, validation gating, and long-term consolidation to prevent skill drift.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top