Microsoft’s SkillOpt Turns Agent Skill Docs into Trainable Parameters for Self‑Evolving AI

Microsoft’s newly open‑source SkillOpt framework treats an agent’s skill document as external weights, applying a rollout‑reflect‑edit‑gate training loop with textual learning rates and rejected‑edit buffers, enabling self‑evolving skills that achieve optimal or tied‑optimal results across 52 model‑benchmark‑environment combinations.

Machine Heart
Machine Heart
Machine Heart
Microsoft’s SkillOpt Turns Agent Skill Docs into Trainable Parameters for Self‑Evolving AI

In modern AI‑agent development, programmers spend extensive effort hand‑crafting skill files (e.g., CLAUDE.md, Codex skill files, various system prompts). This manual trial‑and‑error process is analogous to tuning a single prompt, but the artifact is now a full document, making iteration costly and contradictory to the goal of delegating work to smarter AI.

SkillOpt: Training Skill Documents as External Weights

Microsoft open‑sourced SkillOpt , a framework that treats an agent’s skill document as “external weights.” By applying the same gradient‑descent logic used for neural‑network parameters, SkillOpt iteratively improves the textual skill file without altering the underlying model.

Training Loop

Rollout (forward pass) : A frozen target model executes a batch of tasks using the current skill document, recording the full execution trace (messages, tool calls, validation feedback, final score). This trace serves as the evidence analogous to a forward‑propagation output.

Reflect (backward pass) : An independent optimizer model analyses the trace. Failed minibatches reveal rules that need correction; successful minibatches confirm effective rules that should remain unchanged. This step computes a “textual gradient” indicating how the skill document should be modified.

Edit (parameter update) : Based on the gradient, the optimizer proposes structured edits— add, delete, or replace —to the skill document.

Gate (validation) : Candidate edits are evaluated on a held‑out validation set; only those that strictly improve performance are accepted, preventing over‑fitting.

The loop runs for multiple epochs, each containing several steps, mirroring conventional neural‑network training.

Key Design Mechanisms

Textual learning rate : Limits the number of edit operations per step (default lr=4) to avoid catastrophic forgetting of previously learned rules. Ablation shows performance drops of 2–4 points on benchmarks when this constraint is removed.

Rejected‑edit buffer : Stores edits rejected by the gate. The optimizer can later reference these “failed attempts,” providing negative gradient information and improving subsequent edit proposals. Removing the buffer reduces SpreadsheetBench scores from 77.5 % to 72.9 %.

Slow Update : At the end of each epoch, accepted edits are aggregated and a larger‑scale update is applied, similar to learning‑rate warm‑up or periodic big steps in deep learning.

Meta Skill : The optimizer maintains its own meta‑skill document, recording experience such as “focus on tool‑call format for this benchmark.” This meta‑skill evolves across epochs, allowing the optimizer itself to improve.

Experimental Evaluation

SkillOpt was evaluated on 7 target models (GPT‑5.5, GPT‑5.4, GPT‑5.4‑mini, GPT‑5.4‑nano, GPT‑5.2, Qwen3.5‑4B, Qwen3.6‑35B‑A3B), 6 benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), and 3 execution environments (direct dialogue, OpenAI Codex, Anthropic Claude Code), yielding 52 model‑benchmark‑environment combinations.

Across all 52 combos, SkillOpt achieved the best or tied‑best score. Notable average gains include +23.5 points for GPT‑5.5 in direct dialogue, +24.9 points for the smallest GPT‑5.4‑nano model, and environment‑specific lifts of up to +58.3 points on SpreadsheetBench.

Baseline Comparison

Six baselines were tested: no skill, human‑written skill, one‑shot LLM‑generated skill, Trace2Skill, TextGrad, and GEPA. SkillOpt outperformed the strongest baseline on every benchmark, e.g., +1.9 pts on SearchQA, +4.4 pts on SpreadsheetBench, and +9.2 pts on LiveMath.

Transfer Experiments

Cross‑model transfer: Skills trained on GPT‑5.4 improved GPT‑5.4‑nano by 15.2 points without retraining.

Cross‑environment transfer: SpreadsheetBench skills trained in Codex boosted performance by 31.8 points when applied in Claude Code.

Self‑optimization: Using GPT‑5.4‑nano as both target and optimizer still yielded a +10.4‑point gain on SpreadsheetBench.

Deployment simplicity: Only the final best_skill.md is needed at inference time; no optimizer model or memory module is required, incurring zero runtime overhead.

Visualization of Skill Evolution

An ALFWorld case study (target model GPT‑5.4‑mini, optimizer GPT‑5.5) shows the skill document evolving over four training steps. New rules such as “treat any generic target container as valid,” “maintain a strictly numbered searched set to avoid revisiting locations,” and “expand search radius after consecutive misses” were automatically extracted from failure traces.

These edits raised the hard‑difficulty test accuracy from 70.9 % to 85.8 %. The slow‑update mechanism rescued a step where validation performance initially dropped, and the gate rejected an over‑aggressive edit that improved training loss but not validation, mirroring scientific hypothesis testing.

Conclusion

SkillOpt demonstrates that an agent’s entire skill set can be treated as a trainable artifact, enabling systematic self‑improvement without additional inference cost. The framework’s design—textual learning rates, rejected‑edit buffers, slow updates, and meta‑skills—provides a robust training pipeline that consistently outperforms existing text‑optimization baselines and transfers effectively across models and environments.

Source code, documentation, and the accompanying arXiv paper (https://arxiv.org/abs/2605.23904) are publicly available.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI agentsMicrosoftbenchmark evaluationself‑evolving skillsSkillOpttextual optimization
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.