Breaking the Agent Training Bottleneck: Open‑Source ClawGym Data, Training, and Evaluation Pipeline
ClawGym provides a complete open‑source framework for Claw‑style personal agents, linking a 13.5 K synthetic task dataset, black‑box rollout training, sandbox‑parallel reinforcement learning, and a rigorously verified benchmark of 200 tasks, and demonstrates that synthetic data can lift a 30 B model beyond a 235 B baseline.
Motivation and Challenges
Large language models are moving from answering questions to completing tasks, but Claw‑style personal agents face a harder problem than traditional environment training because they must operate in a persistent, stateful workspace composed of files, scripts, logs, and external tools. Success depends on the final workspace state, not merely on a textual claim of completion, making data, training, and evaluation considerably more difficult.
ClawGym Architecture
ClawGym‑SynData : the first large‑scale synthetic dataset for Claw agents, containing 13.5 K executable tasks.
ClawGym‑Agents : agents trained on OpenClaw black‑box rollouts, exploring sandbox‑parallel reinforcement learning.
ClawGym‑Bench : a benchmark of 200 high‑quality tasks covering six workspace scenario categories for reliable capability diagnosis.
ClawGym‑SynData: Data Construction
Task creation follows a four‑step pipeline—task generation, resource preparation, verification design, and quality assessment—using two complementary synthesis routes.
Persona‑driven top‑down synthesis starts from user intents and constructs diverse scenarios such as file organization, data analysis, and report generation.
Skill‑grounded bottom‑up synthesis extracts reusable tool abilities from OpenClaw skills, filters and composes them to ensure generated tasks are executable within the system’s capabilities.
Both routes are combined to produce tasks that are realistic (user‑driven) and executable (skill‑grounded). For each task a lightweight mock workspace (Markdown, JSON, CSV, YAML, config files, logs) is automatically generated, providing the initial state and data conditions required for execution.
Verification mixes code‑based checks (file paths, schema compliance, numerical correctness) with rubric‑based assessment (report clarity, completeness, alignment with user intent), ensuring that a task is considered solved only when the workspace meets all specified criteria.
ClawGym‑Agents: Training from Black‑Box Rollouts
Using OpenClaw black‑box rollouts, the framework collects multi‑turn interaction trajectories rather than simplifying the agent loop. Collected trajectories are aggregated, cleaned, and filtered to remove system messages and anomalies, retaining high‑quality samples based on verifier scores.
Each retained trajectory averages 13.00 interaction rounds, 18.67 K tokens, 15.82 tool calls, and 3.25 tool types, providing rich supervision that includes planning, file checks, tool execution, environment feedback, and iterative adjustments.
These trajectories are used to fine‑tune Qwen‑3 series models (ClawGym‑4B, ClawGym‑8B, ClawGym‑30B‑A3B) with loss masking on environment feedback, and to explore sandbox‑parallel reinforcement learning where each task runs in an isolated sandbox and receives a reward from the code verifier.
ClawGym‑Bench: Evaluation Benchmark
ClawGym‑Bench comprises 200 carefully screened tasks that assess an agent’s ability to execute in a real workspace. Tasks are selected for difficulty and discriminative power, and undergo a human‑LLM collaborative review to eliminate ambiguities and verification gaps.
The benchmark covers six typical workspace domains: productivity & collaboration, systems & automation, analysis & reasoning, content & domain support, planning & knowledge management, and software development.
Beyond a single score, ClawGym‑Bench enables detailed analysis of model behavior across different capabilities, such as file state understanding, tool selection, long‑range execution, and adherence to fine‑grained output requirements.
Experimental Results
Training on ClawGym‑SynData consistently improves performance of open‑source models on Claw‑style tasks. After fine‑tuning, ClawGym‑4B, ClawGym‑8B, and ClawGym‑30B‑A3B achieve scores of 47.73, 50.24, and 56.82 on ClawGym‑Bench, respectively, all surpassing their base models.
Notably, ClawGym‑30B‑A3B outperforms the larger Qwen‑3‑235B‑A23B, showing that high‑quality agent interaction data can partially compensate for model size.
ClawGym‑Bench exhibits strong discriminative ability: average scores range from 35.02 (Qwen‑3‑8B) to 77.81 (Claude‑4.7‑Opus), forming a clear capability gradient.
When evaluated on the external PinchBench benchmark, ClawGym‑Agents trained solely on synthetic data achieve a notable 86.00 score for the 30 B model, indicating that the learned skills transfer to unseen environments.
Behavioral Analysis
The key challenges for Claw agents go beyond mere tool invocation:
Organizing tool calls into coherent workflows that update the workspace step by step.
Recovering from errors in long‑range execution, such as missing files or failed commands, by using feedback to adjust actions.
Generating and validating concrete workspace artifacts (CSV, JSON, reports) that must satisfy field‑level, formulaic, and cross‑file consistency requirements.
Thus, Claw‑style tasks evaluate an agent’s ability to maintain and evolve a stateful workspace, not just its language or single‑step tool skills.
Conclusion
ClawGym closes the loop for Claw‑style personal agents by providing large‑scale executable data, black‑box trajectory training, and a rigorous benchmark. The framework demonstrates that synthetic, high‑quality interaction data can substantially boost the execution capabilities of models, even allowing a 30 B model to surpass a 235 B baseline, and highlights the importance of workspace‑centric evaluation for future agent research.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
