Breaking the Agent Training Bottleneck: Open‑Source ClawGym Data, Training, and Evaluation Pipeline

ClawGym provides a complete open‑source framework for Claw‑style personal agents, linking a 13.5 K synthetic task dataset, black‑box rollout training, sandbox‑parallel reinforcement learning, and a rigorously verified benchmark of 200 tasks, and demonstrates that synthetic data can lift a 30 B model beyond a 235 B baseline.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Breaking the Agent Training Bottleneck: Open‑Source ClawGym Data, Training, and Evaluation Pipeline

Motivation and Challenges

Large language models are moving from answering questions to completing tasks, but Claw‑style personal agents face a harder problem than traditional environment training because they must operate in a persistent, stateful workspace composed of files, scripts, logs, and external tools. Success depends on the final workspace state, not merely on a textual claim of completion, making data, training, and evaluation considerably more difficult.

ClawGym Architecture

ClawGym‑SynData : the first large‑scale synthetic dataset for Claw agents, containing 13.5 K executable tasks.

ClawGym‑Agents : agents trained on OpenClaw black‑box rollouts, exploring sandbox‑parallel reinforcement learning.

ClawGym‑Bench : a benchmark of 200 high‑quality tasks covering six workspace scenario categories for reliable capability diagnosis.

ClawGym‑SynData: Data Construction

Task creation follows a four‑step pipeline—task generation, resource preparation, verification design, and quality assessment—using two complementary synthesis routes.

Persona‑driven top‑down synthesis starts from user intents and constructs diverse scenarios such as file organization, data analysis, and report generation.

Skill‑grounded bottom‑up synthesis extracts reusable tool abilities from OpenClaw skills, filters and composes them to ensure generated tasks are executable within the system’s capabilities.

Both routes are combined to produce tasks that are realistic (user‑driven) and executable (skill‑grounded). For each task a lightweight mock workspace (Markdown, JSON, CSV, YAML, config files, logs) is automatically generated, providing the initial state and data conditions required for execution.

Verification mixes code‑based checks (file paths, schema compliance, numerical correctness) with rubric‑based assessment (report clarity, completeness, alignment with user intent), ensuring that a task is considered solved only when the workspace meets all specified criteria.

Image
Image

ClawGym‑Agents: Training from Black‑Box Rollouts

Using OpenClaw black‑box rollouts, the framework collects multi‑turn interaction trajectories rather than simplifying the agent loop. Collected trajectories are aggregated, cleaned, and filtered to remove system messages and anomalies, retaining high‑quality samples based on verifier scores.

Each retained trajectory averages 13.00 interaction rounds, 18.67 K tokens, 15.82 tool calls, and 3.25 tool types, providing rich supervision that includes planning, file checks, tool execution, environment feedback, and iterative adjustments.

These trajectories are used to fine‑tune Qwen‑3 series models (ClawGym‑4B, ClawGym‑8B, ClawGym‑30B‑A3B) with loss masking on environment feedback, and to explore sandbox‑parallel reinforcement learning where each task runs in an isolated sandbox and receives a reward from the code verifier.

ClawGym‑Bench: Evaluation Benchmark

ClawGym‑Bench comprises 200 carefully screened tasks that assess an agent’s ability to execute in a real workspace. Tasks are selected for difficulty and discriminative power, and undergo a human‑LLM collaborative review to eliminate ambiguities and verification gaps.

The benchmark covers six typical workspace domains: productivity & collaboration, systems & automation, analysis & reasoning, content & domain support, planning & knowledge management, and software development.

Beyond a single score, ClawGym‑Bench enables detailed analysis of model behavior across different capabilities, such as file state understanding, tool selection, long‑range execution, and adherence to fine‑grained output requirements.

Image
Image

Experimental Results

Training on ClawGym‑SynData consistently improves performance of open‑source models on Claw‑style tasks. After fine‑tuning, ClawGym‑4B, ClawGym‑8B, and ClawGym‑30B‑A3B achieve scores of 47.73, 50.24, and 56.82 on ClawGym‑Bench, respectively, all surpassing their base models.

Notably, ClawGym‑30B‑A3B outperforms the larger Qwen‑3‑235B‑A23B, showing that high‑quality agent interaction data can partially compensate for model size.

ClawGym‑Bench exhibits strong discriminative ability: average scores range from 35.02 (Qwen‑3‑8B) to 77.81 (Claude‑4.7‑Opus), forming a clear capability gradient.

When evaluated on the external PinchBench benchmark, ClawGym‑Agents trained solely on synthetic data achieve a notable 86.00 score for the 30 B model, indicating that the learned skills transfer to unseen environments.

Image
Image

Behavioral Analysis

The key challenges for Claw agents go beyond mere tool invocation:

Organizing tool calls into coherent workflows that update the workspace step by step.

Recovering from errors in long‑range execution, such as missing files or failed commands, by using feedback to adjust actions.

Generating and validating concrete workspace artifacts (CSV, JSON, reports) that must satisfy field‑level, formulaic, and cross‑file consistency requirements.

Thus, Claw‑style tasks evaluate an agent’s ability to maintain and evolve a stateful workspace, not just its language or single‑step tool skills.

Conclusion

ClawGym closes the loop for Claw‑style personal agents by providing large‑scale executable data, black‑box trajectory training, and a rigorous benchmark. The framework demonstrates that synthetic, high‑quality interaction data can substantially boost the execution capabilities of models, even allowing a 30 B model to surpass a 235 B baseline, and highlights the importance of workspace‑centric evaluation for future agent research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkreinforcement learningsynthetic datapersonal agentsOpenClawagent trainingClawGym
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.