Artificial Intelligence 13 min read

Rewriting Claude Code in 90k Lines of Python: How CheetahClaws Tests Harness Scaling

The article analyzes why AI agents need system‑level scaling, explains the UC Berkeley "Harness" framework, and details how the open‑source CheetahClaws project rewrites Claude Code in Python to evaluate system scaling across memory, context, routing, orchestration and governance components.

SuanNi

Jun 1, 2026

Rewriting Claude Code in 90k Lines of Python: How CheetahClaws Tests Harness Scaling

Why Agents Need More Than Model Scaling

AI agents become stronger not only by scaling the underlying model but also by scaling the surrounding execution framework, called the Harness. UC Berkeley researchers argue that agents should be treated as a system‑scaling problem, separating model improvements (R) from system components such as memory (M), context construction (C), skill routing (S), orchestration (O) and governance (G).

Harness Decomposition

The authors split an agent system into six interacting components: reasoning base R, memory store M, context builder C, skill routing layer S, orchestration loop O, and verification/governance layer G. Performance over a time horizon H is expressed as P(H) = Φ(R,M,C,S,O,G), where model scaling improves R and system scaling improves the other five.

Memory is further broken into precision, durability, retrievability and verifiability; context into relevance, compactness, traceability and refresh policy. Each factor is a system‑level lever.

CheetahClaws: Engineering the Theory

CheetahClaws is an open‑source Python reimplementation of the Claude Code agent framework, exposing the core loop in a readable 9‑digit‑line codebase and supporting arbitrary models. Compared with Claude Code (a 12 MB TypeScript/Node package tightly coupled to Anthropic’s API) and OpenClaw (a personal‑assistant tool), CheetahClaws offers model‑agnostic execution, a 740‑line readable agent loop, zero‑build deployment, and extensive tooling (33 built‑in tools, MCP integration, multi‑modal plugins, dual‑layer context compression, offline speech, cloud session sync, and bridge adapters for Telegram, WeChat, Slack, QQ).

Key engineering highlights include:

Model‑agnostic support for 8 cloud providers (Anthropic, OpenAI, Google, DeepSeek, Alibaba, Moonshot, Zhipu, MiniMax) covering 200+ models, plus local Ollama, LM Studio and vLLM endpoints.

Dual‑scope memory (user and project) with confidence, freshness, source metadata, conflict detection, and a /memory consolidate command.

Four‑layer context compression implementing relevance, compactness, traceability and refresh policy.

Multi‑agent capability with typed sub‑agents (coder, reviewer, researcher) isolated via git worktree and background execution.

Four permission modes (auto, accept‑all, manual, plan) with prompt‑injection detection, credential filtering and secure stdio sandboxing.

Comparative Evaluation

Claude Code excels in a rich React/Ink UI, built‑in tools, enterprise features and reliable binary production. CheetahClaws wins on model flexibility, readable code, zero‑build deployment, extensive plugin system, sophisticated context compression, and robust security hardening (environment‑variable token binding, CSRF cookies, session ownership, sandboxed plugins/MCP/filesystem).

OpenClaw targets personal assistants across messaging platforms, while CheetahClaws focuses on AI‑assisted coding in terminal environments.

Assessment Agenda

The authors propose a next‑generation benchmark that goes beyond single‑score task success. It should measure cost, token usage, tool calls, retry counts, failure modes, memory accuracy, drift, safety, and governance traceability. Current benchmarks (SWE‑bench, AgentBench, WebArena, Terminal‑Bench) mix model and Harness effects, making it hard to isolate improvements.

They argue that stronger models alone will not solve system failures such as stale memory, over‑permissive tools, missing provenance, unverified retrievals, or unsafe actions. System‑level evaluation remains essential despite higher cost.

Conclusion

CheetahClaws demonstrates how the Harness design can be concretely realized, making system‑scale trade‑offs explicit and discussable. Both model scaling and system scaling must progress in parallel for agents to truly become stronger.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Memory Management AI Agents multi‑agent Benchmarking system scaling Claude Code Harness framework CheetahClaws

Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.