Agent Harness Model Achieves Frontier Performance at <1% Compute Cost – Introducing Macaron‑V1‑Preview
A 30‑person lab trained a 749B‑parameter Agent model called Macaron‑V1‑Preview using fewer than 300 GPUs, achieving less than 1% of the compute cost of comparable models while matching state‑of‑the‑art performance on real‑world Agent benchmarks such as LivingBench, VitaBench, A2UI and PinchBench.
Introduction
A 30‑person laboratory called Mind Lab, incubated by the Guangdong‑Hong‑Kong‑Macao Greater Bay Area National Technology Innovation Center, trained a 749B‑parameter Agent model (Macaron‑V1‑Preview) with fewer than 300 GPUs, resulting in a compute cost that is under 1% of the cost for similarly sized models.
The model is built on the GLM5.1 base, activates 40B parameters, and employs a Mixture‑of‑LoRA (MoL) architecture designed for deep post‑training on Agent Harness scenarios.
Model Overview and Evaluation
Macaron‑V1‑Preview consists of a 744B base model plus five 1B LoRA adapters dedicated to chat, life tasks, coding, OpenClaw tasks, and routing. Unlike earlier large‑model releases that highlight single abilities (e.g., math or code), this model presents a more complete, Agent‑native capability stack that optimizes for real‑task flows, tool usage, interaction loops, and user‑feedback refinement.
Benchmark results show the model achieving state‑of‑the‑art (SOTA) scores on several real‑world Agent evaluations:
LivingBench (long‑chain life‑task benchmark) – SOTA.
VitaBench (defined by Meituan, covering dining, shopping, travel, etc.) – SOTA.
Google A2UI protocol – first open‑source model to support the protocol, delivering high‑quality dynamic UI generation within 5 seconds.
PinchBench (OpenClaw personal assistant benchmark) – 92.5 points, the best among open‑source models.
Additional tasks such as τ³‑bench (customer‑service tool calls), SWE‑Verified (code repair), and Terminal2 (terminal interaction) – performance comparable to leading closed‑source models.
These results demonstrate that the model can operate effectively in realistic life‑scene tasks, handling user preferences, multi‑step reasoning, and dynamic feedback.
MoL Architecture and Continuous Learning
The core innovation is the Mixture‑of‑LoRA (MoL) architecture. During post‑training, different tasks (chat, tool use, reasoning, coding) require distinct skill sets and reasoning chains, and naïve merging can cause one capability to degrade the other. MoL solves this by grouping tasks with shared skills into a single LoRA adapter while assigning divergent tasks to separate adapters on the same base. This allows similar tasks to reinforce each other and disparate tasks to evolve independently.
Five expert LoRAs are deployed:
Chat – general conversation.
Life – personal life tasks.
Code – programming assistance.
Claw – OpenClaw‑style tasks.
L4 – routing adapter.
Routing is exposed as a tool‑call API compatible with OpenAI standards. The default entry point is the L4 adapter, which registers a router_tool. A central registry stores metadata for each LoRA; adding a new expert only requires updating this metadata.
During an Agent loop, explicit routing invokes the appropriate expert via the router tool, then returns to L4 for the next user turn (implicit routing). KV‑cache reuse across LoRA switches is achieved by preserving existing caches, accepting a modest quality loss that remains within acceptable bounds for Agent switching.
Post‑Training Engineering and Self‑Evolution
Three key engineering advances enable the model’s efficiency and adaptability:
Stabilizing RL on the 744B sparse base. The team introduced Rollout Routing Replay (R3), which records the expert ID selected for each token during rollout and reconstructs the same path during training, masking tokens whose paths cannot be reproduced. This aligns expert trajectories and mitigates gradient pollution.
Embedding the Agent Harness directly into training. The Harness Context Protocol (HCP) standardizes how the harness exposes task metadata, memory state, and routing instructions to the model. By feeding the exact production‑level harness into RL rollouts, discrepancies between training and serving are eliminated.
Self‑evolution loop. Using AutoResearch on HCP configurations, the model iteratively improves its prompts, tool usage, and trajectory selection. The loop consists of:
Prompt evolution – language‑space refinement based on environment feedback.
Trajectory selection – discovering new execution paths unlocked by revised prompts.
Context learning – distilling successful trajectories back into model parameters, turning previously prompt‑only capabilities into out‑of‑the‑box behavior.
This self‑evolution contributed the largest single performance gain on VitaBench between the checkpoint and the final release.
Efficient Training Techniques and Deployment
To keep compute under 1% of typical costs, Mind Lab combined several efficiency‑focused techniques: LoRA adapters, Dynamic Sparse Attention (DSA), Multi‑Token Prediction (MTP), ultra‑low‑rank matrix adapters, and parallel mixed linear attention. The MinT infrastructure extends DeepSeek V4’s three‑layer cache to a four‑layer cache that includes object storage (OSS), allowing training and serving of up to a million LoRA adapters, with scalability toward tens of millions.
MinT transmits only lightweight LoRA adapters during training, evaluation, deployment, and rollback, achieving nearly ten‑fold faster loading times.
Macaron‑V1‑Preview’s weights, including the five expert LoRAs and routing metadata, are publicly released on Hugging Face, and an open preview environment is hosted for direct comparison with other leading models.
Conclusion
The Macaron‑V1‑Preview model demonstrates that a modestly sized lab can produce a 750B‑scale Agent model with frontier performance while consuming a fraction of the usual compute budget. Its MoL architecture, robust post‑training pipeline, and self‑evolution mechanisms provide a reproducible blueprint for building scalable, resource‑efficient Agent systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
