Tagged articles
9 articles
Page 1 of 1
PaperAgent
PaperAgent
May 25, 2026 · Artificial Intelligence

DeepSeek’s Harness: How Agent Harness Engineering Is Shaping the Next LLM Agent Era

The article surveys DeepSeek’s Harness initiative, presenting the Binding‑Constraint Thesis, three‑stage evolution from prompt to harness engineering, the ETCLOVG seven‑layer architecture, and concrete benchmark evidence that harness‑only improvements far outweigh model upgrades, while detailing security, observability, and governance considerations for reliable LLM agents.

AI ArchitectureAgent EvaluationAgent Harness Engineering
0 likes · 12 min read
DeepSeek’s Harness: How Agent Harness Engineering Is Shaping the Next LLM Agent Era
ITPUB
ITPUB
May 16, 2026 · Artificial Intelligence

Managing AI‑Generated Code with an Agent‑Based Evaluation Framework: Lessons from Refactoring 310 K Lines

When over 90% of a codebase is produced by AI, the authors show how a unified "people‑align → human‑machine‑align" approach, driven by evaluation agents, transforms technical debt into incremental business work, enabling continuous refactoring, AI‑friendly standards, and a sustainable engineering environment.

AI codingAI governanceAgent Evaluation
0 likes · 21 min read
Managing AI‑Generated Code with an Agent‑Based Evaluation Framework: Lessons from Refactoring 310 K Lines
Meituan Technology Team
Meituan Technology Team
May 7, 2026 · R&D Management

Managing AI‑Generated Code with Agent‑Based Evaluation: Refactoring 310K Lines of Code

When over 90% of a codebase is produced by AI, system quality hinges on constraining AI rather than speed, and this article details how a team used an agent‑based evaluation framework, unified standards, and incremental refactoring to turn 310,000 lines of AI‑written code into a maintainable, low‑debt system.

AI codingAI governanceAgent Evaluation
0 likes · 21 min read
Managing AI‑Generated Code with Agent‑Based Evaluation: Refactoring 310K Lines of Code
AntData
AntData
Apr 28, 2026 · Artificial Intelligence

Iterative Agent Evaluation Skill: Automating Bad‑Case Diagnosis with AI Pre‑Annotation

The article presents an end‑to‑end, eight‑phase automated evaluation pipeline for large‑model agents that replaces manual bad‑case inspection with AI‑assisted pre‑annotation, cutting analysis time from a full‑day to about 30 minutes and achieving over 90 % efficiency gain while enabling iterative knowledge‑base refinement.

AI Pre‑annotationAgent EvaluationAutomated Pipeline
0 likes · 20 min read
Iterative Agent Evaluation Skill: Automating Bad‑Case Diagnosis with AI Pre‑Annotation
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 27, 2026 · Artificial Intelligence

How OpenClaw Empowers a Self‑Evolving Bank Manager Assistant

This article details a three‑day deep dive into OpenClaw, demonstrating how a self‑iterating AI assistant for bank relationship managers can be built, validated, and refined through autonomous agent communication, scheduled tasks, and memory‑driven reflection.

AI agentsAgent EvaluationMemory Architecture
0 likes · 20 min read
How OpenClaw Empowers a Self‑Evolving Bank Manager Assistant
PaperAgent
PaperAgent
Dec 23, 2025 · Artificial Intelligence

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

CATArena introduces a tournament‑style evaluation framework where AI agents iteratively code, compete, and improve across classic board games, using three‑dimensional quantitative scores to measure strategy programming, global learning, and generalization, and reveals how different LLM‑based agents learn and adapt over multiple rounds.

AI BenchmarkAgent EvaluationCATArena
0 likes · 8 min read
CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning
DataFunTalk
DataFunTalk
Jul 14, 2025 · Artificial Intelligence

Can Kimi K2 Beat Claude and Gemini in Coding and Agent Tasks?

This in‑depth review examines Kimi K2’s new focus on agent and coding abilities, comparing its performance on 3D HTML generation, code generation, and real‑world agent tasks against Claude 4 and Gemini 2.5, while also evaluating cost, openness, and practical usability for developers.

AI codingAgent EvaluationKimi K2
0 likes · 15 min read
Can Kimi K2 Beat Claude and Gemini in Coding and Agent Tasks?