Artificial Intelligence 8 min read

DeliAutoResearch Cuts Human Effort to 2 Hours – Knowledge Accumulation Is the Real Bottleneck

DeepSeek researcher Chen Deli reports that using his DeliAutoResearch skill and a suite of AI agents, a 46‑page research paper was produced in six days with only two hours of human CPU time, revealing that the true limits of autonomous research lie in continuous knowledge accumulation and reliable self‑evaluation rather than model capability.

DataFunTalk

May 27, 2026

DeliAutoResearch Cuts Human Effort to 2 Hours – Knowledge Accumulation Is the Real Bottleneck

DeepSeek researcher Chen Deli published a research overview describing how his DeliAutoResearch skill, combined with DeepSeek‑V4‑Pro and GPT‑Image2, generated a 46‑page LaTeX paper in six days. The process involved six iterations (four for version V1, one for V2, one for V3), about 108 agent calls, consumption of 648 000 tokens, and the creation of 2 234 lines of LaTeX code. The final manuscript contains 103 verified references, seven figures, four tables, and occupies 538 KB.

The paper introduces an L1–L5 autonomy classification for research agents, analogous to the SAE levels for autonomous driving:

L1 – Basic autocomplete : early GitHub Copilot‑style prediction of the next line of code.

L2 – Task execution : chat models (e.g., ChatGPT, Claude) with tool integration that decompose tasks but require human approval at each step.

L3 – Multi‑step execution : agents such as Claude Code or Cursor Agent that can autonomously perform 10–100 steps, requesting human review only at critical points.

L4 – Domain‑restricted full autonomy : the agent receives only a research goal and final‑result evaluation; it can conduct experiments, write code, and draft papers but cannot choose its own research questions.

L5 – Fully autonomous agenda : the agent selects topics, allocates resources, accumulates knowledge across domains, and conducts long‑term research. This level remains speculative, with the main challenges identified as continuous knowledge accumulation, reliable self‑evaluation, and scalable architecture.

Current industry practice has reached roughly L4, while L5 is still a vision.

Beyond the autonomy levels, the paper categorises four dominant architectural patterns for autonomous research agents:

Single‑agent loop : early approaches such as ReAct, Reflexion, LATS, and Thought‑Tree iterate through reasoning‑action‑observation cycles. Simple and efficient but limited on complex tasks.

Multi‑agent collaboration : frameworks like CAMEL, AutoGen, and MetaGPT distribute work among multiple agents, offering diverse perspectives and error correction at the cost of higher coordination overhead.

Hierarchical scheduling : exemplified by Claude Code and Devin, this pattern decomposes long‑term research into layered plans, suitable for extensive, high‑complexity investigations.

Tool‑enhanced execution : agents such as SWE‑Agent integrate execution environments, web browsers, APIs, databases, and multimodal tools. The design of the Agent‑Computer Interface (ACI) directly influences performance.

The author stresses that no pattern is universally superior; the choice should match the task’s requirements—simple short tasks favour single‑agent loops, complex multi‑view tasks benefit from multi‑agent collaboration, long‑duration research needs hierarchical scheduling, and tasks requiring external tool interaction rely on tool‑enhanced execution. In practice, mixed architectures are common.

Using the proposed framework, the paper evaluates 17 mainstream autonomous‑research systems across a six‑dimensional feature matrix (scalability, cost, reliability, etc.). The analysis shows the field has progressed from fragile early prototypes to L4‑level domain‑specific systems. Code‑centric agents exhibit the highest maturity, while scientific agents are beginning to produce verifiable new findings.

The study identifies six open research questions:

Cognitive‑loop traps: agents may fall into repetitive, ineffective strategies without self‑termination.

Context limits: fixed token windows (4 K–1 M) hinder long‑term research.

Innovation assessment: lack of automated metrics for originality and value.

Reproducibility: model randomness and prompt sensitivity impede repeatability.

Safety and ethics: dual‑use risks, autonomous escalation, and academic integrity concerns.

Cost: single‑task expenses can exceed 50 units, exacerbating research inequality.

Finally, the author argues that the primary bottleneck for achieving true L5 autonomy is not model capability but the ability to continuously accumulate knowledge and perform reliable self‑evaluation, alongside scaling the underlying architecture.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents self-evaluation autonomous research agent architectures knowledge accumulation L1-L5 taxonomy

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.