Artificial Intelligence 15 min read

Levels of AGI: A Framework for Evaluating Artificial General Intelligence

The article presents Google DeepMind's AGI evaluation framework, outlining six guiding principles, nine representative definitions, and a hierarchical five‑level classification system to assess AGI performance, autonomy, and societal impact, aiming to provide a common language for model comparison, risk assessment, and progress tracking.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Levels of AGI: A Framework for Evaluating Artificial General Intelligence

Abstract

Google DeepMind proposes a framework for assessing Artificial General Intelligence (AGI) models and early‑stage systems, detailing performance, applicability, and autonomy across multiple levels. The goal is to create a common language for comparing AGI models, evaluating risks, and tracking development, similar to the autonomous‑driving maturity ladder.

Original Information

Source: https://arxiv.org/pdf/2311.02462.pdf

Introduction

AGI refers to AI systems that can perform at or above human level across a broad range of tasks. Recent advances in machine learning, especially large language models (LLMs), have moved the discussion from philosophical debate to concrete implementation, with some claiming that current LLMs already constitute AGI.

Experts offer diverse definitions of AGI, reflecting its relevance to AI goals, predictions, and risks. The pursuit of human‑level intelligence remains a central, often implicit, objective in AI research.

AGI Definitions: Nine Representative Case Studies

Case 1 – Turing Test: Early attempt to assess machine intelligence through imitation, but modern LLMs can pass limited versions without guaranteeing true AGI.

Case 2 – Strong AI: Philosophical claim that a suitably programmed computer could possess consciousness; no scientific consensus on verification.

Case 3 – Analogies to the Human Brain: Defines AGI as a system surpassing human brain complexity and speed, capable of knowledge acquisition and reasoning across domains.

Case 4 – Human‑Level Cognitive Performance: Emphasizes the ability to perform any cognitive task a human can, without requiring a physical embodiment.

Case 5 – Ability to Learn Tasks: Highlights meta‑cognitive learning as essential for AGI, allowing adaptation to new tasks.

Case 6 – Economically Valuable Work: Defines AGI as a highly autonomous system that outperforms humans on economically valuable tasks, providing a measurable benchmark.

Case 7 – Flexible and General: Proposes a “Coffee Test” with five diverse tasks (e.g., movie understanding, programming) to evaluate flexibility and generality.

Case 8 – Artificial Capable Intelligence (ACI): Focuses on financial‑profit driven multi‑step tasks as a practical AGI test.

Case 9 – SOTA LLMs as Generalists: Argues that state‑of‑the‑art LLMs (GPT‑4, Bard, Llama 2, Claude) exhibit sufficient generality to be considered AGI, though performance aspects are under‑emphasized.

Six Principles for Defining AGI

Focus on Capability, Not Process: Exclude requirements for human‑like thinking or subjective consciousness.

Guarantee Generality and Performance: Both breadth across domains and depth of performance are essential.

Emphasize Cognitive and Metacognitive Tasks: Ability to learn new tasks and seek human assistance when needed is a key indicator.

Potential Over Deployment: A system demonstrates AGI if it can perform required capabilities, regardless of real‑world deployment constraints.

Prioritize Ecological Validity: Tasks should align with real‑world human values, including economic, social, and artistic contributions.

Track AGI Pathways, Not a Single Endpoint: Define measurable indicators for each level, identify associated risks, and adapt interaction paradigms accordingly.

AGI Capability Level Classification Framework (Two‑Dimensional Hierarchy)

Level 0 – No AI

Narrow: Simple software such as compilers.

General: Basic human‑in‑the‑loop systems.

Level 1 – Emerging

Narrow: Rule‑based systems (e.g., SHRDLU).

General: Early AGI examples like ChatGPT, Bard, Llama 2.

Level 2 – Competent

Narrow: Malicious content detectors, voice assistants (Siri, Alexa), VQA systems, Watson, sub‑tasks of SOTA LLMs.

General: Not yet achieved.

Level 3 – Expert

Narrow: Spell/grammar checkers, image generators (Imagen, DALL‑E 2).

General: Not yet achieved.

Level 4 – Virtuoso

Narrow: Deep Blue, AlphaGo.

General: Not yet achieved.

Level 5 – Superhuman

Narrow: AlphaFold, AlphaZero, Stockfish.

General: Not yet achieved.

Conclusion

Sharing a common AGI definition and evaluation framework will facilitate model comparison, risk mitigation, policy standardization, and clearer research goals, helping stakeholders understand the current position on the path toward AGI.

Appendix – Dartmouth Artificial Intelligence Conference

Key participants and proposals included C. E. Shannon, M. L. Minsky, N. Rochester, J. McCarthy, and other AI‑interested scholars.

machine learningAI evaluationAGIRisk AssessmentArtificial General Intelligence
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.