Artificial Intelligence 10 min read

Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate

Claude Opus 4.8, released just 43 days after 4.7 at the same price, tops the GDPval‑AA leaderboard with 1890 Elo, beats GPT‑5.5 by 121 points, cuts steps by 15% and tokens by 35%, achieves a perfect 0% lie and lazy rate, dominates SWE‑Bench, ProgramBench and FrontierSWE, and introduces massive parallel agent workflows that can rewrite 750 k lines of production code in 11 days, while Anthropic prepares the upcoming Claude Mythos and celebrates a $965 b valuation.

DataFunTalk

May 29, 2026

Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate

Claude Opus 4.8 was announced just 43 days after Opus 4.7 and retains the same pricing as its predecessor, immediately claiming the global AI leader board.

Benchmark Superiority

On the real‑world agent capability leaderboard GDPval‑AA, Opus 4.8 scores 1890 Elo, a 137‑point jump over Opus 4.7 and 121 points ahead of GPT‑5.5, which translates to an estimated 67 % win‑rate against competing models. The new model also completes the same tasks with 15 % fewer steps and 35 % fewer tokens.

Honesty Metrics – Two Historic Firsts

The “lie rate” (frequency of silently ignoring defective outputs) drops from 0.40 for Opus 4.5 and 0.25 for Opus 4.7 to 0.00 for Opus 4.8, making it the first model to achieve a perfect score. Likewise, the “lazy‑rate” (probability of giving a wrong answer when a problem requires deeper investigation) falls from 25 % in Opus 4.7 to 0 % in Opus 4.8.

Software‑Engineering Benchmarks

On the SWE‑Bench Pro suite, Opus 4.8 achieves 69.2 % accuracy, a full 10 percentage‑point lead over GPT‑5.5. In the more demanding ProgramBench, Opus 4.8 outperforms Opus 4.7 across all token‑budget tiers; at a low 1 M‑token budget it reaches ~79.5 % pass rate, whereas Opus 4.7 only reaches that level at a 5 M‑token budget.

FrontierSWE – Pushing the Human‑Capability Ceiling

When evaluated on the FrontierSWE leaderboard (which requires building a PostgreSQL server in Zig, rewriting Git, and creating a native Lua compiler), Opus 4.8 wins 83 % of the matches, decisively beating both GPT‑5.5 and Opus 4.7.

Massive Parallel Coding with Claude Code

Using Claude Code, a team of hundreds of agents rewrote a 750 k‑line Rust codebase from scratch in 11 days, achieving 99.8 % test‑suite pass rate with virtually no human line‑by‑line review. The process generated over 6 000 commits and the AI autonomously merged changes after resolving conflicts.

Dynamic Workflows and Effort Control

Opus 4.8 introduces a five‑level “effort” selector (Low → Max) and an “ultracode” tier that automatically decides whether to launch a full fleet of sub‑agents. The dynamic workflow system can split a large task into dozens of sub‑tasks, run them in parallel, and then iteratively reconcile results. Because dynamic workflows consume far more tokens than a normal session, Anthropic recommends starting with small‑scope tasks.

Anthropic’s Valuation and the Upcoming Claude Mythos

Alongside the technical leap, Anthropic closed a $65 b Series H round, pushing its valuation to $965 b and briefly surpassing OpenAI. The company also teased the imminent release of Claude Mythos, which is expected to further extend Anthropic’s AI frontier.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Claude AI benchmarks SWE-bench ProgramBench agent parallelism Opus 4.8 Dynamic Workflows

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.