Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate

Claude Opus 4.8, released just 43 days after 4.7 at the same price, tops the GDPval‑AA leaderboard with 1890 Elo, beats GPT‑5.5 by 121 points, cuts steps by 15% and tokens by 35%, achieves a perfect 0% lie and lazy rate, dominates SWE‑Bench, ProgramBench and FrontierSWE, and introduces massive parallel agent workflows that can rewrite 750 k lines of production code in 11 days, while Anthropic prepares the upcoming Claude Mythos and celebrates a $965 b valuation.

DataFunTalk
DataFunTalk
DataFunTalk
Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate

Claude Opus 4.8 was announced just 43 days after Opus 4.7 and retains the same pricing as its predecessor, immediately claiming the global AI leader board.

Benchmark Superiority

On the real‑world agent capability leaderboard GDPval‑AA, Opus 4.8 scores 1890 Elo, a 137‑point jump over Opus 4.7 and 121 points ahead of GPT‑5.5, which translates to an estimated 67 % win‑rate against competing models. The new model also completes the same tasks with 15 % fewer steps and 35 % fewer tokens.

Honesty Metrics – Two Historic Firsts

The “lie rate” (frequency of silently ignoring defective outputs) drops from 0.40 for Opus 4.5 and 0.25 for Opus 4.7 to 0.00 for Opus 4.8, making it the first model to achieve a perfect score. Likewise, the “lazy‑rate” (probability of giving a wrong answer when a problem requires deeper investigation) falls from 25 % in Opus 4.7 to 0 % in Opus 4.8.

Software‑Engineering Benchmarks

On the SWE‑Bench Pro suite, Opus 4.8 achieves 69.2 % accuracy, a full 10 percentage‑point lead over GPT‑5.5. In the more demanding ProgramBench, Opus 4.8 outperforms Opus 4.7 across all token‑budget tiers; at a low 1 M‑token budget it reaches ~79.5 % pass rate, whereas Opus 4.7 only reaches that level at a 5 M‑token budget.

FrontierSWE – Pushing the Human‑Capability Ceiling

When evaluated on the FrontierSWE leaderboard (which requires building a PostgreSQL server in Zig, rewriting Git, and creating a native Lua compiler), Opus 4.8 wins 83 % of the matches, decisively beating both GPT‑5.5 and Opus 4.7.

Massive Parallel Coding with Claude Code

Using Claude Code, a team of hundreds of agents rewrote a 750 k‑line Rust codebase from scratch in 11 days, achieving 99.8 % test‑suite pass rate with virtually no human line‑by‑line review. The process generated over 6 000 commits and the AI autonomously merged changes after resolving conflicts.

Dynamic Workflows and Effort Control

Opus 4.8 introduces a five‑level “effort” selector (Low → Max) and an “ultracode” tier that automatically decides whether to launch a full fleet of sub‑agents. The dynamic workflow system can split a large task into dozens of sub‑tasks, run them in parallel, and then iteratively reconcile results. Because dynamic workflows consume far more tokens than a normal session, Anthropic recommends starting with small‑scope tasks.

Anthropic’s Valuation and the Upcoming Claude Mythos

Alongside the technical leap, Anthropic closed a $65 b Series H round, pushing its valuation to $965 b and briefly surpassing OpenAI. The company also teased the imminent release of Claude Mythos, which is expected to further extend Anthropic’s AI frontier.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ClaudeAI benchmarksSWE-benchProgramBenchagent parallelismOpus 4.8Dynamic Workflows
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.