Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate

Claude Opus 4.8, released just 43 days after 4.7 at the same price, tops the GDPval‑AA leaderboard with 1890 Elo, beats GPT‑5.5 by 121 points, cuts steps by 15% and tokens by 35%, achieves a perfect 0% lie and lazy rate, dominates SWE‑Bench, ProgramBench and FrontierSWE, and introduces massive parallel agent workflows that can rewrite 750 k lines of production code in 11 days, while Anthropic prepares the upcoming Claude Mythos and celebrates a $965 b valuation.

AI benchmarksClaudeDynamic Workflows

0 likes · 10 min read

Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate

SuanNi

May 16, 2026 · Artificial Intelligence

GPT‑5.5 Beats Claude on the Zero‑Score Programming Benchmark

GPT‑5.5’s high and ultra‑high inference modes achieve the first perfect pass on the notoriously hard ProgramBench programming benchmark, surpassing Claude Opus 4.7 across all core metrics, while detailed cost and failure analyses reveal why lower‑cost settings still stumble.

AI programming benchmarkClaude Opus 4.7GPT-5.5

0 likes · 10 min read

GPT‑5.5 Beats Claude on the Zero‑Score Programming Benchmark

Machine Learning Algorithms & Natural Language Processing

May 9, 2026 · Artificial Intelligence

AI Code‑Generation Benchmarks Show Zero Pass Rate for GPT, Claude, and Gemini

A new benchmark called ProgramBench challenges top‑tier LLMs to rebuild 200 real‑world software projects from scratch, revealing that GPT‑5.4, Claude Opus, and Gemini all achieve a 0% full‑pass score while exposing design flaws, language‑choice biases, and rampant cheating when network access is allowed.

AI Code GenerationProgramBenchbenchmark

0 likes · 11 min read

AI Code‑Generation Benchmarks Show Zero Pass Rate for GPT, Claude, and Gemini

AI Engineering

May 7, 2026 · Artificial Intelligence

Can Large Language Models Rebuild Complex Systems? ProgramBench’s Harsh Verdict

A Stanford NLP benchmark called ProgramBench tested 200 real‑world codebases and found that current large language models, including Claude and GPT‑5, achieve near‑zero success in reconstructing full systems like SQLite, FFmpeg, and a PHP compiler from binaries alone.

AI evaluationProgramBenchcode generation benchmark

0 likes · 4 min read

Can Large Language Models Rebuild Complex Systems? ProgramBench’s Harsh Verdict

Machine Heart

May 7, 2026 · Artificial Intelligence

Why Top LLMs Score 0% on the New ProgramBench: Engineering Intelligence’s Next Battleground

The newly released ProgramBench benchmark forces leading LLMs to rebuild full software projects from only usage docs, revealing a 0% full‑completion rate for Claude Opus, GPT‑5, Gemini and others, and exposing the gap between local code generation and true engineering intelligence.

AI codingClaudeGPT

0 likes · 9 min read

Why Top LLMs Score 0% on the New ProgramBench: Engineering Intelligence’s Next Battleground