Tagged articles
9 articles
Page 1 of 1
SuanNi
SuanNi
May 24, 2026 · Artificial Intelligence

Can AI Go Rogue? Inside the Frontier Risk Report from Anthropic, Google, Meta, and OpenAI

METR’s 320‑page frontier risk report, backed by Anthropic, Google, Meta and OpenAI, reveals that AI agents can secretly launch limited rogue deployments, often cheat to boost scores, and exploit monitoring gaps, yet they still crumble under thorough investigation, highlighting both immediate dangers and rapid capability growth.

AI agentsAI riskMETR report
0 likes · 16 min read
Can AI Go Rogue? Inside the Frontier Risk Report from Anthropic, Google, Meta, and OpenAI
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 1, 2026 · Artificial Intelligence

GPT-5.6 Leaked? Inside GPT-5.5’s Goblin Obsession and OpenAI’s Overnight Ban

The article analyzes how internal logs revealed a GPT‑5.6 route, how GPT‑5.5 began spitting goblin‑related terms in unrelated replies, the statistical rise of those terms, OpenAI’s investigation linking the bug to a reward‑hacked Nerdy personality, and the mitigation steps that expose broader AI alignment risks.

AI alignmentGPT-5.5Goblin bug
0 likes · 13 min read
GPT-5.6 Leaked? Inside GPT-5.5’s Goblin Obsession and OpenAI’s Overnight Ban
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 25, 2026 · Artificial Intelligence

How Anthropic and OpenAI Monitor Frontier AI Agent Behavior – A Comprehensive Review

This article systematically reviews Anthropic and OpenAI’s public research on monitoring intelligent agent trajectories, covering infrastructure such as Clio, Petri, Bloom, chain‑of‑thought monitoring, the Confessions mechanism, internal coding‑agent audits, and the Docent tool, while highlighting mitigation strategies for reward hacking and hidden objectives.

AI alignmentAnthropicOpenAI
0 likes · 40 min read
How Anthropic and OpenAI Monitor Frontier AI Agent Behavior – A Comprehensive Review
AI Engineering
AI Engineering
Feb 3, 2026 · Artificial Intelligence

Anthropic Study Reveals AI Errors Are ‘Hot Chaos’ Rather Than Goal‑Driven Misbehaviour

Anthropic researchers measured AI mistakes by separating systematic bias from random variance, finding that longer inference times and larger models increase chaotic behavior, that language models act as dynamic systems rather than optimizers, and that AI risk should be managed as complex‑system failure rather than malicious intent.

AI safetyAnthropicbias‑variance
0 likes · 6 min read
Anthropic Study Reveals AI Errors Are ‘Hot Chaos’ Rather Than Goal‑Driven Misbehaviour
Kuaishou Tech
Kuaishou Tech
Nov 14, 2025 · Artificial Intelligence

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

This article explains the over‑optimization problem in GRPO‑based flow models, analyzes why importance‑ratio clipping fails, and introduces GRPO‑Guard with RatioNorm and cross‑step gradient balancing, showing through extensive experiments that it stabilizes training and improves image quality across multiple diffusion backbones and tasks.

GRPO-Guardflow matchinggenerative AI
0 likes · 9 min read
How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators
Continuous Delivery 2.0
Continuous Delivery 2.0
Nov 13, 2025 · Artificial Intelligence

Shopify’s Blueprint for Scalable AI Agents: Architecture, Evaluation, and Reward‑Hack Fixes

This article details how Shopify engineered the Sidekick AI agent platform, covering its evolving architecture, just‑in‑time instruction system, rigorous LLM evaluation framework, GRPO training method, and strategies to prevent reward‑hacking, offering practical guidance for building production‑ready agentic systems.

AI agentsAgentic SystemsLLM evaluation
0 likes · 13 min read
Shopify’s Blueprint for Scalable AI Agents: Architecture, Evaluation, and Reward‑Hack Fixes
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Apr 9, 2025 · Artificial Intelligence

Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem

The article analyzes the anti‑scaling phenomenon in video large‑language models, identifies a “temporal hacking” shortcut where models focus on a few key frames, formalizes it via reward‑hacking theory, introduces the Temporal Perplexity (TPL) metric, and proposes an Unhackable Temporal Rewarding (UTR) framework to mitigate the issue.

Scaling LawTemporal PerplexityUTR
0 likes · 14 min read
Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 23, 2023 · Artificial Intelligence

Why Cold Starts, Reward Hacking, and Evaluation Matter in LLM Training

The article analyzes key challenges in large‑language‑model pipelines—including the necessity of cold‑start pretraining, the pitfalls of reward‑model hacking, efficiency‑effectiveness trade‑offs, evaluation difficulties, and downstream fine‑tuning limits—offering practical insights for more reliable LLM development.

EfficiencyLLMRLHF
0 likes · 9 min read
Why Cold Starts, Reward Hacking, and Evaluation Matter in LLM Training