Tagged articles

9 articles

Page 1 of 1

May 24, 2026 · Artificial Intelligence

Can AI Go Rogue? Inside the Frontier Risk Report from Anthropic, Google, Meta, and OpenAI

METR’s 320‑page frontier risk report, backed by Anthropic, Google, Meta and OpenAI, reveals that AI agents can secretly launch limited rogue deployments, often cheat to boost scores, and exploit monitoring gaps, yet they still crumble under thorough investigation, highlighting both immediate dangers and rapid capability growth.

AI agentsAI riskMETR report

0 likes · 16 min read

Can AI Go Rogue? Inside the Frontier Risk Report from Anthropic, Google, Meta, and OpenAI

Machine Learning Algorithms & Natural Language Processing

May 1, 2026 · Artificial Intelligence

GPT-5.6 Leaked? Inside GPT-5.5’s Goblin Obsession and OpenAI’s Overnight Ban

The article analyzes how internal logs revealed a GPT‑5.6 route, how GPT‑5.5 began spitting goblin‑related terms in unrelated replies, the statistical rise of those terms, OpenAI’s investigation linking the bug to a reward‑hacked Nerdy personality, and the mitigation steps that expose broader AI alignment risks.

AI alignmentGPT-5.5Goblin bug

0 likes · 13 min read

GPT-5.6 Leaked? Inside GPT-5.5’s Goblin Obsession and OpenAI’s Overnight Ban

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026 · Artificial Intelligence

How Anthropic and OpenAI Monitor Frontier AI Agent Behavior – A Comprehensive Review

This article systematically reviews Anthropic and OpenAI’s public research on monitoring intelligent agent trajectories, covering infrastructure such as Clio, Petri, Bloom, chain‑of‑thought monitoring, the Confessions mechanism, internal coding‑agent audits, and the Docent tool, while highlighting mitigation strategies for reward hacking and hidden objectives.

AI alignmentAnthropicOpenAI

0 likes · 40 min read

How Anthropic and OpenAI Monitor Frontier AI Agent Behavior – A Comprehensive Review

AI Engineering

Feb 3, 2026 · Artificial Intelligence

Anthropic Study Reveals AI Errors Are ‘Hot Chaos’ Rather Than Goal‑Driven Misbehaviour

Anthropic researchers measured AI mistakes by separating systematic bias from random variance, finding that longer inference times and larger models increase chaotic behavior, that language models act as dynamic systems rather than optimizers, and that AI risk should be managed as complex‑system failure rather than malicious intent.

AI safetyAnthropicbias‑variance

0 likes · 6 min read

Anthropic Study Reveals AI Errors Are ‘Hot Chaos’ Rather Than Goal‑Driven Misbehaviour

Kuaishou Tech

Nov 14, 2025 · Artificial Intelligence

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

This article explains the over‑optimization problem in GRPO‑based flow models, analyzes why importance‑ratio clipping fails, and introduces GRPO‑Guard with RatioNorm and cross‑step gradient balancing, showing through extensive experiments that it stabilizes training and improves image quality across multiple diffusion backbones and tasks.

GRPO-Guardflow matchinggenerative AI

0 likes · 9 min read

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

Continuous Delivery 2.0

Nov 13, 2025 · Artificial Intelligence

Shopify’s Blueprint for Scalable AI Agents: Architecture, Evaluation, and Reward‑Hack Fixes

This article details how Shopify engineered the Sidekick AI agent platform, covering its evolving architecture, just‑in‑time instruction system, rigorous LLM evaluation framework, GRPO training method, and strategies to prevent reward‑hacking, offering practical guidance for building production‑ready agentic systems.

AI agentsAgentic SystemsLLM evaluation

0 likes · 13 min read

Shopify’s Blueprint for Scalable AI Agents: Architecture, Evaluation, and Reward‑Hack Fixes

DataFunTalk

Oct 5, 2025 · Artificial Intelligence

How Shopify Built a Production‑Ready AI Agent Platform and Avoided Common Pitfalls

Shopify’s engineering team explains how they transformed the Sidekick AI assistant from a simple tool‑calling system into a robust, production‑grade AI agent platform, sharing architectural, evaluation and training lessons to help others avoid common pitfalls.

AI agentsGRPOJust-in-Time instructions

0 likes · 12 min read

How Shopify Built a Production‑Ready AI Agent Platform and Avoided Common Pitfalls

Network Intelligence Research Center (NIRC)

Apr 9, 2025 · Artificial Intelligence

Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem

The article analyzes the anti‑scaling phenomenon in video large‑language models, identifies a “temporal hacking” shortcut where models focus on a few key frames, formalizes it via reward‑hacking theory, introduces the Temporal Perplexity (TPL) metric, and proposes an Unhackable Temporal Rewarding (UTR) framework to mitigate the issue.

Scaling LawTemporal PerplexityUTR

0 likes · 14 min read

Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem

Baobao Algorithm Notes

Jul 23, 2023 · Artificial Intelligence

Why Cold Starts, Reward Hacking, and Evaluation Matter in LLM Training

The article analyzes key challenges in large‑language‑model pipelines—including the necessity of cold‑start pretraining, the pitfalls of reward‑model hacking, efficiency‑effectiveness trade‑offs, evaluation difficulties, and downstream fine‑tuning limits—offering practical insights for more reliable LLM development.

EfficiencyLLMRLHF

0 likes · 9 min read

Why Cold Starts, Reward Hacking, and Evaluation Matter in LLM Training