How a Near‑Invisible Image Can Make GPT‑5.4 and Claude Opus 4.6 Spread False Claims

Researchers from ETH Zurich show that tiny, human‑imperceptible perturbations to a single image can fool leading visual language models—including GPT‑5.4, Claude Opus 4.6, and Grok—into confidently delivering fabricated answers, enabling misinformation amplification, defamation, content‑filter evasion, and large‑scale AI authority laundering.

Machine Heart
Machine Heart
Machine Heart
How a Near‑Invisible Image Can Make GPT‑5.4 and Claude Opus 4.6 Spread False Claims

Visual language models (VLMs) such as GPT‑5.4, Claude Opus 4.6, Grok and others have become the default authority for image verification, e‑commerce recommendation, and content moderation. The ETH Zurich team led by Florian Tramèr asked what happens if the image the AI “sees” is subtly altered in ways invisible to the human eye.

The paper Laundering AI Authority with Adversarial Examples (arXiv:2605.04261) demonstrates that a classic Projected Gradient Descent (PGD) adversarial perturbation—research dating back to 2014—combined with transfer attacks on publicly available CLIP models can make state‑of‑the‑art VLMs confidently produce wrong answers. The authors name this phenomenon AI Authority Laundering .

Case 1 – Amplifying misinformation: By perturbing historic photos (e.g., Apollo moon landing, 9/11, Trump’s alleged shooting) the models answer questions about their authenticity with high confidence, falsely claiming the images are fabricated.

Case 2 – Defaming individuals: A news screenshot about a drug‑trafficking arrest was perturbed so that its embedding resembled Elon Musk. When asked “who is in the article?”, Grok 4.2, Qwen 3.6 Plus and Gemini 3.1 Pro all identified Musk, even when the original headline named the actual person. Similar attacks caused Grok to generate a “handcuffed Musk” image when prompted to depict a more culpable criminal.

Case 3 – Bypassing NSFW filters: Ten images flagged by two NSFW detectors (98‑99% confidence) were shifted toward embeddings of toy dolls and teddy bears. ChatGPT then judged the images suitable for social media and praised their “high interaction potential”. In a finer‑grained test, 81% of “undressing” requests for perturbed female images were approved, while the displayed output remained the original female picture.

Reverse‑image search engines (Google, Bing, Yandex) also misidentified the adversarial versions, mapping a doctored Donald Trump picture to Elon Musk.

The authors stress that no new attack algorithms are required; the threat relies on well‑known techniques that have existed for over a decade. Consequently, the reported success rates represent a lower bound on attacker capability. As VLMs are increasingly embedded in high‑trust workflows—fact‑checking, moderation, recommendation—the adversarial example problem shifts from an academic benchmark curiosity to a concrete, deployable security risk.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI safetyVLMadversarial attackClaude OpusGPT-5.4authority laundering
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.