Artificial Intelligence 16 min read

From Image Captioning to Detective‑Style Perception: Pixel‑Searcher Beats Closed‑Source Models

Pixel‑Searcher introduces an agentic search‑driven visual perception framework that integrates web‑based evidence with pixel‑level grounding, and the new WebEyes benchmark demonstrates its superiority over existing open‑ and closed‑source multimodal models across localization, segmentation, and VQA tasks.

AIWalker

May 17, 2026

From Image Captioning to Detective‑Style Perception: Pixel‑Searcher Beats Closed‑Source Models

Problem: Multimodal models miss knowledge‑intensive visual queries

When a model receives a photo and a query that requires facts not present in the image pixels—e.g., the launch month of a phone, the brand acquired for $2.7 B, and the global ambassador of that brand—the model fails because the decisive clues are absent from visual cues and frozen internal knowledge.

Existing visual perception models either rely on visible visual cues or on static internal knowledge, which cannot answer queries that need up‑to‑date or long‑tail factual information.

Perception paradigms

Visual‑Cue Segmentation

: uses explicit visual cues (basic mode). Reasoning Segmentation: leverages internal model knowledge (intermediate mode). Perception Deep Research: adds an agentic search component that fetches web evidence, builds an evidence chain, and feeds it to downstream tasks (localization, segmentation, VQA).

WebEyes benchmark

WebEyes is a new benchmark that requires models to incorporate external knowledge to answer visual queries. It defines three complementary tasks:

Search‑based Grounding : predict a bounding box for a knowledge‑intensive query.

Search‑based Segmentation : predict a pixel‑level mask for the target.

Search‑based VQA : choose the correct knowledge‑based description for a highlighted object.

The dataset is built through a five‑stage automated pipeline. In the final “Automatic Filtering” stage, 38.2 % of candidate questions that can be answered without external knowledge are discarded, ensuring every remaining item requires web evidence.

WebEyes contains 120 images, 473 annotated instances, and 1,927 task samples across six balanced categories (ICON, Celebrities, Pop‑IP, etc.). It is the only benchmark that simultaneously requires general knowledge, web knowledge, and covers localization, segmentation, and VQA with fine‑grained evaluation.

Pixel‑Searcher workflow

Phase 1 – Agentic Search & Target Resolution

The goal is to answer “Who am I looking for?”. Given a complex query, the system first plans three sub‑queries (e.g., launch month, acquired brand, ambassador identity). It then enters an adaptive SEARCH → REASON → RESOLVE loop: SEARCH: fetch web evidence for each sub‑query. REASON: infer the intermediate entity from the evidence. RESOLVE: produce a hypothesis h containing entity name, visual class, and key evidence.

The loop may run a single round for simple questions or multiple rounds for complex, multi‑hop queries. The resulting hypothesis h structures the target as (entity e, visual class c, checkable clues K), preparing for visual binding.

Phase 2 – Agentic Grounding & Tool Use

With hypothesis h in hand, the system addresses “Where is this entity in the image?”. It uses the extracted clues to guide a grounding tool that proposes candidate regions, then performs evidence verification to select the region best matching both the web‑derived identity and visual appearance.

For segmentation, the selected bounding box is passed to a dedicated segmenter (e.g., SAM 3) to produce a refined pixel‑level mask.

Experimental validation

SearchGround localization

Pixel‑Searcher achieves the highest IoU among open‑source methods, raising Qwen3‑VL‑8B’s IoU from 26.81 to 34.17 and [email protected] from 32.61 to 41.30. Gains are especially large on high‑ambiguity categories such as Anime and ICON.

SearchSeg segmentation

On the SearchSeg benchmark, Pixel‑Searcher records a gIoU of 39.17 and a cIoU of 32.41, the top open‑source result, confirming that accurate target resolution in Phase 1 benefits downstream segmentation even when using off‑the‑shelf tools.

SearchVQA

Pixel‑Searcher reaches an overall accuracy of 42.24 %, surpassing many closed‑source models on the ICON and Anime categories.

Ablation study

Removing the “direct candidate” mechanism drops IoU from 34.17 to 20.14 and gIoU similarly, showing that the structured evidence generated in Phase 1 is critical. Other components such as contradiction checking or reference matching cause only minor fluctuations, indicating they are complementary rather than essential.

Limitations and error analysis

Failure analysis reveals that most errors (304 out of 389 failed segmentation samples) stem from incorrect search or entity resolution, not from the final mask generation. Only 10 errors are attributable to the SAM 3 mask conversion.

Improving evidence retrieval precision.

Enhancing identity reasoning under ambiguous or noisy information.

Making visual instance binding more robust to subtle visual variations.

Broader impact

The work demonstrates a paradigm shift from static models to dynamic agents that actively explore external information, establishing a closed‑loop “search‑reason‑verify” pipeline that bridges web knowledge and pixel‑level prediction. WebEyes enables quantitative evaluation of this capability and opens possibilities for long‑tail object recognition in autonomous driving, medical imaging with patient records, and cross‑modal product traceability in e‑commerce.

Reference

From Web to Pixels: Bringing Agentic Search into Visual Perception

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark Multimodal visual perception agentic search Pixel-Searcher WebEyes

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem: Multimodal models miss knowledge‑intensive visual queries

Perception paradigms

WebEyes benchmark

Pixel‑Searcher workflow

Phase 1 – Agentic Search & Target Resolution

Phase 2 – Agentic Grounding & Tool Use

Experimental validation

SearchGround localization

SearchSeg segmentation

SearchVQA

Ablation study

Limitations and error analysis

Broader impact

Reference

AIWalker

How this landed with the community

Was this worth your time?

0 Comments

Phase 1 – Agentic Search & Target Resolution

Phase 2 – Agentic Grounding & Tool Use