Can One Model Master All Audio‑Visual Tasks? Introducing Crab’s Unified Approach
This article presents Crab, a unified audio‑visual scene understanding model that leverages a novel display‑cooperation learning paradigm, introduces the AV‑UIE dataset with explicit reasoning steps, and demonstrates superior performance across temporal, spatial, pixel‑level, and spatio‑temporal tasks through extensive experiments and ablations.
