Artificial Intelligence 7 min read

Action Sequence Verification in Videos with CosAlignment Transformer (CAT)

The paper introduces Action Sequence Verification (ASV), a task that determines whether two videos follow the same ordered actions, provides the Chemical Sequence Verification dataset and re‑annotated COIN‑SV and Diving48‑SV sets, and proposes the CosAlignment Transformer (CAT) with intra‑step feature extraction, a Transformer‑based inter‑step encoder, and a sequence‑alignment loss that outperforms prior baselines and serves as a pre‑training model for video retrieval and classification.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Action Sequence Verification in Videos with CosAlignment Transformer (CAT)

At CVPR 2022, ShanghaiTech University and the Xiaohongshu multimodal algorithm team introduced a novel task called Action Sequence Verification (ASV). The goal is to determine whether two videos present the same sequence of actions, moving beyond traditional video tasks that focus on single actions.

The task has practical applications in entertainment and sports, such as automatic scoring in diving competitions, where a candidate video can be compared against a standard reference. In industrial settings, it can be used for standard‑process monitoring.

To support ASV, the authors created a new first‑person dataset named Chemical Sequence Verification (CSV), which records procedural steps in chemistry experiments. CSV contains about 2,000 videos, over 100 step categories, and 18 atomic actions, providing a rich set of positive and negative pairs with variations such as added, missing, or reordered sub‑actions.

In addition, the existing COIN and Diving48 datasets were re‑annotated and re‑partitioned to better fit the ASV setting, resulting in COIN‑SV and Diving48‑SV.

The proposed method, CosAlignment Transformer (CAT), consists of three modules:

Intra‑step module: extracts sub‑action level features from frame‑wise feature maps.

Inter‑step module: adopts a Transformer encoder (inspired by ViT) to model temporal relationships between sub‑actions and obtain a global video representation.

Alignment module: introduces a Sequence Alignment Loss that aligns the feature sequences of a positive video pair, enforcing a one‑to‑one correspondence of sub‑actions across the two videos.

Experiments on CSV, COIN‑SV, and Diving48‑SV demonstrate that CAT consistently outperforms traditional action‑recognition baselines. Ablation studies show that both the Transformer Encoder (TE) and Sequence Alignment (SA) components contribute positively to performance.

The authors note that the approach can also serve as a pre‑training model for downstream tasks such as video retrieval and classification, enhancing video search and recommendation systems.

Paper: https://arxiv.org/abs/2112.06447 Code: https://github.com/svip-lab/SVIP-Sequence-VerIfication-for-Procedures-in-Videos

computer visiontransformermultimodalvideo understandingdatasetaction verification
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.