Boundary Content Graph Neural Network (BC‑GNN) for Temporal Action Proposal Generation
The Boundary Content Graph Neural Network (BC‑GNN) introduces a bipartite‑graph framework that jointly refines start/end boundary probabilities and segment‑content confidence, enabling more precise temporal action proposals and achieving state‑of‑the‑art results on ActivityNet‑1.3 and THUMOS14.
Recently, the computer‑vision conference ECCV 2020 announced its accepted papers. This article introduces a paper from the iQIYI team that proposes the Boundary Content Graph Neural Network (BC‑GNN). The method models the relationship between boundary prediction and content confidence using a graph neural network, producing more accurate temporal boundaries and reliable content confidence scores.
Temporal action proposal generation aims to locate high‑quality action segments in long, untrimmed videos. Existing approaches first predict start/end boundaries, then combine them into proposals, and finally assign a content confidence score. This pipeline ignores the interaction between boundary and content predictions.
BC‑GNN addresses this limitation by constructing a bipartite graph where candidate segment boundaries are nodes and the segment contents are edges. A novel graph reasoning process updates both node and edge features, allowing the model to jointly predict start‑point probabilities and content confidence. The approach achieves state‑of‑the‑art results on ActivityNet‑1.3 and THUMOS14 for both proposal generation and temporal action detection.
Method Overview
The overall framework consists of five stages:
Feature Encoding: A two‑stream network (spatial RGB branch and temporal optical‑flow branch) extracts D‑dimensional features for each video snippet, forming a T×D feature matrix.
Base Module: Two 1‑D convolutional layers enlarge the receptive field and serve as the backbone.
Graph Construction Module (GCM): For each snippet, the start and end timestamps become graph nodes, while the segment content becomes an edge. The bipartite graph connects start nodes N_s and end nodes N_e whenever t_e > t_s. Edge features are obtained by linearly interpolating the content feature matrix between the corresponding start and end positions, reshaping, and passing through a fully‑connected layer.
Graph Reasoning Module (GRM): Edge features are first updated by aggregating the features of their incident nodes. Then node features are updated by aggregating the features of incident edges, using learned weight matrices and ReLU activations. Edge features are normalized before being used as weights for node updates.
Output Module: Updated node and edge features are used to predict start‑point probabilities and content confidence scores, forming high‑quality proposals.
Experiments
BC‑GNN was evaluated on the ActivityNet‑1.3 and THUMOS14 datasets. In both temporal action proposal generation and temporal action detection tasks, the method outperformed existing approaches, achieving leading performance.
Ablation Study
Two key design choices were examined: converting the undirected bipartite graph into a directed graph and adding an explicit edge‑feature update step. Ablation results on ActivityNet‑1.3 demonstrate that both strategies contribute positively to performance.
Conclusion
The proposed BC‑GNN jointly models boundary and content predictions via graph neural networks, improving both boundary precision and content confidence. The approach can be extended to other tasks that involve coupled boundary and content estimation. Future work will explore more efficient designs to reduce the computational cost of the two‑stage pipeline commonly used in temporal action detection.
Paper link: https://arxiv.org/abs/2008.01432
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.