Artificial Intelligence 11 min read

Generating and Applying Social Relationship Graphs for Video Understanding

This talk presents recent research on integrating dynamic analysis and graph machine learning to generate social relationship graphs from video, detailing hierarchical graph convolution networks, multimodal feature fusion, weakly supervised training, experimental results, and applications such as enhanced video retrieval and storyline understanding.

DataFunSummit
DataFunSummit
DataFunSummit
Generating and Applying Social Relationship Graphs for Video Understanding

Online social media platforms have created a huge demand for fine-grained retrieval and video semantic summarization services. Existing video understanding techniques lack deep semantic cues, and incorporating the social relationships of characters in videos can lead to more complete and accurate storyline comprehension.

The presentation first outlines the problem background, emphasizing that current methods focus on shallow visual actions and miss deeper semantic information such as interpersonal relationships.

It then reviews related work on social relationship recognition, from early image‑based datasets like PIPA and PISC to video‑oriented datasets such as MovieGraphs and ViSR, and describes the CVPR‑2019 MSTR framework that combines intra‑person, inter‑person, and person‑object graphs with TSN and GCN.

Building on these ideas, the authors propose a multimodal approach that fuses visual, textual (subtitles and live comments), and auditory cues to strengthen relationship classification, even when characters do not appear simultaneously.

The core contribution is a hierarchical graph convolutional network for social relationship graph generation, consisting of three modules:

Frame‑level graph convolution network that creates sub‑graphs for each frame, integrating person nodes (C), pair nodes (P), background nodes (G), and temporal text nodes (T).

Multi‑channel temporal accumulation using two LSTMs to capture dynamics of individual appearance (C) and pairwise interactions (P).

Segment‑level graph convolution network that aggregates frame‑level sub‑graphs into richer segment‑level graphs, also incorporating segment‑level audio features.

The model is trained with weak supervision: only segment‑level relationship labels are available, so a specially designed loss function mitigates noise when supervising frame‑level networks.

Experimental results on the public ViSR dataset and a self‑constructed Bilibili dataset (which includes rich comment text) show significant improvements, especially for hostile relationships and cases with many characters where graph transitivity provides strong cues.

Applications of the generated social relationship graphs include:

Enhancing user experience by providing semantic storyline descriptions and causal linking of scenes.

Video person retrieval based on social context, where simple matrix operations and SVM classifiers combined with the relationship graph outperform state‑of‑the‑art visual‑only methods, particularly under occlusion or extreme pose variations.

The future outlook highlights the trend toward dynamic, semantic multimodal graph models and the potential to extend scene graphs with richer temporal and semantic cues for finer‑grained understanding.

multimodalvideo understandinggraph neural networkWeak Supervisionsocial relationship graph
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.