Artificial Intelligence 19 min read

LPR4M: A Large-Scale Multimodal Livestreaming Product Recognition Dataset and the RICE Cross‑View Semantic Alignment Model

This paper introduces LPR4M, a 4‑million‑pair multimodal dataset for livestreaming product recognition, and proposes the RICE model that combines instance‑level contrastive learning with patch‑level cross‑view semantic alignment, demonstrating state‑of‑the‑art performance on both LPR4M and MovingFashion benchmarks.

Kuaishou Tech

Sep 25, 2023

LPR4M: A Large-Scale Multimodal Livestreaming Product Recognition Dataset and the RICE Cross‑View Semantic Alignment Model

Background – In Kuaishou’s livestream e‑commerce, trust between anchors and viewers drives sales, but user interest varies with the specific products shown. To enable fine‑grained, real‑time product recommendation, each livestream frame must be accurately matched to its corresponding product.

Limitations of Existing Datasets – Prior datasets such as AsymNet, WAB and MovingFashion either lack textual modality, have limited scale (≈70K pairs), or focus only on fashion items, creating a large domain gap to real‑world livestream scenarios.

Our Contribution: LPR4M – We construct LPR4M, a 4,033,696‑pair (livestream segment, product image) dataset covering 34 everyday categories, with image, video and ASR text modalities. The dataset is 50× larger than the biggest public LPR dataset, exhibits a long‑tail distribution, and provides diverse variations in product size, visibility duration, and background clutter.

Problem Definition – Given a product image and a livestream video segment, the goal is to retrieve the matching product image by learning discriminative cross‑modal representations.

Method: RICE Model

Instance‑level Contrastive Learning (ICL) – Images and video frames are split into non‑overlapping patches, projected to tokens, and processed by a shared ViT‑B/32 encoder. Global [CLS] embeddings are aligned with an InfoNCE loss.

Patch‑level Semantic Alignment – A Pairwise Matching Decoder (PMD) performs self‑attention on image patches and cross‑attention between image and video patches to compute fine‑grained similarity.

Patch Feature Reconstruction (PFR) – Using the attention weights, the model learns to reconstruct image patch features from video patches, supervised by a reconstruction loss.

Intent Product Detection (IPD) – Detected intent‑product bounding boxes replace video patches, focusing the model on the target item and suppressing background clutter. Both single‑frame (DETR‑based) and multi‑frame (temporal transformer) detectors are employed.

The final loss is a weighted sum of the ICL, PMD matching, and PFR reconstruction losses.

Implementation Details – Encoders share CLIP‑pretrained ViT‑B/32 weights; PMD is also initialized from CLIP. Ten frames are uniformly sampled from each video segment and resized to 224×224. Training uses Adam with cosine decay, batch size 256, learning rates 1e‑7 (shared encoder) and 1e‑4 (new modules), on eight NVIDIA Tesla V100 GPUs for ~90 hours (3 epochs).

Experiments

Evaluation on LPR4M test set (20,079 queries, 66,358 gallery images) using rank‑k accuracy.

RICE outperforms prior LPR methods (FashionNet, AsymNet, SEAM) and strong video‑understanding baselines.

On MovingFashion, RICE achieves the highest accuracy compared with NVAN and MGH.

Ablation studies show: PMD improves R1 by 2.3 %, PFR adds 0.9 % gain, IPD adds 1.0 % gain, and adding Chinese‑CLIP text embeddings further boosts performance.

Visualization – Attention maps from PMD illustrate that the model correctly focuses on the target product even under occlusion or heavy background clutter.

Conclusion – LPR4M provides a realistic, large‑scale benchmark for livestream product recognition, and the RICE model demonstrates that combining instance‑level contrastive learning with patch‑level cross‑view alignment and intent‑product detection yields significant performance gains. We hope the dataset and baseline will stimulate further research in multimodal livestream analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning product recognition cross-view alignment livestreaming multimodal dataset

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.