Artificial Intelligence 13 min read

Cross-Domain Product Representation (COPE): A Large-Scale Dataset and Baseline Model for Rich‑Content E‑Commerce

The paper introduces ROPE, the first large‑scale cross‑domain product recognition dataset covering detail pages, short videos and live streams, and proposes COPE, a dual‑tower multimodal model that learns unified product embeddings using contrastive and classification losses, achieving superior retrieval and few‑shot classification performance across domains.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Cross-Domain Product Representation (COPE): A Large-Scale Dataset and Baseline Model for Rich‑Content E‑Commerce

Background – With users spending more time on short‑video and live‑stream media, e‑commerce has shifted toward rich‑content formats, creating challenges due to the huge variance in product appearance across domains. A unified product representation is essential for consistent search and recommendation.

Dataset Construction (ROPE) – The authors collected over 2 billion raw samples from detail pages, short videos, and live streams, then sampled 1 % for manual annotation using Chinese‑CLIP similarity filtering. After annotation and similarity‑based merging, the final dataset contains 189,958 products with 3,056,624 detail‑page images, 5,876,527 short‑video clips, and 3,495,097 live‑stream slices. The data are split into a training set (187,431 products) and a validation set (2,527 products), both with multimodal annotations.

Evaluation Tasks – Two tasks are defined on ROPE: (1) cross‑domain product retrieval (six domain‑pair combinations) and (2) cross‑domain few‑shot (one‑shot) classification, each evaluated on the validation set.

Method (COPE) – COPE adopts a dual‑tower architecture with shared visual and textual encoders (X‑CLIP visual encoder with Cross‑frame Communication Transformer and Multi‑frame Integration Transformer; three‑layer RoBERTa text encoder). Domain‑specific projection layers map modality‑specific features into a common space, followed by a shared multimodal fusion encoder. Visual encoders process multiple frames (8 frames per video) and aggregate them; textual encoders process product titles.

Loss Functions – Training combines a contrastive loss (using seven similarity measures across domains) and a product‑classification loss (softmax over product IDs). The final loss is a weighted sum of the two components.

Implementation Details – RoBERTa and X‑CLIP are used for initialization. Training runs for 80 epochs with batch size 84, cosine‑decay learning rate schedule (warm‑up 2 epochs), and learning rates 5e‑5 (text), 5e‑7 (vision), 5e‑3 (other modules). AdamW optimizer with gradient accumulation is employed on 14 A10 GPUs.

Experimental Results – COPE outperforms existing multimodal baselines (e.g., CLIP4CLIP, FashionCLIP) on both retrieval (R@1 up to 82.58 %) and classification (accuracy up to 59.84 %). It also generalizes well to other product datasets (Product1M, M5Product). Ablation studies show that the classification loss and balanced sampling strategy significantly boost performance.

Visualization – t‑SNE plots of 30 randomly sampled products demonstrate that embeddings of the same product from different domains cluster tightly, confirming effective cross‑domain alignment.

Conclusion – The authors provide the first large‑scale, multi‑domain e‑commerce dataset and a strong baseline model, paving the way for future research on cross‑domain product representation in rich‑content e‑commerce scenarios.

e-commercedeep learningcontrastive learningmultimodaldatasetCross-Domainproduct representation
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.