Artificial Intelligence 11 min read

AlignRec: A Joint Training Framework for Aligning Multimodal Representations with Personalized Recommendation

AlignRec is a joint‑training framework that synchronizes multimodal encoders with personalized recommendation models through a staged alignment strategy and three specialized loss functions, preserving both content and ID signals, and achieving state‑of‑the‑art performance on multiple datasets while releasing superior Amazon multimodal features.

Xiaohongshu Tech REDtech

Sep 23, 2024

AlignRec: A Joint Training Framework for Aligning Multimodal Representations with Personalized Recommendation

At CIKM 2024, the Xiaohongshu middle‑platform algorithm team introduced AlignRec, an innovative joint‑training framework that aligns multimodal representation learning models with personalized recommendation models. The authors identify a training‑step mismatch: recommendation signals dominate joint training, causing loss of multimodal information.

AlignRec addresses this by a staged alignment strategy and three targeted loss functions, enabling the joint model to retain both multimodal and recommendation signals. Experiments on multiple datasets show that AlignRec outperforms existing state‑of‑the‑art (SOTA) methods, and the team provides pre‑processed features for the public Amazon dataset that surpass current open‑source features.

The presentation outlines the practical background of recommendation and e‑commerce work, then details the core challenges of multimodal recommendation: (1) aligning multimodal representations (both content‑modalities and ID‑modalities), (2) balancing learning speeds between content and ID modalities, and (3) evaluating the impact of multimodal features on recommendation performance.

AlignRec’s architecture consists of three modules:

Multimodal Encoder Module : a pretrained MMEncoder (MMEnc) based on BEiT‑3, trained with mask‑then‑predict strategies (mask‑image‑modeling and mask‑language‑modeling). The CLS token is used as the unified multimodal item representation.

Aggregation Module : builds a heterogeneous graph from ID and multimodal embeddings, applies LightGCN for multi‑layer aggregation, and outputs user/item ID embeddings as well as multimodal item/user embeddings.

Fusion Module : fuses ID and multimodal embeddings to produce final user and item representations for top‑K retrieval.

Training details include an InfoNCE loss to align content and ID embeddings, regularization terms to avoid representation collapse, and a weighted combination of recommendation loss, regularization loss, and the staged alignment losses during end‑to‑end training.

To evaluate multimodal contributions, three intermediate metrics are proposed:

Zero‑Shot Recommendation : assesses whether multimodal features can reflect user interests based on historical interactions.

Item‑CF Recommendation : measures the ability of multimodal features alone to support collaborative‑filtering style item recommendation.

Mask Modality Recommendation : masks a portion of visual or textual modality to gauge each modality’s importance.

Extensive experiments on public multimodal datasets demonstrate that AlignRec achieves SOTA performance across all metrics, surpasses CLIP on intermediate evaluations, and provides superior multimodal features compared to existing open‑source baselines. Ablation studies on each module and hyper‑parameter analyses are also presented.

The authors summarize three main contributions: (1) a reusable multimodal recall paradigm validated with online A/B gains, (2) the AlignRec joint‑training approach with staged alignment and intermediate evaluation, and (3) upgraded multimodal data sources for the Amazon dataset to support future research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI evaluation metrics large-scale systems joint training multimodal recommendation representation alignment

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.