Advances in Information‑Flow Recommendation: Pre‑trained Models and Multimodal User‑Interface Modeling
This article reviews Huawei Noah's Ark Lab's work on modern information‑flow recommendation, covering the evolution from collaborative filtering to deep learning, the application of BERT‑based pre‑training for news ranking, multimodal user‑interface modeling, practical deployment challenges, and future research directions.
The talk, presented by Zhu Jieming from Huawei Noah's Ark Lab and edited by Zhang Aoyu (AWS), introduces the rapid development of recommendation systems from early collaborative filtering to current deep‑learning‑centric approaches, highlighting the growing difficulty of optimizing large models under inference constraints.
Huawei's Noah's Ark Lab, comprising six sub‑labs (computer vision, speech, recommendation, search, decision reasoning, AI theory, and AI systems), conducts both fundamental AI research and product‑oriented technology empowerment, collaborating with over ten countries and 25 universities.
In the information‑flow recommendation scenario, Huawei applies its technology to diverse multimodal feeds such as phone home‑screen news, browser article/video waterfalls, and video‑app recommendations, emphasizing the shift toward multimodal, heterogeneous content.
The evolution of recommendation techniques is traced: early 2000s collaborative filtering, 2010s generalized linear models (FTRL, FM, BPR, RankSVM), and from 2015 onward deep learning models like YouTubeDNN, Wide&Deep, DeepFM, and DIN, with performance gains driven by larger datasets and GPU advances.
Since 2018, the pre‑training + fine‑tuning paradigm (e.g., BERT) from NLP and CV has been adopted to boost recommendation performance. The UNBERT model concatenates user‑history and candidate news, uses segment IDs and CLS token for token‑level matching, pools news vectors, and adds a transformer layer for sentence‑level similarity, trained on CTR data.
Experimental results on the Microsoft MIND dataset show that UNBERT and its improved version MINER achieve significant offline AUC gains over baselines (NRMS, NAML), and demonstrate better cold‑start generalization. To serve online, model size is reduced (BERT‑mini), knowledge distillation is explored, and embeddings are compressed to 50‑dimensional vectors.
For practical deployment, UNBERT is combined with a DCN‑based CTR model, caching news embeddings to meet latency requirements, and dimensionality reduction is performed via a learned fully‑connected layer rather than PCA.
Beyond text, a multimodal user‑interface modeling approach captures visual impressions of news cards (image layout, size, typography) using pretrained ResNet/CLIP features on patches and whole cards, integrating local and global impressions into the CTR model.
Offline experiments on MIND demonstrate that incorporating visual card representations yields notable AUC improvements (up to several percentage points) over text‑only baselines.
Deployment challenges include limited image data coverage, engineering effort for real‑time image processing, and the need to fuse multimodal embeddings efficiently. Future work focuses on faster pre‑training and fine‑tuning pipelines, embedding‑only adaptation, and better utilization of contextual visual information.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.