Multilingual Content Understanding in UC International Feed Recommendation
This article presents a comprehensive overview of the challenges, requirements, and technical solutions for multilingual content understanding in UC's international information‑flow recommendation system, covering structured signal construction, low‑resource NLP techniques, transfer learning, quality modeling, and image‑based signal integration.
The talk introduces the need for multilingual content understanding in UC's international feed recommendation, highlighting that recommendation relies on structured signals derived from both content and user understanding, with a focus on the content side.
Key challenges include handling many languages, diverse tasks such as lexical analysis, classification, tagging, and limited annotated data, making it a typical low‑resource NLP problem.
To address these, a pipeline of structured signal construction is described: basic attributes (language detection, quality control, timeliness), explicit interest signals (text classification, tag and keyword extraction), implicit interest signals (topic clustering, image clustering, user representation via representation learning), and specialized signals such as hotspot detection, comment analysis, and image representation.
Low‑resource lexical analysis uses the Universal Dependencies treebanks to standardize schemas across languages, enabling language‑agnostic downstream processing.
Multilingual entity recognition leverages Wikidata and semi‑supervised data‑augmentation with LSTM‑CRF models, while multilingual classification evolves from rule‑based seed methods to transfer learning (sample and feature transfer), and finally to pre‑trained models (ELMo, BERT, GPT) fine‑tuned for specific tasks.
Various transfer techniques are discussed, including translation‑based sample migration, MUSE word‑vector alignment, LASER sentence alignment, and BERT‑based fine‑tuning, all aimed at reducing sample dependence.
Semantic tagging combines expert‑driven seed collection with large‑scale supervised learning, and weak‑sample models use word‑embedding similarity to assign coarse tags without training data.
A large‑sample supervised model (ML‑DNN) based on FastText‑style word2vec averaging provides multi‑label probabilities for head‑topic coverage.
Content quality is ensured through a hybrid rule‑based and multi‑stage model pipeline that filters low‑quality items, merges sub‑models, and incorporates human review for ambiguous cases, all operating on language‑agnostic structured signals.
Image information is exploited by extracting ResNet/VGG features, clustering images with K‑means, and using cluster IDs as implicit signals to improve recall for visual‑heavy formats like memes.
Representation learning is also applied to user modeling: a dual‑tower architecture uses BERT embeddings for items and learns user embeddings from diverse behaviors (e.g., browser queries), enabling effective similarity‑based recommendations in sparse scenarios.
The summary concludes that multilingual content understanding requires language‑agnostic preprocessing, extensive transfer learning, and representation learning to overcome data scarcity and achieve robust recommendation performance.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.