Artificial Intelligence 15 min read

NLP Challenges and Tagging Solutions in Sina Weibo Feed

This article reviews the specific NLP difficulties encountered in Sina Weibo's feed—such as short text, informal language, and ambiguous user behavior—and details the multi‑stage tagging system, material library, multimodal modeling, multi‑task learning, and large‑scale pre‑training techniques used to address them.

DataFunTalk

Dec 27, 2019

NLP Challenges and Tagging Solutions in Sina Weibo Feed

Based on 2019 statistics, Sina Weibo has 497 million monthly active users, 94% of whom are mobile, and the platform faces several NLP challenges in its feed stream.

Challenges and Existing Issues

1) Posts are typically very short (under 100 characters), making topic modeling ineffective; 2) Language is informal and unstructured, complicating content analysis; 3) User search‑behavior sequences are hard to capture accurately; 4) Feed interaction signals (clicks, dwell time, etc.) are noisy, limiting the effectiveness of LDA/PLSA and behavior‑based models.

Tagging System

The system consists of post tags, user‑interest (profile) tags, and author tags. Post tags are divided into primary/secondary tags, entity tags, and keyword tags.

Primary/Secondary Tags : Primary tags map to broad channels (e.g., Finance, Law, IT). Secondary tags refine these channels (e.g., Finance → Investment, Stocks). They are used for channel distribution and coarse‑grained user profiling.

Entity Tags : Also called third‑level tags, sourced from manual collection, hot‑search queries, and model recognition. An entity‑recognition model combines BERT with a CRF layer trained on annotated data.

Keyword Tags : Extracted from noun phrases and user queries. Noun‑phrase extraction relies on dependency parsers (Stanford NLP, LTP, HanLP) and multiple tokenizers to ensure consistency. Queries must be high‑frequency, short, and filtered for boundary errors using entropy‑based methods.

Matching algorithms such as Trie, hash tables, and Aho‑Corasick efficiently match millions of keywords against posts, with tokenization validation to keep only compatible matches.

Material Library

A unified base material library is built from pre‑processed posts, supporting downstream tasks like post‑embedding tagging, segment‑recognition, and material‑rating models that combine BERT text embeddings with author and context features.

Multimodal Modeling

Short posts benefit from image information; multimodal models (pre‑BERT LSTM+Inception‑ResNet‑V2, and later BERT‑based multimodal pre‑training) fuse text and image vectors to improve tag accuracy (e.g., distinguishing “apple” the fruit from Apple Inc.).

Multi‑Task Learning

Adding CTR and dwell‑time prediction tasks to the quality model yields modest performance gains.

Large‑Scale Pre‑Training

Both BERT and GPT‑2 are employed; GPT‑2 handles generation, while BERT provides contextual embeddings for similarity and tagging. T5 is also trained for unified NLP tasks. Model distillation, quantization, and TensorRT are explored to accelerate inference.

Additional projects include machine translation, summarization, assisted writing, and spelling/grammar correction.

Conclusion

Sina Weibo’s feed NLP problems are tackled with a combination of traditional techniques (dependency parsing, keyword extraction) and modern pre‑trained models, moving toward multi‑task, multimodal, and multilingual solutions while still relying on accurate user‑behavior signals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP pretraining BERT Weibo

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.