Artificial Intelligence 11 min read

Exploration and Practice of Text Representation Algorithms in the 58 Security Scenario

This article presents a comprehensive study of text representation techniques—from weighted word‑vector methods to supervised SimBert and unsupervised contrastive learning models—applied to large‑scale unstructured data in 58's information‑security workflows, evaluating their effectiveness for classification and content‑recall tasks.

58 Tech

Aug 5, 2021

Exploration and Practice of Text Representation Algorithms in the 58 Security Scenario

The article introduces the exploration and practice of text representation algorithms within the 58 security (信安) scenario, highlighting the need to model massive daily unstructured text data (e.g., rental, job, and social posts) for classification, clustering, and downstream semantic operations.

Background : 58's platforms generate tens of millions of text records daily. Vectorizing these texts enables fine‑grained operations such as semantic classification, similarity‑based recall, and user‑relationship graph construction. The security team has applied various representation algorithms to address these needs.

Word‑Vector Based Text Semantic Representation : A simple yet effective method computes sentence vectors by weighting word embeddings (Word2Vec, GloVe, fastText) with BM25 scores. The workflow includes tokenization, stop‑word removal, BM25 scoring, selection of top‑N tokens, and aggregation using pretrained 200‑dimensional Chinese word vectors (≈8 M entries). Experimental results on a porn‑related recruitment dataset show slightly lower performance than a TextCNN baseline but offer easy integration for clustering tasks.

Pretrained Text Semantic Representation Models : Recent BERT‑family models (e.g., RoBERTa) pretrained on massive corpora capture richer linguistic knowledge. However, mean‑pooled sentence embeddings can be non‑smooth, especially for high‑frequency versus low‑frequency tokens, limiting direct similarity calculations. Fine‑tuning with task‑specific objectives is therefore essential.

1. SimBert‑Based Model : Using supervised similar‑sentence pairs (≈100 k) from business annotations, the authors replace BERT with RoBERTa and train a Seq2Seq generation task alongside a similarity classification task. Experiments on recruitment post recall show that SimBert outperforms weighted word‑vector methods, though the generation task proves difficult to label in practice, leading to a simplified SimBert‑without‑Seq2Seq variant with only a slight performance drop.

2. Contrastive Learning‑Based Model : To reduce reliance on costly labeled data, the authors adopt an unsupervised contrastive learning approach (SimCSE) with dropout‑based data augmentation and token‑level synonym replacement. The training pipeline includes tokenization, augmentation, batch construction (32 SENT_a + 32 SENT_b), and a contrastive loss that masks diagonal entries. Results demonstrate that this unsupervised method surpasses the supervised SimBert, highlighting the benefit of harder negative sampling and semi‑supervised augmentation.

Conclusion and Outlook : The study traces the evolution from weighted word vectors to supervised SimBert and finally to contrastive learning, emphasizing the importance of aligning model choice with business constraints. Future directions include exploring other pretrained models (T5, ELECTRA), applying BERT‑flow techniques, and extending embeddings to graph‑based user representations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

contrastive learning Information Security text representation BERT pretrained models semantic embedding SimCSE

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.