Artificial Intelligence 11 min read

Exploration and Practice of Text Representation Algorithms in the 58 Security Scenario

This article presents a comprehensive study of text representation techniques—from weighted word‑vector methods to supervised SimBert and unsupervised contrastive learning models—applied to large‑scale unstructured data in 58's information‑security workflows, evaluating their effectiveness for classification and content‑recall tasks.

58 Tech
58 Tech
58 Tech
Exploration and Practice of Text Representation Algorithms in the 58 Security Scenario

The article introduces the exploration and practice of text representation algorithms within the 58 security (信安) scenario, highlighting the need to model massive daily unstructured text data (e.g., rental, job, and social posts) for classification, clustering, and downstream semantic operations.

Background : 58's platforms generate tens of millions of text records daily. Vectorizing these texts enables fine‑grained operations such as semantic classification, similarity‑based recall, and user‑relationship graph construction. The security team has applied various representation algorithms to address these needs.

Word‑Vector Based Text Semantic Representation : A simple yet effective method computes sentence vectors by weighting word embeddings (Word2Vec, GloVe, fastText) with BM25 scores. The workflow includes tokenization, stop‑word removal, BM25 scoring, selection of top‑N tokens, and aggregation using pretrained 200‑dimensional Chinese word vectors (≈8 M entries). Experimental results on a porn‑related recruitment dataset show slightly lower performance than a TextCNN baseline but offer easy integration for clustering tasks.

Pretrained Text Semantic Representation Models : Recent BERT‑family models (e.g., RoBERTa) pretrained on massive corpora capture richer linguistic knowledge. However, mean‑pooled sentence embeddings can be non‑smooth, especially for high‑frequency versus low‑frequency tokens, limiting direct similarity calculations. Fine‑tuning with task‑specific objectives is therefore essential.

1. SimBert‑Based Model : Using supervised similar‑sentence pairs (≈100 k) from business annotations, the authors replace BERT with RoBERTa and train a Seq2Seq generation task alongside a similarity classification task. Experiments on recruitment post recall show that SimBert outperforms weighted word‑vector methods, though the generation task proves difficult to label in practice, leading to a simplified SimBert‑without‑Seq2Seq variant with only a slight performance drop.

2. Contrastive Learning‑Based Model : To reduce reliance on costly labeled data, the authors adopt an unsupervised contrastive learning approach (SimCSE) with dropout‑based data augmentation and token‑level synonym replacement. The training pipeline includes tokenization, augmentation, batch construction (32 SENT_a + 32 SENT_b), and a contrastive loss that masks diagonal entries. Results demonstrate that this unsupervised method surpasses the supervised SimBert, highlighting the benefit of harder negative sampling and semi‑supervised augmentation.

Conclusion and Outlook : The study traces the evolution from weighted word vectors to supervised SimBert and finally to contrastive learning, emphasizing the importance of aligning model choice with business constraints. Future directions include exploring other pretrained models (T5, ELECTRA), applying BERT‑flow techniques, and extending embeddings to graph‑based user representations.

contrastive learninginformation securitytext representationBERTpretrained modelssemantic embeddingSimCSE
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.