Artificial Intelligence 8 min read

2021 Sohu Text Matching Competition: Model Design, Tricks, and Performance Analysis

This article details the authors' approach to the 2021 Sohu Text Matching competition, describing the task definition, data splits, model architectures (cross‑encoder and bi‑encoder), pretrained language models used, various training tricks, ensemble strategies, and the resulting evaluation scores.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
2021 Sohu Text Matching Competition: Model Design, Tricks, and Performance Analysis

Task Overview The competition focused on matching text pairs of varying lengths under two evaluation criteria (topic‑based A class and event‑based B class) with three sub‑tasks: short‑short, short‑long, and long‑long matching. The training set contained 241,726 pairs, a validation set of 13,825 pairs, and a test set of 41,480 pairs, with notable class imbalance in B and strict model size (<2 GB) and latency (<500 ms) constraints.

#1 Data and Evaluation The authors achieved 1st place on the public leaderboard in the preliminary round and 2nd place in both the semi‑final and final rounds, attaining an overall F1 of 0.7515 (A‑class 0.8032, B‑class 0.6999).

#2 Model Architecture A shared pretrained language‑model encoder was used for both A and B tasks, followed by separate simple fully‑connected classifiers. Two encoder styles were explored:

Cross‑encoder : Input formatted as INPUT -> [CLS] WORD1 [SEP] WORD2 [SEP] , allowing intra‑ and inter‑sentence interaction but limited by sequence length.

Bi‑encoder : Source and target encoded independently as INPUT1 -> [CLS] WORD1 and INPUT2 -> [CLS] WORD2 , enabling faster inference when the target is fixed.

#3 Pretrained Models and Techniques Models such as NEZHA_base_wwm, RoFormer, Roberta‑wwm‑ext, MacBERT_base, and ERNIE_1.0 were employed. Two matching strategies were tried: concatenating source and target with a [SEP] token, and a SBERT‑like sentence‑vector comparison. Additional tricks included:

Co‑Attention module in the bi‑encoder to enhance cross‑sentence interaction (≈2‑3% gain over SBERT).

Optimizer Lookahead + RAdam with weight decay and a linear warm‑up learning‑rate schedule to prevent over‑fitting.

Data augmentation by swapping source and target and cross‑labeling between A and B classes.

Multi‑Sample Dropout: five parallel dropout layers (p=0.5) on BERT outputs before the final softmax.

Focal loss to address class imbalance.

Additional MLM pre‑training on the competition corpus (mask probability 0.15).

Stacking: threshold tuning on the validation set (≈0.4) per sub‑task, followed by voting‑based ensemble of multiple models.

#4 Experimental Findings The ensemble of diverse models and tricks yielded a >1% improvement over single models. Over‑fitting was mitigated by Multi‑Sample Dropout and stacking, while aggressive techniques like 5‑fold CV or multi‑seed training offered minimal gains due to the limited data size. Adversarial training (FGM) proved unstable in the final round.

The full code and detailed implementation are available at the provided GitHub repository.

machine learningAINLPcompetitionpretrained modelsText Matchingensemble learning
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.