A Survey of Text Classification and Intent Recognition: Industrial and Research Perspectives
This article reviews recent developments in text classification and intent recognition, comparing industrial practices such as business‑coupled feature engineering with research trends like pretrained language models, and provides references and practical insights for building effective NLP solutions.
Background
To update my technical approach, I surveyed recent advances in text classification and intent recognition, topics that are closely related yet rarely covered together in depth in either industry or academia.
Industrial Situation
Strong Business Coupling
In industry, intent recognition is tightly linked to business features; models must incorporate signals such as click‑through rates, query length, and other domain‑specific attributes, as demonstrated in Meituan and Tencent search systems.
These external features are combined with semantic models using techniques similar to wide&deep architectures, reflecting a blend of business information and language understanding.
Large‑scale models that fuse multiple representations, like the KDD21 Taobao vector search system, illustrate a “high‑model” approach that assembles diverse features, though such heavyweight solutions may be unnecessary for upstream intent tasks.
Semantic Understanding
Semantic models provide robust, generalizable understanding of user queries, handling misspellings and colloquial language, and can be modularly integrated with business rules for flexible engineering.
Despite their power, pretrained models such as BERT are not universally adopted in intent recognition due to cost‑benefit considerations; often simpler models or rule‑based methods suffice for many upstream tasks.
Search‑as‑Classification
The “search‑as‑classification” idea treats intent detection as a lookup problem, e.g., matching queries against a dictionary, which works well for sparse or rapidly changing categories.
Research Situation
Overview
Recent surveys (e.g., a 2020 review) compare shallow models (CNN, RNN) and attention‑based approaches, noting that pretrained models dominate benchmark leaderboards but may not reflect real‑world constraints.
Pretrained Model Dominance
Large pretrained models achieve state‑of‑the‑art results on standard datasets, yet their superiority can be dataset‑dependent; smaller models like TextCNN may outperform them on domain‑specific data.
Relying solely on benchmark performance can lead to suboptimal technology choices, emphasizing the need for task‑oriented data collection.
Other Text Classification Research
Studies explore attention‑focused CNNs, gating mechanisms to incorporate side information, and techniques such as R‑Dropout and adversarial training (FGM/PGD) to boost performance without heavy pretrained models.
Supplement
Additional high‑quality tutorials and code repositories for text classification are listed in the references.
Summary
Semantic understanding remains essential but should be made more universal and stable.
Business coupling can be achieved through feature engineering and rule‑based methods, not solely via deep models.
Pretrained models are not always necessary for intent recognition; downstream improvements often yield higher impact.
Current research trends focus on scenario‑specific challenges and specialized datasets.
Dataset design is a critical research direction for advancing text classification.
References
[1] Tencent Tech: Understanding Search Queries [2] DNN+GBDT Query Category Prediction Fusion Model [3] Daguan Data: User Search Intent Recognition [4] Meituan Search: Query Understanding [5] A Survey on Text Classification: From Shallow to Deep Learning [6‑8] Various Chinese blog translations of the survey [9] Lite Transformer with Long‑Short Range Attention [10] 2021 AAAI Text Classification Papers [11] ACT: An Attentive Convolutional Transformer for Efficient Text Classification [12] Merging Statistical Feature via Adaptive Gate for Improved Text Classification [13] Task‑Aware Representation of Sentences for Generic Text Classification [14] How to Fine‑Tune BERT for Text Classification? [15‑17] Code repositories and articles on Chinese text classification
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.