A Survey of Text Data Augmentation Techniques in Natural Language Processing
This article systematically reviews recent developments in text data augmentation for natural language processing, covering common scenarios such as low‑resource and imbalanced classification, and detailing five major techniques—including back‑translation, EDA, TF‑IDF‑based replacement, contextual augmentation, and language‑model‑based methods—with experimental results and future directions.
The article, originally from the internal technical salon of Entropy‑Simple Technology's NLP team, provides a comprehensive review of text data augmentation methods that have emerged in recent years and discusses why they are essential for modern NLP tasks.
Why study text augmentation? It is valuable in low‑resource (few‑sample) settings, for handling class‑imbalance in classification tasks, in semi‑supervised learning, and for improving model robustness by exposing models to varied expressions of the same semantics.
Typical Techniques
1. Back Translation (Back‑translation)
Leveraging the rapid progress of machine translation, a source sentence is translated into an intermediate language and then back to the original language, producing a paraphrased version that retains the original meaning.
Original: 文本数据增强技术在自然语言处理中属于基础性技术;
Japanese: テキストデータ拡張技術は、自然言語処理の基本的な技術です;
English (re‑translation): Text data extension technology is a basic technology of natural language processing;
Back‑to‑Chinese: 文本数据扩展技术是自然语言处理的基本技术。Experiments show that back‑translation can improve BLEU scores for NMT models by ~1.7 and boost QA model performance by over 1% when used as data augmentation.
2. Random Word Replacement (EDA)
The Easy Data Augmentation (EDA) framework consists of four operations: synonym replacement (SR), random insertion (RI), random swap (RS), and random deletion (RD). These operations are analogous to image augmentations such as cropping and flipping.
Example transformations on the Chinese sentence "今天天气很好。":
SR: 今天天气 不错 。(好 → 不错)
RI: 今天 不错 天气很好。(插入)
RS: 今天很好 天气 。(交换)
RD: 今天天气 好 。(删除)
Empirical results on five public classification datasets show that EDA consistently raises accuracy by ~0.8 % on full data and up to 3 % in the 500‑sample regime.
3. Non‑Core Word Replacement (TF‑IDF based)
Words are weighted by TF‑IDF; low‑importance words are replaced with synonyms, reducing the risk of altering key semantics. This idea first appeared in the UDA paper and was later formalized as a separate augmentation step.
4. Contextual Augmentation (C‑BERT)
Using a pretrained language model (e.g., BERT), a token is masked and the model predicts top‑k alternatives, generating multiple plausible variants while preserving the original label. Experiments with CNN, RNN, and Transformer classifiers demonstrate average gains of ~2 %.
5. Language‑Model‑Based Augmentation (LAMBADA)
Large pretrained models such as GPT/GPT‑2 are fine‑tuned on a small target dataset, then used to generate new sentences. Filtering ensures the synthetic data follows the same distribution as the original. LAMBADA outperforms EDA, CBERT and other baselines, achieving up to 50 % relative improvement on the ATIS dataset.
Future Directions
Beyond the five surveyed methods, the article mentions emerging research on controlled text style transfer and prototype‑editing models, which could serve as more flexible augmentation pipelines.
Practical Considerations for the Financial Domain
The authors plan to discuss the real‑world impact of these techniques in a live session on March 25, focusing on finance‑asset‑management applications, and invite readers to register via the QR code.
Conclusion
Text data augmentation is a fundamental, cost‑effective technique that consistently improves model performance, especially in few‑shot scenarios. It can be viewed as a regularizer, a form of transfer learning, a robustness enhancer, and a way to explore the underlying data manifold.
References to the original papers are listed at the end of the article.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.