Artificial Intelligence 21 min read

Applying UDA Semi‑Supervised Learning to Financial Text Classification: Experiments and Insights

This article investigates the practical performance of Google’s 2019 Unsupervised Data Augmentation (UDA) framework on real‑world financial text classification tasks, detailing experiments with limited labeled data, domain‑out‑of‑distribution samples, noisy labels, and comparisons between BERT and lightweight TextCNN models.

DataFunTalk
DataFunTalk
DataFunTalk
Applying UDA Semi‑Supervised Learning to Financial Text Classification: Experiments and Insights

The article introduces the challenge of scarce labeled data in vertical domains such as finance and proposes semi‑supervised learning, specifically the UDA (Unsupervised Data Augmentation) algorithm, as a promising solution.

It outlines three main contributions: (1) evaluating UDA on a financial text classification case study, (2) testing UDA on lightweight models, and (3) extending the original UDA study with experiments on out‑of‑domain data and mislabeled samples.

Background sections explain why financial NLP tasks suffer from limited annotations and high labeling costs, motivating the use of a small amount of expert‑labeled data combined with large amounts of unlabeled data.

The UDA technique is described: a supervised cross‑entropy loss on labeled data plus an unsupervised consistency loss that enforces similar predictions for augmented versions of the same unlabeled example, using KL divergence. The consistency assumption and the algorithm’s model‑agnostic nature are highlighted.

Experiments on public benchmarks (IMDb, Yelp, Amazon, DBpedia) show that UDA can approach or surpass SOTA performance with very few labeled examples, though gains diminish as more labeled data become available.

Real‑world experiments at Entropy‑Simple Technology focus on a six‑class financial short‑text classification task. Two models—BERT_base (pre‑trained on 10M finance documents) and TextCNN—are trained with and without UDA under varying amounts of labeled data, data‑augmentation strategies (EDA, back‑translation, TF‑IDF word replacement), and noisy‑label conditions.

Key findings include: UDA effectively leverages unlabeled data to achieve performance comparable to doubling the labeled set; the benefit is larger with fewer labels; lightweight TextCNN benefits almost as much as BERT; data‑augmentation choice strongly influences results, with EDA outperforming other methods; and UDA mitigates the impact of moderate label noise.

The article concludes that while UDA adds valuable signal from unlabeled data—especially in low‑resource settings—it does not fully replace the need for more labeled data or complementary techniques such as transfer learning, and that improving augmentation methods remains an important research direction.

data augmentationBERTSemi-supervised LearningText ClassificationTextCNNfinancial NLPUDA
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.