Artificial Intelligence 20 min read

Practical Applications of Pretrained Language Models (BERT, GPT, ELMo) in NetEase Yanxuan NLP Tasks

The article reviews the principles of popular pretrained language models, compares their architectures, and details how NetEase Yanxuan applied BERT, GPT and ELMo to classification, matching, sequence labeling and generation tasks, presenting experimental results and deployment insights.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Applications of Pretrained Language Models (BERT, GPT, ELMo) in NetEase Yanxuan NLP Tasks

With the release of BERT, pre‑training has become a hot direction in NLP. This article introduces the basic principles and usage of several common language models (ELMo, GPT, BERT) and reports their practical deployment in NetEase Yanxuan’s NLP services such as classification, text matching, sequence labeling and text generation.

Model structures

We selected three representative language models—ELMo, GPT and BERT—and compared them as shown in the table below.

Language Model

BERT

GPT

ELMo

Model Architecture

Transformer encoder

Transformer decoder

Bi‑LSTM

Pre‑training Tasks

Masked LM & Next Sentence Prediction

Standard language model (predict next token)

Bidirectional language model (predict forward and backward)

Recommended Usage

Fine‑tuning

Fine‑tuning

Feature ensemble

Pros / Cons

Bidirectional, strong representation

Unidirectional

LSTM feature extraction is weaker and training is slower

Transformer, introduced in the 2017 paper “Attention Is All You Need”, replaces RNN/CNN with multi‑head self‑attention, achieving superior performance in machine translation and other tasks. The dot‑product attention consists of four steps: mapping query to (key, value), computing query‑key weights, normalising with softmax, and weighting the values.

Usage modes

When applying a pretrained language model to a new NLP task, two common patterns are used:

Feature ensemble – obtain token embeddings from the pretrained model and feed them into a downstream model.

Fine‑tuning – keep the same network architecture as pre‑training and continue training on a small labelled dataset.

Empirical studies suggest that for ELMo, feature ensemble usually outperforms fine‑tuning, while for BERT, fine‑tuning is superior on sentence‑pair tasks such as matching.

Feature representation

Two strategies are used: (1) using only the top‑layer features, (2) weighted combination of multiple layers. For BERT, weighting the second‑to‑last layer yields the best sentence‑level similarity.

Practical experiments

1. Text Classification

Model

Data Size

Test F1

ABL (attention Bi‑LSTM)

150k

0.9743

BERT

5k

0.9612

BERT

20k

0.9714

BERT

150k

0.9745

Results show that BERT brings limited improvement for classification because shallow semantic features are often sufficient.

2. Text Matching

Method

Precision

Recall

F1

Latency per query

Siamese‑LSTM

0.98

0.75

0.85

<30 ms

BERT

0.96

0.97

0.97

>50 ms

BERT outperforms the Siamese network, likely because its pre‑training includes next‑sentence prediction, which captures inter‑sentence relations.

3. Sequence Labeling (NER)

Method

Precision

Recall

F1

Latency per query

Feature ensemble (Bi‑LSTM + CRF)

0.9686

0.8813

0.9220

>100 ms

Fine‑tuning (multi‑layer fusion)

0.9361

0.8801

0.9072

<10 ms

Fine‑tuning (high‑layer only)

0.9356

0.8368

0.8824

<10 ms

Feature‑ensemble yields higher accuracy but higher latency; fine‑tuning is more suitable for online services.

4. Generative Tasks

Model

Craftsmanship

Style

Target (ground truth)

先染后纺,色牢度高

经典格纹,帅气立领

BERT‑generator

针织工艺,精致细腻

经典版型,时尚百搭

GPT‑2

100%长绒棉,严格品控一家人满意

学院风格,日系简约

In the Yanxuan scenario, BERT is used as a seq2seq encoder for copy‑writing generation, while GPT‑2 serves as a pure generative model.

Beyond the above, the pretrained models are also explored for reading comprehension, text summarisation, and other downstream tasks. Model compression techniques such as knowledge distillation and lightweight variants like ALBERT are employed to meet online QPS and latency requirements.

Overall, the experiments demonstrate that pretrained language models can significantly improve performance on many NLP tasks in an e‑commerce setting, provided that the appropriate usage mode (feature ensemble vs fine‑tuning) and model optimisation are chosen.

NLPBERTtext generationText ClassificationText Matchingsequence labelingpretrained language models
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.