Artificial Intelligence 23 min read

Overview of Natural Language Processing Techniques and Their Evolution

This article provides a comprehensive overview of natural language processing, covering its definition, historical development from one‑hot encoding to modern models such as word2vec, ELMo, GPT, and BERT, and discusses the advantages, limitations, and key concepts of each technique.

政采云技术

Jul 5, 2022

Overview of Natural Language Processing Techniques and Their Evolution

1. Concept

NLP (Natural Language Processing) refers to the processing of unstructured textual data—such as Chinese, English, and other languages—so that machines can understand natural language. The basic functions are illustrated in the diagram below.

2. Background

Artificial intelligence is now widely applied, especially in image domains, but massive amounts of textual data also exist across companies, schools, and government agencies. The web contains an enormous amount of text that remains under‑exploited, making NLP essential for extracting value.

Web pages are rich in textual content, and understanding this natural language is crucial for deep value mining.

3. Development History

Stage One

1) One‑hot

Concept: each word in the corpus is encoded as a vector whose length equals the vocabulary size; only one position is 1, the rest are 0. The position of 1 represents the word.

Example: Dictionary: {"John":1, "likes":2, "to":3, "watch":4, "movies":5, "also":6, "football":7, "games":8, "Mary":9, "too":10} One‑hot vectors: John: [1,0,0,0,0,0,0,0,0,0] likes: [0,1,0,0,0,0,0,0,0,0] too: [0,0,0,0,0,0,0,0,0,1]

Advantages:

1. Solves the difficulty of classifiers handling discrete data.

2. Expands feature representation to some extent.

Disadvantages:

1. Ignores word order, which is important.

2. Assumes words are independent.

3. Simple and crude; ignores word frequency and relationships.

4. Produces extremely sparse high‑dimensional features.

2) Bag‑of‑Words

Concept: based on one‑hot encoding, the vectors of all words in a sentence are summed to obtain a sentence vector.

Example: Sentence: "John likes to watch movies. Mary likes too" Dictionary size: 12 (including "List", "item") Sum of one‑hot vectors yields the sentence representation: [1,2,1,1,1,0,0,0,1,1].

Characteristics:

1. Counts term frequency, giving higher weight to frequently occurring words, though frequency does not always indicate importance.

2. Struggles to determine word importance.

3. In some cases, frequency is not positively correlated with significance.

Summary

Bag‑of‑words provides a vector representation of sentences but remains sparse and biased because it treats all words equally, ignoring stop words such as "的" and "了" that carry little semantic weight.

3) TF‑IDF

Term Frequency‑Inverse Document Frequency is a statistical weighting technique used in information retrieval and text mining. It increases with a term’s frequency in a document but decreases with its frequency across the corpus.

TF‑IDF helps mitigate the problems of bag‑of‑words by reducing the impact of common stop words.

To compute TF‑IDF, a basic corpus is required for probability distribution statistics.

Calculation:

Term Frequency (TF) = (term count in document) / (total terms in document) or TF = (term count) / (max term count in document) Inverse Document Frequency (IDF) = log( (number of documents) / (documents containing term + 1) )

TF‑IDF = TF * IDF.

Summary

TF‑IDF balances term frequency within a document against its rarity across the corpus, reducing the influence of ubiquitous words.

4) N‑Gram

N‑Gram is a statistical language model that slides a window of size N over the text to generate N‑byte fragments.

The probability of the Nth word depends only on the preceding N‑1 words. Commonly used are bi‑grams and tri‑grams.

Example (2‑gram): probability P(务|业) = occurrences of "业务" / occurrences of "业" = 2/3 = 0.667.

Summary

N‑grams introduce limited word‑order information but still lack true semantic representation and cannot fully vectorize sentences.

Stage Two

1) word2vec

In 2013, Tomas Mikolov introduced CBOW and Skip‑gram models, which dramatically advanced NLP and deep learning applications. word2vec learns dense word embeddings via a shallow two‑layer neural network.

Pre‑trained embeddings are now standard for initializing the first layer of neural networks, especially when labeled data are scarce.

word2vec maps words to low‑dimensional vectors, enabling downstream tasks.

Skip‑gram predicts surrounding context given a target word.

CBOW predicts the target word from its surrounding context.

Training process: one‑hot vectors are fed into the network; after many samples, the hidden‑layer weights become the word embeddings.

Vocabulary size V; each word is represented by a V‑dimensional one‑hot vector.

During forward propagation, the weight corresponding to the 1 position is activated, forming the embedding vector.

Skip‑gram training: predict context words from a single word.

CBOW training: predict the central word from surrounding words.

Summary

Skip‑gram yields more accurate embeddings at the cost of longer training, while CBOW is faster but slightly less precise.

Skip‑gram: one central word learns from many context words (student‑vs‑teachers).

CBOW: many context words learn from one central word (teachers‑vs‑students).

Stage Three

1) ELMo

ELMo (Embeddings from Language Models) was introduced in 2018 by Matthew Peters et al. It uses a forward and backward language model based on bi‑LSTM to generate contextualized word embeddings.

ELMo first uses word2vec for initial embeddings, then fine‑tunes them through the bi‑LSTM language model.

Forward LM + Backward LM

Bi‑LSTM architecture

Solves polysemy (multiple meanings) problem.

Fine‑tunable for various downstream tasks.

Not truly bidirectional: forward and backward RNNs are trained separately and simply summed.

LSTM feature extraction is weaker and slower than later Transformers.

Model architecture may not match downstream task models, hindering direct transfer.

2) GPT

Large‑scale pre‑trained language models sparked a surge of interest. In June 2018, OpenAI released GPT (Generative Pre‑Training), extending the trend.

GPT‑1, 2, and 3 differ mainly in training data scale; the architecture remains a Transformer decoder.

GPT‑3 features 175 billion parameters, 31 authors, 45 TB of training data, and massive compute resources, dramatically influencing AI applications such as text generation, code generation, translation, and QA.

Fails to judge the validity of nonsensical prompts, producing meaningless answers.

Risk of generating biased or harmful content due to massive training data.

Transformer limitations can cause repetition and lack of long‑range coherence.

3) BERT

In October 2018, Google AI released BERT (Bidirectional Encoder Representations from Transformers), which quickly became a landmark in NLP.

BERT uses a Transformer encoder with self‑attention, enabling true bidirectional context modeling. It is trained with Masked‑LM and Next Sentence Prediction tasks.

Transformer encoder with self‑attention provides bidirectional modeling.

Masked‑LM enables token‑level pre‑training.

Next Sentence Prediction adds sentence‑level supervision.

Fine‑tuning requires relatively little compute.

Random masking strategy is coarse.

MASK token never appears at inference, potentially hurting performance.

Only 15% of tokens are predicted per batch, slowing convergence.

Huge hardware consumption for large models.

Numerous variants have been built on BERT, such as lightweight ALBERT, knowledge‑enhanced K‑BERT, and RoBERTa.

Attention visualization shows strong connections between related words such as "The" and "cat".

Summary

The article reviews the evolution of NLP from simple statistical encodings to deep contextual models, highlighting the strengths and weaknesses of each approach and illustrating how modern architectures like BERT, GPT, and ELMo have transformed language understanding.

Recruitment

The Zero technology team at Zhengcai Cloud (based in Hangzhou) seeks passionate engineers. The team of 300+ includes veterans from Alibaba, Huawei, NetEase, and fresh graduates from top universities, working on cloud‑native, blockchain, AI, low‑code platforms, middleware, big data, and more. Interested candidates can email zcy‑tc@cai‑inc.com.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence Machine Learning NLP Language Models Word Embedding

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.