Artificial Intelligence 12 min read

How Do BPE, WordPiece, and SentencePiece Shape Modern NLP Tokenization?

This article explains the fundamentals, workflows, examples, and trade‑offs of three major subword tokenization algorithms—Byte Pair Encoding, WordPiece, and SentencePiece—helping practitioners choose the right method for their large language model pipelines.

Code Mala Tang
Code Mala Tang
Code Mala Tang
How Do BPE, WordPiece, and SentencePiece Shape Modern NLP Tokenization?

In natural language processing (NLP), tokenization is a fundamental preprocessing step that bridges raw text and machine learning models. It splits text into smaller units called tokens, which are then converted into numeric IDs that serve as inputs to large language models (LLMs) and are mapped to embeddings that capture semantic information.

The choice of tokenization algorithm significantly impacts LLM performance and efficiency. This article examines three widely used subword tokenizers: Byte Pair Encoding (BPE), WordPiece, and SentencePiece, describing their principles, advantages, limitations, and providing concrete examples.

1. Byte Pair Encoding (BPE)

BPE is a subword tokenization algorithm that balances vocabulary size and the ability to handle out‑of‑vocabulary (OOV) words. Unlike simple word‑or‑character tokenizers, BPE iteratively merges the most frequent character pairs or subwords, preserving common words as whole tokens while decomposing rare words into subword units.

1.1 How BPE Works

Pre‑tokenization: The input text is first split into smaller units, typically by spaces or punctuation. Example: "applied deep learning" → ["applied", "deep", "learning"] .

Initial vocabulary: Starts with all individual characters. For the word "deep", the initial split is ["d", "e", "e", "p"] .

Iterative merging: The most frequent character pair is merged. If ("e", "e") is most frequent, it becomes the new token "ee" and is added to the vocabulary.

Vocabulary update: Merging continues until a predefined vocabulary size is reached, adding a new token each time.

Final tokenization: Once the vocabulary is fixed, text is tokenized using the learned subword units.

1.2 Example

Input text: "low lower lowest"

Step 1 – Initial character‑level split:

["l", "o", "w", "l", "o", "w", "e", "r", "l", "o", "w", "e", "s", "t"]

Step 2 – Count character‑pair frequencies ("l" and "o" appear 3 times each).

Step 3 – Merge the most frequent pair → new token "lo" :

["lo", "w", "lo", "w", "e", "r", "lo", "w", "e", "s", "t"]

Step 4 – Re‑count frequencies; most frequent pair is "lo" and "w".

Step 5 – Merge → new token "low" :

["low", "low", "e", "r", "low", "e", "s", "t"]

Step 6 – Re‑count; most frequent pair is "e" and "r".

Step 7 – Merge → new token "er" :

["low", "low", "er", "low", "e", "s", "t"]

Step 8 – Final merge of "low" and "er" yields the final token list:

["low", "lower", "low", "e", "s", "t"]

1.3 Advantages and Limitations

Advantages: BPE handles OOV words well and adapts to many languages.

Limitations: It requires a pre‑tokenization step, which can be problematic for languages without clear word boundaries (e.g., Chinese, Japanese).

2. WordPiece

WordPiece is a subword tokenization algorithm similar to BPE but differs in how token pairs are selected. Instead of merging the most frequent pairs, WordPiece merges pairs that maximize the likelihood of the training data, making it especially suitable for models like BERT.

2.1 How WordPiece Works

Initial vocabulary: Starts with a character‑level vocabulary, just like BPE.

Likelihood maximization: For each possible pair, WordPiece computes the ratio P(tok1,tok2) / (P(tok1) * P(tok2)) and selects the pair that maximizes this value.

Iterative merging: The selected pair is merged, and the process repeats until the desired vocabulary size is reached.

Final tokenization: Text is tokenized using the learned subword units.

2.2 Advantages and Limitations

Advantages: WordPiece captures meaningful subword units effectively and is widely used in models such as BERT.

Limitations: Like BPE, it relies on a pre‑tokenization step, which can be challenging for certain languages.

3. SentencePiece

SentencePiece is designed to overcome the limitations of BPE and WordPiece for languages without explicit word boundaries. It treats the input text as a raw character stream, including spaces, thereby eliminating the need for pre‑tokenization. This makes SentencePiece versatile for multilingual and non‑space‑separated languages.

3.1 How SentencePiece Works

Input as character stream: The entire text is considered a continuous sequence of characters, spaces included.

Merging algorithm: It can use the same BPE‑style merging of frequent character pairs or employ a unigram language model that starts with a large token set and prunes it to the desired size.

Space handling: SentencePiece uses an underscore ( _ ) as a placeholder for spaces.

3.2 Example

For the sentence "deep learning engineer":

SentencePiece may generate the tokens ["deep", "_learning", "_engineer"] , where the underscore represents a space, allowing the original sentence to be reconstructed.

3.3 Advantages and Limitations

Advantages: SentencePiece is highly flexible, handling many languages seamlessly without pre‑tokenization.

Limitations: Representing spaces as part of tokens can make the output less intuitive for some downstream applications.

4. Comparison of BPE, WordPiece, and SentencePiece

BPE merges based on frequency, WordPiece merges based on likelihood maximization, and SentencePiece treats the text as a character stream and can choose either a BPE‑style merge or a unigram approach.

Conclusion

Tokenization is a critical step in preparing text data for large language models. The three algorithms discussed—BPE, WordPiece, and SentencePiece—each have distinct strengths and are suited to different use cases. BPE and WordPiece are widely adopted in models like GPT and BERT, while SentencePiece offers greater flexibility for multilingual scenarios. Understanding their nuances enables practitioners to select the most appropriate tokenizer for their specific NLP tasks.

As language models evolve, tokenization techniques will continue to improve, focusing on efficiency, broader language coverage, and tighter integration with training pipelines.

tokenizationNLPWordPieceBPESentencePiecesubword
Code Mala Tang
Written by

Code Mala Tang

Read source code together, write articles together, and enjoy spicy hot pot together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.