Artificial Intelligence 20 min read

Understanding Large Language Models: Tokens, Tokenization, and the Evolution from Markov Chains to Transformers

This article explains how generative AI models work by demystifying tokens, tokenization with tools like tiktoken, simple Markov‑chain training, the limitations of small context windows, and how modern LLMs use neural networks, transformers and attention mechanisms to predict the next token.

Architect
Architect
Architect
Understanding Large Language Models: Tokens, Tokenization, and the Evolution from Markov Chains to Transformers

Generative AI is everywhere, and many people have tried ChatGPT as a personal assistant, but they often wonder how these models actually become intelligent. This article explains the inner workings of large language models (LLMs) in plain language without heavy mathematics.

What is a Token? A token is the basic unit of text that an LLM processes. Tokens can be whole words, sub‑words, punctuation, or even spaces, and they are encoded efficiently using algorithms such as Byte‑Pair Encoding (BPE). For example, the open‑source GPT‑2 model has a vocabulary of 50,257 tokens.

To experiment with tokens in Python you can install OpenAI's tiktoken package:

pip install tiktoken

Then try the following in a Python REPL:

>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-2")
>>> encoding.encode("The quick brown fox jumps over the lazy dog.")
[464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13]
>>> encoding.decode([464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13])
'The quick brown fox jumps over the lazy dog.'
>>> encoding.decode([464])
'The'
>>> encoding.decode([2068])
' quick'
>>> encoding.decode([13])
'.'

Notice that the same word can have different token IDs depending on surrounding spaces or capitalization, e.g.:

>> encoding.encode('The')
[464]
>>> encoding.encode('the')
[1169]
>>> encoding.encode(' the')
[262]

Predicting the Next Token – given a sequence of tokens, an LLM predicts the probability distribution of the next token. In pseudocode:

predictions = get_token_predictions(["The", " quick", " brown", " fox"])

The function receives the tokenized user prompt and returns a probability for every token in the vocabulary (e.g., 50,257 probabilities for GPT‑2). Tokens that are likely continuations (like "jumps" after "The quick brown fox") receive higher probabilities, while unlikely tokens (like "potato") receive near‑zero probabilities.

To generate longer text, the model repeatedly predicts a token, appends it to the input, and feeds the extended sequence back into the model:

def generate_text(prompt, num_tokens, hyperparameters):
    tokens = tokenize(prompt)
    for i in range(num_tokens):
        predictions = get_token_predictions(tokens)
        next_token = select_next_token(predictions, hyperparameters)
        tokens.append(next_token)
    return ''.join(tokens)

The select_next_token function can use greedy selection (choose the highest‑probability token) or sampling with hyper‑parameters such as temperature, top‑p, and top‑k to introduce creativity.

Simple Markov‑Chain Training – as an illustration, a tiny model can be trained by counting how often each token follows another in a tiny dataset. A probability table is built from these counts, and the next‑token prediction simply looks up the row corresponding to the last token. The article shows a concrete example with the tokens ["I", "you", "like", "apples", "bananas"] and three sentences, producing a 5×5 table of counts and derived probabilities.

Because a Markov chain only looks at the last token, its context window is one token, which leads to incoherent text. Extending the context to two or three tokens dramatically increases the number of rows (e.g., 5² = 25 rows for two‑token contexts, 5³ = 125 rows for three‑token contexts) but still does not scale to realistic vocabularies. GPT‑2 uses a 1024‑token context window, which would require 5¹⁰²⁴ rows – an astronomically large table.

Therefore, storing explicit probability tables is infeasible, and we replace the table with a neural network that approximates the probability function.

Neural Networks and Training – a neural network receives token IDs as input and outputs a probability distribution over the vocabulary. Training adjusts billions of parameters (e.g., ~1.5 billion for GPT‑2, 175 billion for GPT‑3, ~1.76 trillion for GPT‑4) using back‑propagation on massive text corpora. The process iterates until the model reliably predicts the next token.

Transformers and Attention – modern LLMs use the Transformer architecture, which consists of stacked layers that apply self‑attention to relate every token in the context window to every other token. Attention allows the model to capture long‑range dependencies and produce coherent continuations.

The article also discusses the misconception that LLMs possess true reasoning or intelligence. While they can generate impressive, seemingly original text by stitching together patterns learned during training, they lack genuine understanding and can hallucinate facts. Consequently, the author advises against using LLM‑generated content in production without human verification.

In summary, the piece walks the reader from the basics of tokenization, through simple probabilistic models, to the sophisticated neural‑network‑based Transformers that power today’s large language models.

artificial intelligenceLLMtransformertokenizationMarkov Chain
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.