How Do Large Language Models Compress Massive Data? Limits and Techniques
This article explains how large language models act like a super‑library by compressing vast amounts of text using information‑theoretic concepts, probability‑based coding, autoregressive neural networks, and arithmetic coding, while discussing accuracy, compression ratios, and theoretical limits.
During a recent AI lecture, the speaker likened large language models (LLMs) to a super‑library where asking the model is equivalent to querying a massive knowledge base. This raises questions about whether such a "library" can truly contain all world information and how accurate that information is.
Basic Concepts of Data Compression
Data compression means representing data with fewer bits. Lossless compression restores the original data exactly after decompression.
A common approach is to encode data based on its probability distribution; Huffman coding and arithmetic coding are classic examples.
Information Content and Entropy
Two fundamental notions are information content and entropy.
Information content (Information Content) measures the uncertainty of an event. An event with high probability (e.g., the letter "E" in English) carries low information; a rare event carries high information.
The information content can be expressed as:
\(I = -\log_2(p)\) where \(p\) is the probability of the event.
Entropy is the average information content of all possible events, indicating the average uncertainty of a system. Higher entropy means more complex, less predictable information.
For a discrete random variable, entropy is calculated as:
\(H = -\sum_{i} p_i \log_2(p_i)\) where \(p_i\) is the probability of each possible outcome.
Example of Data Compression
Consider a text with a vocabulary of 256 symbols (8‑bit encoding). If each symbol is equally likely, each requires 8 bits – the baseline transmission method.
Autoregressive Neural Networks and Lossless Compression
Large language models such as GPT are autoregressive neural networks that predict the next token based on previously transmitted data, effectively providing a probability distribution for the next symbol.
How Neural Networks Aid Compression
Traditional compression treats each data point independently with fixed‑length bits. Autoregressive networks learn the structure and regularities of data, allowing them to predict the next point and encode it more efficiently.
For example, if both parties share a trained network, they can use the predicted probability distribution to represent the next token with fewer bits.
Arithmetic Coding
Arithmetic coding maps each data point’s probability to an interval, shrinking the interval as more symbols are processed. Example probabilities:
Character "0": 0.20
Character "1": 0.25
Character "2": 0.22
Character "3": 0.175
The interval size for each character is proportional to its probability; repeated interval subdivision yields a binary code for the symbol.
For instance, character "3" (probability 0.175) can be encoded after several interval refinements as:
(1, 0, 1)which uses only three bits.
Compression Ratio
Using autoregressive networks can dramatically improve compression rates. Compared to the baseline 8‑bit per symbol, the model may represent the same data with as few as 3 bits per symbol.
During training, the model minimizes the negative log‑likelihood loss, effectively learning a lossless compression of the data distribution.
Limits of Compression
Compression has theoretical limits; as datasets grow, the achievable compression ratio approaches an asymptote. When the model predicts the next token with higher precision, compression improves, eventually reaching the theoretical maximum.
For example, the Llama model compresses 5.6 TB of text to about 7.14 % of its original size. Its code is roughly 1 MB, while the training loss corresponds to 0.4 TB, yielding a substantial reduction in storage and transmission costs.
Through autoregressive networks, arithmetic coding, and continual model advances, LLMs can maintain information integrity while drastically reducing data size, and future developments are expected to push these limits even further.
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.