How Much Data Do You Need for a 10B LLM? Decoding Scaling Laws
This article explains how scaling laws can answer common LLM development questions—such as the data required for a 10B model, the model size achievable with 1 TB of data, and the optimal compute‑data‑model trade‑off for a fixed GPU budget—by presenting core formulas, practical derivations, and insights from OpenAI, DeepMind and Google.
In large‑model research, practitioners often ask four questions: the data needed to train a 10 B parameter model, the model size attainable with a given data amount (e.g., 1 TB), the best model‑data combination when limited to a fixed number of A100 GPUs, and the performance gain when scaling a 10 B model to 100 B. All of these can be answered using the theory of Scaling Laws.
Core Conclusions
For decoder‑only models, the three quantities—compute (FLOPs), model parameters (excluding embeddings), and data size (token count)—are linked by a power‑law relationship derived in the original Scaling Law paper (OpenAI, 2020).
The final performance of a model is primarily determined by the total compute, parameter count, and data size, and is largely independent of the specific architecture (layer depth or width).
When the total parameter count is fixed, varying depth/width changes performance by less than 2 %.
Further observations:
When not constrained by the other two factors, model performance scales as a power law with each factor (compute, parameters, data).
Improving performance requires scaling both parameters and data together; the exact proportionality is still debated.
Scaling Laws apply not only to language models but also to multimodal and cross‑modal tasks.
Key Formula
The first term represents irreducible loss (data entropy, e.g., noise). The second term is the reducible loss that can be decreased by increasing compute; as compute approaches infinity, this term tends to zero and the overall loss approaches the irreducible limit.
Scaling Law in Practice: Compute‑Optimal Training
Because performance eventually saturates, simply adding more data without increasing model size yields diminishing returns. The practical workflow is:
Gather a large dataset (e.g., 1 TB of tokens).
Train a series of small models (0.001 B – 1 B parameters) to convergence on the same data.
Record the compute cost and resulting performance for each model.
Identify the compute‑optimal point where, for a given compute budget, the model‑data combination delivers the best performance.
Empirical results show a clear power‑law between compute and performance, and a linear relationship between model parameters and compute, as well as between data size and compute, at the compute‑optimal point.
Different research groups interpret the trade‑off differently:
OpenAI argues that model size is more important, recommending a ~100× increase in parameters when compute is increased tenfold.
DeepMind (Chinchilla) and Google (PaLM) find that model parameters and data should be scaled equally, suggesting roughly a 100× increase for both when compute grows tenfold.
For example, PaLM’s experiments increased compute by 10×, and the model size grew from 3.2 B to 10.7 B parameters, confirming the equal‑importance view.
Derivation of Compute‑Model‑Data Relationship
For a decoder‑only transformer, let:
C be total FLOPs,
P be the number of parameters (excluding embeddings, norm, and bias),
D be the total token count.
The parameter count per layer is derived from the attention hidden dimension (d) and feed‑forward dimension (d_ff). Assuming L layers, the total parameter count is: P = L * (12 * d^2 + 8 * d * d_ff) The forward‑pass FLOPs per token for one layer consist of:
Input linear projection: 2 * d * d FLOPs,
Self‑attention: 2 * d * d * S where S is sequence length,
Score‑value multiplication: 2 * S * d,
Output linear projection: 2 * d * d,
MLP up‑projection and down‑projection: 2 * d * d_ff + 2 * d_ff * d.
Summing over L layers and multiplying by batch size B gives the total compute: C ≈ 2 * B * S * L * (4d^2 + 2d * d_ff + d * S) Dividing by the total token count D yields the average FLOPs per token, leading to the compact relationship:
References
Scaling Laws for Neural Language Models (OpenAI, 2020)
Training Compute‑Optimal Large Language Models (Chinchilla, DeepMind)
PaLM 2 Technical Report
Scaling Laws for Autoregressive Generative Modeling
GPT‑4 Technical Report
Baichuan 2: Open Large‑scale Language Models
MindLLM: Pre‑training Lightweight Large Language Model from Scratch
LLaMA: Open and Efficient Foundation Language Models
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
