Artificial Intelligence 12 min read

How Much Data Do You Need for a 10B LLM? Decoding Scaling Laws

This article explains how scaling laws can answer common LLM development questions—such as the data required for a 10B model, the model size achievable with 1 TB of data, and the optimal compute‑data‑model trade‑off for a fixed GPU budget—by presenting core formulas, practical derivations, and insights from OpenAI, DeepMind and Google.

Baobao Algorithm Notes

Nov 21, 2023

How Much Data Do You Need for a 10B LLM? Decoding Scaling Laws

In large‑model research, practitioners often ask four questions: the data needed to train a 10 B parameter model, the model size attainable with a given data amount (e.g., 1 TB), the best model‑data combination when limited to a fixed number of A100 GPUs, and the performance gain when scaling a 10 B model to 100 B. All of these can be answered using the theory of Scaling Laws.

Core Conclusions

For decoder‑only models, the three quantities—compute (FLOPs), model parameters (excluding embeddings), and data size (token count)—are linked by a power‑law relationship derived in the original Scaling Law paper (OpenAI, 2020).

The final performance of a model is primarily determined by the total compute, parameter count, and data size, and is largely independent of the specific architecture (layer depth or width).

When the total parameter count is fixed, varying depth/width changes performance by less than 2 %.

Further observations:

When not constrained by the other two factors, model performance scales as a power law with each factor (compute, parameters, data).

Improving performance requires scaling both parameters and data together; the exact proportionality is still debated.

Scaling Laws apply not only to language models but also to multimodal and cross‑modal tasks.

Key Formula

The first term represents irreducible loss (data entropy, e.g., noise). The second term is the reducible loss that can be decreased by increasing compute; as compute approaches infinity, this term tends to zero and the overall loss approaches the irreducible limit.

Scaling Law in Practice: Compute‑Optimal Training

Because performance eventually saturates, simply adding more data without increasing model size yields diminishing returns. The practical workflow is:

Gather a large dataset (e.g., 1 TB of tokens).

Train a series of small models (0.001 B – 1 B parameters) to convergence on the same data.

Record the compute cost and resulting performance for each model.

Identify the compute‑optimal point where, for a given compute budget, the model‑data combination delivers the best performance.

Empirical results show a clear power‑law between compute and performance, and a linear relationship between model parameters and compute, as well as between data size and compute, at the compute‑optimal point.

Different research groups interpret the trade‑off differently:

OpenAI argues that model size is more important, recommending a ~100× increase in parameters when compute is increased tenfold.

DeepMind (Chinchilla) and Google (PaLM) find that model parameters and data should be scaled equally, suggesting roughly a 100× increase for both when compute grows tenfold.

For example, PaLM’s experiments increased compute by 10×, and the model size grew from 3.2 B to 10.7 B parameters, confirming the equal‑importance view.

Derivation of Compute‑Model‑Data Relationship

For a decoder‑only transformer, let:

C be total FLOPs,

P be the number of parameters (excluding embeddings, norm, and bias),

D be the total token count.

The parameter count per layer is derived from the attention hidden dimension (d) and feed‑forward dimension (d_ff). Assuming L layers, the total parameter count is: P = L * (12 * d^2 + 8 * d * d_ff) The forward‑pass FLOPs per token for one layer consist of:

Input linear projection: 2 * d * d FLOPs,

Self‑attention: 2 * d * d * S where S is sequence length,

Score‑value multiplication: 2 * S * d,

Output linear projection: 2 * d * d,

MLP up‑projection and down‑projection: 2 * d * d_ff + 2 * d_ff * d.

Summing over L layers and multiplying by batch size B gives the total compute: C ≈ 2 * B * S * L * (4d^2 + 2d * d_ff + d * S) Dividing by the total token count D yields the average FLOPs per token, leading to the compact relationship:

References

Scaling Laws for Neural Language Models (OpenAI, 2020)

Training Compute‑Optimal Large Language Models (Chinchilla, DeepMind)

PaLM 2 Technical Report

Scaling Laws for Autoregressive Generative Modeling

GPT‑4 Technical Report

Baichuan 2: Open Large‑scale Language Models

MindLLM: Pre‑training Lightweight Large Language Model from Scratch

LLaMA: Open and Efficient Foundation Language Models

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models Model Size Compute Efficiency Data Requirements

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.