Llama 2: Open Foundation and Fine‑Tuned Chat Models – Overview, Training, and RLHF Details
This article provides a comprehensive English overview of Meta's Llama 2 family, describing the model sizes, pre‑training data, architectural improvements, supervised fine‑tuning, reinforcement learning with human feedback, safety evaluations, reward‑model training, and iterative optimization techniques used to produce the high‑performing Llama 2‑Chat models.
Original Information
Name: Llama 2: Open Foundation and Fine‑Tuned Chat Models
Translation: Llama 2 – Open‑source Foundation and Chat Models
Paper: https://arxiv.org/pdf/2307.09288.pdf
Code: https://huggingface.co/meta-llama
Date: 2023‑07‑19
LLaMA full name: Large Language Model Meta AI
Introduction
Large language models (LLMs) act as powerful AI assistants capable of complex reasoning across domains such as programming and creative writing, and they interact with users through intuitive chat interfaces.
"Simple‑method, high‑capacity LLMs require massive compute and human‑labeling costs"
LLMs are built from massive self‑supervised corpora and aligned to human preferences via techniques such as Reinforcement Learning with Human Feedback (RLHF). While the training method is conceptually simple, the required compute limits development to a few large companies. Open‑source models like BLOOM, LLaMA‑1, and Falcon approach the performance of closed‑source models (GPT‑3, Chinchilla) but are not yet direct replacements for highly fine‑tuned closed models such as ChatGPT, Bard, or Claude.
Meta released Llama 2 (7B‑70B parameters) and Llama 2‑Chat (fine‑tuned for dialogue). Extensive usability and safety evaluations show Llama 2‑Chat outperforms other open‑source chat models on many benchmarks, and Meta recommends it as a viable alternative to proprietary models.
Published Llama 2 Versions for Research and Commercial Use
Llama 2: upgraded from Llama 1 with 40% larger pre‑training data, double the context length, and grouped‑query attention. Available in 7B, 13B, and 70B sizes (a 34B internal version was not released).
Llama 2‑Chat: dialogue‑optimized version of the above, also in 7B, 13B, and 70B.
Meta notes that the new techniques carry potential risks, so safety testing and adjustments are required before deployment.
Llama 2‑Chat Training Process Diagram
The diagram shows that Llama 2‑Chat starts from a publicly‑available pre‑training checkpoint, undergoes supervised fine‑tuning, and is subsequently refined with RLHF using rejection sampling and Proximal Policy Optimization (PPO). Reward modeling and iterative reward accumulation keep the model within its distribution during RLHF.
Pre‑training
Pre‑training Data
Meta assembled a 2‑trillion‑token corpus from publicly available sources (CommonCrawl, C4, GitHub, Wikipedia, Books, arXiv, StackExchange) while removing personal‑information‑rich sites. This data mix balances performance and cost and helps reduce hallucinations.
Pre‑training data sources include: CommonCrawl, C4, GitHub, Wikipedia, Books, arXiv, StackExchange.
Training Details
The architecture mirrors Llama 1 with RMSNorm, SwiGLU activation, and RoPE positional embeddings, but adds a longer context window and grouped‑query attention (GQA). Hyper‑parameters include AdamW optimizer, cosine learning‑rate schedule with 2 × 10⁻⁵ peak, 0.1 weight decay, and gradient clipping at 1.0.
Table 1 compares Llama 2 to Llama 1 (image omitted for brevity).
Figure 5 shows the loss curve during Llama 2 training.
Tokenizer
Llama 2 uses the same BPE tokenizer as Llama 1 (SentencePiece implementation) with a 32k vocabulary, splitting numbers into individual digits and handling unknown UTF‑8 bytes.
Training Hardware
Pre‑training ran on Meta’s Research Super Cluster (RSC) and an internal production cluster, both equipped with NVIDIA A100 GPUs. RSC uses InfiniBand (400 W GPU power limit) while the production cluster uses RoCE over Ethernet (350 W limit), both providing 200 Gbps interconnect.
Carbon Footprint
Meta reports GPU power draw based on utilization and does not account for interconnect or cooling power. The overall carbon impact of AI hardware production may increase the total footprint.
Fine‑tuning
Llama 2‑Chat results from months of alignment research, including supervised fine‑tuning (SFT) and RLHF, both of which demand large compute and annotation resources.
Supervised fine‑tuning (SFT)
RLHF
Initialization and iterative reward modeling
Ghost Attention (GATT) for multi‑turn dialogue control
Post‑fine‑tuning safety evaluation
Supervised Fine‑tuning (SFT)
Quality over Quantity
Meta collected several thousand high‑quality SFT samples, discarding millions of lower‑quality third‑party examples. After annotating 27,540 high‑quality prompts‑response pairs (excluding any Meta user data), they stopped SFT collection.
First column: SFT instruction; second column: safety instruction and response.
Manual inspection of 180 samples showed that SFT outputs often rival human‑written data, suggesting that improving SFT quality can outweigh sheer quantity.
Fine‑tuning Details
Training used a cosine learning‑rate schedule, initial LR = 2 × 10⁻⁵, weight decay = 0.1, batch size = 6, sequence length = 4096 tokens, and a special separator token between prompt and response. Only the response tokens received gradient updates. The model was fine‑tuned for 2 epochs.
RLHF – Reinforcement Learning with Human Feedback
RLHF aligns the model with human preferences by collecting binary comparisons of two model outputs for the same prompt. Annotators also rate the preference level (very good, good, etc.) and label safety.
Human Preference Data Collection
Annotators write a prompt, then choose the preferred answer from two model variants (different temperature and hyper‑parameters). They also assign a preference rating and a safety label (preferred safe, both safe, both unsafe). The dataset contains ~1 million binary comparisons, plus additional open‑source preference data.
Reward Modeling
The reward model receives a prompt and a completion and outputs a scalar score reflecting usefulness and safety. Two separate reward models are trained (one for usefulness, one for safety) to avoid trade‑off conflicts.
Training uses a binary ranking loss with a margin that scales with the annotators’ rating granularity.
Reward models are trained for 1 epoch with the same optimizer as the base model (AdamW, 0.1 weight decay, gradient clipping = 1.0). Learning‑rate schedule mirrors the base model, with a 3 % warm‑up and cosine decay to 10 % of the peak.
Iterative Fine‑tuning
Multiple RLHF versions (V1‑V5) were trained. Two main algorithms were used:
Proximal Policy Optimization (PPO) – the standard RLHF approach.
Rejection Sampling Fine‑tuning – sample K outputs, select the highest‑rewarded one, and treat it as a new gold standard for further fine‑tuning.
Early versions relied solely on rejection sampling; later versions combined both methods. The 70B model’s rejection‑sampled outputs were also used to improve smaller models.
PPO hyper‑parameters (batch‑size = 512, clip = 0.2, mini‑batch = 64, β = 0.01 for 7B/13B and 0.005 for 34B/70B) were applied for 200–400 iterations per model. Training leveraged Fully‑Sharded Data Parallel (FSDP) to handle the large memory footprint.
Appendix
Model URLs
Llama2‑Chinese‑7b‑Chat: https://huggingface.co/FlagAlpha/Llama2‑Chinese‑13b‑Chat‑4bit
Atom‑7B: https://huggingface.co/FlagAlpha/Atom‑7B
Atom‑7B‑Chat: https://huggingface.co/FlagAlpha/Atom‑7B‑Chat
Llama‑2‑7b‑hf: https://huggingface.co/meta‑llama/Llama‑2‑7b‑hf
Llama‑2‑70b‑hf: https://huggingface.co/meta‑llama/Llama‑2‑70b‑hf
Llama‑2‑13b‑hf: https://huggingface.co/meta‑llama/Llama‑2‑13b‑hf
Llama‑2‑13b‑chat‑hf: https://huggingface.co/meta‑llama/Llama‑2‑13b‑chat‑hf
Llama‑2‑70b‑chat‑hf: https://huggingface.co/meta‑llama/Llama‑2‑70b‑chat‑hf
Llama‑2‑7b‑chat‑hf: https://huggingface.co/meta‑llama/Llama‑2‑7b‑chat‑hf
Llama‑2‑7b: https://huggingface.co/meta‑llama/Llama‑2‑7b
Llama‑2‑13b: https://huggingface.co/meta‑llama/Llama‑2‑13b
Llama‑2‑70b: https://huggingface.co/meta‑llama/Llama‑2‑70b
Llama‑2‑7b‑chat: https://huggingface.co/meta‑llama/Llama‑2‑7b‑chat
Llama‑2‑13b‑chat: https://huggingface.co/meta‑llama/Llama‑2‑13b‑chat
Llama‑2‑70b‑chat: https://huggingface.co/meta‑llama/Llama‑2‑70b‑chat
Glossary
Red Teaming – adversarial testing to uncover model vulnerabilities, bias, and safety issues.
PPO (Proximal Policy Optimization) – a gradient‑based policy‑optimization algorithm.
RMSNorm – root‑mean‑square layer normalization.
Cosine Learning‑Rate Decay – a schedule that smoothly reduces the learning rate following a cosine curve.
MPT – MosaicML pre‑training model series.
Ghost Attention (GATT) – a technique to preserve instruction adherence over multiple dialogue turns.
Bootstrap – statistical resampling method for estimating quantities such as the mean.
Temperature parameter – controls randomness and diversity in generative model outputs.
References
Original paper: Llama 2: Open Foundation and Fine‑Tuned Chat Models (arXiv:2307.09288).
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.