Artificial Intelligence 33 min read

Llama 2: Open Foundation and Fine‑Tuned Chat Models – Overview, Training, and RLHF Details

This article provides a comprehensive English overview of Meta's Llama 2 family, describing the model sizes, pre‑training data, architectural improvements, supervised fine‑tuning, reinforcement learning with human feedback, safety evaluations, reward‑model training, and iterative optimization techniques used to produce the high‑performing Llama 2‑Chat models.

Rare Earth Juejin Tech Community

Dec 24, 2023

Llama 2: Open Foundation and Fine‑Tuned Chat Models – Overview, Training, and RLHF Details

Original Information

Name: Llama 2: Open Foundation and Fine‑Tuned Chat Models

Translation: Llama 2 – Open‑source Foundation and Chat Models

Paper: https://arxiv.org/pdf/2307.09288.pdf

Code: https://huggingface.co/meta-llama

Date: 2023‑07‑19

LLaMA full name: Large Language Model Meta AI

Introduction

Large language models (LLMs) act as powerful AI assistants capable of complex reasoning across domains such as programming and creative writing, and they interact with users through intuitive chat interfaces.

"Simple‑method, high‑capacity LLMs require massive compute and human‑labeling costs"

LLMs are built from massive self‑supervised corpora and aligned to human preferences via techniques such as Reinforcement Learning with Human Feedback (RLHF). While the training method is conceptually simple, the required compute limits development to a few large companies. Open‑source models like BLOOM, LLaMA‑1, and Falcon approach the performance of closed‑source models (GPT‑3, Chinchilla) but are not yet direct replacements for highly fine‑tuned closed models such as ChatGPT, Bard, or Claude.

Meta released Llama 2 (7B‑70B parameters) and Llama 2‑Chat (fine‑tuned for dialogue). Extensive usability and safety evaluations show Llama 2‑Chat outperforms other open‑source chat models on many benchmarks, and Meta recommends it as a viable alternative to proprietary models.

Published Llama 2 Versions for Research and Commercial Use

Llama 2: upgraded from Llama 1 with 40% larger pre‑training data, double the context length, and grouped‑query attention. Available in 7B, 13B, and 70B sizes (a 34B internal version was not released).

Llama 2‑Chat: dialogue‑optimized version of the above, also in 7B, 13B, and 70B.

Meta notes that the new techniques carry potential risks, so safety testing and adjustments are required before deployment.

Llama 2‑Chat Training Process Diagram

The diagram shows that Llama 2‑Chat starts from a publicly‑available pre‑training checkpoint, undergoes supervised fine‑tuning, and is subsequently refined with RLHF using rejection sampling and Proximal Policy Optimization (PPO). Reward modeling and iterative reward accumulation keep the model within its distribution during RLHF.

Pre‑training

Pre‑training Data

Meta assembled a 2‑trillion‑token corpus from publicly available sources (CommonCrawl, C4, GitHub, Wikipedia, Books, arXiv, StackExchange) while removing personal‑information‑rich sites. This data mix balances performance and cost and helps reduce hallucinations.

Pre‑training data sources include: CommonCrawl, C4, GitHub, Wikipedia, Books, arXiv, StackExchange.

Training Details

The architecture mirrors Llama 1 with RMSNorm, SwiGLU activation, and RoPE positional embeddings, but adds a longer context window and grouped‑query attention (GQA). Hyper‑parameters include AdamW optimizer, cosine learning‑rate schedule with 2 × 10⁻⁵ peak, 0.1 weight decay, and gradient clipping at 1.0.

Table 1 compares Llama 2 to Llama 1 (image omitted for brevity).

Figure 5 shows the loss curve during Llama 2 training.

Tokenizer

Llama 2 uses the same BPE tokenizer as Llama 1 (SentencePiece implementation) with a 32k vocabulary, splitting numbers into individual digits and handling unknown UTF‑8 bytes.

Training Hardware

Pre‑training ran on Meta’s Research Super Cluster (RSC) and an internal production cluster, both equipped with NVIDIA A100 GPUs. RSC uses InfiniBand (400 W GPU power limit) while the production cluster uses RoCE over Ethernet (350 W limit), both providing 200 Gbps interconnect.

Carbon Footprint

Meta reports GPU power draw based on utilization and does not account for interconnect or cooling power. The overall carbon impact of AI hardware production may increase the total footprint.

Fine‑tuning

Llama 2‑Chat results from months of alignment research, including supervised fine‑tuning (SFT) and RLHF, both of which demand large compute and annotation resources.

Supervised fine‑tuning (SFT)

RLHF

Initialization and iterative reward modeling

Ghost Attention (GATT) for multi‑turn dialogue control

Post‑fine‑tuning safety evaluation

Supervised Fine‑tuning (SFT)

Quality over Quantity

Meta collected several thousand high‑quality SFT samples, discarding millions of lower‑quality third‑party examples. After annotating 27,540 high‑quality prompts‑response pairs (excluding any Meta user data), they stopped SFT collection.

First column: SFT instruction; second column: safety instruction and response.

Manual inspection of 180 samples showed that SFT outputs often rival human‑written data, suggesting that improving SFT quality can outweigh sheer quantity.

Fine‑tuning Details

Training used a cosine learning‑rate schedule, initial LR = 2 × 10⁻⁵, weight decay = 0.1, batch size = 6, sequence length = 4096 tokens, and a special separator token between prompt and response. Only the response tokens received gradient updates. The model was fine‑tuned for 2 epochs.

RLHF – Reinforcement Learning with Human Feedback

RLHF aligns the model with human preferences by collecting binary comparisons of two model outputs for the same prompt. Annotators also rate the preference level (very good, good, etc.) and label safety.

Human Preference Data Collection

Annotators write a prompt, then choose the preferred answer from two model variants (different temperature and hyper‑parameters). They also assign a preference rating and a safety label (preferred safe, both safe, both unsafe). The dataset contains ~1 million binary comparisons, plus additional open‑source preference data.

Reward Modeling

The reward model receives a prompt and a completion and outputs a scalar score reflecting usefulness and safety. Two separate reward models are trained (one for usefulness, one for safety) to avoid trade‑off conflicts.

Training uses a binary ranking loss with a margin that scales with the annotators’ rating granularity.

Reward models are trained for 1 epoch with the same optimizer as the base model (AdamW, 0.1 weight decay, gradient clipping = 1.0). Learning‑rate schedule mirrors the base model, with a 3 % warm‑up and cosine decay to 10 % of the peak.

Iterative Fine‑tuning

Multiple RLHF versions (V1‑V5) were trained. Two main algorithms were used:

Proximal Policy Optimization (PPO) – the standard RLHF approach.

Rejection Sampling Fine‑tuning – sample K outputs, select the highest‑rewarded one, and treat it as a new gold standard for further fine‑tuning.

Early versions relied solely on rejection sampling; later versions combined both methods. The 70B model’s rejection‑sampled outputs were also used to improve smaller models.

PPO hyper‑parameters (batch‑size = 512, clip = 0.2, mini‑batch = 64, β = 0.01 for 7B/13B and 0.005 for 34B/70B) were applied for 200–400 iterations per model. Training leveraged Fully‑Sharded Data Parallel (FSDP) to handle the large memory footprint.

Appendix

Model URLs

Llama2‑Chinese‑7b‑Chat: https://huggingface.co/FlagAlpha/Llama2‑Chinese‑13b‑Chat‑4bit

Atom‑7B: https://huggingface.co/FlagAlpha/Atom‑7B

Atom‑7B‑Chat: https://huggingface.co/FlagAlpha/Atom‑7B‑Chat

Llama‑2‑7b‑hf: https://huggingface.co/meta‑llama/Llama‑2‑7b‑hf

Llama‑2‑70b‑hf: https://huggingface.co/meta‑llama/Llama‑2‑70b‑hf

Llama‑2‑13b‑hf: https://huggingface.co/meta‑llama/Llama‑2‑13b‑hf

Llama‑2‑13b‑chat‑hf: https://huggingface.co/meta‑llama/Llama‑2‑13b‑chat‑hf

Llama‑2‑70b‑chat‑hf: https://huggingface.co/meta‑llama/Llama‑2‑70b‑chat‑hf

Llama‑2‑7b‑chat‑hf: https://huggingface.co/meta‑llama/Llama‑2‑7b‑chat‑hf

Llama‑2‑7b: https://huggingface.co/meta‑llama/Llama‑2‑7b

Llama‑2‑13b: https://huggingface.co/meta‑llama/Llama‑2‑13b

Llama‑2‑70b: https://huggingface.co/meta‑llama/Llama‑2‑70b

Llama‑2‑7b‑chat: https://huggingface.co/meta‑llama/Llama‑2‑7b‑chat

Llama‑2‑13b‑chat: https://huggingface.co/meta‑llama/Llama‑2‑13b‑chat

Llama‑2‑70b‑chat: https://huggingface.co/meta‑llama/Llama‑2‑70b‑chat

Glossary

Red Teaming – adversarial testing to uncover model vulnerabilities, bias, and safety issues.

PPO (Proximal Policy Optimization) – a gradient‑based policy‑optimization algorithm.

RMSNorm – root‑mean‑square layer normalization.

Cosine Learning‑Rate Decay – a schedule that smoothly reduces the learning rate following a cosine curve.

MPT – MosaicML pre‑training model series.

Ghost Attention (GATT) – a technique to preserve instruction adherence over multiple dialogue turns.

Bootstrap – statistical resampling method for estimating quantities such as the mean.

Temperature parameter – controls randomness and diversity in generative model outputs.

References

Original paper: Llama 2: Open Foundation and Fine‑Tuned Chat Models (arXiv:2307.09288).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

RLHF Supervised Fine‑Tuning Llama-2 Open‑source

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Original Information

Introduction

Published Llama 2 Versions for Research and Commercial Use

Llama 2‑Chat Training Process Diagram

Pre‑training

Pre‑training Data

Training Details

Tokenizer

Training Hardware

Carbon Footprint

Fine‑tuning

Supervised Fine‑tuning (SFT)

Quality over Quantity

Fine‑tuning Details

RLHF – Reinforcement Learning with Human Feedback

Human Preference Data Collection

Reward Modeling

Iterative Fine‑tuning

Appendix

Model URLs

Glossary

References

Rare Earth Juejin Tech Community

How this landed with the community

Was this worth your time?

0 Comments

Published Llama 2 Versions for Research and Commercial Use

Llama 2‑Chat Training Process Diagram