Artificial Intelligence 15 min read

How ChatGPT Works: Model Architecture, Training Strategies, and RLHF

ChatGPT, OpenAI’s latest language model, builds on GPT‑3 using supervised fine‑tuning and Reinforcement Learning from Human Feedback (RLHF) with PPO, addressing consistency issues by aligning model outputs with human preferences, while discussing training methods, limitations, and evaluation metrics.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
How ChatGPT Works: Model Architecture, Training Strategies, and RLHF

ChatGPT is OpenAI’s newest large‑scale language model that improves on GPT‑3 by generating text in diverse styles with higher accuracy, richer detail, and better contextual coherence, emphasizing interactive capabilities.

The model is first fine‑tuned on a small, high‑quality supervised dataset (SFT), then refined through Reinforcement Learning from Human Feedback (RLHF), where human annotators rank model outputs to train a Reward Model (RM), and finally the SFT model is further optimized with Proximal Policy Optimization (PPO), incorporating a KL‑penalty to prevent over‑optimization.

Inconsistency—where a model’s objective function diverges from human expectations—is highlighted as a core challenge of next‑token‑prediction and masked‑language‑modeling training strategies, leading to issues such as providing invalid help, fabricating facts, lacking explainability, and exhibiting bias.

RLHF consists of three steps: (1) supervised fine‑tuning on collected prompt‑response pairs; (2) training the RM by having annotators rank multiple SFT outputs for each prompt; (3) applying PPO to adjust the SFT policy using the RM’s scores, with the environment generating random prompts and rewarding responses based on the RM.

Model performance is evaluated on three human‑rated criteria—helpfulness, truthfulness, and harmlessness—using prompts unseen during training, and on zero‑shot NLP tasks where an “alignment tax” is observed: alignment via RLHF can reduce performance on some tasks compared to the base model.

Key limitations include the subjectivity of annotator preferences, lack of controlled experiments to isolate RLHF benefits, potential bias from the data collection pipeline, uncertainty about RM stability across paraphrased prompts, and the risk of over‑optimizing to the RM, which can produce undesirable patterns.

Relevant literature cited includes the RLHF paper (Training language models to follow instructions with human feedback), summarization with human feedback, the original PPO algorithm, and alternative alignment approaches such as DeepMind’s Sparrow and GopherCite.

Large Language ModelsChatGPTRLHFPPOAI alignment
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.