Artificial Intelligence 7 min read

Technical Architecture and Training Process of ChatGPT

ChatGPT, a dialogue-focused language model, builds on the GPT family and employs techniques such as Reinforcement Learning from Human Feedback (RLHF), the TAMER framework, and a three-stage training pipeline (supervised fine‑tuning, reward modeling, and PPO reinforcement learning) to achieve advanced conversational capabilities.

DataFunSummit
DataFunSummit
DataFunSummit
Technical Architecture and Training Process of ChatGPT

ChatGPT is a dialogue‑oriented language model that generates intelligent responses based on user input, ranging from short phrases to long essays. GPT stands for Generative Pre‑trained Transformer.

The article excerpt is taken from a Zhihu post titled “ChatGPT Development History, Principles, Technical Architecture Details and Industry Future,” focusing on the technical architecture of ChatGPT.

The GPT family has evolved from GPT‑1, GPT‑2 to GPT‑3, with each successive model increasing in size; GPT‑1 has 12 Transformer layers, while GPT‑3 expands to 96 layers.

InstructGPT/GPT‑3.5 (the predecessor of ChatGPT) introduces Reinforcement Learning from Human Feedback (RLHF), which enhances the model’s ability to adjust outputs based on human preferences. The evaluation criteria for “goodness of sentences” include truthfulness, harmlessness, and usefulness.

The TAMER (Training an Agent Manually via Evaluative Reinforcement) framework incorporates human annotators into the learning loop, providing reward feedback to accelerate training, reduce convergence time, and lower data collection costs, without requiring annotators to have deep technical expertise.

ChatGPT’s training consists of three stages:

Stage 1 – Supervised Fine‑Tuning (SFT): Human annotators provide high‑quality answers to sampled questions, which are used to fine‑tune GPT‑3.5, improving instruction following.

Stage 2 – Reward Model (RM) Training: Annotators rank multiple model responses for each question; these rankings form pairwise data to train a reward model that scores answer quality.

Stage 3 – Proximal Policy Optimization (PPO): Using the reward model, the policy is updated via reinforcement learning, iteratively improving the model’s performance.

Repeating stages 2 and 3 iteratively yields progressively higher‑quality ChatGPT models.

ChatGPTreinforcement learningRLHFlanguage modelGPTTAMER
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.