ChatGPT Technical Analysis Series – Part 2: GPT1, GPT2, and GPT3 (Encoder vs Decoder, Zero‑Shot, and Scaling)
This article reviews the evolution of the GPT family from GPT‑1 to GPT‑3, comparing encoder‑decoder architectures, explaining the shift from supervised fine‑tuning to zero‑shot and few‑shot learning, and highlighting the architectural and training innovations that enabled large‑scale language models.
In the second article of the "ChatGPT Technical Analysis" series, the author examines the GPT family (GPT‑1, GPT‑2, GPT‑3) and contrasts it with BERT, focusing on encoder‑decoder differences, zero‑shot learning, and scaling strategies.
1. Encoder vs Decoder: BERT vs GPT‑1
1.1 Timeline
2017 – Google introduces the Transformer, replacing recurrence and convolutions with attention.
June 2018 – OpenAI releases the first GPT model (decoder‑only) demonstrating the effectiveness of "pre‑training + fine‑tuning" for NLP. In October 2018, Google releases BERT (encoder‑only), which outperforms GPT‑1 on comparable parameter budgets.
Four months later OpenAI launches GPT‑2 (1.5 B parameters) and adds zero‑shot capability, showing that larger models and data can reduce the need for task‑specific fine‑tuning.
June 2020 – GPT‑3 is released with 175 B parameters; training cost reportedly exceeds $12 million, establishing a new performance ceiling.
1.2 GPT‑1 Design Philosophy
(1) Motivation
In computer vision, "pre‑training + fine‑tuning" has long been standard, but NLP lacked large labeled corpora and struggled to model textual semantics. The Transformer’s strong contextual modeling enabled the creation of GPT‑1.
(2) Pre‑training
GPT‑1 is trained on massive unlabeled text to learn "text continuation" (predict the next token given previous tokens). Its architecture follows the Transformer decoder, removing cross‑attention to the encoder.
Key attention mechanisms:
Decoder uses Masked‑Attention (each token sees only its left context).
Encoder uses Standard Attention (tokens see both left and right context).
(3) Fine‑tuning
After pre‑training, GPT‑1 is fine‑tuned on four supervised tasks:
Classification
Entailment
Similarity
Multiple‑Choice
All tasks share a unified input format: special start/end tokens are added, the raw text is fed to the model, and a linear head on the final hidden state produces the prediction. Only the linear layer and the special token embeddings are updated during fine‑tuning.
The overall loss is L = L_{1} + L_{2} , where L_{1} is the pre‑training loss and L_{2} is the fine‑tuning loss.
(4) BERT vs GPT‑1
GPT‑1 uses the same parameter configuration as BERT‑Base (L=12, H=768, A=12) but employs masked attention, which requires richer data because the model can only see preceding tokens. BERT’s bidirectional attention gives it an advantage on many tasks, explaining why GPT models need to grow larger to catch up.
2. Zero‑Shot: GPT‑2
GPT‑2’s core idea is that with enough high‑quality data and a sufficiently large model, fine‑tuning can be omitted, yielding a universal language model.
Zero‑shot, one‑shot, and few‑shot definitions:
Zero‑shot : Provide only a task description and prompt.
One‑shot : Provide a description, one example, and a prompt.
Few‑shot : Provide a description, several examples, and a prompt.
GPT‑2 was trained on large Reddit dumps, selecting high‑quality posts via community voting. This data naturally contains task descriptions, prompts, and answers, enabling the model to learn to perform tasks without explicit fine‑tuning.
"I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool] ."
"I hate the word ‘perfume,’" Burr says. "It’s somewhat better in French: ‘parfum.’"3. Scaling Up: GPT‑3
Building on GPT‑2’s zero‑shot success, OpenAI introduced GPT‑3 with a sparse Transformer architecture and incorporated few‑shot learning to further improve performance. The 175 B‑parameter model dramatically outperforms previous LLMs across a wide range of benchmarks.
4. Summary
The GPT series demonstrates a clear trend: larger models and richer data enable the removal of task‑specific fine‑tuning, shifting from supervised learning toward universal language modeling. Core takeaways include:
Pre‑training + fine‑tuning solves the scarcity of labeled text.
GPT’s decoder‑only design relies on masked attention, demanding larger models and datasets to surpass encoder‑based BERT.
Zero‑shot (GPT‑2) and few‑shot (GPT‑3) training illustrate the feasibility of a single, general‑purpose LLM.
References
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
https://arxiv.org/abs/2005.14165
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.