Artificial Intelligence 13 min read

ChatGPT Technical Analysis Series – Part 2: GPT1, GPT2, and GPT3 (Encoder vs Decoder, Zero‑Shot, and Scaling)

This article reviews the evolution of the GPT family from GPT‑1 to GPT‑3, comparing encoder‑decoder architectures, explaining the shift from supervised fine‑tuning to zero‑shot and few‑shot learning, and highlighting the architectural and training innovations that enabled large‑scale language models.

Rare Earth Juejin Tech Community

Jul 30, 2023

ChatGPT Technical Analysis Series – Part 2: GPT1, GPT2, and GPT3 (Encoder vs Decoder, Zero‑Shot, and Scaling)

In the second article of the "ChatGPT Technical Analysis" series, the author examines the GPT family (GPT‑1, GPT‑2, GPT‑3) and contrasts it with BERT, focusing on encoder‑decoder differences, zero‑shot learning, and scaling strategies.

1. Encoder vs Decoder: BERT vs GPT‑1

1.1 Timeline

2017 – Google introduces the Transformer, replacing recurrence and convolutions with attention.

June 2018 – OpenAI releases the first GPT model (decoder‑only) demonstrating the effectiveness of "pre‑training + fine‑tuning" for NLP. In October 2018, Google releases BERT (encoder‑only), which outperforms GPT‑1 on comparable parameter budgets.

Four months later OpenAI launches GPT‑2 (1.5 B parameters) and adds zero‑shot capability, showing that larger models and data can reduce the need for task‑specific fine‑tuning.

June 2020 – GPT‑3 is released with 175 B parameters; training cost reportedly exceeds $12 million, establishing a new performance ceiling.

1.2 GPT‑1 Design Philosophy

(1) Motivation

In computer vision, "pre‑training + fine‑tuning" has long been standard, but NLP lacked large labeled corpora and struggled to model textual semantics. The Transformer’s strong contextual modeling enabled the creation of GPT‑1.

(2) Pre‑training

GPT‑1 is trained on massive unlabeled text to learn "text continuation" (predict the next token given previous tokens). Its architecture follows the Transformer decoder, removing cross‑attention to the encoder.

Key attention mechanisms:

Decoder uses Masked‑Attention (each token sees only its left context).

Encoder uses Standard Attention (tokens see both left and right context).

(3) Fine‑tuning

After pre‑training, GPT‑1 is fine‑tuned on four supervised tasks:

Classification

Entailment

Similarity

Multiple‑Choice

All tasks share a unified input format: special start/end tokens are added, the raw text is fed to the model, and a linear head on the final hidden state produces the prediction. Only the linear layer and the special token embeddings are updated during fine‑tuning.

The overall loss is L = L_{1} + L_{2}, where L_{1} is the pre‑training loss and L_{2} is the fine‑tuning loss.

(4) BERT vs GPT‑1

GPT‑1 uses the same parameter configuration as BERT‑Base (L=12, H=768, A=12) but employs masked attention, which requires richer data because the model can only see preceding tokens. BERT’s bidirectional attention gives it an advantage on many tasks, explaining why GPT models need to grow larger to catch up.

2. Zero‑Shot: GPT‑2

GPT‑2’s core idea is that with enough high‑quality data and a sufficiently large model, fine‑tuning can be omitted, yielding a universal language model.

Zero‑shot, one‑shot, and few‑shot definitions:

Zero‑shot : Provide only a task description and prompt.

One‑shot : Provide a description, one example, and a prompt.

Few‑shot : Provide a description, several examples, and a prompt.

GPT‑2 was trained on large Reddit dumps, selecting high‑quality posts via community voting. This data naturally contains task descriptions, prompts, and answers, enabling the model to learn to perform tasks without explicit fine‑tuning.

"I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool] ."
"I hate the word ‘perfume,’" Burr says. "It’s somewhat better in French: ‘parfum.’"

3. Scaling Up: GPT‑3

Building on GPT‑2’s zero‑shot success, OpenAI introduced GPT‑3 with a sparse Transformer architecture and incorporated few‑shot learning to further improve performance. The 175 B‑parameter model dramatically outperforms previous LLMs across a wide range of benchmarks.

4. Summary

The GPT series demonstrates a clear trend: larger models and richer data enable the removal of task‑specific fine‑tuning, shifting from supervised learning toward universal language modeling. Core takeaways include:

Pre‑training + fine‑tuning solves the scarcity of labeled text.

GPT’s decoder‑only design relies on masked attention, demanding larger models and datasets to surpass encoder‑based BERT.

Zero‑shot (GPT‑2) and few‑shot (GPT‑3) training illustrate the feasibility of a single, general‑purpose LLM.

References

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

https://arxiv.org/abs/2005.14165

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Transformer fine-tuning pretraining zero-shot GPT

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.