Artificial Intelligence 14 min read

Overview of T5 (Text-to-Text Transfer Transformer): Architecture, Variants, Experiments, and Applications

This article provides a comprehensive overview of Google's T5 model, detailing its unified text‑to‑text formulation, encoder‑decoder architecture, three model variants, attention mask designs, training strategies, model sizes, experimental results, and key contributions to natural language processing.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Overview of T5 (Text-to-Text Transfer Transformer): Architecture, Variants, Experiments, and Applications

Basic Information

Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer URL: https://arxiv.org/pdf/1910.10683.pdf Full name: Text-to-Text Transfer Transformer (T5) Code: https://github.com/google-research/text-to-text-transfer-transformer

Model Architecture

T5 is a sequence‑to‑sequence Transformer that treats every NLP task as a text‑to‑text problem, using task‑specific prefixes and a unified pre‑training objective.

Three Model Variants Compared

To solve the text‑to‑text problem the authors experimented with three structures: Encoder-Decoder , Language model and Prefix LM . Language model and Prefix LM are suitable for NLU tasks, while Encoder‑Decoder performs better for NLG, so T5 adopts the Encoder‑Decoder architecture.

Encoder-Decoder: Standard Transformer with bidirectional attention in the encoder and causal (unidirectional) attention in the decoder.

Decoder‑only: Autoregressive model where each token can only attend to previous tokens.

GPT series and most current large models are Decoder‑only.

Prefix LM: Combines bidirectional and unidirectional attention via a specially designed mask, allowing part of the input to be fully visible like an encoder and the rest to be causal like a decoder.

Three Attention Mechanisms Compared

All three architectures use different attention masks; the figure below shows the mask matrices for each variant.

Mask matrix symbols:

Dark cells indicate positions where the self‑attention mechanism is allowed to attend.

Light cells indicate prohibited attention.

Illustrations:

Left: Fully visible mask – each output step can attend to the entire input.

Center: Causal mask – prevents an output token from attending to future input tokens.

Right: Prefix causal mask – allows full visibility for a prefix portion of the input while keeping causality for the rest.

Different architectures mainly differ by the mask used in attention. With comparable computational cost, the Encoder‑Decoder model has roughly twice the parameters of the other structures.

Experimental Path

After fixing the base architecture, the authors explored self‑supervised objectives, mask designs, and corruption rates to find the optimal pre‑training configuration.

High‑level Approaches

Comparison of three high‑level pre‑training strategies (left figure):

Prefix LM: Conditional text generation with full input and left‑to‑right output.

BERT‑style: Mask random tokens and predict them.

Deshuffling: Shuffle the text and train the model to reconstruct the original order.

Corruption Strategies

Methods for corrupting a portion of the input text (second figure):

Mask: Replace tokens with a special [M] token.

Replace spans: Group consecutive masked tokens into a single special token to improve efficiency.

Drop: Randomly delete characters without replacement.

Corruption Rate

The paper evaluates four mask ratios: 10%, 15%, 25%, 50%; the 15% rate (the same used by BERT) yields the best performance.

Corruption Span Length

For the replace‑spans strategy, the authors test span lengths of 2, 3, 5, and 10 tokens; an average span length of 3 performs best.

Model Configuration

Model Sizes

T5 is released in five sizes: Small (60 M parameters), Base (220 M), Large (770 M), 3 B (2.8 B), and 11 B (11 B).

Performance

Optimal Summary

Key findings for the best pre‑trained T5 model:

Objective: Span‑corruption with an average span length of 3 and a corruption probability of 15%.

Training steps: Continue pre‑training on the C4 corpus for 1 M steps (batch size 2¹¹), covering roughly 1 trillion tokens.

Model configurations: Base: 24 layers, hidden size 768, 12 attention heads, 220 M parameters. Small: 12 layers, hidden size 512, 8 heads, ~60 M parameters. Large: 48 layers, hidden size 1024, 16 heads, ~770 M parameters. 3 B and 11 B: 48 layers, hidden size 1024, 32/128 heads, 2.8 B and 11 B parameters respectively.

Multi‑task pre‑training (mixing supervised tasks) yields modest gains.

Fine‑tune on each downstream task.

Beam search decoding with beam size 4 and length penalty 0.6.

Main Contributions of T5

Unified Text‑to‑Text Transfer

The biggest innovation is framing every NLP task—both understanding and generation—as a text‑to‑text problem, enabling a single model, loss function, and set of hyper‑parameters to be used across diverse tasks.

Allows the same model to be applied to translation, question answering, summarization, classification, etc., by simply adding task‑specific prefixes.

C4 (Colossal Clean Crawled Corpus)

The authors curated a 750 GB clean web‑text dataset from Common Crawl, called C4, which serves as the primary pre‑training corpus for T5.

Common Crawl provides raw web data; C4 removes boilerplate, non‑text, and offensive content to create a high‑quality training set.

Application Scenarios

T5 achieves state‑of‑the‑art results on tasks such as natural language summarization, machine translation, open‑domain question answering, and text classification, making it a versatile model for many NLP applications.

Related Deep‑Learning Concepts

SOTA (State of the Art): The best performing model on a given benchmark.

Transfer Learning: Leveraging knowledge from a source task to improve learning on a target task.

Emergence: Phenomena where scaling model size leads to sudden improvements on complex tasks.

Chain‑of‑Thought (CoT): Prompting large language models to reason step‑by‑step.

NLU vs. NLG: Natural Language Understanding focuses on interpreting text, while Natural Language Generation focuses on producing fluent text.

Artificial IntelligenceTransformerNLPPretrainingT5Text-to-Text
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.