Artificial Intelligence 25 min read

A Comprehensive Introduction to RNN, LSTM, Attention Mechanisms, and Transformers for Large Language Models

This article provides a thorough overview of large language models, explaining the relationship between NLP and LLMs, the evolution from RNN to LSTM, the fundamentals of attention mechanisms, and the architecture and operation of Transformer models, all illustrated with clear examples and diagrams.

Rare Earth Juejin Tech Community

Nov 12, 2023

Introduction

Today's booming large models such as GPT‑3 and BERT achieve unprecedented natural‑language processing capabilities thanks to massive parameters and data, and the attention mechanism is a key foundation that enables models to capture long‑range dependencies and greatly improve performance.

This article explains the basics of large models and attention mechanisms from both a popular and an academic perspective, covering RNN, its limitations, LSTM, the history and types of attention, and finally the Transformer model and its advantages over LSTM.

NLP and LLM: How They Relate

Large models that dominate the conversation are more accurately called Large Language Models (LLM). NLP (Natural Language Processing) is a branch of AI that studies how computers understand, generate, and process human language, powering voice assistants, web search, spam filtering, and translation.

LLMs are powerful tools within NLP; by training language models we can solve many NLP tasks and enable computers to better understand and manipulate natural language.

First Generation Model: RNN

RNN (Recurrent Neural Network) is the most traditional deep‑learning model used in NLP and speech recognition. It processes sequences by maintaining a hidden state h that carries information from previous time steps.

The hidden state acts like a relay runner summarizing each episode of a story; each step passes its summary to the next, allowing the network to accumulate information across the sequence.

However, when the sequence becomes very long, earlier information can be forgotten, leading to the classic "long‑term dependency" problem.

Encoder‑Decoder Model

By combining an N‑to‑1 RNN (encoder) with a 1‑to‑N RNN (decoder), the Encoder‑Decoder (Seq2Seq) architecture can handle inputs and outputs of different lengths.

N‑to‑1

Used for classification or summarization tasks.

1‑to‑N

Two variants: one expands input (e.g., text generation, image enhancement) and the other extracts information (e.g., image captioning, music transcription).

Combining both yields the flexible Encoder‑Decoder (N‑to‑M) model.

RNN Drawbacks

When processing very long sequences, RNNs tend to forget early information due to gradient vanishing or exploding, making it difficult to capture long‑term dependencies.

Advanced Model: LSTM

LSTM (Long Short‑Term Memory) introduces three gates—input, forget, and output—to control information flow, effectively mitigating the long‑term dependency problem.

The internal diagram of a single LSTM cell:

Input Gate : decides which new information to store.

Forget Gate : decides which old information to discard.

Output Gate : decides which information to expose to the next layer.

LSTM handles long dependencies better than RNN, but still processes inputs sequentially, limiting computational efficiency.

LLM Foundation – Attention Mechanism

Attention provides an effective solution for long‑sequence processing and is a cornerstone of modern LLMs.

Key Milestones in Attention Development

First introduced in the 1990s for vision, the mechanism gained prominence with 2014’s "Recurrent Models of Visual Attention" and 2015’s "Neural Machine Translation by Jointly Learning to Align and Translate" (the first NLP application). The 2017 "Attention Is All You Need" paper replaced RNNs with self‑attention, sparking the LLM era.

What Is Attention?

Attention lets a model focus on the most relevant parts of the input when producing an output, similar to how humans read a paragraph by concentrating on key words.

Layperson’s View

Unlike an LSTM that reads sequentially, attention can jump to any relevant position, making it better at handling distant dependencies.

Technical View

Attention computes a weighted sum of values ( V) based on the similarity between a query ( Q) and keys ( K), allowing the model to directly attend to any position.

Three stages:

Compute similarity scores between Q and each K.

Normalize scores with softmax to obtain attention weights.

Weight the V vectors and sum them to produce the attention output.

Key papers for deeper study:

"Neural Machine Translation by Jointly Learning to Align and Translate" (https://arxiv.org/pdf/1409.0473.pdf)

"Attention Is All You Need" (https://arxiv.org/pdf/1706.03762.pdf)

"Effective Approaches to Attention‑based Neural Machine Translation" (https://arxiv.org/pdf/1508.04025.pdf)

Types of Attention

Soft Attention

Considers all keys with continuous weights that can be learned via gradient descent; computationally heavier but fully differentiable.

Hard Attention

Selects a single key at each step; non‑differentiable and typically trained with reinforcement methods.

Self‑Attention

Queries, keys, and values all come from the same input sequence, allowing the model to capture relationships between any pair of positions. This is the core of the Transformer.

Transformer

The Transformer relies on self‑attention to process sequences efficiently and has become the backbone of modern LLMs such as GPT and BERT.

Transformer Architecture

Layperson’s View

Imagine watching a movie where you can instantly recall any previous scene while watching a new one; the model does the same by attending to all positions simultaneously.

Technical View

The model consists of an Encoder‑Decoder stack. Each Encoder layer contains Multi‑Head Self‑Attention and a Feed‑Forward network. The Decoder adds a Masked Multi‑Head Attention before the regular Multi‑Head Attention.

Encoder

Each Encoder block has:

Multi‑Head Attention : runs several self‑attention heads in parallel, each with its own projection matrices, allowing the model to capture information from different representation subspaces.

Feed‑Forward Network : two linear layers with a non‑linear activation applied position‑wise, enabling further transformation of the attended representations.

Decoder

The Decoder mirrors the Encoder but adds:

Masked Multi‑Head Attention : prevents each position from attending to future tokens, ensuring autoregressive generation.

Multi‑Head Attention (Encoder‑Decoder) : attends to the Encoder’s output.

Feed‑Forward Network (same as in the Encoder).

Conclusion

Transformer models have achieved remarkable results in machine translation and many other NLP tasks, forming the core of leading large language models such as GPT and BERT. Understanding their principles helps predict future LLM developments and enables developers to better leverage these models in applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence Transformer attention NLP LSTM RNN

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.