Artificial Intelligence 16 min read

A Comprehensive Overview of Attention Mechanisms in Deep Learning

This article systematically reviews the history, core concepts, variants, and practical implementations of attention mechanisms—from early additive and multiplicative forms to self‑attention, multi‑head attention, and recent transformer‑based models—highlighting why attention has become fundamental in modern AI research.

Qunar Tech Salon

Sep 12, 2019

A Comprehensive Overview of Attention Mechanisms in Deep Learning

Attention, introduced in 2015, quickly became a cornerstone in both natural language processing and computer vision by enabling models to focus on relevant parts of the input.

The article first outlines the historical development of attention, citing seminal works such as Bahdanau et al.'s additive attention, Luong et al.'s multiplicative attention, and the hard/soft attention concepts from image captioning.

It then defines attention through a unified three‑step framework: a score function to measure similarity, an alignment function (often softmax) to produce attention weights, and a context‑vector generation step that aggregates weighted inputs.

Two abstract perspectives are presented: alignment‑based models and memory‑based (Q‑K‑V) models, illustrating how classic attention mechanisms fit into these categories.

The article examines detailed variants—including global vs. local attention, hard vs. soft attention, and different score functions such as dot‑product, scaled dot‑product, and additive—and explains their trade‑offs.

Self‑attention and multi‑head attention are discussed as solutions to the limitations of RNNs and CNNs, providing constant‑time long‑range dependencies and parallelism.

Transformer architecture is described, covering encoder self‑attention, masked decoder self‑attention, and encoder‑decoder attention, along with supporting components like positional encoding and residual connections.

Additional “flavored” attentions such as hierarchical attention networks, attention‑over‑attention, convolutional sequence‑to‑sequence with attention, weighted transformers, and Transformer‑XL are briefly introduced.

Finally, the article concludes that attention works because it effectively captures context, enabling models to perform weighted summations that mimic human focus, thereby improving performance across NLP, vision, and recommendation tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning Transformer attention NLP machine translation Self-Attention

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.