← Back to Home

Attention Mechanisms — The Visual Guide to How Transformers Actually Work

Visual explainer of attention mechanisms in transformers. Understand Query-Key-Value, multi-head attention, self vs cross-attention through animated diagrams and intuitive analogies.

“Attention Is All You Need” — the 2017 paper that launched the transformer revolution. The title isn’t hyperbole. Attention is the core mechanism that makes GPT, Claude, Gemini, and every modern language model work. Everything else (feedforward layers, normalization, embeddings) is supporting infrastructure.

But most explanations jump to matrix math before you understand what attention is actually doing. Let’s start from the intuition.

1. What Attention Does

Imagine reading: “The cat sat on the mat because it was tired.” When you read “it,” your brain instantly connects it to “cat” — not to “mat.” That’s attention: figuring out which other tokens are relevant to the current token.

Attention — "Which Words Matter for This Word?"

Each token looks at every other token and decides how much to pay attention to it.

Thecatsatonthemat
When processing "cat", attention scores:
The
0.05
cat
0.30
sat
0.20
on
0.03
the
0.02
mat
0.40
"cat" attends strongly to "mat" (semantic relationship: cats sit on mats) and "sat" (action). Attention scores are softmax-normalized: they sum to 1.0 across all tokens.

The key insight: without attention, the model processes each token independently. “It” has no way to know it refers to “cat.” Attention creates connections between every pair of tokens in the sequence, letting the model build relationships across the entire context.

2. Query, Key, Value

The mechanism uses three learned projections — Q, K, V — inspired by information retrieval. Each token generates all three, then uses them to compute attention scores with every other token.

Query, Key, Value — The Three Projections

Q
Query"What am I looking for?" — the current token's question. Each token generates a query vector.
K
Key"What do I contain?" — every token's label. Query is matched against all keys to find relevance scores.
V
Value"What information do I carry?" — the actual content. Attention scores weight which values to mix.
Attention(Q, K, V) = softmax(Q·KT / √dk) · V
Analogy: Library search. Query = your search terms. Keys = book titles/tags. Values = book contents. You match your query against keys to decide which books (values) to read. High-scoring books contribute more to your answer.

The √d scaling in the formula prevents the dot products from getting too large with high-dimensional vectors. Without it, softmax saturates — all attention goes to one token and the gradients vanish. A small detail that made training stable.

3. Multi-Head Attention

One attention head captures one type of relationship. But language has many simultaneous relationships — syntactic, semantic, positional, coreference. Multiple heads run in parallel, each learning to focus on different patterns.

Multi-Head Attention — Multiple Perspectives

One attention head looks at one relationship. Multiple heads capture different patterns simultaneously.

Head 1Syntactic structure"cat" → "sat" (subject-verb)
Head 2Spatial relationships"cat" → "mat" (on the mat)
Head 3Coreference"it" → "cat" (pronoun resolution)
Head 4Positional proximity"cat" → "The" (adjacent tokens)
All heads concatenated → Linear projection → Combined representation
GPT-4 has 96+ attention heads per layer, 120+ layers. That's 11,000+ attention patterns processing every token. Each head specializes in different linguistic relationships through training.

This is why transformers are so powerful: they’re not doing one thing really well — they’re doing dozens of things simultaneously and combining the results. Each head becomes a specialist, and the linear projection combines their findings into a single representation.

4. Self vs Cross Attention

There are two flavors of attention, and they serve different purposes. Self-attention (used in GPT, BERT) lets tokens attend to other tokens in the same sequence. Cross-attention lets tokens attend to a different sequence — crucial for translation and multimodal models.

Self-Attention vs Cross-Attention

Self-Attention
InputSame sequence for Q, K, and V
PurposeTokens within a sentence attend to each other
Used inGPT, BERT, all decoder-only models
Example"The cat... it" → "it" attends to "cat"
Cross-Attention
InputQ from one sequence, K/V from another
PurposeDecoder attends to encoder output
Used inTranslation models, Whisper, DALL-E
ExampleEnglish→French: French "chat" attends to English "cat"

Modern decoder-only models (GPT-4, Claude, Llama) use only self-attention with causal masking — each token can only attend to previous tokens, not future ones. This forces left-to-right generation. Encoder-decoder models (T5, Whisper) use both self-attention in the encoder and cross-attention to connect decoder to encoder outputs.

5. Making It Scale

The original attention mechanism is O(n²) — quadratic in sequence length. That worked fine for 512-token sequences in 2017. It doesn’t work for 1M-token sequences in 2024. The field has been racing to make attention faster without losing quality.

The Evolution — Getting Faster

2017
Vanilla AttentionO(n²) memory and compute. Fine for short sequences, impossible for 100K+ tokens.
2020
Sparse AttentionOnly attend to nearby tokens + fixed stride patterns. GPT-3 used this. Tradeoff: misses some long-range dependencies.
2022
FlashAttentionSame math, better hardware utilization. Tiled computation that fits in GPU SRAM. 2-4x faster, exact (not approximate).
2023
Multi-Query / Grouped-QueryShare K/V heads across query heads. Llama 2 uses GQA. Fewer KV-cache entries = cheaper inference = longer context.
2024+
Ring Attention / Infinite ContextDistribute attention across multiple devices. Sequence length limited by total cluster memory, not single GPU. Enables 1M+ context.

FlashAttention was the breakthrough that changed everything. It doesn’t approximate — it computes the exact same result but exploits GPU memory hierarchy (SRAM vs HBM) to avoid memory bottlenecks. Before FlashAttention, long-context models were theoretical. After it, they became practical. Gemini’s 1M context, Claude’s 200K context — all powered by FlashAttention and its successors.