12 — Pay Attention

Self-attention, Q/K/V, multi-head attention, positional encodings, and the transformer block — the mechanism behind every modern language model, explained from scratch.

The RNN reads left to right, one token at a time, compressing everything into a fixed hidden state. To answer a question about token 1 when you’re at token 100, that information has to survive 99 hidden-state transitions. Often it doesn’t.

The transformer’s answer: don’t be sequential at all. Let every token look at every other token simultaneously. No bottleneck. No forgetting. No left-to-right constraint.

One mechanism makes all of this work: self-attention.


Why RNNs Hit a Wall #

Hidden states are information bottlenecks. An RNN at step $t$ holds a fixed-size vector $\mathbf{h}_t$ that summarises everything from steps $1$ to $t$. That’s a lot to cram into 256 numbers — and the earlier the token, the more compression it’s been through.

Attention sidesteps the bottleneck entirely. Instead of funnelling through a sequential chain, every output position can directly reach every input position. Distance doesn’t attenuate the signal.


The Core Idea #

At each position, the model does three things:

  1. Forms a query: “what am I looking for?”
  2. Checks keys at every position: “how well does each position match my query?”
  3. Reads values weighted by those matches: “what do I get from each position?”

The output at position $i$ is a weighted mixture of all the values, where the weights come from how well query $i$ matched each key.


Query, Key, Value #

Three learnable weight matrices transform each input embedding into Q, K, and V:

$$\mathbf{q}_i = \mathbf{x}_i \mathbf{W}_Q, \quad \mathbf{k}_i = \mathbf{x}_i \mathbf{W}_K, \quad \mathbf{v}_i = \mathbf{x}_i \mathbf{W}_V$$

The attention weight from position $i$ to position $j$ is how well query $i$ matches key $j$, normalised with softmax:

$$\alpha_{ij} = \text{softmax}_j!\left(\frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}}\right)$$

The output at position $i$ is the softmax-weighted sum of all value vectors:

$$\mathbf{y}_i = \sum_j \alpha_{ij}\, \mathbf{v}_j$$

The values are what you receive. The query–key dot product decides how much of each value to take.


Scaled Dot-Product Attention #

In matrix form — $n$ tokens, embedding dim $d$, key dim $d_k$:

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}$$

The $n \times n$ matrix $\mathbf{Q}\mathbf{K}^\top$ holds every pairwise compatibility score. Softmax (row-wise) turns each row into a probability distribution. Multiplying by $\mathbf{V}$ blends the value vectors accordingly.

Why $\sqrt{d_k}$? If $\mathbf{q}$ and $\mathbf{k}$ have components $\sim \mathcal{N}(0,1)$, their dot product has variance $d_k$. For large $d_k$, scores blow up and softmax saturates — gradients vanish. Dividing by $\sqrt{d_k}$ keeps variance at 1.

This is $O(n^2)$ in time and memory — the famous quadratic cost of transformers. For $n = 4096$ tokens, that’s 16 million score pairs. Expensive, but entirely parallel — no sequential dependency anywhere.


Multi-Head Attention #

One head looks for one type of relationship. Real transformers run $h$ heads in parallel, each learning different projections:

$$\text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1,\ldots,\text{head}_h),\mathbf{W}_O$$

$$\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_{Qi},\quad \mathbf{K}\mathbf{W}_{Ki},\quad \mathbf{V}\mathbf{W}_{Vi})$$

Different heads discover different structure: one tracks subject–verb pairs, another handles pronoun references (“it” → the entity), a third attends locally. Nobody programmed this — it emerged from gradient descent on next-token prediction.


Positional Encoding #

Self-attention is permutation-invariant. Shuffle the tokens and the attention scores don’t change — dot products only depend on token identity, not order. “Dog bites man” and “man bites dog” would be identical.

Fix: add a positional signal to each embedding before attention.

The original transformer used sinusoidal encodings — each position gets a unique signature across dimensions:

$$\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right), \qquad \text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

Modern models use Rotary Position Embeddings (RoPE): rotate $\mathbf{q}$ and $\mathbf{k}$ by an angle proportional to their position before the dot product. This makes relative position directly affect attention scores — and it generalises to longer sequences than seen during training.


The Transformer Block #

One block does two things: attend, then transform each position independently.

x → LayerNorm → MultiHeadAttention → + (residual)
  → LayerNorm → FFN                 → + (residual) → out

LayerNorm normalises across the embedding dimension (not the batch). Works with batch size 1, works at any point in training.

FFN is two linear layers with a nonlinearity — typically 4× wider than the model dimension:

$$\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1),\mathbf{W}_2 + \mathbf{b}_2$$

The FFN is where most factual knowledge lives — it’s the “memory” learned during pretraining. Attention routes information; FFN transforms it.

Residual connections everywhere — same reason as ResNets: gradients flow through the skip path directly, and layers learn corrections rather than full transformations.

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn  = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ff    = nn.Sequential(
            nn.Linear(d_model, 4 * d_model), nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        n = self.norm1(x)
        x = x + self.attn(n, n, n, attn_mask=mask)[0]
        x = x + self.ff(self.norm2(x))
        return x

GPT: Stack and Sample #

GPT is a stack of transformer blocks with one addition: causal masking. Position $i$ can only attend to positions $\leq i$ — future tokens are masked to $-\infty$ before softmax, becoming $0$ after it.

n = seq_len
mask = torch.triu(torch.full((n, n), float('-inf')), diagonal=1)
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
attn   = torch.softmax(scores + mask, dim=-1)

At inference: predict the next token, append it to the sequence, run forward again. Autoregressive — one token at a time, each conditioned on everything before.

GPT-2 Small (124M parameters):

  • 12 transformer blocks, 12 attention heads, $d_k = 64$
  • $d = 768$ embedding dim, $4d = 3{,}072$ FFN width
  • 50,257-token vocabulary (byte-pair encoding)

Scale to 96 blocks with $d = 12{,}288$ and you have GPT-3 at 175B parameters.


Demo: Attention Heatmap #

Type a sentence (max 12 words) and press Enter, or pick a preset. The grid shows attention weights — row $i$, column $j$ is how much token $i$ attends to token $j$. Click any row label or cell to highlight that token’s attention distribution.

Toggle Causal to see GPT-style masking — each token can only look left.

(Weights are computed from fixed random projections on character-hash embeddings — not a trained model. The patterns illustrate the mechanics, not linguistics.)

presets

Try “the dog that bit the man fled” — a sentence with long-range dependencies. Notice how the attention distributes across non-adjacent tokens. Then toggle Causal and watch the upper-right triangle go dark.


What Trained Attention Actually Learns #

These random-projection weights don’t mean anything linguistically — but in trained models, interpretability research has found real structure:

  • Positional heads: attend to immediately adjacent tokens — essentially a learned local window
  • Syntactic heads: track subject–verb–object arcs across arbitrary distances
  • Coreference heads: “it”, “they”, “she” → the noun phrase they refer to
  • Copy heads: attend to identical or semantically similar tokens elsewhere in the sequence

None of this is engineered in. The model discovered that tracking these relationships is useful for predicting the next token.


Before You Go — Try These #

  1. The attention score matrix $\mathbf{Q}\mathbf{K}^\top$ has shape $n \times n$. For $n = 4096$ tokens and $d_k = 64$: how many floating-point multiplications are needed to compute this matrix? How does this change if you double the sequence length to 8192?

  2. If $\mathbf{q}$ and $\mathbf{k}$ both have $d_k = 64$ components drawn i.i.d. from $\mathcal{N}(0,1)$, what is the expected value and variance of $\mathbf{q} \cdot \mathbf{k}$? What does the softmax output look like when one score is 10× larger than the others? Why is the $\sqrt{d_k}$ fix necessary?

  3. A multi-head attention layer has $h$ heads, each with $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d \times d_k}$ and a combined output projection $\mathbf{W}_O \in \mathbb{R}^{hd_k \times d}$. For GPT-2 Small ($d = 768$, $h = 12$, $d_k = 64$): how many total parameters are in one multi-head attention layer?

  4. Causal masking sets future positions to $-\infty$ before softmax, which produces $0$ attention weights after softmax. Do gradients flow through those zero-attention entries during backprop? What does $\frac{\partial}{\partial s_{ij}} \text{softmax}(s)i$ equal when $s{ij} = -\infty$?

  5. Layer norm normalises across the embedding dimension: $\hat{x} = (x - \mu) / \sigma$ then scales by learned $\gamma, \beta$. Batch norm normalises across the batch dimension. Write out both normalisation formulas. Why would batch norm fail with batch size 1 at inference, while layer norm works fine?


Next up → Lesson 13: Build a Mini-GPT — a small character-level transformer that generates text, built in PyTorch from the pieces you now understand.