13 — Build a Mini-GPT

Every piece from lessons 1–12 assembled into a working character-level GPT in PyTorch — tokenisation, the full transformer stack, training loop, and text generation from scratch.

You’ve seen every ingredient. Backprop. Gradient descent. ReLUs. Residual connections. Layer norm. Self-attention. Causal masking. Positional encodings. Multi-head attention. Transformer blocks.

This lesson assembles them. By the end you’ll have a working character-level GPT in pure PyTorch — one that reads text, trains on it, and generates new text in the same style. Nothing hidden, nothing magic.


The Plan #

We’re building a character-level language model — it reads text one character at a time, learns the patterns, and generates new text character by character.

Why character-level? Every piece is visible. No tokeniser black box. The vocabulary is just the ~65 printable characters in your training text. You can inspect everything.

The architecture: a stack of transformer decoder blocks with causal masking. Exactly GPT, just smaller.


Step 1: Data and Tokenisation #

# Download a small corpus — Shakespeare's works (~1MB)
import requests
text = requests.get('https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt').text

# Build character vocabulary
chars  = sorted(set(text))
vocab_size = len(chars)          # ~65

# Encoder / decoder
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join(itos[i] for i in l)

# Train / validation split
import torch
data = torch.tensor(encode(text), dtype=torch.long)
n    = int(0.9 * len(data))
train_data = data[:n]
val_data   = data[n:]

# Batch sampler — random windows of length block_size
def get_batch(split, block_size=64, batch_size=32):
    d    = train_data if split == 'train' else val_data
    ix   = torch.randint(len(d) - block_size, (batch_size,))
    x    = torch.stack([d[i   : i + block_size    ] for i in ix])
    y    = torch.stack([d[i+1 : i + block_size + 1] for i in ix])
    return x, y

Every training example is a window of block_size characters. The target y is the same window shifted one position right — position $i$ predicts position $i+1$.


Step 2: The Config #

Keep all hyperparameters in one place:

import torch.nn as nn
from dataclasses import dataclass

@dataclass
class GPTConfig:
    vocab_size  : int   = 65
    block_size  : int   = 64      # context window (characters)
    n_layer     : int   = 4       # transformer blocks
    n_head      : int   = 4       # attention heads
    n_embd      : int   = 128     # embedding dimension
    dropout     : float = 0.1

This is a ~400K-parameter model. Tiny by modern standards — it’ll train on a laptop in a few minutes.


Step 3: One Attention Head #

class Head(nn.Module):
    def __init__(self, cfg, head_size):
        super().__init__()
        self.key   = nn.Linear(cfg.n_embd, head_size, bias=False)
        self.query = nn.Linear(cfg.n_embd, head_size, bias=False)
        self.value = nn.Linear(cfg.n_embd, head_size, bias=False)
        self.drop  = nn.Dropout(cfg.dropout)
        # Causal mask — lower-triangular, registered as buffer (not a param)
        self.register_buffer('tril', torch.tril(torch.ones(cfg.block_size, cfg.block_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)    # (B, T, head_size)
        q = self.query(x)  # (B, T, head_size)
        v = self.value(x)  # (B, T, head_size)

        # Scaled attention scores
        scale = k.shape[-1] ** -0.5
        wei   = q @ k.transpose(-2, -1) * scale  # (B, T, T)

        # Causal mask: future positions → -inf → 0 after softmax
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = torch.softmax(wei, dim=-1)
        wei = self.drop(wei)

        return wei @ v  # (B, T, head_size)

Step 4: Multi-Head, Feed-Forward, Block #

class MultiHeadAttention(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        head_size   = cfg.n_embd // cfg.n_head
        self.heads  = nn.ModuleList([Head(cfg, head_size) for _ in range(cfg.n_head)])
        self.proj   = nn.Linear(cfg.n_embd, cfg.n_embd)
        self.drop   = nn.Dropout(cfg.dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.drop(self.proj(out))


class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(cfg.n_embd, 4 * cfg.n_embd),
            nn.GELU(),
            nn.Linear(4 * cfg.n_embd, cfg.n_embd),
            nn.Dropout(cfg.dropout),
        )

    def forward(self, x):
        return self.net(x)


class Block(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.sa   = MultiHeadAttention(cfg)
        self.ff   = FeedForward(cfg)
        self.ln1  = nn.LayerNorm(cfg.n_embd)
        self.ln2  = nn.LayerNorm(cfg.n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))   # attend (pre-LN variant)
        x = x + self.ff(self.ln2(x))   # transform
        return x

Step 5: The Full Model #

class MiniGPT(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.n_embd)
        self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
        self.drop    = nn.Dropout(cfg.dropout)
        self.blocks  = nn.Sequential(*[Block(cfg) for _ in range(cfg.n_layer)])
        self.ln_f    = nn.LayerNorm(cfg.n_embd)
        self.head    = nn.Linear(cfg.n_embd, cfg.vocab_size, bias=False)
        # Weight tying — same matrix for embedding and output projection
        self.tok_emb.weight = self.head.weight
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.normal_(m.weight, std=0.02)
            if m.bias is not None: nn.init.zeros_(m.bias)
        elif isinstance(m, nn.Embedding):
            nn.init.normal_(m.weight, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok = self.tok_emb(idx)                         # (B, T, C)
        pos = self.pos_emb(torch.arange(T, device=idx.device))  # (T, C)
        x   = self.drop(tok + pos)
        x   = self.blocks(x)
        x   = self.ln_f(x)
        logits = self.head(x)                           # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = nn.functional.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.cfg.block_size:]         # trim to context
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature          # last position only
            probs  = torch.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_idx), dim=1)
        return idx

Weight tying: the token embedding matrix and the output projection share the same weights. This saves ~65×128 = 8,320 parameters and improves quality — the model learns one consistent representation for each token.


Step 6: Training #

device = 'cuda' if torch.cuda.is_available() else 'cpu'
cfg    = GPTConfig()
model  = MiniGPT(cfg).to(device)

print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M")
# → Parameters: 0.41M

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

for step in range(5000):
    xb, yb = get_batch('train')
    xb, yb = xb.to(device), yb.to(device)

    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % 500 == 0:
        model.eval()
        with torch.no_grad():
            _, val_loss = model(*[t.to(device) for t in get_batch('val')])
        model.train()
        print(f"step {step:4d}  train {loss.item():.3f}  val {val_loss.item():.3f}")

After 5,000 steps (~2 min on CPU), validation loss drops from ~4.2 (random) to ~1.7. After 10,000 steps the model writes recognisably Shakespearean text — fake words, plausible meter, the occasional coherent sentence.


Step 7: Generating Text #

model.eval()
context = torch.zeros((1, 1), dtype=torch.long, device=device)  # start token: index 0
generated = model.generate(context, max_new_tokens=500, temperature=0.8)
print(decode(generated[0].tolist()))

After ~5,000 steps you might see something like:

ROMEO:
What light is thouver yeed with the king?

JULIET:
I have so be, that I have been the sea
Of the world, and the world is but the world—

Not coherent, but the structure is there. Capital names, line breaks in the right places, iambic rhythm. The model has learned the surface statistics of Shakespearean text.

Temperature controls randomness. $T = 1.0$: sample from the model’s distribution. $T < 1.0$: sharper, more predictable. $T > 1.0$: creative chaos.


Demo: Bigram Warm-Up #

Before the transformer, the simplest possible language model: a bigram model. It only looks at the last character to predict the next one. The learned parameters are just a table of character pair frequencies.

This demo trains a bigram model on a short Shakespeare excerpt right in your browser, then generates text from it. Watch the probability bars change as the generation moves through different characters.

bigram character model (trained in-browser on Shakespeare)

next-char probabilities (top 14)

Hit ▶ Auto and watch it generate. Try temperature 0.3 (repetitive, confident) vs 1.5 (chaotic). This is the simplest language model — it only remembers one character. Your transformer remembers the entire context window.


What Makes It Work #

Three things bridged the gap from “doesn’t train” to “writes Shakespeare”:

Scale — the full Shakespeare dataset is 1M characters. The model sees every pattern millions of times.

Context — a bigram sees 1 character of context. The transformer sees 64. GPT-2 sees 1,024. GPT-4 sees ~128,000. More context = more coherent output.

Depth — 4 transformer blocks, each one building on the last’s features. Layer 1 sees character n-grams. Layer 2 sees word-like chunks. Layer 3 sees phrases. Layer 4 sees narrative structure. This hierarchy emerges automatically.

The math is identical to what you derived in lessons 1–12. There’s no new principle here. Just more of the same, stacked.


Before You Go — Try These #

  1. The model uses weight tyingtok_emb.weight = head.weight. Why does sharing these matrices make sense? What’s the geometric interpretation of the output projection being the transpose of the embedding lookup?

  2. After training, the loss is ~1.7 nats. What does this mean in terms of perplexity? If a uniform distribution over 65 characters gives loss $\ln(65) \approx 4.17$ nats, how much better is the model doing?

  3. The generate function trims the context to block_size before each forward pass. What would happen if you fed the full generated sequence instead? Is there a principled way to handle sequences longer than block_size?

  4. Temperature $T$ rescales logits before softmax: divide by $T$, then softmax. Show algebraically that $T \to 0$ makes the distribution collapse to argmax, and $T \to \infty$ makes it uniform. What does $T = 1$ leave unchanged?

  5. The FFN in each block has width $4 \times$ n_embd. For n_embd=128, that’s 512 hidden units. How many parameters does one FFN layer have? What fraction of the total model parameters does the FFN stack account for?


From here: run the training code, read the Attention Is All You Need paper, and go build something.