Home > Lab > Neural Nets > 06 — Watch It Learn

06 — Watch It Learn

Forward pass, backprop, gradient descent — assembled into a training loop. Watch a network learn to separate XOR, circles, and spirals in real time.

May 2, 2026

You now have all the pieces.

Forward pass: push input through layers, get a prediction. Loss: measure how wrong it is. Backprop: compute which weights caused the wrongness. Gradient descent: nudge those weights in the right direction.

Now we put them in a loop and repeat until the network stops being wrong.

That’s it. That’s training. Let’s make it concrete.

The Loop #

for epoch in range(num_epochs):
    for x_batch, y_batch in dataloader:
        optimizer.zero_grad()   # clear old gradients
        y_hat = model(x_batch)  # forward pass
        loss = loss_fn(y_hat, y_batch)  # compute loss
        loss.backward()         # backprop
        optimizer.step()        # update weights

Six lines. Every neural network training run in history is this loop, scaled up.

The vocabulary:

Epoch — one full pass through the entire dataset
Batch — a small chunk of examples processed together (typically 32–256)
Step — one gradient update (one batch)
Iteration — same as step

A dataset of 1000 examples with batch size 32 = ~31 steps per epoch.

Why Batches? #

You could compute gradients on the entire dataset at once (full-batch gradient descent). Or on one example at a time (stochastic gradient descent, SGD). Batches are the compromise.

Full-batch: smooth, accurate gradients. But one update requires a full forward+backward pass over all data. Slow, and uses enormous memory.

One example (SGD): fast updates, cheap memory. But the gradient from a single example is noisy — it might not point in the true direction of steepest descent.

Mini-batch: best of both. Enough examples to get a decent gradient estimate. Small enough to run fast and fit in GPU memory. The noise actually helps — it prevents the optimizer from getting stuck in sharp local minima.

Standard practice: batch size 32 or 64 for starters.

Shuffle Every Epoch #

One critical detail: shuffle the data before each epoch.

If your data is ordered (all cats, then all dogs), the network will see an unbalanced gradient signal — first optimising hard for cats, then overwriting that for dogs. The loss oscillates wildly.

Shuffling ensures each batch is a random mix. The gradient estimate is much more representative of the full dataset.

# PyTorch handles this automatically with shuffle=True
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

When to Stop: Overfitting #

You split your data into train set and validation set (typically 80/20). Track loss on both.

Early in training: both losses fall. The network is learning the underlying pattern.

Later: training loss keeps falling. Validation loss stops falling — or starts rising. The network is memorising the training data instead of learning to generalise.

That’s overfitting. Stop training here (or before, using early stopping).

epoch 1:   train_loss=0.68  val_loss=0.70
epoch 10:  train_loss=0.35  val_loss=0.38
epoch 50:  train_loss=0.12  val_loss=0.14   ← good
epoch 100: train_loss=0.04  val_loss=0.19   ← overfitting
epoch 200: train_loss=0.01  val_loss=0.31   ← badly overfit

In Python: Full Training Loop (No Libraries) #

import numpy as np

def relu(z): return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))
def bce(yhat, y): return -y*np.log(yhat+1e-7) - (1-y)*np.log(1-yhat+1e-7)

# Network: 2 -> 8 -> 1
np.random.seed(42)
W1 = np.random.randn(8, 2) * np.sqrt(2/2)
b1 = np.zeros(8)
W2 = np.random.randn(1, 8) * np.sqrt(2/8)
b2 = np.zeros(1)

# XOR dataset
X = np.array([[-1,-1],[-1,1],[1,-1],[1,1]] * 25, dtype=float)
X += np.random.randn(*X.shape) * 0.1
y = np.array([0,1,1,0] * 25, dtype=float)

lr = 0.05
for epoch in range(500):
    # Shuffle
    idx = np.random.permutation(len(X))
    X, y = X[idx], y[idx]

    for i in range(0, len(X), 32):  # mini-batches
        xb, yb = X[i:i+32], y[i:i+32]

        # Forward
        z1 = xb @ W1.T + b1          # (batch, 8)
        a1 = relu(z1)
        z2 = a1 @ W2.T + b2          # (batch, 1)
        yhat = sigmoid(z2).flatten()

        # Backward
        dz2 = (yhat - yb) / len(xb)                    # (batch,)
        dW2 = dz2[:, None].T @ a1                       # (1, 8)
        db2 = dz2.sum()
        da1 = dz2[:, None] @ W2                         # (batch, 8)
        dz1 = da1 * (z1 > 0)
        dW1 = dz1.T @ xb                                # (8, 2)
        db1 = dz1.sum(axis=0)

        # Update
        W1 -= lr * dW1;  b1 -= lr * db1
        W2 -= lr * dW2;  b2 -= lr * db2

    if epoch % 100 == 0:
        yhat_all = sigmoid(relu(X @ W1.T + b1) @ W2.T + b2).flatten()
        print(f"epoch {epoch}  loss={bce(yhat_all, y).mean():.4f}")

Demo: Watch It Learn #

Hit Train and watch the network figure out the pattern in real time. The background colour shows the network’s confidence at every point — blue for class 1, red for class 0. The inset chart tracks the loss curve.

Try all three datasets. XOR is easy — solved in seconds. Circle takes a bit longer. Spiral is genuinely hard — watch the boundary slowly unwind and separate the two arms.

lr=0.05

Notice: the Spiral dataset often needs a lower learning rate (try 0.02) and more steps — it’s genuinely hard to separate. A linear model can’t do it at all. The curved boundary the network slowly learns is possible only because of the nonlinear activations we covered in Lesson 03.

What’s Actually Happening #

Every frame of that animation is many gradient descent steps. Each step:

Pick a shuffled batch from the data
Run forward pass, get predictions
Compute BCE loss
Run backprop, get $\frac{\partial L}{\partial W}$ for every weight
Nudge every weight by $-\eta \cdot \frac{\partial L}{\partial W}$

The background heatmap is redrawn by running the forward pass on a grid of points — showing what the network currently “thinks” about every location in 2D space. As the weights change, the decision boundary reshapes.

This is how every neural network learns. The same loop, just with more layers, bigger matrices, and millions more steps.

Before You Go — Try These #

In the demo, train on XOR until it converges. Then hit Reset and train again. Does it converge to the same boundary? Why might it look different each time?
Crank the learning rate to 0.3 on XOR and watch what happens. What does the loss curve look like? Why?
In the Python code, the gradient for the output layer is dz2 = yhat - y. This is suspiciously simple — shouldn’t there be more chain rule terms? Expand it: what’s $\frac{\partial L_{BCE}}{\partial z^{(2)}}$ using the chain rule, and why does it simplify so cleanly?
What would happen if you removed the shuffle from the training loop and the data was ordered (all class 0 first, then all class 1)?
Train on Spiral with lr=0.05. Now train again with lr=0.01. Which converges to a better solution? What does this say about the relationship between learning rate and solution quality?

Next up → Lesson 07: The Vanishing Act — why deep networks stopped training in the 1990s, and the tricks that brought them back.

←

07 — The Vanishing Act

05 — Blame It on the Weights

→