You now have all the pieces.
Forward pass: push input through layers, get a prediction. Loss: measure how wrong it is. Backprop: compute which weights caused the wrongness. Gradient descent: nudge those weights in the right direction.
Now we put them in a loop and repeat until the network stops being wrong.
That’s it. That’s training. Let’s make it concrete.
The Loop #
for epoch in range(num_epochs):
for x_batch, y_batch in dataloader:
optimizer.zero_grad() # clear old gradients
y_hat = model(x_batch) # forward pass
loss = loss_fn(y_hat, y_batch) # compute loss
loss.backward() # backprop
optimizer.step() # update weights
Six lines. Every neural network training run in history is this loop, scaled up.
The vocabulary:
- Epoch — one full pass through the entire dataset
- Batch — a small chunk of examples processed together (typically 32–256)
- Step — one gradient update (one batch)
- Iteration — same as step
A dataset of 1000 examples with batch size 32 = ~31 steps per epoch.
Why Batches? #
You could compute gradients on the entire dataset at once (full-batch gradient descent). Or on one example at a time (stochastic gradient descent, SGD). Batches are the compromise.
Full-batch: smooth, accurate gradients. But one update requires a full forward+backward pass over all data. Slow, and uses enormous memory.
One example (SGD): fast updates, cheap memory. But the gradient from a single example is noisy — it might not point in the true direction of steepest descent.
Mini-batch: best of both. Enough examples to get a decent gradient estimate. Small enough to run fast and fit in GPU memory. The noise actually helps — it prevents the optimizer from getting stuck in sharp local minima.
Standard practice: batch size 32 or 64 for starters.
Shuffle Every Epoch #
One critical detail: shuffle the data before each epoch.
If your data is ordered (all cats, then all dogs), the network will see an unbalanced gradient signal — first optimising hard for cats, then overwriting that for dogs. The loss oscillates wildly.
Shuffling ensures each batch is a random mix. The gradient estimate is much more representative of the full dataset.
# PyTorch handles this automatically with shuffle=True
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
When to Stop: Overfitting #
You split your data into train set and validation set (typically 80/20). Track loss on both.
Early in training: both losses fall. The network is learning the underlying pattern.
Later: training loss keeps falling. Validation loss stops falling — or starts rising. The network is memorising the training data instead of learning to generalise.
That’s overfitting. Stop training here (or before, using early stopping).
epoch 1: train_loss=0.68 val_loss=0.70
epoch 10: train_loss=0.35 val_loss=0.38
epoch 50: train_loss=0.12 val_loss=0.14 ← good
epoch 100: train_loss=0.04 val_loss=0.19 ← overfitting
epoch 200: train_loss=0.01 val_loss=0.31 ← badly overfit
In Python: Full Training Loop (No Libraries) #
import numpy as np
def relu(z): return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))
def bce(yhat, y): return -y*np.log(yhat+1e-7) - (1-y)*np.log(1-yhat+1e-7)
# Network: 2 -> 8 -> 1
np.random.seed(42)
W1 = np.random.randn(8, 2) * np.sqrt(2/2)
b1 = np.zeros(8)
W2 = np.random.randn(1, 8) * np.sqrt(2/8)
b2 = np.zeros(1)
# XOR dataset
X = np.array([[-1,-1],[-1,1],[1,-1],[1,1]] * 25, dtype=float)
X += np.random.randn(*X.shape) * 0.1
y = np.array([0,1,1,0] * 25, dtype=float)
lr = 0.05
for epoch in range(500):
# Shuffle
idx = np.random.permutation(len(X))
X, y = X[idx], y[idx]
for i in range(0, len(X), 32): # mini-batches
xb, yb = X[i:i+32], y[i:i+32]
# Forward
z1 = xb @ W1.T + b1 # (batch, 8)
a1 = relu(z1)
z2 = a1 @ W2.T + b2 # (batch, 1)
yhat = sigmoid(z2).flatten()
# Backward
dz2 = (yhat - yb) / len(xb) # (batch,)
dW2 = dz2[:, None].T @ a1 # (1, 8)
db2 = dz2.sum()
da1 = dz2[:, None] @ W2 # (batch, 8)
dz1 = da1 * (z1 > 0)
dW1 = dz1.T @ xb # (8, 2)
db1 = dz1.sum(axis=0)
# Update
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
if epoch % 100 == 0:
yhat_all = sigmoid(relu(X @ W1.T + b1) @ W2.T + b2).flatten()
print(f"epoch {epoch} loss={bce(yhat_all, y).mean():.4f}")
Demo: Watch It Learn #
Hit Train and watch the network figure out the pattern in real time. The background colour shows the network’s confidence at every point — blue for class 1, red for class 0. The inset chart tracks the loss curve.
Try all three datasets. XOR is easy — solved in seconds. Circle takes a bit longer. Spiral is genuinely hard — watch the boundary slowly unwind and separate the two arms.
Notice: the Spiral dataset often needs a lower learning rate (try 0.02) and more steps — it’s genuinely hard to separate. A linear model can’t do it at all. The curved boundary the network slowly learns is possible only because of the nonlinear activations we covered in Lesson 03.
What’s Actually Happening #
Every frame of that animation is many gradient descent steps. Each step:
- Pick a shuffled batch from the data
- Run forward pass, get predictions
- Compute BCE loss
- Run backprop, get $\frac{\partial L}{\partial W}$ for every weight
- Nudge every weight by $-\eta \cdot \frac{\partial L}{\partial W}$
The background heatmap is redrawn by running the forward pass on a grid of points — showing what the network currently “thinks” about every location in 2D space. As the weights change, the decision boundary reshapes.
This is how every neural network learns. The same loop, just with more layers, bigger matrices, and millions more steps.
Before You Go — Try These #
-
In the demo, train on XOR until it converges. Then hit Reset and train again. Does it converge to the same boundary? Why might it look different each time?
-
Crank the learning rate to 0.3 on XOR and watch what happens. What does the loss curve look like? Why?
-
In the Python code, the gradient for the output layer is
dz2 = yhat - y. This is suspiciously simple — shouldn’t there be more chain rule terms? Expand it: what’s $\frac{\partial L_{BCE}}{\partial z^{(2)}}$ using the chain rule, and why does it simplify so cleanly? -
What would happen if you removed the shuffle from the training loop and the data was ordered (all class 0 first, then all class 1)?
-
Train on Spiral with
lr=0.05. Now train again withlr=0.01. Which converges to a better solution? What does this say about the relationship between learning rate and solution quality?
Next up → Lesson 07: The Vanishing Act — why deep networks stopped training in the 1990s, and the tricks that brought them back.