Your network got 99% accuracy. Congratulations.
On the training data.
On new data it had never seen: 61%. You have a very expensive memorisation machine, not a learning machine.
This is overfitting. It’s the most common failure mode in machine learning, and understanding it is what separates people who can actually deploy models from people who just celebrate training accuracy.
What’s Actually Happening #
A neural network has enough capacity to memorise its training data. Given enough parameters, it can fit any finite set of points perfectly — including the noise in those points.
The problem: real noise is random. It doesn’t generalise. A model that memorises noise has learned the wrong thing.
Think about fitting a curve through 8 noisy data points sampled from a sine wave. A straight line (underfitting) misses the pattern. A cubic captures the true shape. A degree-10 polynomial threads through every single point — including the noisy ones — and goes haywire everywhere else.
Neural networks have this same problem, scaled to millions of parameters.
The Diagnostic: Learning Curves #
Always track two numbers during training: training loss and validation loss.
Split your data: 80% for training, 20% held out for validation. The model never trains on validation data — it’s a clean test of generalisation.
epoch 1: train=0.85 val=0.87 ← both high, normal
epoch 20: train=0.42 val=0.44 ← both dropping, great
epoch 50: train=0.21 val=0.22 ← still good
epoch 100: train=0.09 val=0.18 ← gap opening, watch out
epoch 200: train=0.02 val=0.31 ← overfitting, stop here
The moment validation loss stops improving while training loss keeps falling — that’s overfitting. The model is learning the training set, not the underlying pattern.
Fix 1: More Data #
The best fix. More training examples means less room for memorisation — the model is forced to find real patterns that generalise.
If you’re overfitting and can get more data, get more data. Everything else on this list is a workaround for when you can’t.
Fix 2: L2 Regularisation (Weight Decay) #
Add a penalty to the loss for having large weights:
$$L_{total} = L_{data} + \lambda \sum_{i,j} W_{ij}^2$$
The gradient becomes:
$$\frac{\partial L_{total}}{\partial W} = \frac{\partial L_{data}}{\partial W} + 2\lambda W$$
The update rule now pushes weights toward zero at every step:
$$W \leftarrow W - \eta \left(\frac{\partial L_{data}}{\partial W} + 2\lambda W\right) = W(1 - 2\eta\lambda) - \eta\frac{\partial L_{data}}{\partial W}$$
The factor $(1 - 2\eta\lambda)$ is why it’s called “weight decay” — weights decay a little at every step unless the loss gradient pushes back.
Small weights → smoother functions → less overfitting. $\lambda$ is a hyperparameter you tune. Too large and you underfit; too small and it does nothing.
# PyTorch: just pass weight_decay to the optimiser
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
Fix 3: Dropout #
Introduced in 2014. During training, randomly zero out each neuron with probability $p$ (typically 0.2–0.5). Scale the remaining ones by $\frac{1}{1-p}$ to keep the expected value the same.
def dropout(x, p, training=True):
if not training: return x # no dropout at inference
mask = (np.random.rand(*x.shape) > p) / (1 - p)
return x * mask
Why does randomly breaking your network help? A few reasons:
Ensemble effect. Each training step uses a different randomly-sampled sub-network. At inference you use the full network — effectively averaging many sub-networks. Ensembles generalise better.
Co-adaptation prevention. Without dropout, neurons can learn to depend on each other. Neuron A learns to fix neuron B’s mistakes, which means they’re not independently useful. Dropout forces each neuron to be robust on its own.
class DropoutLayer:
def __init__(self, p=0.5):
self.p = p
self.mask = None
def forward(self, x, training=True):
if not training:
return x
self.mask = (np.random.rand(*x.shape) > self.p) / (1 - self.p)
return x * self.mask
def backward(self, grad):
return grad * self.mask # same mask, same neurons zeroed
Fix 4: Early Stopping #
The simplest technique. Monitor validation loss. When it hasn’t improved for $k$ epochs (the “patience”), stop training and revert to the best checkpoint.
best_val_loss = float('inf')
patience_counter = 0
for epoch in range(max_epochs):
train_loss = train_one_epoch(model, train_data)
val_loss = evaluate(model, val_data)
if val_loss < best_val_loss:
best_val_loss = val_loss
save_checkpoint(model) # save the best weights
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch}")
load_checkpoint(model) # restore best weights
break
Fix 5: Data Augmentation #
If you can’t get more real data, make more from what you have. For images: flip, rotate, crop, change brightness. The model sees more variation, generalises better.
# torchvision transforms
transform = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.2),
transforms.ToTensor(),
])
The label stays the same — a flipped cat is still a cat.
Demo: The Overfitting Machine #
Fit a polynomial to noisy data. Drag the degree slider up and watch it start memorising noise instead of learning the pattern. Then turn on regularisation and see what happens.
Train points are solid. The true underlying function is the faint curve. Test error tells you how well it actually generalises.
Try: degree 1 (straight line, underfitting) → degree 4 (good fit) → degree 10 (memorising noise). Then crank degree to 12 and slowly increase λ — watch the curve smooth out as regularisation forces the weights smaller.
The dashed line is the true underlying function. The filled dots are training data. The hollow circles are test data the model never saw. Test MSE tells you the truth.
The Bias-Variance Tradeoff #
There’s a fundamental tension:
High bias (underfitting): model is too simple, misses the pattern. Both train and test error are high.
High variance (overfitting): model is too complex, fits the noise. Train error is low, test error is high.
You can’t reduce both to zero simultaneously. The best model is somewhere in the middle — complex enough to capture the real pattern, but not so complex it captures the noise too.
More data shifts this tradeoff in your favour. More data means you can afford a more complex model before overfitting kicks in.
Before You Go — Try These #
-
In the demo, fit degree 1 to the data. Now fit degree 2. The train MSE drops significantly. Is this overfitting or a genuine improvement? How do you tell the difference?
-
In the dropout forward pass, why do we divide by $(1-p)$ during training? What would happen at inference if we didn’t?
-
L2 regularisation adds $\lambda |\mathbf{W}|^2$ to the loss. What does the regularisation gradient look like for a single weight $w_{ij} = 5.0$ with $\lambda = 0.01$? How does this compare to a weight $w_{ij} = 0.1$?
-
If you apply dropout with $p=0.5$ to a hidden layer of 128 neurons, how many neurons are active on average per forward pass during training? What does the effective architecture look like?
-
Early stopping monitors validation loss. What’s the risk of checking validation loss too frequently (every batch instead of every epoch)? What’s the risk of checking too infrequently?
Next up → Lesson 10: Seeing with Filters — convolutional neural networks, how they see images, and why your laptop can now recognise cats.