05 — Blame It on the Weights

Loss functions, gradient descent, and backpropagation — how a neural network looks at its mistakes and figures out exactly who to blame.

The network made a prediction. It was wrong. Now what?

Someone has to be blamed. Specifically, the weights that caused the wrong answer need to be identified and nudged in a better direction. Do this enough times, and the network learns.

That process — tracing the error backward through every layer to every weight — is backpropagation. It’s the algorithm that made deep learning possible. And it’s just the chain rule from calculus, applied very carefully.


First: How Wrong Are We? #

Before we can fix anything, we need to measure the wrongness. That’s the loss function (also called cost function).

For regression — predicting a number — the standard choice is Mean Squared Error:

$$L = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2$$

Predicted minus actual, squared, averaged. Squaring does two things: makes negatives positive, and punishes big errors much harder than small ones.

For classification — picking a category — cross-entropy loss is the standard:

$$L = -\sum_{i} y_i \log(\hat{y}_i)$$

This punishes confident wrong answers brutally. If the true label is “cat” and you say 0.001% cat, the loss goes to infinity. The network learns to not be confidently wrong.


Gradient Descent: Rolling Downhill #

Imagine the loss as a hilly landscape. Every point in that landscape is a different set of weights. We’re standing somewhere on the hill — we want to find the lowest valley.

The gradient $\nabla L$ points uphill (in the direction of steepest increase). So we go the opposite way:

$$w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}$$

This is the gradient descent update rule. $\eta$ (eta) is the learning rate — how big a step to take.

Too large: you overshoot the valley and bounce around forever. Too small: you’ll get there eventually, but you’ll die of old age first.

We apply this to every single weight in the network after every batch. That’s gradient descent. The hard part is computing $\frac{\partial L}{\partial w}$ for every weight — especially weights buried deep in early layers, far from the output.

That’s what backprop solves.


The Chain Rule: The Only Tool You Need #

If $L$ depends on $a$, which depends on $z$, which depends on $w$:

$$\frac{dL}{dw} = \frac{dL}{da} \cdot \frac{da}{dz} \cdot \frac{dz}{dw}$$

That’s the chain rule. Applied to a neural network with many layers, it becomes:

$$\frac{\partial L}{\partial w^{(1)}} = \frac{\partial L}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \cdots \frac{\partial a^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial w^{(1)}}$$

A long chain of derivatives, multiplied together. Each term is easy to compute on its own. Chained together, they give you the gradient of the loss with respect to any weight, no matter how deep.

Backprop is just this chain rule applied layer by layer, starting from the output and working backwards.


Backprop, Step by Step #

Take our tiny network: $\mathbf{x} \to \mathbf{z}^{(1)} \to \mathbf{a}^{(1)} \to z^{(2)} \to \hat{y} \to L$

Forward pass (computed and stored):

$$z^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}, \quad \mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)})$$ $$z^{(2)} = \mathbf{W}^{(2)}\mathbf{a}^{(1)} + b^{(2)}, \quad \hat{y} = \sigma(z^{(2)})$$

Backward pass (chain rule, right to left):

Start at the loss. For MSE with one example: $L = (\hat{y} - y)^2$

$$\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y)$$

Through sigmoid output:

$$\frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial \hat{y}} \cdot \sigma’(z^{(2)}) = 2(\hat{y} - y) \cdot \hat{y}(1-\hat{y})$$

Gradient w.r.t. output layer weights:

$$\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \frac{\partial L}{\partial z^{(2)}} \cdot \mathbf{a}^{(1)\top}$$

Now propagate the error back through to the hidden layer:

$$\frac{\partial L}{\partial \mathbf{a}^{(1)}} = \mathbf{W}^{(2)\top} \cdot \frac{\partial L}{\partial z^{(2)}}$$

Through ReLU (derivative is 1 if $z > 0$, else 0):

$$\frac{\partial L}{\partial \mathbf{z}^{(1)}} = \frac{\partial L}{\partial \mathbf{a}^{(1)}} \odot \mathbf{1}[\mathbf{z}^{(1)} > 0]$$

Gradient w.r.t. first layer weights:

$$\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \frac{\partial L}{\partial \mathbf{z}^{(1)}} \cdot \mathbf{x}^{\top}$$

Now we have gradients for every weight. Apply gradient descent. Repeat. That’s training.


The Key Insight #

Notice what backprop needs: the forward pass values, stored in memory. Every $z^{(l)}$, every $a^{(l)}$ — you need them during the backward pass to compute the derivatives.

This is why training a large network uses so much more memory than inference. The forward pass caches everything. The backward pass reads it all back.

Also notice: dead ReLU neurons (output zero during forward pass) have zero gradient during the backward pass. The weight update is zero. They’re truly frozen — gradient can’t reach them.


In Python #

import numpy as np

def sigmoid(z): return 1 / (1 + np.exp(-z))
def relu(z): return np.maximum(0, z)

# Forward pass — store intermediate values
x = np.array([0.5, -1.0])
y = 1.0  # true label

W1 = np.array([[0.8, 0.4], [-0.5, 0.9]])
b1 = np.array([0.1, -0.2])
W2 = np.array([[1.2, -0.7]])
b2 = np.array([0.3])

z1 = W1 @ x + b1
a1 = relu(z1)
z2 = W2 @ a1 + b2
y_hat = sigmoid(z2)[0]

loss = (y_hat - y) ** 2

# Backward pass — chain rule
dL_dyhat = 2 * (y_hat - y)
dL_dz2 = dL_dyhat * y_hat * (1 - y_hat)          # through sigmoid
dL_dW2 = dL_dz2 * a1                              # gradient for W2
dL_da1 = W2[0] * dL_dz2                           # error to hidden layer
dL_dz1 = dL_da1 * (z1 > 0).astype(float)         # through ReLU
dL_dW1 = np.outer(dL_dz1, x)                      # gradient for W1

# Update weights
lr = 0.01
W1 -= lr * dL_dW1
W2 -= lr * dL_dW2

No magic. Just derivatives multiplied together, working backwards.


Demo: Gradient Descent on a Loss Surface #

This is what gradient descent is actually doing in weight space. Pick a surface, drop a ball, watch it roll. Adjust the learning rate and see what happens.

Click anywhere on the surface to drop a ball and watch gradient descent run. Try cranking the learning rate up — watch it overshoot and oscillate. Switch to Elongated and notice how much faster it moves in one direction than the other. That’s why things like momentum and Adam optimiser exist.


Why Does Backprop Work So Fast? #

Computing all gradients naively would mean one forward pass per weight (finite differences). A network with 1 million weights: 1 million forward passes per update. Useless.

Backprop computes all gradients in one backward pass — roughly the same cost as one forward pass. That’s the magic. It’s not a heuristic or approximation. It’s exact. Just the chain rule, applied cleverly so intermediate results are reused.

This is why 1986’s backprop paper changed everything. The algorithm was efficient enough to actually train networks. Without it: no deep learning.


Before You Go — Try These #

  1. In the demo, switch to Elongated and try learning rate 0.05. Then try 0.3. What happens? Why do long thin valleys cause problems for vanilla gradient descent?

  2. Work through the backward pass for the hand example in Lesson 04. We computed $\hat{y} \approx 0.603$ with true label $y = 1$. Compute $\frac{\partial L}{\partial z^{(2)}}$ using MSE loss.

  3. What is the gradient of ReLU at $z = 0$? Why is this technically undefined, and why does it not matter in practice?

  4. A weight deep in layer 1 gets gradient $\approx 0.001$ while a weight in the output layer gets $0.3$. What causes this difference, and what problem does it hint at?

  5. In the Python code above, change the true label from y = 1.0 to y = 0.0 and re-trace the backward pass mentally. Does the sign of dL_dW2 flip? Why?


Next up → Lesson 06: The Learning Loop — we put forward pass + backprop + gradient descent into a training loop and watch a network actually learn something.