07 — The Vanishing Act

Why deep networks went dark in the 90s — vanishing gradients, exploding gradients, and the tricks that finally made depth work: weight init, batch norm, and residual connections.

In the late 1980s, people tried making neural networks deeper. More layers, more power, right?

It didn’t work. Adding layers made things worse. Training stalled. The networks learned nothing. Researchers gave up and called it the second AI winter.

The culprit: gradients were disappearing before they reached the early layers.


The Vanishing Gradient Problem #

Backprop multiplies derivatives together as it flows backward through each layer. With sigmoid activations, the maximum derivative is $\sigma’(z) = 0.25$ (at $z = 0$). In practice it’s often much smaller.

Chain 10 layers together:

$$\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \frac{\partial L}{\partial z^{(10)}} \cdot \prod_{l=2}^{10} \sigma’(z^{(l)}) \cdot \ldots$$

If each sigmoid derivative is $0.2$:

$$0.2^{10} = 0.0000001024$$

That’s the gradient magnitude reaching layer 1. Essentially zero. The first few layers get no useful signal and learn nothing. You can stack as many layers as you like — the early ones are frozen by mathematical coincidence.

This isn’t a bug. It’s the inevitable result of multiplying many small numbers together. The deeper the network, the worse it gets.


The Exploding Gradient Problem #

The opposite can also happen. If weights are large, derivatives multiply up instead of down:

$$2.0^{10} = 1024$$

Gradients explode, weights jump wildly, loss oscillates or goes to NaN. Common in RNNs processing long sequences.

The fix for exploding gradients is simple and brutal: gradient clipping — cap the gradient norm at a threshold.

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Vanishing gradients are trickier. They need architectural solutions.


Fix 1: Use ReLU #

The simplest fix. ReLU’s gradient is exactly 1 for positive inputs — it doesn’t shrink. Backprop passes through without attenuation.

The product of 10 ReLU derivatives (all 1.0): still 1.0.

This is the biggest reason ReLU replaced sigmoid in hidden layers. It’s not just about speed — it’s about keeping gradients alive through depth.


Fix 2: Weight Initialisation #

How you initialise weights matters enormously. Random initialisation sounds fine — but if weights start too large, activations saturate immediately (sigmoid/tanh go flat). If they start too small, activations collapse to zero.

Xavier / Glorot initialisation (for tanh):

$$W \sim \mathcal{U}\left[-\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}},\ \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}\right]$$

Scales weights by the number of inputs and outputs, keeping the variance of activations roughly constant across layers.

He initialisation (for ReLU):

$$W \sim \mathcal{N}\left(0,\ \frac{2}{n_{in}}\right)$$

Accounts for the fact that ReLU kills half the neurons (the negative side), so variance needs to be doubled to compensate.

# PyTorch uses He init for ReLU by default
nn.Linear(in, out)  # Kaiming uniform (He variant)

# Xavier for tanh/sigmoid
nn.init.xavier_uniform_(layer.weight)

The difference between bad and good initialisation can be the difference between a network that trains and one that doesn’t — before a single gradient step.


Fix 3: Batch Normalisation #

Introduced in 2015. The idea: normalise the inputs to each layer so they always have mean 0 and variance 1. Do this during training, for each mini-batch.

For a batch of pre-activations ${z^{(i)}}$:

$$\mu_B = \frac{1}{m}\sum_i z^{(i)}, \quad \sigma^2_B = \frac{1}{m}\sum_i (z^{(i)} - \mu_B)^2$$

$$\hat{z}^{(i)} = \frac{z^{(i)} - \mu_B}{\sqrt{\sigma^2_B + \varepsilon}}$$

Then scale and shift by learned parameters $\gamma$ and $\beta$:

$$z^{(i)}_{out} = \gamma \hat{z}^{(i)} + \beta$$

The network learns what mean and variance it wants. The normalisation keeps activations in the well-behaved region — preventing saturation, keeping gradients healthy.

Side effect: acts as mild regularisation. Reduces need for dropout.

At inference (no batch): use running statistics computed during training.


Fix 4: Residual Connections #

The nuclear option. In 2015, ResNet introduced skip connections — a direct path that bypasses one or more layers:

$$\mathbf{a}^{(l+2)} = f(\mathbf{W}^{(l+2)} \mathbf{a}^{(l+1)} + \mathbf{b}^{(l+2)}) + \mathbf{a}^{(l)}$$

Instead of learning the full transformation, the layers learn the residual — what to add to the identity. If the best answer is “do nothing”, the layers can learn weights near zero, leaving the skip connection to carry the signal.

The gradient implication: the skip connection gives gradients a highway to the early layers, bypassing potential vanishing. Derivative of the skip path: $1.0$, always. No matter how many layers are stacked.

This is why ResNet-152 (152 layers!) trained successfully in 2015, while 20-layer networks had failed a decade earlier.

# Residual block in PyTorch
class ResBlock(nn.Module):
    def forward(self, x):
        return self.layers(x) + x  # skip connection

Demo: Watch Gradients Vanish #

This plots the gradient magnitude at each layer during backprop. Drag the depth slider and switch activations to see the problem and the fix in real time.

Try this: set Sigmoid, depth 16. The bars at the left (early layers) are nearly invisible — the gradient has vanished. Now switch to ReLU. The bars stay tall all the way through. That’s what made deep learning possible.


Putting It Together #

Modern deep networks use all of these fixes together:

  • ReLU (or GELU/SiLU) instead of sigmoid in hidden layers
  • He initialisation for ReLU networks
  • Batch normalisation after linear layers (before or after activation, still debated)
  • Residual connections for very deep architectures (ResNets, Transformers)

With these, you can train networks with hundreds of layers. Without them, 10 layers was already a struggle.


Before You Go — Try These #

  1. In the demo, with Sigmoid and depth 10, roughly what is the gradient magnitude at layer 1? Now compute it manually: if each sigmoid derivative is $0.2$ and each weight contribution is $0.9$, what’s $\prod_{l=1}^{10} 0.2 \times 0.9$?

  2. Why does He initialisation use $\sqrt{2/n_{in}}$ while Xavier uses $\sqrt{2/(n_{in}+n_{out})}$? What assumption about ReLU drives the factor of 2?

  3. Batch norm has learnable parameters $\gamma$ and $\beta$. If the network learns $\gamma = 1$ and $\beta = 0$ everywhere, what has batch norm effectively done? Is this a problem?

  4. Draw a 4-layer residual network on paper. Trace the gradient flow backward. How many paths exist from the output to layer 1, and which ones bypass the weight matrices entirely?

  5. Exploding gradients are common in RNNs but rare in feedforward networks. Why might sequential models with shared weights be especially prone to exploding gradients?


Next up → Lesson 08: Build One From Scratch — pure Python and NumPy, no frameworks, building a full neural network that classifies handwritten digits.