Here’s a dirty secret about neural networks: without activation functions, the whole thing is a scam.
Stack 10 layers. Stack 100 layers. Without activations, they all collapse into a single matrix multiply. You’d have a very expensive, very elaborate linear regression. All that “depth” — meaningless.
Activation functions are the thing that makes depth real.
The Linearity Trap #
A layer computes $\mathbf{W}\mathbf{x} + \mathbf{b}$. Linear. Another layer: $\mathbf{W}_2(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$. Still linear. Always linear. No matter how deep you go, you can always squish all those matrices into one.
$$\mathbf{W}_2\mathbf{W}_1\mathbf{x} + \mathbf{W}_2\mathbf{b}_1 + \mathbf{b}_2 = \mathbf{W}’\mathbf{x} + \mathbf{b}’$$
One layer. Done. Your 100-layer network was lying to you.
The fix is one weird trick: between every layer, apply a nonlinear function to every single number. Now the layers can’t collapse. Each layer genuinely changes the shape of the data, not just the scale.
That’s an activation function. Let’s meet them.
The Step Function: The OG (Retired) #
The perceptron used this:
$$f(z) = \begin{cases} 1 & z \geq 0 \ 0 & z < 0 \end{cases}$$
Great for on/off decisions. Terrible for learning. The derivative is zero everywhere — gradient descent has nothing to grab onto. It’s like trying to find the bottom of a valley that’s completely flat. You wander forever.
Retired. We need slopes.
Sigmoid: The Smooth Operator #
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Takes any number, outputs something between 0 and 1. Smooth curve everywhere. Has a derivative everywhere. Gradient descent is happy.
The derivative has a gorgeous trick — you compute it for free from the output:
$$\sigma’(z) = \sigma(z)\bigl(1 - \sigma(z)\bigr)$$
Output 0.7? Gradient is $0.7 \times 0.3 = 0.21$. No extra work.
But. Zoom out to large $z$ values and the curve goes pancake-flat. Gradient ≈ 0. Multiply a bunch of near-zero gradients together through 10 layers and you get a number so tiny it might as well be zero. Your early layers stop learning. This is the vanishing gradient problem and it haunted neural nets for decades.
Also: sigmoid outputs are always positive. This causes zig-zagging in gradient updates. Think trying to drive somewhere diagonally but you can only go north or east — you have to zig-zag.
Tanh: Sigmoid’s Cooler Cousin #
$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$
Same S-curve shape, but outputs $(-1, 1)$ — centred at zero. Positive inputs → positive output. Negative inputs → negative output. Gradient zig-zagging fixed.
$$\tanh’(z) = 1 - \tanh^2(z)$$
Still saturates. Still has vanishing gradients deep down. But better than sigmoid for hidden layers, especially in RNNs where zero-centring really helps.
ReLU: The One That Won #
$$\text{ReLU}(z) = \max(0, z)$$
That’s it. You’re staring at the activation function that powers most of modern deep learning.
If $z > 0$: pass it through unchanged. Gradient = 1. Perfect. If $z \leq 0$: output 0. Neuron is silent.
Why it won:
- No saturation on the positive side → gradients flow freely
- Computationally trivial → one comparison, billions per second on a GPU
- Sparsity → on average half the neurons fire at any moment, which helps generalisation
The one flaw: the “dying ReLU.” A neuron that keeps receiving negative inputs gets stuck at zero, gradient zero, never recovers. It’s dead. The fix is Leaky ReLU — let a tiny trickle through when negative:
$$\text{LeakyReLU}(z) = \max(0.01z,\ z)$$
Modern networks also use GELU (used in GPT), SiLU/Swish (used in LLaMA) — smoother ReLU variants. But vanilla ReLU is still the default starting point.
Softmax: Turning Scores Into a Horse Race #
The final layer of a classifier outputs raw scores — called logits. You need to turn them into probabilities. Enter softmax:
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
Exponentiate everything (makes big scores dominate), then normalise so they sum to 1. Scores $[2.0, 1.0, 0.1]$ become $[0.66, 0.24, 0.10]$. The model is 66% confident about class 1.
Always paired with cross-entropy loss. They’re the peanut butter and jelly of classification.
Cheat Sheet #
| Range | Zero-centred | Saturates | Use it for | |
|---|---|---|---|---|
| Step | ${0,1}$ | No | Hard | Historical curiosity |
| Sigmoid | $(0,1)$ | No | Yes | Binary output layer |
| Tanh | $(-1,1)$ | ✓ | Yes | RNNs, hidden layers |
| ReLU | $[0,\infty)$ | No | Positive side only | Hidden layers (default) |
| Leaky ReLU | $\mathbb{R}$ | No | No | When neurons keep dying |
| Softmax | $(0,1)$, sums to 1 | — | — | Multi-class output |
Demo: Activation Explorer #
Drag the slider to feel how $f(z)$ and $f’(z)$ change. Watch the gradient collapse to nearly zero at the edges for sigmoid and tanh — that’s vanishing gradients, live.
Before You Go — Try These #
-
Open the Activation Explorer, select Sigmoid, and drag $z$ to $\pm 5$. What’s $f’(z)$ at those extremes? Now try the same with ReLU. Why does ReLU not have this problem on the positive side?
-
Compute $\sigma(0)$, $\sigma(1)$, $\sigma(-1)$ by hand using $\sigma(z) = \frac{1}{1+e^{-z}}$. Verify with the demo.
-
Prove that $\tanh(z) = 2\sigma(2z) - 1$. Hint: write out $\sigma(2z)$ and simplify.
-
A network has 3 hidden layers, all using sigmoid. The gradient of the loss w.r.t. the first layer involves multiplying three sigmoid derivatives together. If each is about $0.2$, what’s the combined gradient? What does this tell you about training deep sigmoid networks?
-
In the Art Studio, hit 🎲 Randomise a few times while on Step. Why does it always look like a hard-edged split rather than smooth blobs? What does that tell you about gradients?
Next up → Lesson 04: Dominos All the Way Down — we wire neurons into full layers and trace a number from input to output, step by step.