04 — Dominos All the Way Down

The forward pass: how a number enters one end of a neural network and a prediction falls out the other, layer by layer.

A neural network is just a function. Feed it numbers. Get numbers back. Everything between input and output is called the forward pass — and once you understand it, the whole architecture stops being magic.

Think of it like dominos. Push the first one (your input). It knocks over the next row (layer 1). That row knocks over the next (layer 2). On and on until the last tile falls — your prediction.

Let’s follow a number all the way through.


One Neuron, Revisited #

Before tackling a full network, a single neuron:

$$a = f!\left(\sum_i w_i x_i + b\right) = f(\mathbf{w}^\top \mathbf{x} + b)$$

Two steps, always:

  1. Weighted sum — dot product of weights and inputs, plus bias. This is $z$, the “pre-activation.”
  2. Activation — squash $z$ through a nonlinear function to get $a$, the output.

$z$ is the raw score. $a$ is what the neuron actually says.


One Layer #

A layer is just many neurons running in parallel — each one looking at the same inputs but with different weights. In matrix form:

$$\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}$$ $$\mathbf{a} = f(\mathbf{z})$$

Where $\mathbf{W}$ is the weight matrix (rows = neurons, columns = inputs), $\mathbf{b}$ is a bias vector, and $f$ is applied element-wise.

A layer with 4 neurons and 3 inputs: $\mathbf{W}$ is $4 \times 3$. Takes a 3D input, spits out a 4D output. The dimensions flow forward through the network, shaped entirely by the weight matrices.


A Full Network, by Hand #

Let’s build a tiny network: 2 inputs → 2 hidden neurons (ReLU) → 1 output (sigmoid)

Say the input is $\mathbf{x} = [0.5,\ -1.0]^\top$.

Layer 1 weights and biases:

$$\mathbf{W}^{(1)} = \begin{bmatrix} 0.8 & 0.4 \\ -0.5 & 0.9 \end{bmatrix}, \quad \mathbf{b}^{(1)} = \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix}$$

Pre-activations:

$$\mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}$$

$$z^{(1)}_1 = 0.8(0.5) + 0.4(-1.0) + 0.1 = 0.4 - 0.4 + 0.1 = 0.1$$ $$z^{(1)}_2 = -0.5(0.5) + 0.9(-1.0) + (-0.2) = -0.25 - 0.9 - 0.2 = -1.35$$

Activations (ReLU):

$$\mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)}) = \begin{bmatrix} 0.1 \\ 0 \end{bmatrix}$$

Neuron 2 got a negative pre-activation and went silent. Domino didn’t fall.

Layer 2 weights and biases:

$$\mathbf{W}^{(2)} = \begin{bmatrix} 1.2 & -0.7 \end{bmatrix}, \quad b^{(2)} = 0.3$$

Output pre-activation:

$$z^{(2)} = 1.2(0.1) + (-0.7)(0) + 0.3 = 0.12 + 0 + 0.3 = 0.42$$

Output (sigmoid — probability):

$$\hat{y} = \sigma(0.42) = \frac{1}{1 + e^{-0.42}} \approx 0.603$$

The network says: 60.3% chance this is a positive example. The whole thing — no mystery, just dot products and nonlinear squashing, layer after layer.


Why Shape Matters #

The dimensions have to line up. If layer 1 has 64 neurons and layer 2 has 32 neurons, then:

  • $\mathbf{W}^{(1)}$ is $64 \times \text{input_size}$
  • $\mathbf{W}^{(2)}$ is $32 \times 64$

Output of layer 1 is 64-dimensional → exactly what layer 2 expects as input. Get the shapes wrong and everything crashes. Neural network bugs are almost always shape mismatches.

A good habit: before writing any code, sketch the shapes. $[batch \times 784] \to [batch \times 128] \to [batch \times 64] \to [batch \times 10]$. If the dimensions flow cleanly, you’re good.


In Python (NumPy) #

The entire forward pass of that tiny network:

import numpy as np

def relu(z): return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))

x = np.array([0.5, -1.0])

W1 = np.array([[0.8, 0.4], [-0.5, 0.9]])
b1 = np.array([0.1, -0.2])
W2 = np.array([[1.2, -0.7]])
b2 = np.array([0.3])

z1 = W1 @ x + b1       # pre-activation, layer 1
a1 = relu(z1)           # activation, layer 1

z2 = W2 @ a1 + b2      # pre-activation, layer 2
y_hat = sigmoid(z2)    # output probability

print(y_hat)  # ~0.603

Three lines of actual math. The rest is bookkeeping. That’s it. Every modern neural network is this, just with more layers and way bigger matrices.


Demo: Watch the Signal Flow #

Drag the inputs and watch the numbers ripple through the network in real time. Every node shows its $z$ (pre-activation) and $a$ (output). Thick edges carry more signal.

Notice: some hidden neurons go dark (ReLU killed them). Flip $x_1$ or $x_2$ to negative and watch which dominos stop falling. Hit New weights and the whole network rewires — same structure, completely different behaviour.


Before You Go — Try These #

  1. In the Signal Flow demo, drag $x_1$ and $x_2$ until at least two hidden neurons go dark (output 0). What does that mean for the information reaching the output neuron?

  2. Work through the hand example again but change $x = [0.5, -1.0]$ to $x = [-0.5, 1.0]$. Recompute $\mathbf{z}^{(1)}$, $\mathbf{a}^{(1)}$, $z^{(2)}$, and $\hat{y}$. Does the output flip sides of 0.5?

  3. A network takes a $28 \times 28$ image (flattened to 784 inputs) and has layers of size 256, 128, 64, 10. Write out the shape of every weight matrix $\mathbf{W}^{(l)}$.

  4. In the Classifier demo, toggle between ReLU and Tanh. Which one produces straight-edged boundaries? Which one curves? Can you explain why from what you know about the two functions?

  5. Without changing the architecture, hit New network 5 times. Each time a completely different boundary appears. What’s the only thing changing? What does that tell you about what training is actually doing?


Next up → Lesson 05: Blame It on the Weights — the forward pass tells you what the network predicts. Backprop tells you whose fault it is.