A 256×256 colour image has 196,608 numbers in it. If you flatten that and feed it into a fully connected layer with 1000 neurons, the first layer alone needs 196 million weights.
That’s not just expensive — it’s structurally dumb. A fully connected layer treats pixel (0,0) and pixel (255,255) as equally related. But they’re not. Pixels that are close to each other share structure. Edges, curves, textures — they’re local patterns, not global ones.
Convolutional networks are built around that insight. A small filter slides across the image, looking for one local pattern everywhere. It’s translation-invariant: a vertical edge in the top-left is detected by the same weights as a vertical edge in the bottom-right.
The Convolution Operation #
Take a small matrix called a kernel (or filter) and slide it across the image. At each position, multiply element-wise and sum. That sum becomes one pixel in the feature map.
A 3×3 kernel applied to a 5×5 image:
$$\begin{bmatrix} 1&0&-1 \ 1&0&-1 \ 1&0&-1 \end{bmatrix} \circledast \begin{bmatrix} 2&1&3&0&1 \ 0&2&1&3&2 \ 1&0&2&1&0 \ 3&1&0&2&1 \ 2&3&1&0&2 \end{bmatrix}$$
At position (1,1) — centre of the kernel placed at row 1, col 1:
$$2(1)+1(0)+3(-1) + 0(1)+2(0)+1(-1) + 1(1)+0(0)+2(-1)$$ $$= 2+0-3+0+0-1+1+0-2 = -3$$
Slide to every valid position and you get the feature map. That 3×3 vertical edge detector produces high positive values at left edges, high negative at right edges, near zero on flat regions.
The kernel has 9 weights. That’s it. Whether the image is 28×28 or 1024×1024, still 9 weights. This is parameter sharing — the same filter applied everywhere.
Multiple Filters, Multiple Feature Maps #
One filter detects one type of pattern. Real CNNs use many filters in parallel. A layer with 32 filters produces 32 feature maps from one input image — each looking for a different pattern.
Layer 1 might learn: horizontal edges, vertical edges, diagonal edges, colour gradients. Layer 2 sees those feature maps and combines them: corners, circles, textures. Layer 3 combines those: eyes, wheels, doorknobs.
Deeper = more abstract. This hierarchy emerges automatically from training — nobody told the network to look for eyes. It figured out that “eye-detectors” are useful intermediate features for classifying faces.
Padding and Stride #
Padding — add zeros around the image border so the output feature map is the same size as the input. “Same” padding with a 3×3 kernel adds 1 pixel of zeros on each side.
Without padding: a 5×5 image with a 3×3 kernel gives a 3×3 output (shrinks each time). Stack enough layers and your image disappears.
Stride — how many pixels the kernel moves per step. Stride 1: move one pixel at a time (standard). Stride 2: skip every other position, halving the spatial dimensions. An alternative to pooling.
$$\text{output size} = \left\lfloor \frac{\text{input size} - \text{kernel size} + 2 \times \text{padding}}{\text{stride}} \right\rfloor + 1$$
For input 28, kernel 3, padding 1, stride 1: $\lfloor(28-3+2)/1\rfloor + 1 = 28$. Same size.
Pooling: Shrinking Deliberately #
After a conv layer, you usually want to reduce the spatial size. Max pooling takes the maximum value in each window.
A 2×2 max pool with stride 2 on a 4×4 feature map:
$$\begin{bmatrix}1&3&2&0\4&2&1&3\0&1&4&2\3&2&1&0\end{bmatrix} \xrightarrow{\text{maxpool 2×2}} \begin{bmatrix}4&3\3&4\end{bmatrix}$$
It halves the spatial size. It’s also slightly translation-invariant: a feature shifted by one pixel still produces the same max.
The CNN Architecture #
A typical CNN stacks:
Input image
→ Conv(32 filters, 3×3) + ReLU # detect low-level features
→ Conv(32 filters, 3×3) + ReLU
→ MaxPool(2×2) # halve spatial size
→ Conv(64 filters, 3×3) + ReLU # more abstract features
→ Conv(64 filters, 3×3) + ReLU
→ MaxPool(2×2)
→ Flatten # spatial → vector
→ Linear(128) + ReLU # combine features
→ Linear(10) + Softmax # classify
The spatial size shrinks as you go deeper. The number of channels (feature maps) grows. By the time you flatten, the network has compressed the spatial structure into a rich feature vector that the final linear layers classify.
In PyTorch:
import torch.nn as nn
model = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1), nn.ReLU(),
nn.Conv2d(32, 32, kernel_size=3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(64 * 7 * 7, 128), nn.ReLU(),
nn.Linear(128, 10)
)
Parameter Count vs Fully Connected #
For a 28×28 image and a layer producing 32 feature maps:
- Fully connected: $28 \times 28 \times 32 = 25{,}088$ weights
- Conv (3×3): $3 \times 3 \times 32 = 288$ weights
87× fewer parameters. And the conv layer is structurally smarter — it enforces locality and translation invariance.
This efficiency is why CNNs can handle high-resolution images. ResNet-50 classifies 224×224 images with ~25M parameters. A fully connected equivalent would need billions.
Demo: Filter Playground #
Draw anything on the grid. Watch the convolution filters react to it live — each output shows what a different kernel “sees”. Try drawing edges, diagonals, circles. The filters respond differently to each.
draw here (click + drag)
filter outputs (each kernel sees something different)
Notice:
- Horizontal filter responds strongly to horizontal lines — bright where a dark-to-light transition happens top-to-bottom
- Vertical filter does the same but for vertical lines
- Edges (Laplacian) lights up wherever there’s any edge — the boundary between your drawing and the background
- Sharpen amplifies the edges and centres — makes the drawing “pop”
Hit Fill to draw a cross + diagonals to see all four filters react to different edge directions at once.
What Filters Actually Learn #
In an untrained CNN, filters are random noise. After training on images:
- Layer 1 filters: oriented edges (Sobel-like), colour gradients, blobs
- Layer 2 filters: combinations — corners, T-junctions, simple textures
- Layer 3+ filters: complex textures, object parts
- Deep layers: semantic concepts — eyes, wheels, text
This was verified empirically by visualising what input images maximise each filter’s activation. The hierarchy is real — and it emerged from gradient descent on labels, not from anyone designing it.
Before You Go — Try These #
-
In the demo, draw a single horizontal line in the middle. Which filter responds most strongly? Now draw a vertical line. Which filter responds now? Is there any response in the “wrong” filter?
-
A conv layer takes input of shape $(H, W, C_{in})$ and has $C_{out}$ filters each of size $k \times k$. How many learnable parameters does this layer have (including biases)?
-
Why does max pooling give some translation invariance? Draw a 2×2 example where a feature shifts one pixel but the max-pooled output stays the same.
-
A fully connected layer with 1000 input neurons and 512 output neurons has how many parameters? A conv layer with 32 input channels, 64 output channels, and 3×3 kernels has how many? Which would you use for processing a 64×64 image, and why?
-
CNNs are translation-invariant but not rotation-invariant by default. Why? What architectural change would give some rotation invariance?
Next up → Lesson 11: Memory in the Loop — recurrent neural networks, sequences, and why order matters.