Home > Lab > Neural Nets > 10 — Seeing with Filters

10 — Seeing with Filters

How convolutional neural networks see images — kernels, feature maps, pooling, and why a sliding 3×3 window beats a million fully-connected weights.

May 2, 2026

A 256×256 colour image has 196,608 numbers in it. If you flatten that and feed it into a fully connected layer with 1000 neurons, the first layer alone needs 196 million weights.

That’s not just expensive — it’s structurally dumb. A fully connected layer treats pixel (0,0) and pixel (255,255) as equally related. But they’re not. Pixels that are close to each other share structure. Edges, curves, textures — they’re local patterns, not global ones.

Convolutional networks are built around that insight. A small filter slides across the image, looking for one local pattern everywhere. It’s translation-invariant: a vertical edge in the top-left is detected by the same weights as a vertical edge in the bottom-right.

The Convolution Operation #

Take a small matrix called a kernel (or filter) and slide it across the image. At each position, multiply element-wise and sum. That sum becomes one pixel in the feature map.

A 3×3 kernel applied to a 5×5 image:

$$\begin{bmatrix} 1&0&-1 \ 1&0&-1 \ 1&0&-1 \end{bmatrix} \circledast \begin{bmatrix} 2&1&3&0&1 \ 0&2&1&3&2 \ 1&0&2&1&0 \ 3&1&0&2&1 \ 2&3&1&0&2 \end{bmatrix}$$

At position (1,1) — centre of the kernel placed at row 1, col 1:

$$2(1)+1(0)+3(-1) + 0(1)+2(0)+1(-1) + 1(1)+0(0)+2(-1)$$ $$= 2+0-3+0+0-1+1+0-2 = -3$$

Slide to every valid position and you get the feature map. That 3×3 vertical edge detector produces high positive values at left edges, high negative at right edges, near zero on flat regions.

The kernel has 9 weights. That’s it. Whether the image is 28×28 or 1024×1024, still 9 weights. This is parameter sharing — the same filter applied everywhere.

Multiple Filters, Multiple Feature Maps #

One filter detects one type of pattern. Real CNNs use many filters in parallel. A layer with 32 filters produces 32 feature maps from one input image — each looking for a different pattern.

Layer 1 might learn: horizontal edges, vertical edges, diagonal edges, colour gradients. Layer 2 sees those feature maps and combines them: corners, circles, textures. Layer 3 combines those: eyes, wheels, doorknobs.

Deeper = more abstract. This hierarchy emerges automatically from training — nobody told the network to look for eyes. It figured out that “eye-detectors” are useful intermediate features for classifying faces.

Padding and Stride #

Padding — add zeros around the image border so the output feature map is the same size as the input. “Same” padding with a 3×3 kernel adds 1 pixel of zeros on each side.

Without padding: a 5×5 image with a 3×3 kernel gives a 3×3 output (shrinks each time). Stack enough layers and your image disappears.

Stride — how many pixels the kernel moves per step. Stride 1: move one pixel at a time (standard). Stride 2: skip every other position, halving the spatial dimensions. An alternative to pooling.

$$\text{output size} = \left\lfloor \frac{\text{input size} - \text{kernel size} + 2 \times \text{padding}}{\text{stride}} \right\rfloor + 1$$

For input 28, kernel 3, padding 1, stride 1: $\lfloor(28-3+2)/1\rfloor + 1 = 28$. Same size.

Pooling: Shrinking Deliberately #

After a conv layer, you usually want to reduce the spatial size. Max pooling takes the maximum value in each window.

A 2×2 max pool with stride 2 on a 4×4 feature map:

$$\begin{bmatrix}1&3&2&0\4&2&1&3\0&1&4&2\3&2&1&0\end{bmatrix} \xrightarrow{\text{maxpool 2×2}} \begin{bmatrix}4&3\3&4\end{bmatrix}$$

It halves the spatial size. It’s also slightly translation-invariant: a feature shifted by one pixel still produces the same max.

The CNN Architecture #

A typical CNN stacks:

Input image
→ Conv(32 filters, 3×3) + ReLU       # detect low-level features
→ Conv(32 filters, 3×3) + ReLU
→ MaxPool(2×2)                         # halve spatial size
→ Conv(64 filters, 3×3) + ReLU       # more abstract features
→ Conv(64 filters, 3×3) + ReLU
→ MaxPool(2×2)
→ Flatten                              # spatial → vector
→ Linear(128) + ReLU                   # combine features
→ Linear(10) + Softmax                 # classify

The spatial size shrinks as you go deeper. The number of channels (feature maps) grows. By the time you flatten, the network has compressed the spatial structure into a rich feature vector that the final linear layers classify.

In PyTorch:

import torch.nn as nn

model = nn.Sequential(
    nn.Conv2d(1, 32, kernel_size=3, padding=1),  nn.ReLU(),
    nn.Conv2d(32, 32, kernel_size=3, padding=1), nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(64 * 7 * 7, 128), nn.ReLU(),
    nn.Linear(128, 10)
)

Parameter Count vs Fully Connected #

For a 28×28 image and a layer producing 32 feature maps:

Fully connected: $28 \times 28 \times 32 = 25{,}088$ weights
Conv (3×3): $3 \times 3 \times 32 = 288$ weights

87× fewer parameters. And the conv layer is structurally smarter — it enforces locality and translation invariance.

This efficiency is why CNNs can handle high-resolution images. ResNet-50 classifies 224×224 images with ~25M parameters. A fully connected equivalent would need billions.

Demo: Filter Playground #

Draw anything on the grid. Watch the convolution filters react to it live — each output shows what a different kernel “sees”. Try drawing edges, diagonals, circles. The filters respond differently to each.

draw here (click + drag)

filter outputs (each kernel sees something different)

Notice:

Horizontal filter responds strongly to horizontal lines — bright where a dark-to-light transition happens top-to-bottom
Vertical filter does the same but for vertical lines
Edges (Laplacian) lights up wherever there’s any edge — the boundary between your drawing and the background
Sharpen amplifies the edges and centres — makes the drawing “pop”

Hit Fill to draw a cross + diagonals to see all four filters react to different edge directions at once.

What Filters Actually Learn #

In an untrained CNN, filters are random noise. After training on images:

Layer 1 filters: oriented edges (Sobel-like), colour gradients, blobs
Layer 2 filters: combinations — corners, T-junctions, simple textures
Layer 3+ filters: complex textures, object parts
Deep layers: semantic concepts — eyes, wheels, text

This was verified empirically by visualising what input images maximise each filter’s activation. The hierarchy is real — and it emerged from gradient descent on labels, not from anyone designing it.

Before You Go — Try These #

In the demo, draw a single horizontal line in the middle. Which filter responds most strongly? Now draw a vertical line. Which filter responds now? Is there any response in the “wrong” filter?
A conv layer takes input of shape $(H, W, C_{in})$ and has $C_{out}$ filters each of size $k \times k$. How many learnable parameters does this layer have (including biases)?
Why does max pooling give some translation invariance? Draw a 2×2 example where a feature shifts one pixel but the max-pooled output stays the same.
A fully connected layer with 1000 input neurons and 512 output neurons has how many parameters? A conv layer with 32 input channels, 64 output channels, and 3×3 kernels has how many? Which would you use for processing a 64×64 image, and why?
CNNs are translation-invariant but not rotation-invariant by default. Why? What architectural change would give some rotation invariance?

Next up → Lesson 11: Memory in the Loop — recurrent neural networks, sequences, and why order matters.

←

01 — The Numbers That Run Everything

09 — Too Good to Be True

→