08 — Just NumPy, No Magic

Stop using the concepts, start writing the code. A full neural network in pure Python and NumPy — same thing PyTorch does internally, just slower.

You’ve been told how it works. Now let’s write it.

No PyTorch. No TensorFlow. Just NumPy and the stuff we’ve built up over seven lessons. By the end of this, PyTorch won’t feel like magic — it’ll feel like a faster version of something you already understand.


The Layer #

Every layer does the same two things: a forward pass and a backward pass. Wrap that in a class.

import numpy as np

class Layer:
    def __init__(self, n_in, n_out, activation='relu'):
        # He initialisation for ReLU
        self.W = np.random.randn(n_out, n_in) * np.sqrt(2 / n_in)
        self.b = np.zeros(n_out)
        self.activation = activation
        self.x = self.z = self.a = None  # cache for backward pass

    def forward(self, x):
        self.x = x                        # cache input
        self.z = self.W @ x + self.b      # pre-activation
        if self.activation == 'relu':
            self.a = np.maximum(0, self.z)
        elif self.activation == 'softmax':
            e = np.exp(self.z - self.z.max())  # stable softmax
            self.a = e / e.sum()
        return self.a

    def backward(self, grad, lr):
        # grad = dL/da (coming from next layer)
        if self.activation == 'relu':
            dz = grad * (self.z > 0)       # ReLU gradient
        else:
            dz = grad                       # softmax grad handled upstream

        dW = np.outer(dz, self.x)          # gradient for weights
        db = dz                             # gradient for biases
        dx = self.W.T @ dz                 # gradient to pass back

        self.W -= lr * dW
        self.b -= lr * db
        return dx

The forward pass stores everything it needs for the backward pass. That’s the cache.


The Network #

A network is a stack of layers. Forward pass goes left to right. Backward pass goes right to left.

class NeuralNet:
    def __init__(self, sizes):
        # e.g. sizes = [36, 16, 3]
        self.layers = []
        for i in range(len(sizes) - 1):
            act = 'relu' if i < len(sizes) - 2 else 'softmax'
            self.layers.append(Layer(sizes[i], sizes[i+1], act))

    def predict(self, x):
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def train_step(self, x, y, lr=0.05):
        # Forward pass
        yhat = self.predict(x)

        # Cross-entropy + softmax gradient (beautifully simplified)
        # dL/dz_output = yhat - one_hot(y)
        grad = yhat.copy()
        grad[y] -= 1.0

        # Backward pass — chain rule, reversed
        for layer in reversed(self.layers):
            grad = layer.backward(grad, lr)

        return -np.log(yhat[y] + 1e-7)  # loss for logging

That’s the entire training machinery. The train_step is:

  1. Forward pass through all layers
  2. Compute gradient at the output (softmax+CE combined trick)
  3. Pass it backward through each layer, collecting dx at each step
  4. Each layer updates its own weights internally

Training Loop #

net = NeuralNet([36, 16, 3])

for epoch in range(500):
    np.random.shuffle(data)
    total_loss = 0
    for x, y in data:
        total_loss += net.train_step(x, y, lr=0.05)

    if epoch % 100 == 0:
        # Accuracy check
        correct = sum(net.predict(x).argmax() == y for x, y in data)
        print(f"epoch {epoch}  loss={total_loss/len(data):.4f}  acc={correct/len(data):.2%}")

Output:

epoch 0    loss=1.1832  acc=34.17%
epoch 100  loss=0.4521  acc=82.50%
epoch 200  loss=0.2103  acc=93.33%
epoch 300  loss=0.1347  acc=97.50%
epoch 400  loss=0.0981  acc=98.33%

This Is What PyTorch Does #

Seriously. Under the hood, PyTorch’s nn.Linear, nn.ReLU, loss.backward(), optimizer.step() — it’s this exact pattern. What PyTorch adds:

  • Autograd — computes gradients automatically for any computation graph (no manual backprop)
  • GPU support — same math on CUDA cores, 100× faster
  • Optimisers — Adam, momentum, etc. instead of plain SGD
  • Batching — handles matrices of examples instead of one at a time
  • Ecosystem — pretrained models, dataloaders, etc.

The math is identical. The abstraction is identical. Once you can write this by hand, you’ll never be confused by PyTorch again.


The PyTorch Equivalent #

import torch
import torch.nn as nn

# Identical architecture
model = nn.Sequential(
    nn.Linear(36, 16),
    nn.ReLU(),
    nn.Linear(16, 3)
)

optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(500):
    for x, y in dataloader:
        optimizer.zero_grad()
        loss = loss_fn(model(x), y)
        loss.backward()
        optimizer.step()

Same structure. Same math. More automatic.


Demo: Pixel Pattern Classifier #

This is the network running live in your browser. Draw on the 6×6 grid — click or drag to toggle pixels. The network (trained on 120 examples of each class right when the page loaded) predicts what you drew.

Draw a horizontal line, a vertical line, or an X. See if it gets it.

draw here

prediction

The network you’re drawing against was trained entirely in JavaScript when this page loaded — same backprop, same gradient descent, same weight updates we’ve written. 120 training examples, 600 epochs, a few hundred milliseconds.


What We Haven’t Added (Yet) #

This implementation is missing a few things real networks use:

  • Batching — we train on one example at a time. Real code batches 32–256 for GPU efficiency.
  • Momentum / Adam — we use plain SGD. Real optimisers accumulate gradient history.
  • Regularisation — dropout, weight decay. Without it, the network can memorise instead of generalise.
  • Autograd — we hand-computed every gradient. PyTorch builds a computation graph automatically.

Lesson 09 covers training dynamics — overfitting, dropout, batch normalisation in practice — and lesson 13 migrates everything to PyTorch so you can see exactly where each hand-written piece maps.


Before You Go — Try These #

  1. In the backward method, the line dW = np.outer(dz, self.x) computes the weight gradient. What are the dimensions of dz and self.x for a layer with 16 inputs and 8 outputs? What shape does dW have, and why must it match self.W?

  2. The softmax+CE gradient simplifies to yhat - one_hot(y). Derive this yourself: start with $L = -\log(\hat{y}_c)$ where $c$ is the correct class, apply chain rule through the softmax, and show the cancellation.

  3. In the pixel demo, train on mostly horizontal lines with a few diagonals. Does the network become more confident on horizontals? What does this tell you about how training data distribution affects the model?

  4. The Layer.backward method both computes gradients and updates weights in the same call. What problem does this cause if you wanted to implement momentum or Adam? How would you restructure it?

  5. What would happen if you removed the self.z.max() subtraction from the softmax implementation? Try it mentally with $z = [1000, 1001, 1002]$.


Next up → Lesson 09: Too Good to Be True — overfitting, regularisation, dropout, and batch norm in practice.