You’ve been told how it works. Now let’s write it.
No PyTorch. No TensorFlow. Just NumPy and the stuff we’ve built up over seven lessons. By the end of this, PyTorch won’t feel like magic — it’ll feel like a faster version of something you already understand.
The Layer #
Every layer does the same two things: a forward pass and a backward pass. Wrap that in a class.
import numpy as np
class Layer:
def __init__(self, n_in, n_out, activation='relu'):
# He initialisation for ReLU
self.W = np.random.randn(n_out, n_in) * np.sqrt(2 / n_in)
self.b = np.zeros(n_out)
self.activation = activation
self.x = self.z = self.a = None # cache for backward pass
def forward(self, x):
self.x = x # cache input
self.z = self.W @ x + self.b # pre-activation
if self.activation == 'relu':
self.a = np.maximum(0, self.z)
elif self.activation == 'softmax':
e = np.exp(self.z - self.z.max()) # stable softmax
self.a = e / e.sum()
return self.a
def backward(self, grad, lr):
# grad = dL/da (coming from next layer)
if self.activation == 'relu':
dz = grad * (self.z > 0) # ReLU gradient
else:
dz = grad # softmax grad handled upstream
dW = np.outer(dz, self.x) # gradient for weights
db = dz # gradient for biases
dx = self.W.T @ dz # gradient to pass back
self.W -= lr * dW
self.b -= lr * db
return dx
The forward pass stores everything it needs for the backward pass. That’s the cache.
The Network #
A network is a stack of layers. Forward pass goes left to right. Backward pass goes right to left.
class NeuralNet:
def __init__(self, sizes):
# e.g. sizes = [36, 16, 3]
self.layers = []
for i in range(len(sizes) - 1):
act = 'relu' if i < len(sizes) - 2 else 'softmax'
self.layers.append(Layer(sizes[i], sizes[i+1], act))
def predict(self, x):
for layer in self.layers:
x = layer.forward(x)
return x
def train_step(self, x, y, lr=0.05):
# Forward pass
yhat = self.predict(x)
# Cross-entropy + softmax gradient (beautifully simplified)
# dL/dz_output = yhat - one_hot(y)
grad = yhat.copy()
grad[y] -= 1.0
# Backward pass — chain rule, reversed
for layer in reversed(self.layers):
grad = layer.backward(grad, lr)
return -np.log(yhat[y] + 1e-7) # loss for logging
That’s the entire training machinery. The train_step is:
- Forward pass through all layers
- Compute gradient at the output (softmax+CE combined trick)
- Pass it backward through each layer, collecting
dxat each step - Each layer updates its own weights internally
Training Loop #
net = NeuralNet([36, 16, 3])
for epoch in range(500):
np.random.shuffle(data)
total_loss = 0
for x, y in data:
total_loss += net.train_step(x, y, lr=0.05)
if epoch % 100 == 0:
# Accuracy check
correct = sum(net.predict(x).argmax() == y for x, y in data)
print(f"epoch {epoch} loss={total_loss/len(data):.4f} acc={correct/len(data):.2%}")
Output:
epoch 0 loss=1.1832 acc=34.17%
epoch 100 loss=0.4521 acc=82.50%
epoch 200 loss=0.2103 acc=93.33%
epoch 300 loss=0.1347 acc=97.50%
epoch 400 loss=0.0981 acc=98.33%
This Is What PyTorch Does #
Seriously. Under the hood, PyTorch’s nn.Linear, nn.ReLU, loss.backward(), optimizer.step() — it’s this exact pattern. What PyTorch adds:
- Autograd — computes gradients automatically for any computation graph (no manual backprop)
- GPU support — same math on CUDA cores, 100× faster
- Optimisers — Adam, momentum, etc. instead of plain SGD
- Batching — handles matrices of examples instead of one at a time
- Ecosystem — pretrained models, dataloaders, etc.
The math is identical. The abstraction is identical. Once you can write this by hand, you’ll never be confused by PyTorch again.
The PyTorch Equivalent #
import torch
import torch.nn as nn
# Identical architecture
model = nn.Sequential(
nn.Linear(36, 16),
nn.ReLU(),
nn.Linear(16, 3)
)
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(500):
for x, y in dataloader:
optimizer.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
optimizer.step()
Same structure. Same math. More automatic.
Demo: Pixel Pattern Classifier #
This is the network running live in your browser. Draw on the 6×6 grid — click or drag to toggle pixels. The network (trained on 120 examples of each class right when the page loaded) predicts what you drew.
Draw a horizontal line, a vertical line, or an X. See if it gets it.
draw here
prediction
The network you’re drawing against was trained entirely in JavaScript when this page loaded — same backprop, same gradient descent, same weight updates we’ve written. 120 training examples, 600 epochs, a few hundred milliseconds.
What We Haven’t Added (Yet) #
This implementation is missing a few things real networks use:
- Batching — we train on one example at a time. Real code batches 32–256 for GPU efficiency.
- Momentum / Adam — we use plain SGD. Real optimisers accumulate gradient history.
- Regularisation — dropout, weight decay. Without it, the network can memorise instead of generalise.
- Autograd — we hand-computed every gradient. PyTorch builds a computation graph automatically.
Lesson 09 covers training dynamics — overfitting, dropout, batch normalisation in practice — and lesson 13 migrates everything to PyTorch so you can see exactly where each hand-written piece maps.
Before You Go — Try These #
-
In the
backwardmethod, the linedW = np.outer(dz, self.x)computes the weight gradient. What are the dimensions ofdzandself.xfor a layer with 16 inputs and 8 outputs? What shape doesdWhave, and why must it matchself.W? -
The softmax+CE gradient simplifies to
yhat - one_hot(y). Derive this yourself: start with $L = -\log(\hat{y}_c)$ where $c$ is the correct class, apply chain rule through the softmax, and show the cancellation. -
In the pixel demo, train on mostly horizontal lines with a few diagonals. Does the network become more confident on horizontals? What does this tell you about how training data distribution affects the model?
-
The
Layer.backwardmethod both computes gradients and updates weights in the same call. What problem does this cause if you wanted to implement momentum or Adam? How would you restructure it? -
What would happen if you removed the
self.z.max()subtraction from the softmax implementation? Try it mentally with $z = [1000, 1001, 1002]$.
Next up → Lesson 09: Too Good to Be True — overfitting, regularisation, dropout, and batch norm in practice.