Look, I’ll be honest — most people skip this part. They jump straight into PyTorch, call model.fit(), and wonder why nothing makes sense when something breaks.
We’re not doing that.
Everything a neural network does — every prediction, every weight update, every gradient — is built on three things: linear algebra, calculus, and the chain rule. Once these click, the rest of this series will feel inevitable rather than magical.
Let’s go.
Scalars, Vectors, Matrices, Tensors #
These are just ways of organizing numbers. That’s it. The fancy names make it sound harder than it is.
Scalar #
A single number. Temperature, price, error value — one number.
$$x = 5.3$$
Vector #
An ordered list of numbers. Think of it as a point in space, or a list of features.
$$\mathbf{v} = \begin{bmatrix} 1 \\ 4 \\ 2 \end{bmatrix}$$
A vector with $n$ numbers lives in $n$-dimensional space. That’s it. A user’s ratings for 5 movies? A 5-dimensional vector. The pixel intensities of a 28×28 image flattened out? A 784-dimensional vector.
Notation: bold lowercase $\mathbf{v}$ means vector. $v_i$ means the $i$-th element.
Matrix #
A 2D grid of numbers. Rows and columns.
$$A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}$$
This is a $2 \times 3$ matrix (2 rows, 3 columns). We write $A \in \mathbb{R}^{2 \times 3}$.
$A_{ij}$ means row $i$, column $j$. So $A_{12} = 2$.
Notation: uppercase $A$ means matrix.
Tensor #
A generalization to $N$ dimensions. A matrix is a 2D tensor. A color image (height × width × 3 channels) is a 3D tensor. A batch of 32 images is a 4D tensor.
Scalar → 0D tensor → just a number
Vector → 1D tensor → [1, 2, 3]
Matrix → 2D tensor → [[1,2],[3,4]]
Image → 3D tensor → shape (H, W, C)
Batch → 4D tensor → shape (N, H, W, C)
In Python with NumPy:
import numpy as np
scalar = np.array(5.3) # shape: ()
vector = np.array([1, 4, 2]) # shape: (3,)
matrix = np.array([[1,2,3],[4,5,6]]) # shape: (2, 3)
tensor = np.zeros((32, 28, 28, 3)) # shape: (32, 28, 28, 3)
Vector Operations #
Addition #
Add element-by-element. Vectors must be the same size.
$$\begin{bmatrix} 1 \\ 3 \end{bmatrix} + \begin{bmatrix} 2 \\ 1 \end{bmatrix} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}$$
Scalar Multiplication #
Multiply every element by a number.
$$3 \cdot \begin{bmatrix} 1 \\ 2 \end{bmatrix} = \begin{bmatrix} 3 \\ 6 \end{bmatrix}$$
The Dot Product — This Is the Big One #
$$\mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n$$
Example:
$$\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \cdot \begin{bmatrix} 4 \\ 5 \\ 6 \end{bmatrix} = (1)(4) + (2)(5) + (3)(6) = 4 + 10 + 18 = 32$$
Why it matters: a dot product is a weighted sum. And a weighted sum is exactly what a neuron computes. The weights $\mathbf{w}$ and the inputs $\mathbf{x}$, dotted together. This is the core operation in every neural network, repeated millions of times.
Geometrically: $\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}||\mathbf{b}|\cos\theta$. When two vectors point in the same direction ($\theta = 0$), the dot product is maximized. When they’re perpendicular, it’s zero. This is why dot products measure similarity.
Dot Product as Similarity — Rotate & See
Drag the slider to rotate vector b. Watch how the dot product changes with the angle.
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = np.dot(a, b) # 32
# or: a @ b # same thing
Matrix Multiplication #
This is where single neurons become entire layers.
For matrices $A$ (shape $m \times n$) and $B$ (shape $n \times p$):
$$C = AB \quad \text{where} \quad C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$$
Each element $C_{ij}$ is the dot product of row $i$ of $A$ with column $j$ of $B$.
The inner dimensions must match: $(m \times \mathbf{n}) \times (\mathbf{n} \times p) = (m \times p)$.
$$\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1(5)+2(7) & 1(6)+2(8) \\ 3(5)+4(7) & 3(6)+4(8) \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}$$
Why it matters: an entire neural network layer is one matrix multiply. Input vector $\mathbf{x}$ goes in, weight matrix $W$ transforms it, output vector comes out. That’s it.
$$\text{layer output} = W\mathbf{x} + \mathbf{b}$$
Play with it below — hover over any result cell to see which row × column produced it.
Matrix Multiplication — Row × Column
Matrix A
Matrix B
Result C
Derivatives — How Things Change #
A derivative tells you how fast a function is changing at a specific point. That’s the entire idea.
Formally:
$$f’(x) = \frac{df}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$
This is the slope of the tangent line at point $x$. If $f’(x) > 0$, the function is going up. If $f’(x) < 0$, going down. If $f’(x) = 0$, you’re at a flat spot — either a peak, a valley, or an inflection point.
The Rules You Actually Need #
Power rule: $$\frac{d}{dx}[x^n] = nx^{n-1}$$
So $\frac{d}{dx}[x^2] = 2x$, and $\frac{d}{dx}[x^3] = 3x^2$.
Sum rule: $$\frac{d}{dx}[f(x) + g(x)] = f’(x) + g’(x)$$
Constant rule: $$\frac{d}{dx}[c \cdot f(x)] = c \cdot f’(x)$$
So for $f(x) = 3x^4 - 2x^2 + 7$:
$$f’(x) = 12x^3 - 4x$$
Why this matters for neural nets: training a network means minimizing a loss function. To do that, we need to know which direction to adjust each weight — that’s exactly the derivative. Move in the direction that reduces the loss.
Move the slider below to see the tangent line (= the derivative) at any point:
Derivative — The Slope of the Tangent Line
Partial Derivatives & The Gradient #
Most functions in neural networks have many inputs — one per weight. So we need derivatives that handle multiple variables.
A partial derivative $\frac{\partial f}{\partial x_i}$ asks: if I nudge only $x_i$, holding everything else fixed, how does $f$ change?
Example: $f(x, y) = x^2 + 3xy$
$$\frac{\partial f}{\partial x} = 2x + 3y \qquad \frac{\partial f}{\partial y} = 3x$$
The gradient stacks all partial derivatives into a vector:
$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
The gradient points in the direction of steepest increase. To minimize a loss function, we go the opposite direction — that’s gradient descent, and it’s the engine behind all of training.
The Chain Rule — The Most Important Thing Here #
If you remember one thing from this lesson, make it this.
When you compose two functions — $y = f(u)$ and $u = g(x)$ — the derivative of the whole thing is:
$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$
You multiply the derivatives at each step. That’s it.
Example: $y = (x^2 + 1)^3$
Let $u = x^2 + 1$, so $y = u^3$.
$$\frac{dy}{du} = 3u^2 = 3(x^2+1)^2$$ $$\frac{du}{dx} = 2x$$ $$\frac{dy}{dx} = 3(x^2+1)^2 \cdot 2x = 6x(x^2+1)^2$$
Another example: $y = \sin(x^2)$
$$\frac{dy}{dx} = \cos(x^2) \cdot 2x$$
Why this owns neural networks: a neural network is a chain of composed functions. Input → layer 1 → layer 2 → … → loss. Backpropagation is just the chain rule applied backwards through this whole chain, computing how each weight contributed to the final error.
We’ll unpack this completely in Lesson 06. For now, just know: chain rule = backprop.
Putting It All Together — A Single Neuron #
Here’s the punchline. A single artificial neuron computes:
$$\text{output} = \sigma!\left(\sum_{i} w_i x_i + b\right) = \sigma(\mathbf{w} \cdot \mathbf{x} + b)$$
Where:
- $\mathbf{x}$ — input vector (your data)
- $\mathbf{w}$ — weight vector (learned parameters)
- $b$ — bias (a scalar, learned too)
- $\sigma$ — an activation function (covered in Lesson 03)
That $\mathbf{w} \cdot \mathbf{x}$ is a dot product. The whole network is stacked matrix multiplications with activation functions in between. Training it means using partial derivatives and the chain rule to compute how each weight affects the loss, then nudging them in the right direction.
You now have everything you need to understand that completely.
import numpy as np
def neuron(x, w, b):
z = np.dot(w, x) + b # weighted sum (dot product)
return sigmoid(z) # activation (lesson 03)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
x = np.array([0.5, 1.2, -0.8])
w = np.array([0.3, -0.5, 0.9])
b = 0.1
print(neuron(x, w, b)) # try it
Before You Go — Try These #
-
Compute $\begin{bmatrix}2\\-1\\3\end{bmatrix} \cdot \begin{bmatrix}4\\5\\-2\end{bmatrix}$ by hand, then verify with the demo above.
-
What’s the shape of $AB$ if $A \in \mathbb{R}^{4 \times 3}$ and $B \in \mathbb{R}^{3 \times 5}$?
-
Find $\frac{d}{dx}[5x^3 - 2x + 9]$.
-
Apply the chain rule to $f(x) = \cos(x^3)$.
-
For $f(x, y) = x^2 y + y^3$, find $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$.
Next up → Lesson 02: Meet the World’s Dumbest Brain Cell — we build a perceptron from scratch and watch it learn.