Transformers from first principles :: Learning from mistakes

Learning from mistakes

A network starts its life with random weights, producing garbage when asked to recognize a pattern like a cat. To "learn," the network needs a way to measure how wrong it is and a strategy to improve - a process called optimization that is the heart of AI. We do this by defining a loss function, which acts like a meter telling us how far off our signal is from the intended answer. A popular choice is mean squared error (MSE), where we square the difference between the model's guess ( $y_{pred}$ ) and the correct answer ( $y_{true}$ ):

L = (y_{true} - y_{pred})^2

Where:

$L$ : The Loss (error score). Lower is better.
$y_{true}$ : The correct answer (the target).
$y_{pred}$ : The model's prediction (the guess).

Squaring the difference ensures the result is always positive and disproportionately punishes large errors, forcing the network to prioritize fixing its biggest mistakes first. Minimizing this loss is like hiking down a mountain at night; you can't see the valley floor - the set of optimal weights - but you can feel the slope, or gradient, under your feet. By repeatedly calculating this gradient and taking small steps downhill, the network eventually finds its way to the bottom. The update rule shifts the weights ( $w$ ) based on the derivative ( $\frac{\partial L}{\partial w}$ ), which measures the sensitivity of the error to that specific weight, scaled by a learning rate that determines our step size. If the learning rate is too small, training takes forever; too large, and we might jump right over the valley.

w_{new} = w_{old} - \text{learning\_rate} \times \frac{\partial L}{\partial w}

Where:

$w_{new}$ : The updated weight value.
$w_{old}$ : The current weight value.
$\text{learning\_rate}$ : The step size (how fast we learn).
$\frac{\partial L}{\partial w}$ : The gradient (the direction of steepest ascent). We subtract it to go downhill.

Loss: 1.8794

Learning Rate (Step Size)0.6

CautiousAggressive

Descent process

Click Train Step to calculate the local gradient and move the ball toward the nearest valley.

Backpropagation: the chain rule in action

Gradient descent tells us how to change the weights, but calculating the gradient for a deep network requires knowing the error caused by a weight buried deep inside a hidden layer. This is where backpropagation comes in, breaking the problem down into local steps using the chain rule. To find $\frac{\partial L}{\partial w_1}$ - how much changing an input weight affects the total loss - we calculate the sensitivity step-by-step. First, we look at the loss sensitivity with respect to the prediction, $\frac{\partial L}{\partial y} = 2(y - y_{true})$ , which tells us the direction to push the output. Then we move backward to the output weight ( $w_2$ ); since $y = h \cdot w_2$ , its rate of change is simply the hidden node's value $h$ . To pass the error deeper, we find the hidden node's sensitivity is $w_2$ , meaning stronger weights transmit more error. Finally, we reach the input weight ( $w_1$ ), where the derivative is just the input $x$ because $h = x \cdot w_1$ . By multiplying these local instructions, we get the precise update for $w_1$ :

\frac{\partial L}{\partial w_1} = \underbrace{2(y - y_{true})}_{\text{Slope}} \cdot \underbrace{w_2}_{\text{Backflow}} \cdot \underbrace{x}_{\text{Input}}

Where:

$\frac{\partial L}{\partial w_1}$ : The final calculated change we need to make to the input weight.
$\text{Slope}$ : The error from the output layer ( $\frac{\partial L}{\partial y}$ ).
$\text{Backflow}$ : How much the next layer's weight ( $w_2$ ) transmits that error.
$\text{Input}$ : The original input signal ( $x$ ) that triggered this chain.

This chain logic works for any depth, allowing us to calculate gradients for weights buried deep in the network simply by passing error backward one step at a time. The "sensitivity" is often just the weight itself for linear parts of the network, but activation functions like ReLU introduce a gating behavior. Think of ReLU, $h = \max(0, z)$ , as a switch: if the neuron was OFF (negative input), it contributed nothing and its derivative is 0, stopping the error signal dead. If it was ON, it passed the signal through directly with a derivative of 1. This ensures the network only updates the specific pathways that were actually active during the task.

In a real network, a single hidden neuron connects to multiple output categories - like Cat, Dog, and Bird - creating a web of signals. To calculate its total blame, we follow the forward pass logic in reverse: if the neuron influenced multiple paths, its total gradient is the sum of the gradients from each path. This operation of multiplying pairs and summing them up is exactly the dot product we learned earlier. When performed for every neuron at once, it becomes a single matrix multiplication, allowing us to synthesize millions of conflicting error signals into clean update steps. With these fundamentals - tensors for structure, layers for complexity, and backpropagation for training - you have everything needed to build a functional neural network.

\text{Total Blame}_h = (\text{Error}_{cat} \cdot w_{cat}) + (\text{Error}_{dog} \cdot w_{dog}) + (\text{Error}_{bird} \cdot w_{bird})

Where:

$\text{Total Blame}_h$ : The total error gradient for hidden neuron $h$ .
$\text{Error}_{class}$ : The error signal coming from each output class.
$w_{class}$ : The weight connecting neuron $h$ to that output class.

Step	Forward Pass (The Prediction)	Backward Pass (The Gradient)	Intuition
Weighted Sum	$z = w \cdot x + b$	$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial z} \cdot x$	The stronger the input $x$ , the more weight $w$ is to blame.
Bias	$z = \dots + b$	$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial z}$	The bias gets the full downstream error since it's added directly.
Activation (ReLU)	$a = \max(0, z)$	$\frac{\partial L}{\partial z} = \frac{\partial L}{\partial a} \cdot (1 \text{ if } z > 0 \text{ else } 0)$	The Gatekeeper: If the neuron fired (1), pass the blame back. If it slept (0), block it.
Loss (MSE)	$L = (y_{pred} - y_{true})^2$	$\frac{\partial L}{\partial y} = 2(y_{pred} - y_{true})$	The Source: The initial error signal comes from how far off we were.

Deep Dive: The Derivation

If you are curious where these formulas come from, here is the full calculus breakdown:

1. The Loss (MSE)

L = (y - t)^2

\frac{\partial L}{\partial y} = \frac{\partial}{\partial y}(y^2 - 2yt + t^2) = 2y - 2t = 2(y-t)

2. The Activation (ReLU)

a = \max(0, z)

If $z > 0$ , $a = z \rightarrow \frac{da}{dz} = 1$ . If $z < 0$ , $a = 0 \rightarrow \frac{da}{dz} = 0$ .

3. The Weights

z = w \cdot x + b

\frac{\partial z}{\partial w} = x

Using the chain rule:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial w} = \frac{\partial L}{\partial z} \cdot x

4. The Bias

z = w \cdot x + b

\frac{\partial z}{\partial b} = 1

Using the chain rule:

\frac{\partial L}{\partial b} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial b} = \frac{\partial L}{\partial z}

Designing the architecture

Before counting parameters, it is worth understanding why we structure networks as we do. For a task like recognizing 10 digits, we could output a single number from 0 to 9, but the network would wrongly assume that 6 and 7 are "closer" than 0 and 9. Instead, we use one-hot encoding, where ten separate output neurons each answer a yes/no question. Each output neuron receives every pixel as input, effectively performing a dot product between the image and its learned "template." While a single layer can only learn linear boundaries, hidden layers with nonlinear activations let the network bend the space to enclose complex data.

Pixel 1 →

Pixel 2↓

Linear layer: Impossible to separate.

A straight line cannot separate the center cluster from the surrounding corners.

Each hidden layer builds abstractions: the first layer might detect raw edges, the second combines them into shapes, and the output layer finally identifies the digit. This allows the network to think in steps. The size of these layers, such as choosing 256 neurons, is often a result of trial and error and GPU optimization; too narrow a layer loses information (underfitting), while too wide a layer wastes parameters (overfitting). For a simple MNIST model with no hidden layers, we connect 784 pixels to 10 outputs for a total of 7,850 parameters.

\underbrace{784 \times 10}_{\text{weights}} + \underbrace{10}_{\text{biases}} = 7{,}850 \text{ parameters}

Where:

$\text{weights}$ : The connections between every input pixel and every output neuron.
$\text{biases}$ : The 10 threshold values (one for each output neuron).

Adding two hidden layers ( $784 \to 256 \to 128 \to 10$ ) increases this capacity to over 235,000 parameters. |-------|-------------|------------| | 1 | $784 \times 256 + 256$ | 200,960 | | 2 | $256 \times 128 + 128$ | 32,896 | | 3 | $128 \times 10 + 10$ | 1,290 | | Total | | 235,146 |

The principle scales to the massive models of today. GPT-3 has 96 layers and 175 billion parameters, all starting as random numbers that eventually arranged themselves into a structure that understands language through trillions of training loops. Large Language Models like GPT-4 use this exact same machinery; their inputs are text turned into numbers, their hidden layers extract meanings instead of shapes, and their output is a probability distribution over the next possible word. The only remaining piece of the puzzle is how we feed human text into a machine that only eats numbers - the subject of our next chapter.