The thousand token rule is now available also in epub and pdf formats.
← Details

Transformers from first principles

This course is fully available for free. Sign in to access all chapters.

The weighted sum

The perceptron's anatomy

We begin our journey at the atomic level. Before we can understand the massive orchestras of matrices that power GPT-4, we must understand the single instrument: the artificial neuron (or perceptron).

Mathematically, it is a simple machine that takes inputs, weighs their importance, adds a bias, and decides whether to "fire" (output a signal).

y=f(wx+b)y = f(\mathbf{w} \cdot \mathbf{x} + b)

Where:

  • x\mathbf{x} are the inputs (the data).
  • w\mathbf{w} are the weights (the importance).
  • bb is the bias (the threshold).
  • ff is the activation function.

A note on mental models

At this stage, resist the urge to assign complex real-world meaning to x1,x2,x3x_1, x_2, x_3. Don't think of them as "age," "height," or "income" yet.

Instead, visualize the neuron as a pattern matcher.

  • Imagine x1,x2,x3x_1, x_2, x_3 are just the brightness values of three pixels in an image.
  • The weights ww represent a "template" the neuron is looking for.
  • If the input pixels match the template (high input ×\times high weight), the sum is large, and the neuron fires.
  • If they mismatch (high input ×\times negative weight), the sum is low, and the neuron stays silent.

This is a mathematical device for measuring similarity, nothing more.

Explore the flow below. Notice how the weights scale the input, and the bias shifts the result before it hits the activation function.

Neuron Anatomy Explorer

w1: 0.5x11.0w2: 0.8x20.5w3: 1.2x3-0.2b0.1Σ + f0.760.76Output (ReLU)
y=ReLU((0.5·1.0+0.8·0.5+1.2·-0.2)+0.1)=0.76

Weights (Importance)

0.5
0.8
1.2

Inputs (Data)

1.0
0.5
-0.2
0.1

The decision boundary

What does this math actually look like?

If we have two inputs, x1x_1 and x2x_2, the equation wx+b=0\mathbf{w} \cdot \mathbf{x} + b = 0 defines a straight line in 2D space. Everything on one side of the line gets classified as positive, and everything on the other side as negative.

Your goal:

  1. Easy mode: Adjust the weights (w1,w2w_1, w_2) and bias (bb) to separate the blue dots from the red dots. You should be able to get 100% accuracy.
  2. Hard mode: Switch the toggle. Now try to separate the red center from the blue corners.
Accuracy: 88%
Equation:
0.5x + 0.5y + -5.0 = 0

Did you fail on hard mode? Good. That was the point.

Notice that no matter what you do, you can only draw a straight line. You cannot circle the red dots in the middle. This is the fundamental limitation of a single neuron: it is a linear classifier. It can solve "A vs B" if they are side-by-side, but it cannot solve "A inside B" or other complex "enclosed" patterns.

The need for non-linearity

If neurons could only draw straight lines, connecting a billion of them would still only be able to draw a straight line. (A linear combination of linear functions is just another linear function).

To describe the real world, which is full of curves, irregularities, and complex patterns, we need to introduce non-linearity. We need to "bend" the line.

Solving the impossible

So, how do we solve the problem above? We have two options:

  1. Use a deep network: Combine multiple neurons (which we'll do in Chapter 3).
  2. Feature engineering: Manually change the inputs to make the problem solvable.

In the example above, we failed because we were looking at xx and yy directly. But what if we transformed the inputs? What if we squared them?

If our neuron sees x2x^2 and y2y^2 instead of xx and yy, the equation becomes w1x2+w2y2+b=0w_1 x^2 + w_2 y^2 + b = 0. In geometry, that's the equation of an ellipse (or a circle if weights are equal).

We haven't changed the neuron (it's still doing a linear sum), but we warped the input space so that a straight line in the "squared space" looks like a circle in the original space.

Try it below. Adjust the weights to create a circular boundary that successfully traps the red dots.

Accuracy: 90%

What if we bend the input space?
Equation: w1 x² + w2 y² + b = 0

This technique, manually creating new features like x2x^2 to solve a problem, is powerful, but it requires us to know the solution is a circle. We want AI to figure that out for us.

Activation functions

This is where the activation function comes in.

If feature engineering is us manually warping the input, the activation function is a small "gate" or "modulator" attached to every neuron that allows the network to learn these warps automatically (when we stack many neurons together).

For a single neuron, think of the activation function purely as a volume knob or dimmer switch.

  • It takes the raw sum (wx+b\mathbf{w} \cdot \mathbf{x} + b) which could be any number from -\infty to ++\infty.
  • It squashes or gates it into a useful output range.

Don't visualize it bending lines just yet. Instead, think of it as a rule for signal propagation:

  • Sigmoid: "Squash everything between 0% and 100%." (Useful for probabilities).
  • ReLU (the gatekeeper): "If the signal is positive, let it through unchanged. If it's negative, silence it completely."

Experiment below to see how these functions modulate the input signal.

Input (x)
0.00
Output (y)
0.50