Deep Dive

Transformers from first principles

AIMathDeep learning

Build intuition for large language models through mathematics and interactive visualizations. No black boxes - every component explained from the ground up.

Chapters

The weighted sum
The atomic unit of intelligence. Scalar operations, weights, biases, and activation functions like ReLU and Sigmoid.
The shape of data
Moving from scalars to matrices. Understanding shapes, broadcasting, and why GPUs love linear algebra.
Layers of abstraction
Connecting neurons into layers. The forward pass as a series of matrix transformations (MLP).
Learning from mistakes
How machines learn. Loss functions, the chain rule, and visualizing gradient descent in 3D.
From scratch to 97%
Build and train a neural network from scratch using WebGPU compute shaders, entirely in your browser.
The token
Turning text into numbers. Vocabulary, BPE algorithm, and the discrete nature of language.
The embedding
From integers to meaning. The lookup table, vector semantics, and the continuous representation space.
The attention
The mechanism that changed everything. Query, Key, Value matrices, causal masking, and the scaled dot-product.
The multi-head attention
Why one perspective isn't enough. Splitting the stream, specialization of heads, and the output projection.
The residual stream
The superhighway of gradients. Solving the vanishing gradient problem with skip connections.
The normalization
Keeping the signal clean. LayerNorm, internal covariate shift, and the importance of statistics.
The feed-forward network
The independent thinker. Expanding dimensions, non-linear activation (GELU), and processing facts.
The model
Putting it all together. Positional encodings, stacking blocks, and the full GPT architecture.

Details

Format:	Text, LaTeX, Interactive 3D
Audience:	Engineers and visual learners
Platform:	Web (Interactive)
Length:	6-8 hours

Start reading