Transformers from first principles :: The attention

The attention

The sequential bottleneck

Before attention, the dominant approach to language was the Recurrent Neural Network (RNN). To understand why attention changed everything, we first need to understand the problem it solved.

RNNs process text like a human reading a book: one word at a time, strictly in order. You read "The," update your mental model. Read "cat," update your model. Read "sat," update again.

Step 1: Read "The" → Update hidden_state
Step 2: Read "cat" + hidden_state → Overwrite hidden_state
Step 3: Read "sat" + hidden_state → Overwrite hidden_state

The critical flaw is the overwrite. There is only one "hidden state" - one scratchpad - that must carry the entire history of the document. By the time you reach the 100th word, the information from the first word has been overwritten 99 times. It is a game of "telephone" played by the network against itself.

The RNN Bottleneck

The

t=1

cat

t=2

sat

t=3

t=4

the

t=5

mat

t=6

Hidden State

Ready to start...

This creates two massive problems:

Amnesia: Information vanishes over long distances. The network forgets the subject of a sentence by the time it reaches the verb.
Serial dependency: You cannot calculate step 100 until you have calculated step 99. This makes training incredibly slow because you cannot use the massive parallel power of GPUs.

The architectural shift

Attention proposes a radical alternative: parallel processing.

Instead of reading one word at a time, the model takes in the entire sequence at once. "The cat sat" is not processed as three steps, but as a single event.

Every token can look at every other token simultaneously. There is no bottleneck. The scratchpad is gone; instead, every word has a direct line of communication to every other word.

🖥️

Desktop recommended

This chapter contains interactive visualizations and code explorers that work best on a larger screen.

One transformation

Input → Weight → Output

Query

Key

Value

→Score×

A traditional layer applies a single transformation to the input. It treats every token identically.

This shift from "sequential processing" to "relational processing" is what allows modern LLMs to understand complex context and train on internet-scale data.

The intuition: Selective focus

Consider the sentence:

"The animal didn't cross the street because it was too tired."

When you read the word "it," how do you know what it refers to?

Is it the street? No, streets don't get tired.
Is it the animal? Yes.

You solved this by attending to specific past words ("animal") while ignoring others ("street"). You essentially performed a search query: *"I am 'it', I am looking for a noun that can get tired."

This is the core intuition of attention. It is a mechanism that allows a word to "search" the rest of the sentence to find the information it needs to resolve its meaning.

The metaphor: Query, Key, Value

To implement this "search," the Transformer uses a concept borrowed from database retrieval: Query, Key, and Value.

Imagine you are in a library.

Query: The topic you are researching (e.g., "Physics").
Key: The label on the book spine (e.g., "Physics", "Cooking", "History").
Value: The content inside the book.

You walk down the aisle comparing your Query to every book's Key. When they match, you take the book and read its Value.

In the Transformer:

Query ( $Q$ ): What this token is looking for. ("I am 'it', looking for a noun.")
Key ( $K$ ): What this token identifies as. ("I am 'animal', I am a noun.")
Value ( $V$ ): The information this token holds. ("I am an animate object.")

When the Query of "it" matches the Key of "animal," the network moves the Value of "animal" into the representation of "it."

🖥️

Desktop recommended

This chapter contains interactive visualizations and code explorers that work best on a larger screen.

Input Embedding

0.5

0.8

0.3

0.6

Query

"What do I need?"

Key

"What do I have?"

Value

"Here is my info."

The same input goes through three different weight matrices, creating Q, K, and V.

The determinism misconception

A quick side note: You might think that because LLMs give different answers every time, this process is random. It is not.

The attention mechanism is 100% deterministic. If you feed the exact same input into the model, it calculates the exact same attention scores and the exact same output probabilities every single time.

The randomness you see in apps like ChatGPT comes from the sampling step after the model has finished running - where we basically roll a dice to pick the next word from the probabilities the model generated. The model itself is a fixed mathematical function.

What attention is NOT

It is easy to look at an attention matrix and think: "Ah, the model is looking at 'animal' because it's about to predict 'animal'."

This is incorrect.

Attention does not predict the next word. It is a mechanism for gathering context.

When "it" attends to "animal", the model is effectively saying: "To understand my own meaning right now, I need to pull in information from the word 'animal'." It is enriching the representation of "it" with the properties of "animal" (e.g., that it is alive, mobile, capable of being tired).

The actual prediction of the next word happens much later, at the very end of the network, after many layers of this context-gathering have built a rich, unambiguous representation of the entire situation. Attention is the research phase; prediction is the final report.

Seeing it in action

Below is a live visualization of attention. Type a sentence and click on words to see who they are "looking at."

Notice how "it" focuses on "animal". Notice how verbs focus on their subjects. These relationships are not hard-coded; the model learned them purely by reading billions of pages of text.

🖥️

Desktop recommended

This chapter contains interactive visualizations and code explorers that work best on a larger screen.

Click a word above to reveal its attention pattern

The language of similarity

In the previous chapter, we said attention allows a Query to "find" a matching Key. But vectors don't have eyes. How does one list of numbers "find" another?

The answer is the dot product.

Geometrically, the dot product of two vectors tells us how much they point in the same direction.

If they point in the same direction, the product is large positive.
If they are perpendicular (unrelated), the product is zero.
If they point in opposite directions, the product is large negative.

In high-dimensional space, "pointing in the same direction" means "semantic similarity."

Rotation

Aligned (0°)Opposite (180°)

Dot Product (Similarity)

0.71

This is the engine of attention. We compute the dot product between the Query of the current token and the Key of every other token. The result is a score: how relevant is that token to me?

The formula

The entire attention mechanism can be summarized in one elegant equation:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

$Q, K, V$ : The Query, Key, and Value matrices.
$T$ : Matrix transpose (swapping rows and columns).
$d_k$ : The dimension of the Key vectors (e.g., 64). Used for scaling.

Let's break it down piece by piece.

$QK^T$ (The Search): We multiply the Query matrix by the Key matrix. This computes the dot product for every pair of words at once. The result is a grid of raw similarity scores.
$\\$ d_k$ (The Scaling): We divide by the square root of the dimension size. This prevents the scores from getting too huge (which would kill the gradients during training).
Softmax (The Selection): We convert the raw scores into probabilities. This forces the model to prioritize: it can't pay 100% attention to everything. It has to choose what matters most.
$V$ (The Retrieval): Finally, we use these probabilities to take a weighted average of the Value vectors.

Step-by-step walkthrough

Let's trace the numbers. Imagine we have a tiny embedding dimension of 4. Watch how the Query for "sat" finds the Key for "cat."

🖥️

Desktop recommended

This chapter contains interactive visualizations and code explorers that work best on a larger screen.

1. Project Q, K, V

Each token is projected into Query, Key, and Value vectors.

the

0.2

0.8

0.1

0.5

cat

0.9

0.2

0.1

0.3

sat

0.3

0.4

0.8

0.1

the

0.1

0.9

0.2

0.4

cat

0.8

0.3

0.1

0.2

sat

0.2

0.5

0.9

0.1

the

1.0

0.0

cat

0.0

1.0

0.0

sat

0.0

1.0

0.0

The attention matrix

When we compute this for every word against every other word, we get an attention matrix. This is the "brain scan" of the model. It shows us exactly how information flows between words.

🖥️

Desktop recommended

This chapter contains interactive visualizations and code explorers that work best on a larger screen.

The

animal

didn't

cross

the

street

because

was

too

tired

The

.32

.28

.12

.08

.06

.05

.03

.02

.01

animal

.15

.30

.18

.15

.04

.06

.04

.03

.02

.01

.02

didn't

.06

.22

.28

.25

.03

.05

.04

.03

.02

.01

cross

.04

.25

.15

.26

.05

.12

.05

.03

.02

.01

.02

the

.03

.04

.03

.08

.30

.38

.05

.04

.02

.01

street

.02

.08

.04

.18

.22

.28

.06

.05

.03

.02

because

.02

.12

.08

.06

.03

.08

.25

.18

.10

.04

.02

.38

.05

.04

.02

.08

.10

.18

.06

.03

.04

was

.02

.10

.04

.03

.02

.04

.08

.22

.25

.10

too

.01

.04

.02

.01

.02

.04

.06

.15

.28

.35

tired

.02

.28

.03

.01

.03

.05

.12

.10

.21

Row i attends to Column j. Darker purple = higher attention.

The softmax bottleneck

The softmax function is crucial. It converts raw scores (which could be anything, like 12.5 or -9.1) into a clean distribution that sums to 1.0.

\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}

Where:

$x_i$ : The raw score (logit) for the current option we are evaluating.
$\sum_j e^{x_j}$ : The sum of the exponentials of all possible options $j$ (the normalization term).

Why does this matter? It creates competition.

If token A really wants to attend to token B, it must steal attention mass from tokens C and D. This forces the model to make hard decisions about relevance. It filters out the noise (low scores become near-zero) and amplifies the signal (high scores become near-one).

🖥️

Desktop recommended

This chapter contains interactive visualizations and code explorers that work best on a larger screen.

Logit (Score)

Exponent (e^x)

Probability

Bar

7.389

60.9%

2.718

22.4%

1.649

13.6%

0.368

3.0%

Sum: 12.124

Sum: 100%

Try making one score much larger (e.g., 5.0 vs 1.0). Notice how Softmax amplifies the winner and suppresses the losers ("winner-takes-all").

The causal mask

There is one final detail. If we are training the model to predict the next word, we must stop it from "cheating."

When the model processes the word "The," it shouldn't be able to see "cat," because "cat" hasn't happened yet. But the attention mechanism naturally connects everything to everything.

To fix this, we apply a causal mask. We manually set the attention scores for all future tokens to negative infinity ( $-\infty$ ).

\text{mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\\\ -\infty & \text{if } j > i \end{cases}

Where:

$\text{mask}_{ij}$ : The value added to the attention score between token $i$ (current) and token $j$ (other).
$0$ : "Go ahead, attend." Adding zero changes nothing.
$-\infty$ : "Forbidden." Adding negative infinity makes the softmax probability zero.

When softmax sees $-\infty$ , it turns it into exactly 0. The future becomes invisible.

🖥️

Desktop recommended

This chapter contains interactive visualizations and code explorers that work best on a larger screen.

Current token: "sat"

The

cat

sat

the

mat

The

-∞

cat

-∞

sat

-∞

the

-∞

mat

Allowed (Past)

Masked (Future)

Current Attention

This mask is what makes the Transformer "autoregressive" - capable of generating text one word at a time, just like a human speaking.