Transformers from first principles :: The token

The token

The fundamental problem

Neural networks are function approximators built on linear algebra. They take numbers as input, perform matrix multiplications, and produce numbers as output. Text, however, is not a number. The word "hello" is a sequence of symbols with no inherent numerical value.

We need a reversible mapping that converts text into numbers in a way that preserves enough information for the network to learn patterns. This mapping starts with tokenization - breaking text into atomic units.

The vocabulary

Before we can convert text to numbers, we need to decide what our atomic units are. Should we split text into characters? Words? Something in between?

Character-level tokenization treats each letter as a token. The sentence "hello" becomes ['h', 'e', 'l', 'l', 'o']. This gives us a tiny vocabulary (26 letters + punctuation), but creates very long sequences. The model has to learn that ['c', 'a', 't'] means something distinct from ['d', 'o', 'g'] purely from raw patterns.

Word-level tokenization splits on spaces. "Hello world" becomes ['Hello', 'world']. This preserves meaning, but the vocabulary explodes. English has hundreds of thousands of words. If you encounter a word not in your dictionary (like "unfriendliness"), you are stuck.

Subword tokenization is the compromise used by almost all modern LLMs. It breaks words into meaningful chunks. "Unfriendliness" might become ['Un', 'friend', 'li', 'ness'].

Try it yourself below. Switch between modes to see how the sequence length and token IDs change.

The289

quick541

brown552

fox333

jumps559

over444

the321

lazy448

dog314

.46

Word level: Short sequences, huge vocabulary. Struggles with unknown words and typos.

Notice how "subword" mode strikes a balance. Common words like "The" stay whole, while longer or rarer words get split.

The BPE algorithm

The most popular subword method is Byte Pair Encoding (BPE). It builds a vocabulary not by linguistic rules, but by statistics.

It starts with characters and iteratively merges the most frequent adjacent pairs.

Start: Vocabulary is just characters: a, b, c, ...
Count: Look at your massive training dataset. Which pair appears most often? Maybe it is e followed by s.
Merge: Create a new token es. Add it to the vocabulary.
Repeat: Now count again. Maybe t + h is next. Merge to th.

Eventually, you might merge th + e to get the. The algorithm stops when you reach a target vocabulary size (e.g., 50,000 tokens).

BPE Algorithm Step-by-Step

Start with characters. We count all adjacent pairs in our corpus.

This ensures that the most common words in your language become single tokens (efficient), while rare words are built from smaller pieces (robust).

Special tokens

Vocabularies include special tokens that serve structural purposes:

<BOS> (Beginning Of Sequence): Marks the start.
<EOS> (End Of Sequence): Signals the model to stop generating.
<PAD> (Padding): Used to make batches of sequences the same length.
<UNK> (Unknown): A fallback for characters never seen during training (rare in modern BPE).

Once we have our list of tokens IDs (e.g., [104, 2599, 88]), we are halfway there. We have integers. But integers are discrete - token 100 isn't "closer" to token 101 than to token 999.

To do math, we need to move from discrete integers to continuous vectors. That is the subject of the next chapter.