The token
The fundamental problem
Neural networks are function approximators built on linear algebra. They take numbers as input, perform matrix multiplications, and produce numbers as output. Text, however, is not a number. The word "hello" is a sequence of symbols with no inherent numerical value.
We need a reversible mapping that converts text into numbers in a way that preserves enough information for the network to learn patterns. This mapping starts with tokenization - breaking text into atomic units.
The vocabulary
Before we can convert text to numbers, we need to decide what our atomic units are. Should we split text into characters? Words? Something in between?
Character-level tokenization treats each letter as a token. The sentence "hello" becomes ['h', 'e', 'l', 'l', 'o']. This gives us a tiny vocabulary (26 letters + punctuation), but creates very long sequences. The model has to learn that ['c', 'a', 't'] means something distinct from ['d', 'o', 'g'] purely from raw patterns.
Word-level tokenization splits on spaces. "Hello world" becomes ['Hello', 'world']. This preserves meaning, but the vocabulary explodes. English has hundreds of thousands of words. If you encounter a word not in your dictionary (like "unfriendliness"), you are stuck.
Subword tokenization is the compromise used by almost all modern LLMs. It breaks words into meaningful chunks. "Unfriendliness" might become ['Un', 'friend', 'li', 'ness'].
Try it yourself below. Switch between modes to see how the sequence length and token IDs change.
Notice how "subword" mode strikes a balance. Common words like "The" stay whole, while longer or rarer words get split.
The BPE algorithm
The most popular subword method is Byte Pair Encoding (BPE). It builds a vocabulary not by linguistic rules, but by statistics.
It starts with characters and iteratively merges the most frequent adjacent pairs.
- Start: Vocabulary is just characters:
a, b, c, ... - Count: Look at your massive training dataset. Which pair appears most often? Maybe it is
efollowed bys. - Merge: Create a new token
es. Add it to the vocabulary. - Repeat: Now count again. Maybe
t+his next. Merge toth.
Eventually, you might merge th + e to get the. The algorithm stops when you reach a target vocabulary size (e.g., 50,000 tokens).
BPE Algorithm Step-by-Step
This ensures that the most common words in your language become single tokens (efficient), while rare words are built from smaller pieces (robust).
Special tokens
Vocabularies include special tokens that serve structural purposes:
<BOS>(Beginning Of Sequence): Marks the start.<EOS>(End Of Sequence): Signals the model to stop generating.<PAD>(Padding): Used to make batches of sequences the same length.<UNK>(Unknown): A fallback for characters never seen during training (rare in modern BPE).
Once we have our list of tokens IDs (e.g., [104, 2599, 88]), we are halfway there. We have integers. But integers are discrete - token 100 isn't "closer" to token 101 than to token 999.
To do math, we need to move from discrete integers to continuous vectors. That is the subject of the next chapter.