02 Tokenization
Lecture 01 of CS336 Language Modeling from Scratch at Stanford.
Before a language model can process text, the text must be converted into numbers. This conversion is handled by tokenization — the process of splitting raw text into discrete units called tokens and mapping each token to an integer ID. We will walk through the following perspectives:
- Why tokenization is necessary,
- What design choices are available, and
- Why Byte Pair Encoding (BPE) has become the dominant approach in modern LLMs.
Why We Need Tokenization
Language models are built on mathematical operations: matrix multiplications, dot products, and nonlinear activation functions, all of which operate on numerical data. However, in practice, human language often comes in the form of text, sequences of characters or words that carry meaning but are not inherently numeric. To make language usable for these models, the first step in any pipeline is to convert text into a numerical representation.
The question then becomes: how should we convert continuous text into numbers?
The most straightforward idea is to assign a unique number to each word. But this approach runs into several fundamental issues:
- Vocabulary explosion. Natural language contains an enormous number of distinct word forms. Even in English, accounting for proper nouns, domain-specific terms, and morphological variants (e.g., run, runs, running, runner), a word-level vocabulary can easily reach hundreds of thousands of entries, or more.
- Out-of-vocabulary (OOV) words. Any word not seen during training has no assigned ID and cannot be handled at inference time, such as new words, rare words, and even simple typos.
- Loss of morphological structure. Words like happy and unhappy share a root and are semantically related, but a word-level scheme treats them as completely unrelated entries.
These issues suggest that representing language at the word level is too rigid and brittle. Language is fluid—new words emerge, meanings shift, and structure lies beneath the surface. What we need is a representation that can adapt, one that can handle unseen words, capture shared structure, and remain computationally efficient.
So instead of simply assigning integers to whole words, we step back and ask a more fundamental question: what should the basic units of text be?
That question leads us to the idea of a token. A token is not limited to being a full word. It can be a word, a fragment of a word, or even a single character. It’s simply a chunk of text chosen to serve as the atomic unit for the model.
Tokenization is then the process of splitting raw text into tokens and mapping each to a unique integer ID. By redefining text into flexible units—tokens—it allows language models to handle large vocabularies, generalize to unseen words, and capture meaningful structure in language.
To make it more concrete, consider a simple example. The raw text is represented as a Unicode string:
string = "Hello, 🌍! 你好!"
A language model, however, operates over sequences of token IDs:
indices = [15496, 11, 995, 0]
A tokenizer is the component that bridges these two representations. It provides:
- an encoding process (text $\rightarrow$ token IDs), and
- a decoding process (token IDs $\leftarrow$ text).
The total number of possible token IDs is called the vocabulary size, which determines how many distinct tokens the model can represent.
Next, we’ll explore different tokenization strategies and how they define tokens.
(TBC)