Tokenization & Embeddings

by dokuDoku Transformers karpathy/minbpe

🔍

## What are Tokenization and Word Embeddings?

For a Large Language Model (LLM), whenever we input a piece of text, before we put it into the model, we have to first translate the raw human input text into embedding vectors that the algorithm can understand.

This is a two step process:
1. Tokenize text input
2. Convert tokens to vector embeddings

At the end, the transformer produces hidden states. A linear output projection maps each hidden state to logits over the vocabulary. A decoding rule then selects the token IDs, and the tokenizer decodes those token IDs back into text. The output layer is often called an unembedding matrix, and in some models its weights are tied to the input embedding matrix.

## Tokenization

Tokenization is not merely just preprocessing, it defines the atoms of what the LLM actually sees. Text is not directly fed into the transformer, it's first converted into integer token ID's which then are later looked up in an embeddings table. Many strange LLM behaviors, arithmetic, whitespace, non-English text, code indentation and odd tokens are attributable from tokenizer design, rather than the neural network itself.

The following algorithms are used for tokenization, with **BPE** (Byte Pair Encoding) being the most popular: 
- **BPE: Byte Pair Encoding**
    - Frequency based merging of most common adjacent pairs 
    - Used in: GPT-2/3/4/4o/5, RoBERTa, Llama 3/4, Gemma 2, Qwen2, most newer models
    - [Tiktoken Library](https://github.com/openai/tiktoken) is OpenAI's refined BPE implementation
- WordPiece
    - Similar to BPE, but chooses merges based on likelihood instead of raw frequency 
    - During inference, WordPiece tokenization performed greedily choosing the longest vocabulary item matching current prefix
    - Used in: BERT, DistilBERT, Electra
    - Original algorithm used in BERT paper, somewhat common although legacy
- Unigram (Language Model) 
    - Probabilistic approach, starts with large vocabulary & prunes it by removing tokens that hurt overall likelihood the least 
    - Used in many SentencePiece models: Llama 2, Mistral 1, T5, ALBERT, and some Gemma variants
    - More statistically optimal than BPE, but slightly more complex

Of note, [SentencePiece](https://github.com/google/sentencepiece) is a popular open-source library/framework that was created by Google. It supports BPE and unigram language model tokenization, and is notable for being trainable directlyfrom raw text. Note that SentencePiece + Unigram switched to BPE + tiktoken-style tokenizer (in Llama 3) for better compatibility & performance.

Note that Byte-level BPE, is the dominant algorithm for new English-centeric or code heavy models because 
- Guaranteed 100% coverage of any text 
- Efficient 
- Plays nicely with regex pre-tokenization tricks

### BPE (Byte Pair Encoding) Algorithm

The BPE algorithm is a subword tokenization algorithm that automatically builds a vocabulary by repeatedly merging the most frequent pairs of characters (or bytes).

The core idea is to start with the smallest possible atomic units, which are individual bytes (256 possible UTF-8 bytes). Then we repeatedly find the most common pair of adjacent tokens and merge them into a single new token. We repeat merging until a desired vocabulary size is reached such as 50k tokens. The result is that common words or word pieces become single tokens while rare words are still broken into smaller known pieces. In byte-level BPE, there are fundamentally no unknown text strings because every UTF-8 string can be represented as a sequence of bytes, whether or not the word appeared in the tokenizer training corpus.

This algorithm has been found to give a good balance between short sequences and full coverage of any text.

The pseudo-code describing this algorithm is:

$\begin{array}{rl}
1. & \text{Initialize vocabulary}\ V \leftarrow \{\text{all 256 UTF-8 bytes}\} \\
2. & \text{Convert entire training corpus to list of token IDs (initially bytes)} \\
3. & \text{While}\ |V| < \text{desired vocabulary size}: \\
   & \quad a.\ \text{Count frequency of every adjacent pair}\ (t_1, t_2)\ \text{in the corpus} \\
   & \quad b.\ \text{Find the most frequent pair:}\ (t_1^*, t_2^*) = \arg\max\ \text{frequency} \\
   & \quad c.\ \text{Create new token}\ t_{\text{new}} = t_1^* + t_2^* \\
   & \quad d.\ \text{Add}\ t_{\text{new}}\ \text{to}\ V \\
   & \quad e.\ \text{Record the merge rule:}\ (t_1^*, t_2^*) \to t_{\text{new}} \\
   & \quad f.\ \text{Replace every occurrence of}\ (t_1^*, t_2^*)\ \text{with}\ t_{\text{new}}\ \text{in the corpus}
\end{array}$

To see a code implementation of this see the repository [minbpe](https://github.com/karpathy/minbpe) by Andrej Karpathy, which is quite good.

### Tokenization Remarks

Note that often users will misspell or just type things incorrectly into the LLM input. Thus, before tokenization we typically do the following: 
1. Normalization 
2. Pre-tokenization

**Normalization** standardizes the raw text so that equivalent strings are represented identically. This removes tiny inconsistencies that would create extra tokens or break decoding. Some typical operations include unicode normalization, whitespace normalization, lowercasing, accent stripping and other rules. The goal simply is to make the text clean and consistent without losing meaning.

**Pre-tokenization** splits the text into chunks (pre-tokens) using simple rules or a regular expression. The actual BPE or Unigram algorithm then runs only within these chunks and not across chunk boundaries. This is important because without pre-tokenization, BPE would merge across word boundaries, punctuation, or numbers. The most common way that this is implemented is through a **regex pattern** that separates 
- words/letters
- numbers 
- punctuation & symbols
- whitespace
- contractions

Note that GPT-2/4 use a very specific long regex (as can be seen in the [minbpe](https://github.com/karpathy/minbpe) repository). Llama 3 & newer models use similar, but refined regex patterns.

Many tokenizers use regex- or rule-based pre-tokenization to restrict where subword merges can occur. However, this is a design choice rather than a universal requirement; SentencePiece, for example, can be trained directly from raw text.

## Embedding Tokens

After tokenization, we have a sequence of integer token IDs, one for each token. We then feed these tokens into an embedding matrix $E \in \mathbb{R}^{V \times d}$ where $V$ is the Vocabulary size and $d$ is the output dimension size. Note that the convention here is that $E$ is a **row major** matrix. To get a sense of these sizes, for GPT-2 $V = 50,257$ and $d = 768$. Each **row** of $E$ is the vector representation of that specific token.

$$ \text{token ID } i \quad \xrightarrow{\text{lookup}} \quad \mathbf{e}_i = E[i,:] \quad \in \mathbb{R}^{d} $$

Or in matrix form

$$\text{embeddings} = E[\text{token\_ids}] \quad \text{(PyTorch/TensorFlow indexing)}$$

where if you think of each token ID as a one-hot vector $x_i \in \mathbb{R}^V$, we have 
$$\mathbf{e}_i = \mathbf{x}_i^\top E$$

### How the Embedding Matrix is Trained

The embedding matrix $E$ isn't pretrained separately. It's a fully learnable parameter of the larger LLM model and is trained **end-to-end** together with every other weight in the transformer. Typically, the training objective for next token prediction is the *Cross Entropy Loss*:

$$ L = -\frac{1}{N} \sum_{t=1}^N log\ P(token_{t+1} | tokens_{1...t}) $$

During backpropagation, the loss gradient flows all the way back to the embedding matrix. Note that $\frac{\partial L}{\partial E[i,:]}$ is non-zero only for the token IDs that actually appeared in the batch if the input embedding matrix is not tied to the output head. In this case every time a particular token appears in the training data, its row in $E$ gets a tiny update that makes the model's embeddings better. However, if the embedding matrix is tied to the output projection, or when considering the output softmax layer, gradients can involve many vocabulary rows.

After trillions of tokens, the rows of $E$ move around in the $d$ dimensional space until the entire model (embeddings + attention + feed forward layers) become good at next-token prediction.

### What do these vectors look like?

To investigate what these vectors look like, we are going to discuss the paper [word2vec](https://arxiv.org/abs/1301.3781) & its followup [Distributed Representations of Words and Phrases
and their Compositionality](https://arxiv.org/pdf/1310.4546). Note that [GloVe](https://nlp.stanford.edu/pubs/glove.pdf) is also highly related too.

**NOTE**: Word2vec embeddings are static word-level embeddings. Transformer input embeddings are also static lookup vectors, but the representations used by the model after attention layers are contextual and depend on the entire sequence.

Word2vec learns a dense vector for each word by solving a self-supervised prediction task over raw text. Instead of manually labeling data, the model creates training examples from local windows of text. The two scenarios are either you predict the middle word from its surrounding words or you predict neighboring words from the middle word.

The key assumption is that **a word's meaning can be learned from the contexts it appears in**. The model isn't trained to know that *king* and *queen* are related, it learns vectors that make these relationships emerge since these words appear in structured, overlapping contexts. The paper famously show that we can use vector arithmetic to capture analogies such as: 
$$\text{king - man + woman = queen}$$

and country capital relations.

### Skip-Gram Architecture

The main architecture for the word2vec paper that we're going to study is the *Skip-Gram Architecture*. Note that although word2vec only trained the "embedding" layer, we can still gain insights from this paper into how embeddings typically work in larger LLM's.

The Skip-Gram architecture is a shallow neural network that takes a *center word* as input, and predicts its *nearby context words* as output as seen in **Figure 1**.

<img src="/media/images/20260512_195455_b0bdcc04-2e12-44b6-8cf9-e2e1f085c5a3.png" alt="b0bdcc04 2e12 44b6 8cf9 e2e1f085c5a3" style="max-width: min(700px,100%); height: auto;">
<center>
$\textbf{Figure 1}$: Skip-Gram General Architecture
</center>

As we can see in the **interactive pyapplet**, this produces results where given a vector, we can do vector arithmetic and measure the angles between vectors for similarity. For this reason, typically vector embeddings are measured using the dot product to measure how similar two vectors are in magnitude and their orientation angle. Note that if we want to compare directions instead of scale we often use cosine similarity or normalize the vectors to have the same magnitude.

Recall the definition of dot product measures the angle $\theta$ between two vectors $\mathbf{a},\mathbf{b}$:
$$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos \theta$$