The Inner Mechanism of Byte Pair Encoding
Juan Vera
January 2025
Abstract
BPE is a subword tokenization technique that's used to split words into smaller, more frequent subunits, which reduces vocabulary size while still being able to represent a large variety of words. BPE was used in models such as GPT-1 and GPT-2 for tokenization
Mechanism
We start with a set of characters ( or bytes ) - every character is initially treated as a token,
where each letter in is converted to a single numerical index, or it's "token".
Assume we have another word, "Helios":
For "Hello", we count how often pairs of symbols appear in the text.
For the word Hello and Helios, "HE" appears twice and "HEL" appears twice.
We find the most frequent pair of characters, or subword, and merge it into a new token - in this case since both appear twice, we can randomly choose based on a pseudorandom probability .
Assume we choose "HEL", we merge individual tokens into a single token,
We reuse that token in the original tokenized form for the words and
This process is repeated, of course over a larger set of words or tokens, until we reach a maximum amount of merges or until a desired vocabulary size (predefined as a hyperparameter).
That's it!