The Inner Mechanism of Byte Pair Encoding

Juan Vera

January 2025

Abstract

BPE is a subword tokenization technique that's used to split words into smaller, more frequent subunits, which reduces vocabulary size while still being able to represent a large variety of words. BPE was used in models such as GPT-1 and GPT-2 for tokenization

Mechanism

We start with a set of characters ( or bytes ) - every character is initially treated as a token,

s1=HELLOW(s1)=[H, E, L, L, O]T(W(s1))=[1,2,3,4,5]s_1 = \text{HELLO} \\[3mm] \mathcal{W}(s_1) = \left[\text{H, E, L, L, O}\right] \\[3mm] \mathcal{T}(W(s_1)) = \left[ 1, 2, 3, 4, 5 \right]

where each letter in W\mathcal{W} is converted to a single numerical index, or it's "token".

Assume we have another word, "Helios":

s2=HELIOSW(s2)=[H, E, L, I, O, S]T(W(s2))=[1,2,3,7,5,8]s_2 = \text{HELIOS} \\[3mm] \mathcal{W}(s_2) = \left[\text{H, E, L, I, O, S}\right] \\[3mm] \mathcal{T}(W(s_2)) = \left[ 1, 2, 3, 7, 5, 8 \right]

For "Hello", we count how often pairs of symbols appear in the text.

For the word Hello and Helios, "HE" appears twice and "HEL" appears twice.

We find the most frequent pair of characters, or subword, and merge it into a new token - in this case since both appear twice, we can randomly choose based on a pseudorandom probability P()\mathcal{P}().

Assume we choose "HEL", we merge individual tokens into a single token, "H", "E", "L"\text{"H", "E", "L"} "HEL"\rightarrow \text{"HEL"}

We reuse that token in the original tokenized form for the words HELIOS\text{HELIOS} and HELLO\text{HELLO}

s1[HEL, L, O]s2[HEL, I, O, S]s_1 \rightarrow \left[\text{HEL, L, O}\right] \\[3mm] s_2 \rightarrow \left[\text{HEL, I, O, S}\right]

This process is repeated, of course over a larger set of words or tokens, until we reach a maximum amount of merges or until a desired vocabulary size (predefined as a hyperparameter).

That's it!