How Tokens Really Emerge in Large Language Models

Published: 2026-04-08

Tokenization sits at the center of how large language models work. It is the layer that turns human language into something a model can process, and it influences far more than text parsing alone. Context window limits are measured in tokens, API usage is billed in tokens, and prompt-writing strategies often succeed or fail because of how text is split into tokens.

Yet the basic question is still easy to miss: when you type an ordinary sentence, how does it become a sequence of tokens in the first place? And why don’t models simply use familiar units like full words or individual characters?

The answer leads from text to subwords, and from subwords all the way down to bytes.

A quick intuition: what does a token actually look like?

Before getting into algorithms, it helps to look at tokenization directly. If you try a tokenizer tool such as OpenAI’s official tokenizer and paste in a few different kinds of text, a pattern starts to emerge.

For an English sentence like:

I'm a front-end programmer exploring AI.

you will usually see subword-style splits. Something like front-end may be broken into front, -, and end, while programmer may remain a single token.

For a Chinese sentence like:

八百标兵奔北坡，炮兵并排跑

splitting does not simply happen one character at a time. Frequent combinations such as 标兵 or 炮兵 may be grouped into standalone tokens, while rarer combinations tend to stay in smaller pieces.

For mixed text like:

GPT-4o能处理代码print(1+1)和😊表情

code fragments and emoji are also tokenized within the same overall system. The mechanism does not switch to a completely different logic just because the input mixes natural language, symbols, and program syntax.

Two important traits become obvious from this kind of inspection:

A token is neither just a single character nor always a whole word. It is usually a subword unit chosen to balance meaning and efficiency.
Different languages and text types can still be processed under one unified tokenization framework.

That framework is not accidental. It is the result of a practical compromise.

Why not just use words or characters?

Large models rely on subwords because subwords strike a workable balance between vocabulary size and semantic coverage. Using whole words or single characters sounds simpler, but both approaches break down in different ways.

If the model uses full words

Treating every complete word as a basic unit creates two major problems.

First, the vocabulary becomes unmanageably large. Languages produce endless variations: apple, apples, applepie; or in Chinese, combinations and derivations built around the same base expression. Even common English vocabulary exceeds one hundred thousand words, while practical Chinese word inventories can easily grow far larger. A huge vocabulary raises storage demands and increases training complexity.

Second, models run into the classic out-of-vocabulary problem. Any word missing from the training vocabulary becomes difficult or impossible to represent cleanly. New slang, niche terminology, and emerging product names are exactly the kinds of expressions models need to handle, yet a strict word-level system can fail on them. Older systems often had to fall back to [UNK], an unknown token that effectively throws away the word’s identity.

If the model uses single characters

Moving to character-level units avoids vocabulary explosion, but creates a different set of costs.

The biggest is sequence length. A Chinese sentence that would be compact as words or subwords becomes much longer if split into individual characters; English text split into letters becomes even more inflated. Longer sequences consume context windows faster and make computation less efficient.

Meaning also becomes fragmented. A single character carries limited semantic information. For example, the meaning of a character such as 炮 depends heavily on the surrounding combination: 炮兵 and 火炮 are not understood the same way. Character-level input makes it harder for models to learn higher-level structure and stable semantic patterns.

Subword tokenization emerged as the practical middle path, and the most influential algorithm behind it is BPE.

The core idea behind BPE

BPE, short for Byte Pair Encoding, is a frequency-driven algorithm that builds subword units step by step. Its basic logic is simple: find the most frequent adjacent symbol pairs in a corpus and repeatedly merge them into larger units.

This process does not depend on grammar rules or a hand-written dictionary. It is driven by statistics.

How BPE works

The usual workflow can be summarized in three stages:

Initialization: split text into the smallest starting symbols, such as characters or letters, with each symbol treated as its own token.
Count frequencies: scan the corpus and record how often each adjacent symbol pair appears.
Merge iteratively: take the most frequent adjacent pair, merge it into a new token, and repeat until a target vocabulary size or merge threshold is reached.

What begins as tiny units gradually becomes a layered vocabulary containing larger and more meaningful subword pieces.

A concrete BPE example in Chinese

Consider the tongue twister:

八百标兵奔北坡，北坡炮兵并排跑，炮兵怕把标兵碰，标兵怕碰炮兵炮

To illustrate the algorithm, start by splitting it into the smallest units, here using individual Chinese characters purely for demonstration:

['八', '百', '标', '兵', '奔', '北', '坡', '，', '北', '坡', '炮', '兵', '并', '排', '跑', '，', '炮', '兵', '怕', '把', '标', '兵', '碰', '，', '标', '兵', '怕', '碰', '炮', '兵', '炮']

First merge: `标-兵`

After counting adjacent pairs across the sequence, ('标', '兵') appears 4 times, making it the most frequent pair. Merge it into a new token, 标兵:

['八', '百', '标兵', '奔', '北', '坡', '，', '北', '坡', '炮', '兵', '并', '排', '跑', '，', '炮', '兵', '怕', '把', '标兵', '碰', '，', '标兵', '怕', '碰', '炮', '兵', '炮']

Second merge: `炮-兵`

Now ('炮', '兵') becomes the top pair with a frequency of 3, so it is merged into 炮兵:

['八', '百', '标兵', '奔', '北', '坡', '，', '北', '坡', '炮兵', '并', '排', '跑', '，', '炮兵', '怕', '把', '标兵', '碰', '，', '标兵', '怕', '碰', '炮兵', '炮']

Third merge: `北-坡`

At this point ('北', '坡') appears 2 times, so it is merged into 北坡:

['八', '百', '标兵', '奔', '北坡', '，', '北坡', '炮兵', '并', '排', '跑', '，', '炮兵', '怕', '把', '标兵', '碰', '，', '标兵', '怕', '碰', '炮兵', '炮']

When does BPE stop?

The merging process ends once remaining pair frequencies fall below a chosen threshold, or once the vocabulary reaches its intended size. The result is a set of compact, high-frequency semantic units such as 标兵, 炮兵, and 北坡.

This reduces sequence length while preserving useful meaning.

Where plain BPE falls short

BPE is powerful, but starting from characters still leaves some major issues.

The initial vocabulary is already large. Chinese alone requires thousands of common characters, and once English letters, punctuation, and other symbols are included, the starting set easily grows into the tens of thousands.
Cross-language coverage is awkward. Character systems vary dramatically across languages, so a single character-based starting vocabulary is hard to standardize globally.
Out-of-vocabulary risk is not fully gone. If a symbol lies outside the initial character set, such as a rare character or a new emoji, the tokenizer can still fail to represent it directly.

To solve these problems, modern systems pushed the same merging logic one level deeper—from characters down to bytes.

Why modern LLMs favor byte-level BPE

Byte-Level BPE, or BBPE, is the version of BPE that has become standard in many top-tier models, including systems such as GPT-4, Claude, and LLaMA. Instead of beginning with characters, it begins with bytes.

That one design choice changes everything.

The background you need: Unicode and UTF-8

To understand byte-level tokenization, it helps to separate two concepts.

Unicode gives every character a unique code point. For example, 八 is assigned U+516B, a is U+0061, and 😊 is U+1F60A. Unicode solves the identity problem: each character has a globally defined representation.

UTF-8 is the encoding that turns those Unicode characters into byte sequences computers can store and process. It uses variable-length encoding:

ASCII letters and digits usually take 1 byte. For example, a becomes 0x61.
Common Chinese characters usually take 3 bytes. For example, 八 becomes 0xE5 0x85 0xAB.
Emoji and certain special symbols often take 4 bytes. For example, 😊 becomes 0xF0 0x9F 0x98 0x8A.

A useful way to think about it is this: Unicode is the character’s identity card, and UTF-8 is the machine-readable byte form of that identity.

BBPE operates on that byte form.

How BBPE works

At its core, BBPE builds a multilingual subword system by starting from raw byte sequences and then merging them statistically.

Step 1: convert text into bytes

No matter what the input looks like—Chinese, English, code, emoji, or a mixture of all of them—the text is first converted into UTF-8 bytes.

Examples:

Chinese 八 → 0xE5 0x85 0xAB
English apple → 0x61 0x70 0x70 0x6C 0x65
Mixed text GPT-4o😊 → 0x47 0x50 0x54 0x2D 0x34 0x6F 0xF0 0x9F 0x98 0x8A

Step 2: perform BPE merges on bytes

The initial vocabulary in BBPE contains only the 256 possible byte values from 0x00 to 0xFF. That is the entire starting alphabet.

Then the same iterative merge logic applies:

Count frequent adjacent byte pairs.
Merge common pairs into larger units.
Continue merging as larger patterns repeatedly co-occur.

For a Chinese character such as 八, the pair (0xE5, 0x85) may become a merged unit, and then that unit may merge with 0xAB, effectively reconstructing the character. At higher levels, frequently co-occurring character sequences may merge into larger subwords such as 八百.

Step 3: produce hierarchical tokens

After enough merging rounds, the byte stream is no longer treated as isolated bytes. It becomes a layered system of tokens:

low-level byte units,
mid-level character or subword units,
high-level frequent expressions.

This gives the tokenizer a single mechanism that can preserve semantics while remaining language-agnostic.

The practical advantages of BBPE

Compared with character-based BPE, byte-level BPE has three major strengths.

1. It eliminates true OOV failures

Because every UTF-8 string can always be expressed as bytes, the tokenizer can always fall back to the 256-byte base vocabulary. Even rare characters, unfamiliar terms, or new emoji can still be represented. That means there is no hard [UNK] barrier in the old sense.

2. It works across languages and text types

Chinese, English, source code, punctuation, and emoji all reduce to byte sequences at the lowest level. BBPE can therefore use one consistent logic across all of them. That is a major reason modern language models can handle heterogeneous input so naturally.

3. It keeps the vocabulary manageable

The base vocabulary is fixed at 256 bytes, and the final token vocabulary can be expanded through merges to a chosen size. This gives developers precise control over the trade-off between model capacity and computational cost. A model such as GPT-4 can use a vocabulary on the order of about one hundred thousand tokens without requiring an enormous hand-built starting lexicon.

Other tokenization methods in large models

BPE and BBPE are not the only approaches. Two other common subword tokenization strategies also appear in major model families.

<table> <thead> <tr> <th>Algorithm</th> <th>Core idea</th> <th>Strengths</th> <th>Limitations</th> <th>Typical model families</th> </tr> </thead> <tbody> <tr> <td>BPE/BBPE</td> <td>Iteratively merges frequent byte or character pairs</td> <td>Controllable vocabulary, multilingual compatibility, no hard OOV</td> <td>Rare words may be split too finely</td> <td>GPT series, LLaMA, Qwen</td> </tr> <tr> <td>WordPiece</td> <td>Splits text into subwords based on language-model probability maximization</td> <td>Often produces semantically coherent subwords</td> <td>Depends heavily on probability distributions from training data</td> <td>BERT, Gemini</td> </tr> <tr> <td>Unigram</td> <td>Selects segmentations through probabilistic subword modeling and search</td> <td>Supports multiple granularities of subword choice</td> <td>Higher training cost and greater computational complexity</td> <td>T5, XLNet</td> </tr> </tbody> </table>

The differences matter, but they all reflect the same larger goal: represent language in units that are compact, learnable, and flexible.

Why tokenization matters in practice

Tokenization is not just a preprocessing detail. It changes how you use models in day-to-day work.

Inspecting tokens with `tiktoken`

OpenAI’s tiktoken library makes it easy to see how GPT-family tokenizers behave in real input. The following code shows how to tokenize several kinds of text:

import tiktoken

# 加载gpt-4o的分词器
enc = tiktoken.encoding_for_model("gpt-4o")

# 测试不同文本的分词
texts = [
    "八百标兵奔北坡",
    "I'm a front-end programmer",
    "print(1+1) and 😊"
]

for text in texts:
    tokens = enc.encode(text)
    token_count = len(tokens)
    token_strs = [enc.decode_single_token_bytes(token).decode("utf-8", errors="replace") for token in tokens]
    print(f"文本：{text}")
    print(f"Token数量：{token_count}")
    print(f"Token序列：{token_strs}\n")

Running this makes token counts and token boundaries visible in a very concrete way. That is often the first step in understanding why one prompt is cheap, another is expensive, and a third unexpectedly exceeds a context limit.

Context window management

A model’s context window is defined in tokens, not pages or paragraphs. If a model supports something like 128K tokens, that limit covers the prompt, the conversation history, retrieved material, and the model’s own output. Estimating token counts before sending requests is essential if you want to avoid truncation or failure.

API cost control

Most LLM APIs are priced by token usage. That means tokenization has a direct financial impact. Long passages, redundant phrasing, and poorly structured prompts all increase cost. Compact wording and better segmentation can reduce usage substantially.

Prompt engineering

Models often respond better when prompts use coherent semantic units rather than fragmented forms. High-frequency subwords are easier for a model to process reliably than scattered character-level fragments. In practical terms, writing prompts with complete and stable expressions usually improves accuracy.

From bytes to meaning

The technical essence of tokenization in large models is a hierarchical extraction process that moves from bytes to subwords.

It abandons the naive assumption that language should be handled as either isolated characters or complete dictionary words. Instead, it builds a middle layer that preserves useful semantics without letting the vocabulary become unbounded.

BPE established the fundamental merging logic. BBPE pushed that logic down to the byte level, which solved cross-language compatibility and removed the hard out-of-vocabulary bottleneck. Together, these methods form one of the hidden foundations of modern multilingual, multi-domain language models.

What looks like a low-level implementation detail is actually a crucial part of how large models cross language boundaries and generalize across text types. Once you understand how tokens are formed, prompt engineering stops being guesswork and starts to connect with the model’s actual internal interface.