Tokenization is a crucial text-processing step for large language models. It splits text into smaller units called tokens that can be fed into the language model.
Many tokenization algorithms are available; choosing the correct algorithm leads to better model performance and lower memory requirements. How? Let's see together.
This blog post dives into the question of why tokenization is needed. And what are popular tokenization techniques for large language models?
Let's get started!
Why Tokenization Is Needed
Language models process text in discrete units called tokens. Tokens can represent words, subwords, or even characters, depending on the tokenization method used. Tokenization provides a structured way to break down text into manageable pieces for the model to process.
Now, we all know large language models are pre-trained on a huge corpus of text (think of all Wikipedia pages). Tokenization takes the corpus and provides a vocabulary of tokens that exist in the corpus. The simplest way is to split the corpus by whitespace and output words as vocabulary. That would give a very large vocabulary. For example, the number of words in the English Wikipedia corpus must be around 2 billion.

The other extreme end is to have a tokenization that splits the corpus by characters and outputs all characters as the vocabulary. This vocabulary will have a small size of about 100! Which tokenization algorithm is better? We will see soon.
But we see that in either case, tokenization allows models to work with a finite vocabulary size.
Tokenization helps manage the vocabulary by mapping text elements to predefined tokens in the model's vocabulary. This is essential for controlling memory and computational requirements.
Tokenization does a lot more than the above. For example, it handles out-of-vocabulary (OOV) words. How? — words that are not present in the model's vocabulary are represented as combinations of known subword units in the vocabulary.
Natural language has a vast number of words and subword units. There are word tokenizers, subword tokenizers, and character-level tokenizers. Let's briefly look into each one and see how they compare.
1) Character-Level Tokenization
Character-level tokenization involves breaking text down into individual characters.

Character-level tokenization has the finest granularity as it captures each character in the text. It is one of the simplest tokenization algorithms available. Here are the pros and cons of this method:
Pros:
- Since characters are the smallest units, there is no need to define a vocabulary, making it suitable for handling any language.
- Since all words are composed of characters, this method naturally handles rare or out-of-vocabulary words.
Cons:
- Character-level tokenization typically results in longer sequences, which can be computationally expensive to process.
- It cannot capture word-level semantics.
- Models may struggle to generalize effectively due to the absence of higher-level linguistic units like words and subwords.
2) Word-Level Tokenization
Word-level tokenization involves splitting text into individual words.

This method is on the other extreme end of tokenization; It aligns with how humans process language, and thus, it is suitable for various language understanding tasks. The pros and cons of this method are as follows:
Pros:
- Word-level tokenization captures word semantics and relationships between words, which is crucial for many NLP tasks.
- Word-level tokenization generally results in shorter sequences compared to character-level tokenization; hence, it reduces computational complexity.
Cons:
- Word-level tokenization relies on a predefined vocabulary, which can lead to issues with out-of-vocabulary words or domain-specific terminology.
- Some words have multiple meanings, and word-level tokenization might not capture the correct sense without additional context.
3) Subword Tokenization
Subword-level tokenization involves breaking text into smaller units, such as subwords. Common methods for subword tokenization are Byte Pair Encoding (BPE), WordPiece, and SentencePiece.

Subword tokenization strikes a balance between character and word level. It captures meaningful subword units while maintaining a manageable vocabulary size. Here are the pros and cons:
Pros:
- It can represent out-of-vocabulary words by composing them from subword units, enhancing robustness.
- Subword tokens result in sequences of moderate length, allowing models to process text while retaining semantic information efficiently.
Cons:
- Subword tokenization methods like BPE involve additional complexity in vocabulary construction and tokenization compared to word level.
All together. In practice, the choice of tokenization level depends on the specific NLP task, the dataset, and the trade-offs between granularity, computational efficiency, and the ability to handle different languages and linguistic phenomena. Many state-of-the-art NLP models, like Transformers, leverage subword-level tokenization (e.g., BPE or WordPiece) because it offers a good compromise between the advantages of character and word-level tokenization.

State-Of-The-Art Tokenizer Is Subword Tokenizer
Byte-pair encoding (BPE)[1] is a subword tokenizer commonly used in state-of-the-art language models. Some notable examples of current LLMs that use either BPE or its variants are as follows:
- GPT-2 and GPT-3: OpenAI's recent GPT models uses a variant of BPE for tokenization. It has been a game-changer in various NLP tasks, including text generation and language understanding.
- BERT: BERT (Bidirectional Encoder Representations from Transformers) uses WordPiece, a variant of BPE, for tokenization. It has achieved state-of-the-art results on various NLP benchmarks.
- RoBERTa: RoBERTa, a variant of BERT, also uses WordPiece tokenization, similar to BERT.
- XLNet: A transformer model that avoids BERT's independence assumption by predicting masked tokens in any order. Uses SentencePiece, another BPE variant, for tokenization.
- T5: T5, a transformer model trained on a denoising objective to be more robust, uses SentencePiece and BPE for pre-training.
- ALBERT: A memory-efficient BERT variant using factorized embedding layers uses WordPiece (BPE) to tokenize text.
- ELECTRA: A transformer model pre-trained via a replaced token detection task tokenizes text via BPE.
- BART: A sequence-to-sequence model pre-trained on text generation uses BPE for tokenization.
Subword Tokenization Algorithms
There are a few common algorithms for subword tokenization:
- Byte-Pair Encoding (BPE) [1]
- Unigram language modeling (ULM) [3]
- Wordpiece [2]
- Sentencepiece [5]
Wordpiece and SentencePiece are variants of BPE. All above algorithms have two main parts:
1- A token learner: this takes a corpus of text and creates a vocabulary containing tokens.

2- A token segmenter: this takes a piece of text, such as a sentence, and segments it into tokens.

Key Takeaways
Takeaway1: We prefer tokenizers with lower vocabulary size. Why? Because:
- Lower vocabulary size leads to fewer embedding parameters to learn. With a smaller vocabulary, the word embedding matrix is much smaller. Also, the softmax classifier layer is simplified. The softmax over probabilities of tokens is more straightforward, with fewer vocabulary entries. In addition, Smaller vocabulary requires less memory usage to store embeddings.
- A smaller vocabulary usually has a better generalization to unseen data. First, with fewer tokens, the model is less prone to overfitting the training data. Second, subword vocabularies are more likely to represent unseen words by combining known subwords. Contrast this with the whole work as a token in the vocabulary.
The cons of smaller vocabulary is that the input document will be tokenized to a larger sequence of tokens, and so it might not fit the context of the language model we are using.
Takeaway 2: Most subword tokenizers are run inside a space-separated token, which means that the first documents is split into words using a space-splitter. Then, the subword-tokenizer is ran on the list of words. This is to ensure every token finally outputted by the subword tokenizer will be a word or a subword.
This is needed because otherwise, some words that occur frequently together in texts, such as "there is," could be considered as one token! But running the space-splitter tokenizer first, split it into two words, "there" and "is," and then the subword tokenizer can further split it into subwords.
If you have any questions or suggestions, feel free to reach out to me: Email: mina.ghashami@gmail.com LinkedIn: https://www.linkedin.com/in/minaghashami/