Summary of word tokenization for pre-trained models.

# Preliminaries

The table summarizes the difference between various tokenization approaches.

Tokenization Methods Pros Cons
word-based 1.Space and punctuation tokenization;
2.Rule-based tokenization.
Easy to use. 1. Very large vocabularies.
2. OOV problem.
3. Loss of meanings across very similar words.
char-based Splitting into chars 1.Slimmer vocabularies.
2.Mostly open-vocabulary: fewer OOV words.
1. Very long sequence.
2. Less meaningful individual tokens
subword-based WordPiece
BPE
Unigram
1.Good balance the vocabulary size and the sequence length;
2.Help identify similar syntactic or semantic situations in texts;
3.Can identify start of word tokens.

Why subword?

• Subword-based tokenization lies between character- and word-based tokenization, which arises from the idea that:

Frequently used words should not be split into smaller subwords;
Rare words should be decomposed into meaningful subwords.

• Subwords help identify similar syntactic or semantic situations in texts, such as same prefix or sufix.
• Subword tokenization can identify start of word tokens, such as “##” in WordPiece (BERT).

# Summary

It can be seen from the table that:

• OpenAI and Facebook favor BPE tokenization whereas Google prefers self-proposed WordPiece and Unigram methods. ;)
ModelTokenizationCorpusAuthors
GPTBPE (40,478)
[Spacy/ftfy pre-tokenizer]
BooksCorpusOpenAI
GPT-2BBPE (50,257)WebText (40GB)OpenAI
GPT-3BBPECommon Crawl, WebText2,
Books1/2, Wikipedia
OpenAI
XLNetUnigram (SentencePiece)BooksCorpus, enwiki, Giga5,
ClueWeb 201-B, Common Crawl
ELECTRAWordPiece (30k)base: same as BERT;
large: same as XLNet

Following sections generally compares common subword methods in pre-trained models. Refer to following links[13] for detailed tokenization process.

TL;DR

• WordPiece $\Uparrow$ (probability-based) merges tokens based on bigram likelihood. It uses a language model to evaluate the likelihood of subword pair mergence during each iteration, incrementally merging the neighbor unit pairs.
• Byte Pair Encoding (BPE) $\Uparrow$ (frequency-based) merges tokens based on bigram frequency. It uses the subword pair co-occurrence to greedily merge neighbor pairs, which can effiectively balance the vocabulary size and the sequence length. It is based on the greedy longest-match-first algorithm (deterministic symbol replacement), which cannot generate multiple segmentations with probabilities.
• Unigram Language Model $\Downarrow$ (subword regularization) prunes tokens based on unigram LM perplexity, which can be viewed as a probabilistic mixture of characters, subwords, and word segmentations, where the mixture probabiilty is computed using EM algorithm. It reduces the subword using a unigram LM with likelihood reduction.
Tokenization #Vocab Update method New symbols
WordPiece Bottom-up merge
BPE Bottom-up merge
Unigram Prune

## Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE)[8] firstly adopts a pre-tokenizer to split the text sequence into words, then curates a base vocabulary consisting of all character symbol sets in the training data for frequency-based merge.

Pre-tokenization  The pre-tokenization can be:

• Space tokenization, e.g. GPT-2, RoBERTa.
• Rule-based tokenization (Moses), e.g. XLM.
• Spacy and ftfy: GPT.

Frequency-based Merge  Starting with the base vocabulary, BPE counts the frequency of each neighbor pair and selects the unit pair that occurs most frequently to the base vocabulary. Then it searches for the next unit pair that occurs the most frequently.

### Byte-level BPE (BBPE)

#### Background

Unicode vs Byte:

• Unicode: Unicode is an encoding for textual characters which is able to represent characters from different languages. Each character is represented by a unicode code point. Unicode consists of a total of 137,929 characters.
• Byte: 8 bits is called a byte. One byte character set can contain 256 characters.
BPE Tokenization #Code points Seq length OOV
Unicode-level 13k+ L
Byte-level 256 ≤4L

Unicode code point contains 130k+ points to cover the full space of textual characters, which can increase the base vocabulary size of BPE. Thus, Applying BPE to the byte sequence of language is a great idea proposed in GPT-2[14] to reduce the vocabulary size. However, directly applying byte-level BPE can result in suboptimum because the greedy frequency-based heuristic in BPE tend to merge common words into neighbors to generate overfit sub-tokens, such as “-ing.”, “-ing!”, “-ing?”.

To avoid this, GPT-2[14] prevents BPE from merging across different character categories for any byte sequence except space. With byte-level subwords, BBPE can represent any texts using moderate vocabulary size without out-of-vocabulary problem. Moreover, it will increase the byte sequence length to x4 maximum.

#### BBPE

The base vocabulary contains all possible base characters in the training data. It can become large if all unicode characters are included. Thus, GPT-2[14] used Byte-level BPE (BBPE) by resorting to byte sequence of texts instead of unicode character strings for base vocabulary construction. It is also adopted by RoBERTa, BART, GPT-2, and GPT-3.

Vocabulary Size  The final vocabulary size is the size of base vocabulary plus the # of merges, where the # of merges is a hyperparameter. For instance,

• GPT (character-level BPE) has 40,478 vocabularies: 478 base vocabularies + 40k merges.
• GPT-2 (byte-level BPE) has 50,257 vocabularies: 256 base vocabularies + 1 [EOS] token + 50k merges.

The implementation of GPT-2 & RoBERTa.

## WordPiece

WordPiece[9][10] can be viewed as a language-modeling based BPE variant. It trains with similar process to the BPE but uses disparate merge rule: WordPiece select the unit pair that maximizes the likelihood of traing data at utmost, rather than choose the most frequent pair. WordPiece chooses the subword pair that has the maximum mutual information value.

WordPiece scores the likelihood of possible pairs using an n-gram LM. [9] mentioned that training LMs for every possible merge is prohibit, they used aggressive heuristics to reduce the budget. However, the public training implementation is unavailable.

The BERT tokenization applies two tokenizers one after another:

• BasicTokenizer:
1. Convert text to unicode.
2. Clean text: invalid character removal and whitespace cleanup.
3. Use whitespace to seperate Chinese characters.
4. Whitespace tokenization.
5. Lowercase & Strips accents.
6. Split punctuations.
• WordpieceTokenizer:
1. Convert texts to unicode.
2. Apply WordPiece, a greedy longest-match-first algorithm to perform tokenization given vocabulary.

## Unigram Language Model

Unigram Language Model[11] initializes its base vocabulary with a large # of vocabulary and gradually removes a portion (e.g., 20%) of units according to the likelihood change. It use a unigram LM to evaluate the likelihood increase after subword removal, where the probability of each unit is computed using EM algorithm. The drop process will stop until reach the pre-defined vocabulary size.

Since unigram is not based on merge rules (in contrast to BPE and WordPiece), there has several ways of tokenizing new text after training. Therefore, unigram also saves the probability of each token in the training corpus on top of saving the vocabulary so that the probability of each possible tokenization can be computed after training. It simply picks the most likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their possibilities.

Assume that the set of all possible tokenizations for a word $x_i$ is defined as $S(x_i)$, the overall loss is defined as:

Refer to [18] for details.

## SentencePiece Library

SentencePiece[12][17] includes the space in the base vocabulary then use BPE or ungram algorithm to tokenize. XLNet, T5, ALBERT use SentencePiece for subword tokenization. It uses the unigram by default.

Pros：

1. C++ implementations makes it blazingly fast to tokenize.
2. It is whitespace agnostic, supporting to train non-whitespace delineated languages, such as Chinese and Japanese with the same ease as English or French.[18]
3. It works at the byte level.

### Unigram: sampling and nbest segmentation for subword regularization

When --model_type=unigram (default) is used, we can perform sampling and n-best segmentation for data augmentation. See subword regularization paper[11] for more detail. nbest_size is the number of highest-ranked groups of tokens to sample from at each time, where -1 means all of the possibilities.

### BPE model

Sentencepiece also supports BPE (byte pair encoding) model by setting --model_type=bpe. The BPE model does not support sampling and n-best segmentation.

### Character and word model

Sentencepiece supports character and word segmentation with --model_type=char and --model_type=character flags.
In word segmentation, sentencepiece just segments tokens with whitespaces, so the input text must be pre-tokenized. We can apply different segmentation algorithm transparently without changing pre/post processors.

### Text normalization

Sentencepiece provides the following general pre-defined normalization rules. We can change the normalizer with --normaliation_rule_name=<NAME> flag.

• nmt_nfkc: NFKC normalization with some additional normalization around spaces. (default)
• nfkc: original: NFKC normalization.
• nmt_nfkc_cf: nmt_nfkc + Unicode case folding (mostly lower casing)
• nfkc_cf: nfkc + Unicode case folding.
• identity: no normalization

The TSV file is fed with --normalization_rule_tsv=<FILE> flag.

### Vocabulary restriction

We can encode the text only using the tokens spececified with set_vocabulary method.

### Extracting crossing-words pieces

Sentencepieces does not extract pieces crossing multiple words (here the word means the space delimited tokens). The piece will never contain the whitespace marker (_) in the middle.

--split_by_whtespace=false disables this restriction and allows to extract pieces crossing multiple words. In CJK (Chinese/Japanese/Korean), this flag will not affect the final segmentation results so much as words are not tokenized with whitespaces in CJK.

### Getting byte offsets of tokens

Sentencepiece keeps track of byte offset (span) of each token, which is useful for highlighting the token on top of unnormalized text.

We first need to install protobuf module and sentencepiece_pb2.py as the byte offsets and all other meta data for segementation are encoded in protocol buffer. encode_as_serialized_proto method resturns serialized SentencePieceText proto. You can get the deserialized object by calling ParseFromString method.

For the need of expanding new special tokens to pre-trained sentencepiece model, such as [MASK0-99], [DOMAIN0-99], and so on.
Ref: [19]

Ref:[20].

## Empirical Analysis

### Unigram vs Char-based BPE

Unigram aligns better than char-based BPE does in morphology. [15] argued that Unigram LM tokenization can recover subword units that align with morphology much better than BPE do, using SentencePiece[12] implementation on English and Japanese Wikipedia.

It can be seen from the below figure that Unigram tends to produce longer subword units than BPE on average and have more tokens of moderate frequency.

As shown in the table, BPE tokenization tends to merge common tokens, such as English inflectional suffixes and Japanese particles, into their neighbors even though resulting units are not semantically meaningful. This may be due to the greedy construction of BPE tokenization.

[15] found that segmentations produced by Unigram LM align more closely to the morphological references in both English and Japanese.

Models using Unigram outperform counterparts using BPE in finetuning downstream tasks. [15] claimed that fine-tuning models pretrained with unigram LM tokenization produces better performance than with BPE tokenization for experimented tasks.