Summary of word tokenization, as well as coping with OOV words. (This is expanded based on my MT course lectured by Dr. Rico Sennrich in Edinburgh Informatics in 2018.)

# Background

## How to Represent Text?

• One-hot encoding
• lookup of word embedding for input
• probability distribution over vocabulary for output
• Large vocabulary
• increase network size
• decrease training and decoding speed

## Problems

Open-vocabulary problems:

• Many training corpora contain millions of word types
• Productive word formation processes (compounding; derivation) allow formation and understanding of unseen words
• Names, numbers are morphologically simple, but open word classes

# Word-based Tokenization

Limits:

• Very similar words have entirely different meanings, such as dog vs dogs.
• The vocabulary can end up very large.
• Large vocabularies result in enormous embedding matrix as the input and output layer.

Common methods include:

• Space and punctuation tokenization;
• Rule-based tokenization.

## Non-solution: Ignoring Rare Words

• Replace OOV words with UNK
• A vocabulary of 50,000 words covers 95% of text: 95% is not enough !

OOV problem:

• If two very different words are both OOV, they wil get the same id ([UNK]).
• Large vocabulary will increase the embedding layer’s parameters.

## Approximative Softmax

Compute softmax over “active” subset of vocabulary $\rightarrow$ smaller weight matrix, faster softmax[1]

• At training time: vocabulary based on words occurring in training set partition
• At test time: determine likely target words based on source text (using cheap method like translation dictionary)

Limitations:

• Allow larger vocabulary, but still not open
• Networks may not learn good representation of rare words

## Back-off Models

• Replace rare words with UNK at training time [2]
• When system produces UNK, alight UNK to source word, and translate this with back-off method

Limitations:

• Compounds: hard to model 1-to-many relationships
• Morphology: hard to predict inflection with back-off dictionary
• Names: if alphabets differ, we need transliteration
• Alignment: attention model unreliable

# Character-based Tokenization

Character tokens solve the OOV problem with following benefits:

• Vocabularies are slimmer.
• Mostly open-vocabulary: fewer OOV words;
• No heuristic or language-specific segmentation;
• Neural networks can conceivably learn from raw char sequences.

Drawbacks:
Representing the input as a squences of characters can have following problems:

• Character tokens can increase sequence length, which will slow down the speed of training and decoding (x2 - x4 increase in training time), as well as make it difficult to learn relationships between characters to form meaningful words.
• It is harder for the model to learn meaningful input representations. E.g., learn the context-independent reprsentation for a single char “N” is much harder than learning a context-indepdent reprentation for the word “NLP”.
• Naive char-level encoder-decoders are currently resource-limited.

OPEN QUESTIONS:

• On which level do we represent meaning?
• On which level do attention operate?

## Hierarchical Model: Backoff

• Word-level model produces UNKs [4]
• For each UNK, char-level model predicts word based on word hidden state

Pros:

• Prediction is more flexible than dictionary look-up
• More efficient than pure char-level translation

Cons:

• Independence assumptions between main model and backoff model

## Char-level Output

• No word segmentation on target side [5]
• Encoder is BPE-level

## Char-level Input

Hierarchical representation: RNN states represent words, but their representation is computed from char-level LSTM [6]

## Fully Char-level

• Goal: get rid of word boundaries [7]
• Target side: char-level RNNs
• Source side: convolution and max-pooling layers

# Subword-based Tokenization

The idea behind subword tokenization is that frequently occurring words should be in the vocabulary, whereas rare words should be split into frequent sub words.

Subword-based tokenization lies between character and word-based tokenization.

• Frequently used words should not be split into smaller subwords;
• Rare words should be decomposed into meaningful subwords.
• Subwords help identify similar syntactic or semantic situations in texts.
• Subword. tokenization can identify start of word tokens, such as “##” in WordPiece.

## Byte-Pair Encoding

Byte-Pair Encoding (BPE) relies on a pre-tokenizer that splits the training data into words.

Why BPE? [13]

• Open-vocabulary: operations learned on the training set can be applied to <UNK>
• Compression of frequent character sequences improves efficiency $\rightarrow$ trade-off between text length and vocabulary size.

### Pre-tokenization

Pretokenzation splits the texts into words, For example:

• Space tokenization, e.g. GPT-2, RoBERTa;
• Rule-based tokenization (using Moses), e.g., XLM, FlauBERT;
• Space and ftfy: GPT.

### BPE Training

Given pre-tokenized tokens, we can train a BPE tokenizer as follows:

Bottom-up character merging: [3][13]

• Starting point: char-level representation $\rightarrow$ computationally expensive.
• Compress representation based on information theory $\rightarrow$ byte-pair encoding.
• Repeatedly replace most frequent symbol pair ('A', 'B') with 'AB'.
• Hyperparameter: when to stop $\rightarrow$ controls vocabulary size.

Step by step:

1. Cut the pre-tokenized corpus into smallest units, usually characters, as a base vocabulary.
2. Append </w> at the end of original tokens.
3. Count the neighbor unit pairs, and merge the pair that occurs most frequently.
4. Go to 2 until reach the maximum vocabulary size.

### BPE Encoding

BPE encodes the input tokens one by one, progressively merging the pairs according to frequency during training (from high to low). In other words, merge according to the merge order bpe_codes in merge.txt (with decreasing frequency).

## WordPiece

WordPiece was firstly proposed in Google’s Japanese and Korean voice search system[8], and was used in Google’s machine translation system[9]. It deals with an infinite vocabulary from large amounts of text automatically and increamentally by running greedy algorithms. This provides a user-specified number of word units which are chosen in a greedy way (without focusing on semantics) to maximize the likelihood on the language model (LM) training data - incidentally the same metric during decoding.

Example:

• Word: Jet makers feud over seat width with big orders at stake
• Wordpieces: _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

Procedure:

1. Initialize the word unit inventory with basic unicode characters.
2. Build a language model on the triaining data using the inventory from 1.
3. Generate a new word unit by combinining two units out of the current word inventory to increment the word unit inventory by one. “Choose the new word unit out of all possible ones that increases the likelihood on the training data the most when added to the model”.
4. Goto 2 unitil reach the predefined limit of word units, or the likelihood increase falls below a certain threshold.

How to choose which pair to merge?
Suppose we get $z$ after merging neibor subwords $x$ and $y$, the difference between LM loglikelihood is:

This is exactly the mutual infomation (MI) under the trained LM between the subword pair $(x,y)$. Therefore, WordPiece chooses the subword pair that has the maximum MI value.

## Unigram Language Model

Motivation:

• A sentence can be represented into multiple subword sequences even with the same vocabulary using BPE. Subword regularization is a regularization method for open-domain NMT.

Subword Regularization based on a unigram language model[10] assumes that each subword occurs independently, and consequently, the probability of a subword sequence $\mathbf{x} = (x_1, \cdots, x_M)$ is formulated as the product of the subword occurence probablities $p(\mathbf{x})$:

where $\mathcal{V}$ is a pre-determined vocabulary. The most probable segmentation $\mathbf{x}^*$ for the input $X$ is given by:

where $\mathcal{S}(X)$ is a set of segmentation candidates built from the input sentence $X$. $\mathbf{x}^*$ is obtained from the Viterbi algorithm.

Procedure:

1. Heuristically make a reasonably big seed vocabulary from the training corpus. Possible choices are:
• All chars and the most frequent substrings.
• BPE with sufficient mergence.
2. Fix the vocab, optimize $p(x)$ with EM algorithm
3. Compute the loss for each subword $x_i$ using a unigram language model, representing how likely the likelihood $\mathcal{L}$ is reduced when the subword $x_i$ is removed from the current vocabulary inventory.
4. Sort the vocabularies by loss and keep top $\eta \%$ ($\eta = 80$) of subwords. (Always keep the single char set to avoid OOV).
5. Goto 2.

Interpretation

• Subword regularization with unigram language model can be seen as a probabilistic mixture of characters, subwords, and word segmentations.
• I regard it as a post-regularization approach to subwords techniques, such as BPE units. This means we can use BPE word pieces as the initialization.

## Comparison

Subword tokenization Merge rules Trim rules Frequency-based Probability-based
BPE
WordPiece
Unigram

Comparison between subword methods:

1. BPE $\Uparrow$: start from char sets, incrementally merge according to (neighbor) subword pair co-occurrence.
2. WordPiece$\Uparrow$: start from char sets, incrementally merge according to the decreased likelihood (Mutual Information) after mergence.
3. Unigram Language Model (subword regularization)$\Downarrow$: starting from subword vocabularies, reducing the subword with likelihood reduction under a unigram LM after mergence.