Fork me on GitHub

How to Handle Out-Of-Vocabulary Words?

Tips for handling OOV words.

Background

how to represent text?

  • 1-hot encoding
    • lookup of word embedding for input
    • probability distribution over vocabulary for output
  • Large vocabulary
    • increase network size
    • decrease training and decoding speed

Problems

Open-vocabulary problems:

  • Many training corpora contain millions of word types
  • productive word formation processes (compounding; derivation) allow formation and understanding of unseen words
  • Names, numbers are morphologically simple, but open word classes

Non-solution: ignore rare words

  • Replace OOV words with UNK
  • A vocabulary of 50,000 words covers 95% of text: 95% is not enough !

Approximative softmax

Compute softmax over “active” subset of vocabulary $\rightarrow$ smaller weight matrix, faster softmax[1]

  • At training time: vocabulary based on words occurring in training set partition
  • At test time: determine likely target words based on source text (using cheap method like translation dictionary)

Limitations:

  • Allow larger vocabulary, but still not open
  • Networks may not learn good representation of rare words

Back-off models

  • Replace rare words with UNK at training time [2]
  • When system produces UNK, alight UNK to source word, and translate this with back-off method

Limitations:

  • Compounds: hard to model 1-to-many relationships
  • Morphology: hard to predict inflection with back-off dictionary
  • Names: if alphabets differ, we need transliteration
  • Alignment: attention model unreliable

Byte-pair encoding

Bottom-up character merging: [3]

  • Starting point: char-level representation $\rightarrow$ computationally expensive
  • Compress representation based on information theory $\rightarrow$ byte-pair encoding
  • Repeatedly replace most frequent symbol pair (‘A’, ‘B’) with ‘AB’
  • Hyper-parameter: when to stop $\rightarrow$ controls vocabulary size
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import re, collections

def get_stats(vocab):
pairs = collections.defaultdict(int) # default count 0
for word, freq in vocab.items():
symbols = word.split() # split to chars
for i in range(len(symbols) - 1):
pairs[symbols[i], symbols[i + 1]] += freq # count bigram
return pairs


def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(" ".join(pair)) # escape
   p = re.compile(r'(?<!\S>)' + bigram + r'(?!\S)') # lookaround assertion, equal to [^\f\n\r\t\v]
for word in v_in:
w_out = p.sub("".join(pair), word) # merge the most frequent pair by removing whitespace
v_out[w_out] = v_in[word] # assign pair count
return v_out


if __name__ == '__main__':
vocab = {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(best)

Why BPE?

  • Open-vocabulary: op learned on training set can be applied to UNK
  • Compression of frequent character sequences improves efficiency $\rightarrow$ trade-off between text length and vocabulary size

Char-level Models

Advantages:

  • (mostly) open-vocabulary
  • No heuristic or language-specific segmentation
  • NN can conceivably learn from raw char sequences

Drawbacks:

  • Increase sequence length slows training/decoding (x2 - x4 increase in training time)
  • Naive char-level encoder-decoders are currently resource-limited

OPEN QUESTIONS:

  • On which level should we represent meaning?
  • On which level should attention op?

Hierarchical model: backoff

  • Word-level model produces UNKs [4]
  • For each UNK, char-level model predicts word based on word hidden state

Pros:

  • prediction is more flexible than dictionary look-up
  • more efficient than pure char-level translation

Cons:

  • independence assumptions between main model and backoff model

Char-level output

  • No word segmentation on target side [5]
  • Encoder is BPE-level

Char-level input

Hierarchical representation: RNN states represent words, but their representation is computed from char-level LSTM [6]

upload successful

Fully char-level

  • Goal: get rid of word boundaries [7]
  • Target side: char-level RNNs
  • Source side: convolution and max-pooling layers

upload successful

References

Thanks for your reward!