Mask modeling is a crucial role in pre-training language models. This note provides a short summary.

BERT[1] applies masked language modeling (MLM) on the sequence of text segments. Specifically, BERT uses a uniform masking rate of 15% after WordPiece tokenization, where it replace the masked tokens with
1) [MASK] 80% of time time,
2) with a random word 10% of the time, and
3) 10% unchanged, to bias the representation towards the actual observed word.

The random replacement only occurs for 15% of all tokens (i.e., 10% of 15%), this does not seem to harm the model’s language understanding capacity.

BERT applies static masking for multiple runs ahead of time and keeps unchanged afterwards; while RoBERTa adopts dynamic masking in an on-the-fly manner during training.

### SpanBERT Implementation

1. ERNIE[6] applies knowledge masking on the input sequence including entity- and phrase- level masking to inject knowledge composition.
2. SpanBERT[2] employs random span masking under a clamped geometric distribution.
3. BERT-WWM[7] uses whole word masking (for Chinese BERT) rather than randomly masking subword pieces to retain the whole meaning of a word.

SpanBERT[2] iteratively samples span’s length under a (clamped) geometric distribution $\mathcal{l} \sim \textrm{Geo}(p)$, i.e.,

which is skewed towards shorter spans ($p=0.2$). It also clips $\mathcal{l}$ with $\mathcal{l} = \min (\mathcal{l}, 10)$, yielding a mean span length of $\bar{\mathcal{l}}=3.8$. SpanBERT measures span length in complete words, not subword tokens, making the masked spans even longer.

The masking strategies are the same as BERT: masking 15% in total, where replacing 80% of tokens with [MASK], 10% with random tokens, and 10% unchanged.

It can be seen from the table that with the exception of coreference resolution, masking random spans is preferable to other strategies. Although linguistic masking schemes (named entities and noun phrases) are often competitive with random spans, their performance is not consistent. For coreference resolution, masking random subword toekns is preferable to any form of span masking.

MASS[3] encoder replaces each masked token by a special [MASK] token, leading to unchanged length overall. Then the decoder predicts the masked tokens autoregressively.

BART[4] replaces corrputed continuous spans of the encoder input as single [MASK], and trains the decoder in an autogressive way using a transformer encoder-decoder architecture.

BART allows any type of document corruption, including:

• Token Deletion: random tokens are deleted from the input.
• Text Infilling: amounts of text spans are corrupted, with span length drawn from a Poission distribution ($\lambda=3$). Each span is replaced with a single [MASK] token. 0-length spans correspond to the insertion of [MASK] tokens.
• Sentence Permutation: Divide a document into peices of sentences based on full stops, and randomly shuffle them.
• Document Rotation: uniformly chose a token at random to rotate the document.