Mask modeling is a crucial role in pre-training language models. This note provides a short summary.

BERT[1] applies masked language modeling (MLM) on the sequence of text segments. Specifically, BERT uses a uniform masking rate of 15% after WordPiece tokenization, where it replace the masked tokens with
1) [MASK] 80% of time time,
2) with a random word 10% of the time, and
3) 10% unchanged, to bias the representation towards the actual observed word.

The random replacement only occurs for 15% of all tokens (i.e., 10% of 15%), this does not seem to harm the model’s language understanding capacity.

BERT applies static masking for multiple runs ahead of time and keeps unchanged afterwards; while RoBERTa adopts dynamic masking in an on-the-fly manner during training.

### SpanBERT Implementation

Span masking consists of random masking, named entity masking, etc.

1. ERNIE[6] applies knowledge masking on the input sequence including entity- and phrase- level masking to inject knowledge composition.
2. SpanBERT[2] employs random span masking under a clamped geometric distribution.
3. BERT-WWM[7] uses whole word masking (for Chinese BERT) rather than randomly masking subword pieces to retain the whole meaning of a word.

SpanBERT[2] iteratively samples span’s length under a (clamped) geometric distribution $\mathcal{l} \sim \textrm{Geo}(p)$, i.e.,

which is skewed towards shorter spans ($p=0.2$). It also clips $\mathcal{l}$ with $\mathcal{l} = \min (\mathcal{l}, 10)$, yielding a mean span length of $\bar{\mathcal{l}}=3.8$. SpanBERT measures span length in complete words, not subword tokens, making the masked spans even longer.

The masking strategies are the same as BERT: masking 15% in total, where replacing 80% of tokens with [MASK], 10% with random tokens, and 10% unchanged.

It can be seen from the table that with the exception of coreference resolution, masking random spans is preferable to other strategies. Although linguistic masking schemes (named entities and noun phrases) are often competitive with random spans, their performance is not consistent. For coreference resolution, masking random subword toekns is preferable to any form of span masking.

MASS[3] encoder replaces each masked token by a special [MASK] token, leading to unchanged length overall. Then the decoder predicts the masked tokens autoregressively.

BART[4] replaces corrputed continuous spans of the encoder input as single [MASK], and trains the decoder in an autogressive way using a transformer encoder-decoder architecture.

BART allows any type of document corruption, including:

• Token Deletion: random tokens are deleted from the input.
• Text Infilling: amounts of text spans are corrupted, with span length drawn from a Poission distribution ($\lambda=3$). Each span is replaced with a single [MASK] token. 0-length spans correspond to the insertion of [MASK] tokens.
• Sentence Permutation: Divide a document into peices of sentences based on full stops, and randomly shuffle them.
• Document Rotation: uniformly chose a token at random to rotate the document.

## T5 Span Mask

T5[5] replaces with unique sentinel the corrupted spans in the input sequence, and predicts the concatenation of corrupted spans prefixed by the sentinal token used in the input. Specifically, T5 first replaces the entirety of each consecutive span of corrupted tokens with a unique mask token. Then, the target sequence becomes the concatenation of the corrupted spans, each prefixed by the mask token used to replace it in the input.

As shown in the table, BERT-syle objective simply replaces 15% of the input tokens without the original random token swapping step, and reconstruct the original uncorrupted sequence.

The first two rows (i.e., BERT-style and MASS-style objectives) predict the entire uncorrupted text span which requires self-attention over long sequences in the decoder. To avoid this, T5 applies the strategies in the last two rows. The last row(i.e., Drop corrupted tokens) simply drops the corrupted tokens from the input sequence completely and task the model with reconstructing the dropped tokens in order.

It can be seen from the table that "dropping corrupted spans" completely produced a small improvement in the GLUE score thanks to the significatly higher score on CoLA.
The first two rows (i.e., BERT-style and MASS-style objectives) predict the entire uncorrupted text span which requires self-attention over long sequences in the decoder. To avoid this, T5 applies the strategies in the last two rows. The last row(i.e., Drop corrupted tokens) simply drops the corrupted tokens from the input sequence completely and task the model with reconstructing the dropped tokens in order. (60.45 vs avg. baseline 53.84). However, dropping tokens completely performed worse than replacing with sentinel tokens on SuperGLUE. The last two rows’ variants make the target sequence shorter and consequently make training faster.

For attribution in academic contexts, please cite this work as: