A summary of Large-scale Pre-trained Models (PTMs).
Summary
As shown in following table, I summarize the mainstream PTMs at a large scale in NLP. It is clear that the size of PTMs has become larger and larger in recent years, ranging from 2.6 billion to even 175 billion parameters. Although the training methods are different among these models, they all use Transformers as the standard backbone in PTMs due to the nature of efficient parallel computing in self-attention mechanism. Since training large-sacele models needs massive unsupervised data, research on scaling up PTMs focuses on high-resource languages such as English and Chinese.
Model | #Params | Masked LM | Causal LM | Prefix LM | Seq2Seq LM | Language | Pre-Training Data | Training Parallelism | Official Impl. |
---|---|---|---|---|---|---|---|---|---|
T5 | 11B | ✘ | ✘ | ✘ | ✔ | English | C4 Corpus (~750GB) | model / data parallelism | TensorFlow |
mT5 | 13B | ✘ | ✘ | ✘ | ✔ | 101 languages | mC4 Corpus (6.3T tokens) | - | TensorFlow |
Switch Transformers | 1751B | ✘ | ✘ | ✘ | ✔ | English | C4 Corpus (~750GB) | Mixure of Experts (MoE) | TensorFlow |
CPM-2 | 11B | ✘ | ✘ | ✘ | ✔ | Chinese, English | WuDao Corpus (2.3TB Chinese + 300GB English) | - | PyTorch |
CPM-2-MoE | 198B | ✘ | ✘ | ✘ | ✔ | Chinese, English | WuDao Corpus (2.3TB Chinese + 300GB English) | MoE | PyTorch |
Turing-NLG | 17B | ✘ | ✔ | ✘ | ✘ | English | English data | DeepSpeed; ZeRo | - |
GPT-3 | 175B | ✘ | ✔ | ✘ | ✘ | English | cleaned CommonCrawl, WebText | model parallelism | - |
CPM | 2.6B | ✘ | ✔ | ✘ | ✘ | Chinese | Chinese corpus (100GB) | - | PyTorch |
HyperCLOVA | 204B | ✘ | ✔ | ✘ | ✘ | Korean | Korean data | - | - |
PanGu-$\alpha$ | 200B | ✘ | ✔ | ✘ | ✘ | Chinese | Chinese data (1.1TB, 250B tokens) | MindSpore Auto-parallel | MindSpore |
DeBERTa1.5B | 1.5B | ✔ | ✘ | ✘ | ✘ | English | English corpus | - | PyTorch |
ERNIE 3.0 | 10B | ✔ | ✔ | ✘ | ✘ | Chinese, English | Chinese data (4TB); English | - | PaddlePaddle |
Yuan 1.0 | 245B | ✘ | ✔ | ✔ | ✘ | Chinese | Chinese data (5TB) | - | - |
Megatron-Turing NLG | 530B | ✘ | ✔ | ✘ | ✘ | English | English data | Megatron-LM, DeepSpeed | - |
According to different design of pre-training architectures, large-scale PTMs can be generally classified into three classes: encoder only, decoder only, and encoder-decoder. The majority of large PTMs leverage the decoder only and encoder-decoder architecture whereas seldom large models adopt the encoder only design. This is due to that encoder only architectures, such as BERT and DeBERTa, employ stacked Transformer encoders only to attend to bidirectional contexts in language, in which their bidirectional nature prevent them from applying to NLG tasks. In contrast, decoder only models are good at NLG tasks by nature and can perform NLU tasks via prompt-based methods. Examples inlucde GPT series and Turing-NLG.
- Encoder only, i.e., pretraining on stacked Transformer encoders. Examples: DeBERTa1.5B[10].
- Decoder only. This line of large PTMs pretrained Transformer decoders by applying auto-regressive masks to prevent the current token from attending to future ones. Examples: Turing-NLG [5], GPT-3 [6], CPM [7], HyperCLOVA [8], PanGu-$\alpha$ [9], Yuan 1.0 [12], Megatron-Turing NLG [13].
- Encoder-decoder, including (1) conventional sequence-to-sequence encoder-decoder, such as T5 [1], mT5 [2], CPM-2 [3]; and (2) Unified encoder-decoder, such as ERNIE 3.0 [11].
For attribution in academic contexts, please cite this work as:1
2
3
4
5
6@misc{chai2021scaling-ptms-summary,
author = {Chai, Yekun},
title = {{Scaling Up Pre-Training Models: A Summary}},
year = {2021},
howpublished = {\url{https://cyk1337.github.io/notes/2021/10/09/PTMs/Scaling-Up-Pre-trained-Models/}},
}
References
- 1.(T5) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." JMLR (2020). ↩
- 2.(mT5) Xue, Linting, et al. "mt5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934 (2020). ↩
- 3.Zhang, Zhengyan, et al. "CPM-2: Large-scale Cost-effective Pre-trained Language Models." arXiv preprint arXiv:2106.10715 (2021). ↩
- 4.Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." arXiv preprint arXiv:2101.03961 (2021). ↩
- 5.Turing-NLG: A 17-billion-parameter language model by Microsoft. February 13, 2020. ↩
- 6.Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020). ↩
- 7.Zhang, Zhengyan, et al. "CPM: A large-scale generative Chinese pre-trained language model." AI Open 2 (2021): 93-99. ↩
- 8.Kim, Boseop, et al. "What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers." arXiv preprint arXiv:2109.04650 (2021). ↩
- 9.Zeng, Wei, et al. "PanGu-$\alpha $: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation." arXiv preprint arXiv:2104.12369 (2021). ↩
- 10.He, Pengcheng, et al. "Deberta: Decoding-enhanced bert with disentangled attention." arXiv preprint arXiv:2006.03654 (2020). ↩
- 11.Sun, Yu, et al. "Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation." arXiv preprint arXiv:2107.02137 (2021). ↩
- 12.Wu, Shaohua, et al. "Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning." arXiv preprint arXiv:2110.04725 (2021). ↩
- 13.Paresh Kharya and Ali Alvi, Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model. Oct 11, 2021. ↩