Yekun's ML Notes

Some machine learning notes and writeup.

Fork me on GitHub

Scaling Up Pre-trained Models: A Summary

A summary of Large-scale Pre-trained Models (PTMs).


As shown in following table, I summarize the mainstream PTMs at a large scale in NLP. It is clear that the size of PTMs has become larger and larger in recent years, ranging from 2.6 billion to even 175 billion parameters. Although the training methods are different among these models, they all use Transformers as the standard backbone in PTMs due to the nature of efficient parallel computing in self-attention mechanism. Since training large-sacele models needs massive unsupervised data, research on scaling up PTMs focuses on high-resource languages such as English and Chinese.

Model #Params Masked LM Causal LM Prefix LM Seq2Seq LM Language Pre-Training Data Training Parallelism Official Impl.
T5 11B English C4 Corpus (~750GB) model / data parallelism TensorFlow
mT5 13B 101 languages mC4 Corpus (6.3T tokens) - TensorFlow
Switch Transformers 1751B English C4 Corpus (~750GB) Mixure of Experts (MoE) TensorFlow
CPM-2 11B Chinese, English WuDao Corpus (2.3TB Chinese + 300GB English) - PyTorch
CPM-2-MoE 198B Chinese, English WuDao Corpus (2.3TB Chinese + 300GB English) MoE PyTorch
Turing-NLG 17B English English data DeepSpeed; ZeRo -
GPT-3 175B English cleaned CommonCrawl, WebText model parallelism -
CPM 2.6B Chinese Chinese corpus (100GB) - PyTorch
HyperCLOVA 204B Korean Korean data - -
PanGu-$\alpha$ 200B Chinese Chinese data (1.1TB, 250B tokens) MindSpore Auto-parallel MindSpore
DeBERTa1.5B 1.5B English English corpus - PyTorch
ERNIE 3.0 10B Chinese, English Chinese data (4TB); English - PaddlePaddle
Yuan 1.0 245B Chinese Chinese data (5TB) - -
Megatron-Turing NLG 530B English English data Megatron-LM, DeepSpeed -

According to different design of pre-training architectures, large-scale PTMs can be generally classified into three classes: encoder only, decoder only, and encoder-decoder. The majority of large PTMs leverage the decoder only and encoder-decoder architecture whereas seldom large models adopt the encoder only design. This is due to that encoder only architectures, such as BERT and DeBERTa, employ stacked Transformer encoders only to attend to bidirectional contexts in language, in which their bidirectional nature prevent them from applying to NLG tasks. In contrast, decoder only models are good at NLG tasks by nature and can perform NLU tasks via prompt-based methods. Examples inlucde GPT series and Turing-NLG.

  • Encoder only, i.e., pretraining on stacked Transformer encoders. Examples: DeBERTa1.5B[10].
  • Decoder only. This line of large PTMs pretrained Transformer decoders by applying auto-regressive masks to prevent the current token from attending to future ones. Examples: Turing-NLG [5], GPT-3 [6], CPM [7], HyperCLOVA [8], PanGu-$\alpha$ [9], Yuan 1.0 [12], Megatron-Turing NLG [13].
  • Encoder-decoder, including (1) conventional sequence-to-sequence encoder-decoder, such as T5 [1], mT5 [2], CPM-2 [3]; and (2) Unified encoder-decoder, such as ERNIE 3.0 [11].

For attribution in academic contexts, please cite this work as:

author = {Chai, Yekun},
title = {{Scaling Up Pre-Training Models: A Summary}},
year = {2021},
howpublished = {\url{}},


  1. 1.(T5) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." JMLR (2020).
  2. 2.(mT5) Xue, Linting, et al. "mt5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934 (2020).
  3. 3.Zhang, Zhengyan, et al. "CPM-2: Large-scale Cost-effective Pre-trained Language Models." arXiv preprint arXiv:2106.10715 (2021).
  4. 4.Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." arXiv preprint arXiv:2101.03961 (2021).
  5. 5.Turing-NLG: A 17-billion-parameter language model by Microsoft. February 13, 2020.
  6. 6.Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
  7. 7.Zhang, Zhengyan, et al. "CPM: A large-scale generative Chinese pre-trained language model." AI Open 2 (2021): 93-99.
  8. 8.Kim, Boseop, et al. "What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers." arXiv preprint arXiv:2109.04650 (2021).
  9. 9.Zeng, Wei, et al. "PanGu-$\alpha $: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation." arXiv preprint arXiv:2104.12369 (2021).
  10. 10.He, Pengcheng, et al. "Deberta: Decoding-enhanced bert with disentangled attention." arXiv preprint arXiv:2006.03654 (2020).
  11. 11.Sun, Yu, et al. "Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation." arXiv preprint arXiv:2107.02137 (2021).
  12. 12.Wu, Shaohua, et al. "Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning." arXiv preprint arXiv:2110.04725 (2021).
  13. 13.Paresh Kharya and Ali Alvi, Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model. Oct 11, 2021.