Yekun's ML Notes

Some machine learning notes and writeup.

Fork me on GitHub

Scaling Up Large Language Models: A Summary

A summary of Large-scale Pre-trained Models (PTMs).


As shown in following table, I summarize the mainstream PTMs at a large scale in NLP. It is clear that the size of PTMs has become larger and larger in recent years, ranging from 2.6 billion to even 175 billion parameters. Although the training methods are different among these models, they all use Transformers as the standard backbone in PTMs due to the nature of efficient parallel computing in self-attention mechanism. Since training large-scale models needs massive unsupervised data corpus, research on scaling up PTMs focuses on high-resource languages such as English and Chinese.

Model #Params #Training Tokens Masked LM Causal LM Prefix LM Seq2Seq LM Pre-Training Data
T5 11B - C4 Corpus (~750GB)
mT5 13B - mC4 Corpus (6.3T tokens)
Switch Transformers 1751B - C4 Corpus (~750GB)
CPM-2 11B - WuDao Corpus (2.3TB Chinese + 300GB English)
CPM-2-MoE 198B - WuDao Corpus (2.3TB Chinese + 300GB English)
Turing-NLG 17B - English data
GPT-3 175B 300B cleaned CommonCrawl, WebText
CPM 2.6B - Chinese corpus (100GB)
HyperCLOVA 204B - Korean data
PanGu-$\alpha$ 200B - Chinese data (1.1TB, 250B tokens)
DeBERTa1.5B 1.5B - English corpus
ERNIE 3.0 10B - Chinese data (4TB); English
Yuan 1.0 245B - Chinese data (5TB)
Megatron-Turing NLG 530B 270B The Pile,
OPT 175B 300B BookCorpus, Stories, CCNews (RoBERTa)
The Pile Reddit.
Gopher 280B 300B MassiveText(10.5TB data) including webpages, books, news article, code.
Jurassic-1 178B 300B GPT-3 data
Chinchilla 70B 1.4T Same as Gopher.
Sparrow 70B - -
LaMDA 137B 168B 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words
PaLM 540B 780B 780B token, dataset from LamDA, GLaM, and code
BLOOM 176B ROOTS dataset of 498 huggingface datasets. 46 natural languages,13 programming languages.
GLM-130B 130B 400B English: 1.2T the Pile
Chinese: 1T Chinese WuDao corpora; 250GB crawled from online forum, encyclopedia, QA.
ChatGLM-6B 6B 1T Chinese-English bilingual data.
LLaMA 65B 1.4T 1.4T token. CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, StackExchange.
Alpaca 7B - 52K instruction-following data
Vicuna 1.3B - finetune 70K ChatGPT data
ChatRWKV (100% RNN) 14B - -
Galactica 120B 450B 106 billion tokens from papers, reference material, encyclopedias and other scientific sources
Codex 12B 100B 159GB Python code.
AlphaCode 41B/9B 967B/
715GB code from GitHub.
Flamingo 80B - M3W, 43M webpaes
Image/video-text pairs: ALIGN, LTIP/VTP
BEiT-3 1.9B - 21M image-text pairs, 14M images, 160GB documents
Kosmos-1 1.6B 360B 1. text: The Pile, Common Crawl
exclude GitHub/arXiv/Stack Exchange/PubMed Central,
+ CC-Stories, RealNews

2. Image-caption pairs
LAION-2B/400M/COYO-700M/Comceptual Captions

3. Interleaved image-text data
from Common Crawl
GPT-4 - - Open-sourced data and third-party data

According to different design of pre-training architectures, large-scale PTMs can be generally classified into three classes: encoder only, decoder only, and encoder-decoder. The majority of large PTMs leverage the decoder only and encoder-decoder architecture whereas seldom large models adopt the encoder only design. This is due to that encoder only architectures, such as BERT and DeBERTa, employ stacked Transformer encoders only to attend to bidirectional contexts in language, in which their bidirectional nature prevent them from applying to NLG tasks. In contrast, decoder only models are good at NLG tasks by nature and can perform NLU tasks via prompt-based methods. Examples inlucde GPT series and Turing-NLG.

  • Encoder only, i.e., pretraining on stacked Transformer encoders. Examples: DeBERTa1.5B[10].
  • Decoder only. This line of large PTMs pretrained Transformer decoders by applying auto-regressive masks to prevent the current token from attending to future ones. Examples: Turing-NLG [5], GPT-3 [6], CPM [7], HyperCLOVA [8], PanGu-$\alpha$ [9], Yuan 1.0 [12], Megatron-Turing NLG [13].
  • Encoder-decoder, including (1) conventional sequence-to-sequence encoder-decoder, such as T5 [1], mT5 [2], CPM-2 [3]; and (2) Unified encoder-decoder, such as ERNIE 3.0 [11].

For attribution in academic contexts, please cite this work as:

author = {Chai, Yekun},
title = {{Scaling Up Pre-Training Models: A Summary}},
year = {2021},
howpublished = {\url{}},


  1. 1.(T5) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." JMLR (2020).
  2. 2.(mT5) Xue, Linting, et al. "mt5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934 (2020).
  3. 3.Zhang, Zhengyan, et al. "CPM-2: Large-scale Cost-effective Pre-trained Language Models." arXiv preprint arXiv:2106.10715 (2021).
  4. 4.Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." arXiv preprint arXiv:2101.03961 (2021).
  5. 5.Turing-NLG: A 17-billion-parameter language model by Microsoft. February 13, 2020.
  6. 6.Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
  7. 7.Zhang, Zhengyan, et al. "CPM: A large-scale generative Chinese pre-trained language model." AI Open 2 (2021): 93-99.
  8. 8.Kim, Boseop, et al. "What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers." arXiv preprint arXiv:2109.04650 (2021).
  9. 9.Zeng, Wei, et al. "PanGu-$\alpha $: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation." arXiv preprint arXiv:2104.12369 (2021).
  10. 10.He, Pengcheng, et al. "Deberta: Decoding-enhanced bert with disentangled attention." arXiv preprint arXiv:2006.03654 (2020).
  11. 11.Sun, Yu, et al. "Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation." arXiv preprint arXiv:2107.02137 (2021).
  12. 12.Wu, Shaohua, et al. "Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning." arXiv preprint arXiv:2110.04725 (2021).
  13. 13.Paresh Kharya and Ali Alvi, Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model. Oct 11, 2021.