Fork me on GitHub

Counting the Number of Parameters in Deep Learning

Calculate the # of trainable parameters by hand.

Feed-forward NN



  • $\pmb{i}$ input size
  • $\pmb{o}$ output size

For each FFNN layer



  • $\pmb{n}$ # of FFNN in each unit
    • RNN: 1
    • GRU: 3
    • LSTM: 4
  • $\pmb{i}$ input size
  • $\pmb{h}$ hidden size

For each FFNN, the input state and previous hidden state are concatenated, thus each FFFN has $\pmb{(h+i) \times h} + \pmb{h}$ parameters.

The total # of params is

  • LSTM:
  • GRU:
  • RNN:


upload successful

Image source:[2]



  • $\pmb{i}$ input channel
  • $\pmb{f}$ filter size
  • $\pmb{o}$ output channel (i.e. # of filters)

Image source: [2]



  • $\pmb{x}$ denotes the embedding dim == model dimension == output dimension


upload successful

Image source: [3]
  • Scaled dot product
  • Multi-head dot product attention (MHDPA)

  • Overall, MHDPA has 4 linear connections (i.e., K, V, Q, output after concat). There are $4 \times \left[ (\pmb{x} \times \pmb{x}) + \pmb{x} \right]$ trainable parameters

Transformer Encoder

upload successful

Image source: [3]


  • $\pmb{m}$ is # of encoder stacks

  • Layer normalization

    the param number of single layer norm is sum the count of weights $\gamma$ and biases $\beta$: $\pmb{x}+\pmb{x}$

  • FFNN: param number of a single layer = $\pmb{x} \times \pmb{x} + \pmb{x}$

Thus the total number of transformer encoder is: sum the number of 1 MHDPA, 2 Layer norm, 1 FFNN, times the stack number $\pmb{m}$:

Transformer Decoder


  • $\pmb{n}$ is # of decoder stacks

The total number of transformer decoder is: sum the number of 2 MHDPA, 3 Layer norm, 1 FFNN, times the stack number $\pmb{n}$:


  1. 1.Towards data science: Counting No. of Parameters in Deep Learning Models by Hand
  2. 2.Towards data science: Animated RNN, LSTM and GRU
  3. 3.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).