Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

An Introduction to Activation Functions

Activation functions lead to non-linearity in neural networks. Most common types are Sigmoid, Tanh, Relu, etc.

Commonly-used Activations


Sigmoid function takes a real-valued number and ‘squashes’ it into the range (0,1).


  1. Sigmoids saturate and kill gradients. When at either the tail of 0 or 1, the gradient is almost zero. Take care of the weight initialization: if too large most neurons would saturate soon and the networks will barely learn.
  2. Outputs are not zero-centered.

upload successful


Tanh squashes the real number input into the range [-1,1].

  • Like sigmoid, its activations saturate;
  • but the output of tanh is zero-centered. Therefore, tanh non-linearity is always preferred to the sigmoid non-linearity.
  • $\tanh$ is simply a scaled sigmoid neuron:

upload successful

ReLU (Rectified Linear Unit)

ReLu is simply thresholded at zero.


  • 6x accecerate the convergence of SGD compared to the tanh functions.[2]
  • No expensive operations (e.g. exponential)
  • Fragile during training and can “die”: ReLU units can irreversibly die during training.

upload successful

Leaky ReLU

Leaky ReLU attempts to fix the dying ReLU problem, by setting a small nagative slope when $x<0$. However, the consistency of the benefits across tasks is unclear.


Parametric ReLU

where $\alpha$ is learnable.

ELU (Exponential Linear Units)

ELU [4]


See Maxout Networks(Goodfellow 2013)[3] :

GELU (Gaussian Error Linear Units)


  • combine the properties of dropout, zoneout, and ReLUs.[5]
  • multiplying the input by zero or one, but the values of this zero-one mask are stochastically determined while also dependent upon the input.

Specically, multiply the neuron input $x$ by $m \sim \text{Benoulli}(\Phi(x))$, where $\Phi(x)=P(X \leq x)$, $X \sim \mathcal{N}(0,1)$.

The non-linearity is the expected transformation of the stochastic regularizer on an input $x$:
$\Phi(x) \times I x + (1-\Phi(x)) \times 0x = x \Phi(x)$

Then define the Gaussian Error Linear Unit (GELU) as:

BERT implementation:

def gelu(x):
"""Gaussian Error Linear Unit.
This is a smoother version of the RELU.
Original paper:
x: float Tensor to perform activation.
`x` with the GELU activation applied.
cdf = 0.5 * (1.0 + tf.tanh(
(np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
return x * cdf



Swish has the property of one-sided boundaries at zero, smoothness and non-monotonicity. Swish is shown to outperform ReLU on many tasks.[6]


Personally, this idea is borrowed from the work of (Dauphin et. al, 2017)[7] at FAIR in 2017, Gated Linear Unit(GLU) in gated CNNs, which is used to capture the sequential information after temporal convolutions:

Image source: <sup id="fnref:7"><a href="#fn:7" rel="footnote"><span class="hint--top hint--error hint--medium hint--rounded hint--bounce" aria-label="Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017, August). [Language modeling with gated convolutional networks]( In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 933-941). JMLR. org.">[7]</span></a></sup>

Relu can be seen as a simplication of GLU, where the activation of the gate depends on the sign of the input:

The gradient of LSTM-style gating of Gated Tanh Unit (GTU) is gradually vanishing because of the downscaling factors $\color{salmon}{\tanh’(\pmb{X})}$ and $\color{salmon}{\sigma’(\pmb{X})}$:

GLU has the path $\color{green}{\nabla \pmb{X} \otimes \sigma(\pmb{X})}$, which does not downscale the activated gating unit. This can be thought as a multiplicative skip connection.

import torch.nn.functional as F

def Swish(x):
return x*F.sigmoid(x)


Mish is a non-monotonic, self-gated/regularized, smoothing activation function. It is shown to outperform Swish and ReLU on various tasks.[8]



  1. 1.
  2. 2.Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814).
  3. 3.Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout networks. arXiv preprint arXiv:1302.4389.
  4. 4.Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289.
  5. 5.Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415.
  6. 6.Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941.
  7. 7.Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017, August). Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 933-941). JMLR. org.
  8. 8.Misra, D. (2019). Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv preprint arXiv:1908.08681.