# Yekun's Note

Machine learning notes and writeup.

Activation functions lead to non-linearity in neural networks. Most common types are Sigmoid, Tanh, Relu, etc.

# Commonly-used Activations

## Sigmoid

Sigmoid function takes a real-valued number and ‘squashes’ it into the range (0,1).

Drawbacks:

1. Sigmoids saturate and kill gradients. When at either the tail of 0 or 1, the gradient is almost zero. Take care of the weight initialization: if too large most neurons would saturate soon and the networks will barely learn.
2. Outputs are not zero-centered.

## Tanh

Tanh squashes the real number input into the range [-1,1].

• Like sigmoid, its activations saturate;
• but the output of tanh is zero-centered. Therefore, tanh non-linearity is always preferred to the sigmoid non-linearity.
• $\tanh$ is simply a scaled sigmoid neuron:

## ReLU (Rectified Linear Unit)

ReLu is simply thresholded at zero.

Pros:

• 6x accecerate the convergence of SGD compared to the tanh functions.[2]
• No expensive operations (e.g. exponential)
Problems:
• Fragile during training and can “die”: ReLU units can irreversibly die during training.

## Leaky ReLU

Leaky ReLU attempts to fix the dying ReLU problem, by setting a small nagative slope when $x<0$. However, the consistency of the benefits across tasks is unclear.

## PReLU

Parametric ReLU

where $\alpha$ is learnable.

ELU [4]

## Maxout

See Maxout Networks(Goodfellow et.al 2013)[3] :

## GELU (Gaussian Error Linear Units)

Motivation:

• combine the properties of dropout, zoneout, and ReLUs.[5]
• multiplying the input by zero or one, but the values of this zero-one mask are stochastically determined while also dependent upon the input.

Specically, multiply the neuron input $x$ by $m \sim \text{Benoulli}(\Phi(x))$, where $\Phi(x)=P(X \leq x)$, $X \sim \mathcal{N}(0,1)$.

The non-linearity is the expected transformation of the stochastic regularizer on an input $x$:
$\Phi(x) \times I x + (1-\Phi(x)) \times 0x = x \Phi(x)$

Then define the Gaussian Error Linear Unit (GELU) as:

BERT implementation:

## Swish

Swish has the property of one-sided boundaries at zero, smoothness and non-monotonicity. Swish is shown to outperform ReLU on many tasks.[6]

Personally, this idea is borrowed from the work of (Dauphin et. al, 2017)[7] at FAIR in 2017, Gated Linear Unit(GLU) in gated CNNs, which is used to capture the sequential information after temporal convolutions:

Relu can be seen as a simplication of GLU, where the activation of the gate depends on the sign of the input:

The gradient of LSTM-style gating of Gated Tanh Unit (GTU) is gradually vanishing because of the downscaling factors $\color{salmon}{\tanh’(\pmb{X})}$ and $\color{salmon}{\sigma’(\pmb{X})}$:

GLU has the path $\color{green}{\nabla \pmb{X} \otimes \sigma(\pmb{X})}$, which does not downscale the activated gating unit. This can be thought as a multiplicative skip connection.

## Mish

Mish is a non-monotonic, self-gated/regularized, smoothing activation function. It is shown to outperform Swish and ReLU on various tasks.[8]

# References

1. 1.
2. 2.Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814).
3. 3.Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout networks. arXiv preprint arXiv:1302.4389.
4. 4.Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289.
5. 5.Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415.
6. 6.Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941.
7. 7.Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017, August). Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 933-941). JMLR. org.
8. 8.Misra, D. (2019). Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv preprint arXiv:1908.08681.