Vanilla gradient descent

Let $d_i(t) = \frac{\partial E}{\partial w_i(t)}$ be the gradient of the error function $E$ w.r.t a weight $w_i$ at update time $t$.

“Vanilla” gradient descent updates the weight along the nagative gradient direction: $\Delta w_i(t) = - \eta d_i(t)$ $w_i(t+1) = w_i(t) + \Delta w_i(t)$ where $\eta$ denotes the learning rate.

How to set the learning rate?

Initialize the $\eta$, and update as the training processes

Learning rate schedules: typically initial larger steps followed by smaller steps for fine tuning: results in faster convergence and better solutions.

Learning Rate Schedules:

Time-dependent schedules:
- Piecewise constant: pre-determined $\eta$ for each epoch
- Exponential:
  $\eta(t) = \eta(0) \exp(-t/r)$ , where $r \sim \text{training set size}$
- Reciprocal:
  $\eta(t) = \eta(0)(1+ \frac{t}{r})^{-c}$ , where $c \sim 1$.
Performance-dependent $\eta$: fixd $\eta$ until validation set stops improving, then halve $\eta$ each epoch (i.e. constant, then exponential)

Momentum

$\Delta w_i(t) = - \eta d_i(t) + \alpha \Delta w_i(t-1)$

Momentum hyperparameter: $\alpha \sim 0.9$
Momentum term encourages the weight change to go in the previous direction
Problems: tuning learning rate and momentum parameters can be expensive.

Adaptive learning rates

Tuning learning rate parameters is expensive (grid search)
- AdaGrad: normalize the update for each weight
- RMSProp: AdaGrad forces the learning rate to always decrease, this constraint is relexed with RMSProp
- Adam: “RMSProp with momentum”

AdaGrad

Separate, nomalized update for each weight
Normalized by the sum squared gradient $S$ $\begin{align} S_i(0) &= 0 \\ S_i(t) & = S_i(t-1) + d_i(t)^2 \\ \Delta w_i(t)& = \frac{-\eta}{\sqrt{S_i(t)}+\epsilon} d_i(t) \end{align}$ where $\epsilon \sim 10^{-8}$ is a small constant to prevent division by zero errors.
The update step for is normalized by the (square root of) the sum squared gradients for that parameter
- Weights with larger gradient magnitudes will have smaller effective learning rates
- $S_i$ cannot get smaller, so the effective learningr rates monotonically decrease
AdaGrad can decrease the effective learning rate too aggressively in NNs.

RMSProp

Backgroud: RProp^[3] is for batch gradient descent with an adaptive learning rate for each parameter, and uses only the sign of the gradient (equivalent to normalizing by the gradient)
RMSProp can be viewed as a stochastic gradient descent version of RProp normalized by a moving average of the squared gradient^[4]. It is similar to AdaGrad, but replacing the sum by a moving average for $S$: $\begin{align} S_i(t) & = \beta S_i(t-1) + (1-\beta)d_i(t)^2 \\ \Delta w_i(t)& = \frac{-\eta}{\sqrt{S_i(t)}+\epsilon} d_i(t) \end{align}$ where $\beta \sim 0.9$ is the decay rate.
Effective learning rates no longer guaranteed to decrease.

AdaDelta

Background:
AdaGrad has two main drawbacks:

the continual decay of learning rates throughout training;
the need for a manually selected global learning rate.

Let $\beta$ denote the decay rate.

$\begin{align} S_i(t) & = \beta S_i(t-1) + (1-\beta)d_i(t)^2 &\text{accumulate gradients}\\ \mathbf{\Delta w_i(t)}& = - \frac{\sqrt{ C_i(t)+\epsilon}}{\sqrt{ S_i(t)+\epsilon}} d_i(t) & \text{compute update} \\ C_i(t) &= \beta C_i(t-1) + (1-\beta)\Delta w_i(t)^2 & \text{accumulate update} \\ \end{align}$

The numerator of update term is not manually $\eta$, resulting in insensitivity of pre-defined learning rate.

Pros:

Requires no manually setting of a learning rate;
insensitive to hyperparameters;
minimal computation over gradient descent.

where $\text{RMS}[x]_t = \sqrt{E[x^2]_t + \epsilon}$

Adam (ICLR 2015)

Adam can be viewed as a variant of RMSProp with momentum: $\begin{align} M_i(t) & = \alpha M_i(t-1) + (1-\alpha)d_i(t) & \text{momentum-smoothed gradient}\\ S_i(t) & = \beta S_i(t-1) + (1-\beta)d_i(t)^2 &\text{RMSProp update} \\ \hat{M}_i(t) &\rightarrow M_i(t) / (1-\alpha) & \text{correct bias}\\ \hat{S}_i(t) &\rightarrow S_i(t) / (1-\beta) & \text{correct bias} \\ \Delta w_i(t)& = \frac{-\eta}{\sqrt{\hat{S}_i(t)}+\epsilon} \hat{M}_i(t) \end{align}$ where $\alpha \sim 0.9$, $\beta \sim 0.999$, $\epsilon \sim 1e-8$

AdamW (ICLR 2018)

AMSGrad

AMSGrad^[6] maintains the maximum of all $S_i(t)$ unitl the present time step to normalizing the running average of the gradient instead of directly using $S_i(t)$ in Adam.
AMSGrad results in a non-increasing step size and avoids pitfalls of Adam and RMSProp. $\begin{align} M_i(t) & = \alpha M_i(t-1) + (1-\alpha)d_i(t) & \text{momentum-smoothed gradient}\\ S_i(t) & = \beta S_i(t-1) + (1-\beta)d_i(t)^2 &\text{RMSProp update} \\ \hat{S}_i(t) &= \max(\hat{S}_i(t-1), S_i(t)) & \text{keep the maximum value} \\ \Delta w_i(t)& = \frac{-\eta}{\sqrt{\hat{S}_i(t)}+\epsilon} M_i(t) \end{align}$

AdaBound

AdaBound^[7] applies a gradient clipping on the learning rate. It behaves like Adam at the beginning as the bounds have little impact on learning rates, and it gradually transforms to SGD.

$\begin{align} M_i(t) & = \alpha M_i(t-1) + (1-\alpha)d_i(t) & \text{momentum-smoothed gradient}\\ S_i(t) & = \beta S_i(t-1) + (1-\beta)d_i(t)^2 &\text{RMSProp update} \\ \eta_i(t) &= \text{Clip} \left(\frac{\eta}{\sqrt{S_i(t)}}, \eta_{l}(t), \eta_{u}(t) \right) / \sqrt{t} & \text{clip to the range }[\eta_{l}(t), \eta_{u}(t)] \text{, then scale}\\ \Delta w_i(t)& = \eta_i(t) M_i(t) \end{align}$

Rectified Adam (RAdam)

Problems

Adaptive learning rate has an undesirably large variance due to the lack of samples in the early stage, leading to suspicious local optima.

Previous solution

Warmup heuistics: using a small learning rate in the first few epochs of training
E.g. Linear warmup, set $\alpha_t = t \alpha_0$ , when $t<T_w$ .
RAdam
Rectified Adam (RAdam) induced a rectification term $r_t$ to mitigate the variance of adaptive learning rate, inspired by Exponential Moving Average (EMA)^[10].

Let $\rho_t$ denote the length of Simple Moving Average (SMA), $\rho_\infty \leftarrow \frac{2}{(1-\beta_2) -1}$ be the maximum length of the approximated SMA.

$\begin{align} M_i(t) & = \alpha M_i(t-1) + (1-\alpha)d_i(t) & \text{momentum-smoothed gradient}\\ S_i(t) & = \beta S_i(t-1) + (1-\beta)d_i(t)^2 &\text{RMSProp update} \\ \hat{M}_i(t) & \leftarrow M_i(t) / (1-\alpha) & \text{correct bias for 1st moment}\\ \rho_t &\leftarrow \rho_\infty - 2t \beta / (1-\beta) & \text{the length of approximated SMA} \\ \end{align}$

if $\rho_t > 4$ i.e., the variance is tractable:

$\begin{align} \hat{S}_i(t) &\rightarrow S_i(t) / (1-\beta) & \text{correct bias for 2nd moment} \\ \mathbf{r_t} &\leftarrow \sqrt{\frac{(\rho_t-4)(\rho_t-2)\rho_\infty}{(\rho_t-4)(\rho_t-2)\rho_t}} & \text{compute the variance rectification term}\\ \Delta w_i(t)& = \frac{-\eta}{\sqrt{\hat{S}_i(t)}} \hat{M}_i(t) \mathbf{r_t} \\ \end{align}$

else:

$\begin{align} \Delta w_i(t)& = \eta \hat{M}_i(t) & \text{ momentum without adaptation} \end{align}$

The heuristic linear warmup can be viewed as setting $r_t = \frac{\min(t, T_w)}{T_w}$

Lookahead

Lookahead ^[11] is orthogonal to 1) adaptive learning rate schemes, such as AdaGrad and Adam, and 2) accelerated schemes, e.g. heavy-ball and Nesterov momentum. It iteratively updates two sets of weights.

Intuition: choose a search direction by loking ahead at the sequence of “fast weights”, generated by another optimizer.
Process: Lookahead first updates the “fast weights” $k$ times using any standard optimizer in the inner loop, before updating the “slow weights” once, in the direction of the final fast weights.
It is empirically robust to the suboptimal hyperparameters, changes in the inner loop optimizer, # of fast weight updates and the slow weights learning rate.

Lookahead optimizer

Lookahead maintains a set of slow weights $\phi$ and fast weights $\theta$,which synced with fast weights every $k$ updates.

Init parameters $\phi_0$ , loss function $L$
sync period $k$, slow weights step size $\alpha$, optimizer $A$
for t = 1,2,…:
- sync updated slow weights to the inner loop fast weights: $\theta_{t,0} \leftarrow \phi_{t-1}$
- for i in 1,2,…,k:
  - sample mini-batch of data $d \sim \mathcal{D}$
  - update fast weights with standard optimizer $\theta_{t,i} \leftarrow \theta_{t, i-1} + A(L, \theta_{t,i-1},d)$
- update slow weights with interpolation $\phi_t \leftarrow \phi_{t-1} + \alpha (\underbrace{\theta_{t,k} - \phi_{t-1})}_\text{linear interpolation}$

After $k$ inner updates using $A$, the slow weights are updated towards the fast weights by linearly interpolating in weight space $\theta - \phi$. Then after each slow weights update, the fast weights are reset to the current slow weights.

Benefit: Loopahead benefits from large learning rate in the inner loop. The fast weights updates make rapid progress along the low curvature direction, whilst the slow weights help smooth out the oscillation through the parameter interpolation.

Slow weights trajectory

an exponential moving average (EMA) of the final fast weights within each inner loop after $k$ innerloop steps:
$\begin{align} \phi_{t+1} & = \phi_t + \alpha (\theta_{t,k} - \phi_t) \\ & = \alpha\left[ \theta_{t,k} + (1-\alpha) \theta_{t-1,k} + \cdots + (1-\alpha)^{t-1} \theta_{0,k} \right] + (1-\alpha)^t \phi_0 \end{align}$
Intuition: slow weights use recent fast weights but matain the effect from previous fast weights.

Fast weights trajectory

maintaining
interpolating update: $\theta_{t,i+1} = \theta_{t,i} + A(L , \theta_{t,i-1},d)$
resetting to current slow weights

Ranger

Ranger^[12] applies RAdam as the optimizer $A$ in Lookahead algorithms (source code: ^[13]).

References

1.cs231n neural-networks-3 ↩
2.Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159. ↩
3.Riedmiller, M., & Braun, H. (1993, March). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE international conference on neural networks (Vol. 1993, pp. 586-591). ↩
4.csc321 Neural Networks for Machine Learning Lecure6 ↩
5.Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. ↩
6.Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237. ↩
7.Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843. ↩
8.Loshchilov, I., & Hutter, F. (2017). Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101. ↩
9.Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. ↩
10.Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., & Han, J. (2019). On the Variance of the Adaptive Learning Rate and Beyond. ArXiv, abs/1908.03265. ↩
11.Zhang, M.R., Lucas, J., Hinton, G.E., & Ba, J. (2019). Lookahead Optimizer: k steps forward, 1 step back. ArXiv, abs/1907.08610. ↩
12.https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d ↩
13.https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer ↩

Yekun's Note

Optimization Methods in Deep Learning