Let $d_i(t) = \frac{\partial E}{\partial w_i(t)}$ be the gradient of the error function $E$ w.r.t a weight $w_i$ at update time $t$.

• “Vanilla” gradient descent updates the weight along the nagative gradient direction:where $\eta$ denotes the learning rate.

## How to set the learning rate?

Initialize the $\eta$, and update as the training processes

• Learning rate schedules: typically initial larger steps followed by smaller steps for fine tuning: results in faster convergence and better solutions.

Learning Rate Schedules:

1. Time-dependent schedules:
• Piecewise constant: pre-determined $\eta$ for each epoch
• Exponential:
$\eta(t) = \eta(0) \exp(-t/r)$, where $r \sim \text{training set size}$
• Reciprocal:
$\eta(t) = \eta(0)(1+ \frac{t}{r})^{-c}$, where $c \sim 1$.
2. Performance-dependent $\eta$: fixd $\eta$ until validation set stops improving, then halve $\eta$ each epoch (i.e. constant, then exponential)

# Momentum

• Momentum hyperparameter: $\alpha \sim 0.9$
• Momentum term encourages the weight change to go in the previous direction

• Problems: tuning learning rate and momentum parameters can be expensive.

• Tuning learning rate parameters is expensive (grid search)
• RMSProp: AdaGrad forces the learning rate to always decrease, this constraint is relexed with RMSProp

• Separate, nomalized update for each weight
• Normalized by the sum squared gradient $S$where $\epsilon \sim 10^{-8}$ is a small constant to prevent division by zero errors.
• The update step for $w_i$ is normalized by the (square root of) the sum squared gradients for that parameter
• Weights with larger gradient magnitudes will have smaller effective learning rates
• $S_i$ cannot get smaller, so the effective learningr rates monotonically decrease
• AdaGrad can decrease the effective learning rate too aggressively in NNs.

## RMSProp

• Backgroud: RProp[3] is for batch gradient descent with an adaptive learning rate for each parameter, and uses only the sign of the gradient (equivalent to normalizing by the gradient)
• RMSProp can be viewed as a stochastic gradient descent version of RProp normalized by a moving average of the squared gradient[4]. It is similar to AdaGrad, but replacing the sum by a moving average for $S$:where $\beta \sim 0.9$ is the decay rate.
• Effective learning rates no longer guaranteed to decrease.

Background:

1. the continual decay of learning rates throughout training;
2. the need for a manually selected global learning rate.

Let $\beta$ denote the decay rate.

The numerator of update term is not manually $\eta$, resulting in insensitivity of pre-defined learning rate.

Pros:

• Requires no manually setting of a learning rate;
• insensitive to hyperparameters;
• minimal computation over gradient descent.

where $\text{RMS}[x]_t = \sqrt{E[x^2]_t + \epsilon}$

• Adam can be viewed as a variant of RMSProp with momentum:where $\alpha \sim 0.9$, $\beta \sim 0.999$, $\epsilon \sim 1e-8$

• AMSGrad[6] maintains the maximum of all $S_i(t)$ unitl the present time step to normalizing the running average of the gradient instead of directly using $S_i(t)$ in Adam.
• AMSGrad results in a non-increasing step size and avoids pitfalls of Adam and RMSProp.

• AdaBound[7] applies a gradient clipping on the learning rate. It behaves like Adam at the beginning as the bounds have little impact on learning rates, and it gradually transforms to SGD.

### Problems

• Adaptive learning rate has an undesirably large variance due to the lack of samples in the early stage, leading to suspicious local optima.

### Previous solution

• Warmup heuistics: using a small learning rate in the first few epochs of training
• E.g. Linear warmup, set $\alpha_t = t \alpha_0$, when $t.

Rectified Adam (RAdam) induced a rectification term $r_t$ to mitigate the variance of adaptive learning rate, inspired by Exponential Moving Average (EMA)[10].

Let $\rho_t$ denote the length of Simple Moving Average (SMA), $\rho_\infty \leftarrow \frac{2}{(1-\beta_2) -1}$ be the maximum length of the approximated SMA.

if $\rho_t > 4$ i.e., the variance is tractable:

else:

• The heuristic linear warmup can be viewed as setting $r_t = \frac{\min(t, T_w)}{T_w}$

• Intuition: choose a search direction by loking ahead at the sequence of “fast weights”, generated by another optimizer.

• Process: Lookahead first updates the “fast weights” $k$ times using any standard optimizer in the inner loop, before updating the “slow weights” once, in the direction of the final fast weights.

• It is empirically robust to the suboptimal hyperparameters, changes in the inner loop optimizer, # of fast weight updates and the slow weights learning rate.

Lookahead maintains a set of slow weights $\phi$ and fast weights $\theta$,which synced with fast weights every $k$ updates.

• Init parameters $\phi_0$, loss function $L$
• sync period $k$, slow weights step size $\alpha$, optimizer $A$
• for t = 1,2,…:
• sync updated slow weights to the inner loop fast weights: $\theta_{t,0} \leftarrow \phi_{t-1}$
• for i in 1,2,…,k:
• sample mini-batch of data $d \sim \mathcal{D}$
• update fast weights with standard optimizer $\theta_{t,i} \leftarrow \theta_{t, i-1} + A(L, \theta_{t,i-1},d)$
• update slow weights with interpolation $\phi_t \leftarrow \phi_{t-1} + \alpha (\underbrace{\theta_{t,k} - \phi_{t-1})}_\text{linear interpolation}$

After $k$ inner updates using $A$, the slow weights are updated towards the fast weights by linearly interpolating in weight space $\theta - \phi$. Then after each slow weights update, the fast weights are reset to the current slow weights.

• Benefit: Loopahead benefits from large learning rate in the inner loop. The fast weights updates make rapid progress along the low curvature direction, whilst the slow weights help smooth out the oscillation through the parameter interpolation.

Slow weights trajectory

• an exponential moving average (EMA) of the final fast weights within each inner loop after $k$ innerloop steps:

• Intuition: slow weights use recent fast weights but matain the effect from previous fast weights.

Fast weights trajectory

• maintaining
• interpolating update: $\theta_{t,i+1} = \theta_{t,i} + A(L , \theta_{t,i-1},d)$
• resetting to current slow weights

## Ranger

Ranger[12] applies RAdam as the optimizer $A$ in Lookahead algorithms (source code: [13]).

# References

1. 1.
2. 2.Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.
3. 3.Riedmiller, M., & Braun, H. (1993, March). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE international conference on neural networks (Vol. 1993, pp. 586-591).
4. 4.
5. 5.Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
6. 6.Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.
7. 7.Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843.
8. 8.Loshchilov, I., & Hutter, F. (2017). Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101.