Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Optimization Methods in Deep Learning


Vanilla gradient descent

Let be the gradient of the error function $E$ w.r.t a weight at update time $t$.

  • “Vanilla” gradient descent updates the weight along the nagative gradient direction:where $\eta$ denotes the learning rate.

How to set the learning rate?

Initialize the $\eta$, and update as the training processes

  • Learning rate schedules: typically initial larger steps followed by smaller steps for fine tuning: results in faster convergence and better solutions.

Learning Rate Schedules:

  1. Time-dependent schedules:
    • Piecewise constant: pre-determined $\eta$ for each epoch
    • Exponential:
      , where $r \sim \text{training set size}$
    • Reciprocal:
      , where $c \sim 1$.
  2. Performance-dependent $\eta$: fixd $\eta$ until validation set stops improving, then halve $\eta$ each epoch (i.e. constant, then exponential)


  • Momentum hyperparameter: $\alpha \sim 0.9$
  • Momentum term encourages the weight change to go in the previous direction

  • Problems: tuning learning rate and momentum parameters can be expensive.

Adaptive learning rates

  • Tuning learning rate parameters is expensive (grid search)
    • AdaGrad: normalize the update for each weight
    • RMSProp: AdaGrad forces the learning rate to always decrease, this constraint is relexed with RMSProp
    • Adam: “RMSProp with momentum”


  • Separate, nomalized update for each weight
  • Normalized by the sum squared gradient $S$where $\epsilon \sim 10^{-8}$ is a small constant to prevent division by zero errors.
  • The update step for is normalized by the (square root of) the sum squared gradients for that parameter
    • Weights with larger gradient magnitudes will have smaller effective learning rates
    • cannot get smaller, so the effective learningr rates monotonically decrease
  • AdaGrad can decrease the effective learning rate too aggressively in NNs.


  • Backgroud: RProp[3] is for batch gradient descent with an adaptive learning rate for each parameter, and uses only the sign of the gradient (equivalent to normalizing by the gradient)
  • RMSProp can be viewed as a stochastic gradient descent version of RProp normalized by a moving average of the squared gradient[4]. It is similar to AdaGrad, but replacing the sum by a moving average for $S$:where $\beta \sim 0.9$ is the decay rate.
  • Effective learning rates no longer guaranteed to decrease.


AdaGrad has two main drawbacks:

  1. the continual decay of learning rates throughout training;
  2. the need for a manually selected global learning rate.

Let $\beta$ denote the decay rate.

The numerator of update term is not manually $\eta$, resulting in insensitivity of pre-defined learning rate.


  • Requires no manually setting of a learning rate;
  • insensitive to hyperparameters;
  • minimal computation over gradient descent.

upload successful
where $\text{RMS}[x]_t = \sqrt{E[x^2]_t + \epsilon}$

Adam (ICLR 2015)

  • Adam can be viewed as a variant of RMSProp with momentum:where $\alpha \sim 0.9$, $\beta \sim 0.999$, $\epsilon \sim 1e-8$

upload successful

upload successful

AdamW (ICLR 2018)

upload successful


  • AMSGrad[6] maintains the maximum of all unitl the present time step to normalizing the running average of the gradient instead of directly using in Adam.
  • AMSGrad results in a non-increasing step size and avoids pitfalls of Adam and RMSProp.


  • AdaBound[7] applies a gradient clipping on the learning rate. It behaves like Adam at the beginning as the bounds have little impact on learning rates, and it gradually transforms to SGD.

Rectified Adam (RAdam)


  • Adaptive learning rate has an undesirably large variance due to the lack of samples in the early stage, leading to suspicious local optima.

Previous solution

  • Warmup heuistics: using a small learning rate in the first few epochs of training
  • E.g. Linear warmup, set , when .


    Rectified Adam (RAdam) induced a rectification term to mitigate the variance of adaptive learning rate, inspired by Exponential Moving Average (EMA)[10].

Let denote the length of Simple Moving Average (SMA), be the maximum length of the approximated SMA.

if $\rho_t > 4$ i.e., the variance is tractable:


upload successful

  • The heuristic linear warmup can be viewed as setting


Lookahead [11] is orthogonal to 1) adaptive learning rate schemes, such as AdaGrad and Adam, and 2) accelerated schemes, e.g. heavy-ball and Nesterov momentum. It iteratively updates two sets of weights.

  • Intuition: choose a search direction by loking ahead at the sequence of “fast weights”, generated by another optimizer.

  • Process: Lookahead first updates the “fast weights” $k$ times using any standard optimizer in the inner loop, before updating the “slow weights” once, in the direction of the final fast weights.

  • It is empirically robust to the suboptimal hyperparameters, changes in the inner loop optimizer, # of fast weight updates and the slow weights learning rate.

Lookahead optimizer

Lookahead maintains a set of slow weights $\phi$ and fast weights $\theta$,which synced with fast weights every $k$ updates.

  • Init parameters , loss function $L$
  • sync period $k$, slow weights step size $\alpha$, optimizer $A$
  • for t = 1,2,…:
    • sync updated slow weights to the inner loop fast weights:
    • for i in 1,2,…,k:
      • sample mini-batch of data $d \sim \mathcal{D}$
      • update fast weights with standard optimizer
    • update slow weights with interpolation

After $k$ inner updates using $A$, the slow weights are updated towards the fast weights by linearly interpolating in weight space $\theta - \phi$. Then after each slow weights update, the fast weights are reset to the current slow weights.

upload successful

  • Benefit: Loopahead benefits from large learning rate in the inner loop. The fast weights updates make rapid progress along the low curvature direction, whilst the slow weights help smooth out the oscillation through the parameter interpolation.

Slow weights trajectory

  • an exponential moving average (EMA) of the final fast weights within each inner loop after $k$ innerloop steps:

  • Intuition: slow weights use recent fast weights but matain the effect from previous fast weights.

Fast weights trajectory

  • maintaining
  • interpolating update:
  • resetting to current slow weights


Ranger[12] applies RAdam as the optimizer $A$ in Lookahead algorithms (source code: [13]).


  1. 1.cs231n neural-networks-3
  2. 2.Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.
  3. 3.Riedmiller, M., & Braun, H. (1993, March). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE international conference on neural networks (Vol. 1993, pp. 586-591).
  4. 4.csc321 Neural Networks for Machine Learning Lecure6
  5. 5.Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  6. 6.Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.
  7. 7.Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843.
  8. 8.Loshchilov, I., & Hutter, F. (2017). Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101.
  9. 9.Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
  10. 10.Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., & Han, J. (2019). On the Variance of the Adaptive Learning Rate and Beyond. ArXiv, abs/1908.03265.
  11. 11.Zhang, M.R., Lucas, J., Hinton, G.E., & Ba, J. (2019). Lookahead Optimizer: k steps forward, 1 step back. ArXiv, abs/1907.08610.
  12. 12.
  13. 13.