Stochastic gradient descent optimisers

A cheat sheet of stochastic gradient descent optimisers

Last updated: 22 Sep 2019

Categories

Stochastic gradient descent (SGD) optimisers can be broadly categorised into (1) adaptive learning rate schemes (e.g. Adam and RMSProp) and (2) accelerated schemes (e.g. Nesterov) .

Notation

Vanilla SGD

$$ w_{t+1}=w_{t}-\alpha g_t $$

Momentum

$$ w_{t+1} =w_{t}-\alpha m_t $$

where

$$ m_t =\beta m_{t-1}+(1-\beta) g_t $$

AdaGrad (adaptive gradient)

$$ w_{t+1} = w_{t}-\frac{\alpha}{\sqrt{v_t+\epsilon}} \cdot g_t $$

where

$$ v_t =v_{t-1}+\left[g_t\right]^{2} $$

Adadelta

$$ w_{t+1} =w_{t}-\frac{\sqrt{D_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \cdot g_t $$

where

$$ \begin{aligned} D_{t} &=\beta D_{t-1}+(1-\beta)\left[\Delta w_{t}\right]^{2} \\ v_t &=\beta v_{t-1}+(1-\beta)\left[g_t\right]^{2} \end{aligned} $$

RMSProp

$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{v_t+\epsilon}} \cdot g_t $$

where

$$ v_t =\beta v_{t-1}+(1-\beta)\left[g_t\right]^{2} $$

Nesterov-accelerated gradient

$$ w_{t+1} =w_{t}-\alpha m_t $$

where

$$ m_t =\beta m_{t-1}+(1-\beta) g_* $$

Note that $g_*$ is the gradient of $w_*$, where

$$ w_* =w_{t}-\alpha m_{t-1} $$

Adam (adaptive moment estimation)

$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot \hat{m}_{t} $$

where

$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\frac{v_t}{1-\beta_{2}^{t}} \end{aligned} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned} $$

AdaMax

$$ w_{t+1} =w_{t}-\alpha \cdot \frac{\hat{m}_{t}}{s_t} $$

where

$$ \hat{m}_{t} =\frac{m_t}{1-\beta_{1}^{t}} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ s_t &=\max \left(\beta_{2} s_{t-1},\left|g_t\right|\right) \end{aligned} $$

Nadam (Nesterov-accelerated Adam)

$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot \beta_{1} \hat{m}_{t} - \frac{1-\beta_{1}}{1-\beta_{1}^{t}} \cdot g_t $$

where

$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\frac{v_t}{1-\beta_{2}^{t}} \end{aligned} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned} $$

SGDW (SGD with decoupled weight decay)

Note: SGDW is equivalent to optimising a loss function with L2 regularisation using SGD with momentum and weight decay (see appendix). Scheduler rate is set to 1 for readability.

$$ w_{t+1} = w_{t}-\alpha m_t - \lambda w_t $$

where

$$ m_t =\beta m_{t-1}+(1-\beta) g_t $$

AdamW (Adam with decoupled weight decay)

Note: Here the scheduler rate is set to 1 for readability.

$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot \hat{m}_{t} - \lambda w_t$$

where

$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\frac{v_t}{1-\beta_{2}^{t}} \end{aligned} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned} $$

SWA (stochastic weight averaging)

Apart from

$$ w_{t+1}=w_{t}-\alpha g_t $$

the optimiser maintains another set of weights $m$ which will be used in the final model. The update of $m$ is:

$$ m_{t+1}=\left\{\begin{array}{lr} {\beta_t m_t + (1-\beta_t) w_t} & {\text{if mod} (t,c) = 0} \\ {m_t} & {\text { otherwise }} \end{array}\right. $$

where $m$ is updated with the running average every $c$ time steps and

$$ \beta_t = \frac{t}{t+c} $$

The equations look different from the paper but are essentially identical due to my reparametrisation (see appendix).

AMSGrad

$$ w_{t+1} = w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot m_t $$

where

$$ \hat{v}_{t} =\max \left(\hat{v}_{t-1}, v_t\right) $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned}$$

NovoGrad

Note that weight update is done per layer, indicated by the superscript $^{(l)}$.

$$ w_{t+1}^{(l)} =w_{t}^{(l)} - \alpha m_t^{(l)} $$

where

$$ m_t^{(l)} = \beta_{1} m_{t-1}^{(l)} + \left(1-\beta_{1}\right) ( \frac{1}{\sqrt{v_t^{(l)}} + \epsilon} \cdot g_t^{(l)} + \lambda w_t^{(l)} ) $$

and

$$ v_t^{(l)} = \beta_{2} v_{t-1}^{(l)} + \left(1-\beta_{2}\right) \left\| g_t^{(l)}\right\| _2 $$

RAdam (Rectified Adam)

$$ w_{t+1}=w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot r_{t} \hat{m}_{t} $$

where

$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\left\{\begin{array}{cc}{\frac{v_t}{1-\beta_{2}^{t}}} & {\text { if } p_{t}>4} \\ {1} & {\text { otherwise }}\end{array}\right.\\ r_{t} &=\left\{\begin{array}{cc}{\sqrt{\frac{\left(p_{t}-4\right)\left(p_{t}-2\right) p_{\infty}}{\left(p_{\infty}-4\right)\left(p_{\infty}-2\right) p_{t}}}} & {\text { if } p_{t}>4} \\ {1} & {\text { otherwise }}\end{array}\right.\end{aligned} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \\ p_{t} &=p_{\infty}-\frac{\beta_{2}^{t}}{1-\beta_{2}^{t}} \cdot 2 t \\ p_{\infty} &=\frac{2}{1-\beta_{2}}-1 \end{aligned} $$

Updates and corrections

If you see mistakes or want to suggest changes, kindly email me at raimi.bkarim@gmail.com

Weight decay

$$ w_{t+1} = (1-\lambda)w_t - \alpha g_t$$

Weight decay and L2

SGD with L2 regularisation is the same as SGD with weight decay, as proven in .

For a loss function with L2 regularisation and regularisation parameter $\lambda^{\prime}$ , we have:

$$ f(w) = L(w) + \frac{1}{2}\lambda^{\prime}||w||_2^2 $$

and its corresponding gradient descent weight update is:

$$ \begin{aligned} w_{t+1} &= w_t - \alpha \frac{\partial f}{\partial w} \\ &= w_t - \alpha \frac{\partial L}{\partial w} - \alpha \lambda^{\prime} w_t \end{aligned}$$

For SGD with weight decay $\lambda$ but no regularisation, the gradient update is

$$ \begin{aligned} w_{t+1} &= (1-\lambda)w_t - \alpha \frac{\partial L}{\partial w} \\ &= w_t - \alpha \frac{\partial L}{\partial w} - \lambda w_t \end{aligned} $$

Here we can see that $\lambda$ is just a reparametrisation of $\lambda = \alpha \lambda^{\prime}$.

SWA reparametrisation

From the pseudo code in the paper and taking $n_{\text{models}}=n$, the update is

$$ \begin{aligned} w_{\mathrm{SWA}} &= \frac{w_{\mathrm{SWA}} \cdot n+w}{n+1} \\ &= \frac{n}{n+1} w_{\mathrm{SWA}} + \frac{1}{n+1} w \\ &= \frac{n}{n+1} w_{\mathrm{SWA}} + \frac{(n+1)-n}{n+1} w \\ &= \frac{n}{n+1} w_{\mathrm{SWA}} + (1-\frac{n}{n+1}) w \\ &= \beta w_{\mathrm{SWA}} + (1-\beta) w \end{aligned}$$

where $\beta = \frac{n}{n+1}$. And because $n=n_{\text{models}}=\frac{t}{c},$

$$ \begin{aligned} \beta = \frac{t/c}{t/c+1} = \frac{t/c}{(t+c)/c} = \frac{t}{t+c} \end{aligned}$$

Related

https://towardsdatascience.com/10-gradient-descent-optimisation-algorithms-86989510b5e9

References

http://ruder.io/optimizing-gradient-descent/index.html

https://keras.io/optimizers/