Stochastic gradient descent optimisers

A cheat sheet of stochastic gradient descent optimisers

Last updated: 22 Sep 2019

(sorted by publication date)

Vanilla SGD
Momentum
Averaged SGD
AdaGrad
Adadelta
RMSProp
Nesterov-accelerated gradient
Adam
AdaMax
Nadam
SGDW
AdamW
LARS
SWA
AMSGrad
LAMB
NovoGrad
RAdam

Notation

$w_t$ — weight at time step $t$
$g_t$ — gradient w.r.t. $w$ at time step $t$
$\alpha$ — learning rate
$\lambda$ — weight decay parameter
$\epsilon$ — a very small number, used to prevent division by zero
$m$ — estimate for first moment of gradient
$v$ — estimate for second moment of gradient
$^{(l)}$ — specific to layer $l$
$||\cdot|| $ — absolute value
$||\cdot||_2 $ — Euclidean/L2 norm

Vanilla SGD

$$ w_{t+1}=w_{t}-\alpha g_t $$

Momentum

$$ w_{t+1} =w_{t}-\alpha m_t $$

where

$$ m_t =\beta m_{t-1}+(1-\beta) g_t $$

AdaGrad (adaptive gradient)

$$ w_{t+1} = w_{t}-\frac{\alpha}{\sqrt{v_t+\epsilon}} \cdot g_t $$

where

$$ v_t =v_{t-1}+\left[g_t\right]^{2} $$

Adadelta

$$ w_{t+1} =w_{t}-\frac{\sqrt{D_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \cdot g_t $$

where

$$ \begin{aligned} D_{t} &=\beta D_{t-1}+(1-\beta)\left[\Delta w_{t}\right]^{2} \\ v_t &=\beta v_{t-1}+(1-\beta)\left[g_t\right]^{2} \end{aligned} $$

RMSProp

$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{v_t+\epsilon}} \cdot g_t $$

where

$$ v_t =\beta v_{t-1}+(1-\beta)\left[g_t\right]^{2} $$

Nesterov-accelerated gradient

$$ w_{t+1} =w_{t}-\alpha m_t $$

where

$$ m_t =\beta m_{t-1}+(1-\beta) g_* $$

Note that $g_*$ is the gradient of $w_*$, where

$$ w_* =w_{t}-\alpha m_{t-1} $$

Adam (adaptive moment estimation)

$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot \hat{m}_{t} $$

where

$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\frac{v_t}{1-\beta_{2}^{t}} \end{aligned} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned} $$

AdaMax

$$ w_{t+1} =w_{t}-\alpha \cdot \frac{\hat{m}_{t}}{s_t} $$

where

$$ \hat{m}_{t} =\frac{m_t}{1-\beta_{1}^{t}} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ s_t &=\max \left(\beta_{2} s_{t-1},\left|g_t\right|\right) \end{aligned} $$

Nadam (Nesterov-accelerated Adam)

$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot \beta_{1} \hat{m}_{t} - \frac{1-\beta_{1}}{1-\beta_{1}^{t}} \cdot g_t $$

where

$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\frac{v_t}{1-\beta_{2}^{t}} \end{aligned} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned} $$

SGDW (SGD with decoupled weight decay)

Note: SGDW is equivalent to optimising a loss function with L2 regularisation using SGD with momentum and weight decay (see appendix). Scheduler rate is set to 1 for readability.

$$ w_{t+1} = w_{t}-\alpha m_t - \lambda w_t $$

where

$$ m_t =\beta m_{t-1}+(1-\beta) g_t $$

AdamW (Adam with decoupled weight decay)

Note: Here the scheduler rate is set to 1 for readability.

$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot \hat{m}_{t} - \lambda w_t$$

where

$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\frac{v_t}{1-\beta_{2}^{t}} \end{aligned} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned} $$

SWA (stochastic weight averaging)

Apart from

$$ w_{t+1}=w_{t}-\alpha g_t $$

the optimiser maintains another set of weights $m$ which will be used in the final model. The update of $m$ is:

$$ m_{t+1}=\left\{\begin{array}{lr} {\beta_t m_t + (1-\beta_t) w_t} & {\text{if mod} (t,c) = 0} \\ {m_t} & {\text { otherwise }} \end{array}\right. $$

where $m$ is updated with the running average every $c$ time steps and

$$ \beta_t = \frac{t}{t+c} $$

The equations look different from the paper but are essentially identical due to my reparametrisation (see appendix).

AMSGrad

$$ w_{t+1} = w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot m_t $$

where

$$ \hat{v}_{t} =\max \left(\hat{v}_{t-1}, v_t\right) $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned}$$

NovoGrad

Note that weight update is done per layer, indicated by the superscript $^{(l)}$.

$$ w_{t+1}^{(l)} =w_{t}^{(l)} - \alpha m_t^{(l)} $$

where

$$ m_t^{(l)} = \beta_{1} m_{t-1}^{(l)} + \left(1-\beta_{1}\right) ( \frac{1}{\sqrt{v_t^{(l)}} + \epsilon} \cdot g_t^{(l)} + \lambda w_t^{(l)} ) $$

and

$$ v_t^{(l)} = \beta_{2} v_{t-1}^{(l)} + \left(1-\beta_{2}\right) \left\| g_t^{(l)}\right\| _2 $$

RAdam (Rectified Adam)

$$ w_{t+1}=w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot r_{t} \hat{m}_{t} $$

where

$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\left\{\begin{array}{cc}{\frac{v_t}{1-\beta_{2}^{t}}} & {\text { if } p_{t}>4} \\ {1} & {\text { otherwise }}\end{array}\right.\\ r_{t} &=\left\{\begin{array}{cc}{\sqrt{\frac{\left(p_{t}-4\right)\left(p_{t}-2\right) p_{\infty}}{\left(p_{\infty}-4\right)\left(p_{\infty}-2\right) p_{t}}}} & {\text { if } p_{t}>4} \\ {1} & {\text { otherwise }}\end{array}\right.\end{aligned} $$

and

$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \\ p_{t} &=p_{\infty}-\frac{\beta_{2}^{t}}{1-\beta_{2}^{t}} \cdot 2 t \\ p_{\infty} &=\frac{2}{1-\beta_{2}}-1 \end{aligned} $$

Updates and corrections

If you see mistakes or want to suggest changes, kindly email me at raimi.bkarim@gmail.com

Weight decay

$$ w_{t+1} = (1-\lambda)w_t - \alpha g_t$$

Weight decay and L2

SGD with L2 regularisation is the same as SGD with weight decay, as proven in .

For a loss function with L2 regularisation and regularisation parameter $\lambda^{\prime}$ , we have:

$$ f(w) = L(w) + \frac{1}{2}\lambda^{\prime}||w||_2^2 $$

and its corresponding gradient descent weight update is:

$$ \begin{aligned} w_{t+1} &= w_t - \alpha \frac{\partial f}{\partial w} \\ &= w_t - \alpha \frac{\partial L}{\partial w} - \alpha \lambda^{\prime} w_t \end{aligned}$$

For SGD with weight decay $\lambda$ but no regularisation, the gradient update is

$$ \begin{aligned} w_{t+1} &= (1-\lambda)w_t - \alpha \frac{\partial L}{\partial w} \\ &= w_t - \alpha \frac{\partial L}{\partial w} - \lambda w_t \end{aligned} $$

Here we can see that $\lambda$ is just a reparametrisation of $\lambda = \alpha \lambda^{\prime}$.

SWA reparametrisation

From the pseudo code in the paper and taking $n_{\text{models}}=n$, the update is

$$ \begin{aligned} w_{\mathrm{SWA}} &= \frac{w_{\mathrm{SWA}} \cdot n+w}{n+1} \\ &= \frac{n}{n+1} w_{\mathrm{SWA}} + \frac{1}{n+1} w \\ &= \frac{n}{n+1} w_{\mathrm{SWA}} + \frac{(n+1)-n}{n+1} w \\ &= \frac{n}{n+1} w_{\mathrm{SWA}} + (1-\frac{n}{n+1}) w \\ &= \beta w_{\mathrm{SWA}} + (1-\beta) w \end{aligned}$$

where $\beta = \frac{n}{n+1}$. And because $n=n_{\text{models}}=\frac{t}{c},$

$$ \begin{aligned} \beta = \frac{t/c}{t/c+1} = \frac{t/c}{(t+c)/c} = \frac{t}{t+c} \end{aligned}$$

https://towardsdatascience.com/10-gradient-descent-optimisation-algorithms-86989510b5e9

References

http://ruder.io/optimizing-gradient-descent/index.html

https://keras.io/optimizers/

Stochastic gradient descent optimisers

Categories

Notation

Vanilla SGD

Momentum

AdaGrad (adaptive gradient)

Adadelta

RMSProp

Nesterov-accelerated gradient

Adam (adaptive moment estimation)

AdaMax

Nadam (Nesterov-accelerated Adam)

SGDW (SGD with decoupled weight decay)

AdamW (Adam with decoupled weight decay)

SWA (stochastic weight averaging)

AMSGrad

NovoGrad

RAdam (Rectified Adam)

Updates and corrections

Weight decay

Weight decay and L2

SWA reparametrisation

Related

References