A cheat sheet of stochastic gradient descent optimisers
Last updated: 22 Sep 2019
Stochastic gradient descent (SGD) optimisers can be broadly categorised into
(1) adaptive learning rate schemes (e.g. Adam and RMSProp)
and (2) accelerated schemes (e.g. Nesterov)
$$ w_{t+1}=w_{t}-\alpha g_t $$
$$ w_{t+1} =w_{t}-\alpha m_t $$
where
$$ m_t =\beta m_{t-1}+(1-\beta) g_t $$
$$ w_{t+1} = w_{t}-\frac{\alpha}{\sqrt{v_t+\epsilon}} \cdot g_t $$
where
$$ v_t =v_{t-1}+\left[g_t\right]^{2} $$
$$ w_{t+1} =w_{t}-\frac{\sqrt{D_{t-1}+\epsilon}}{\sqrt{v_t+\epsilon}} \cdot g_t $$
where
$$ \begin{aligned} D_{t} &=\beta D_{t-1}+(1-\beta)\left[\Delta w_{t}\right]^{2} \\ v_t &=\beta v_{t-1}+(1-\beta)\left[g_t\right]^{2} \end{aligned} $$
$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{v_t+\epsilon}} \cdot g_t $$
where
$$ v_t =\beta v_{t-1}+(1-\beta)\left[g_t\right]^{2} $$
$$ w_{t+1} =w_{t}-\alpha m_t $$
where
$$ m_t =\beta m_{t-1}+(1-\beta) g_* $$
Note that $g_*$ is the gradient of $w_*$, where
$$ w_* =w_{t}-\alpha m_{t-1} $$
$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot \hat{m}_{t} $$
where
$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\frac{v_t}{1-\beta_{2}^{t}} \end{aligned} $$
and
$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned} $$
$$ w_{t+1} =w_{t}-\alpha \cdot \frac{\hat{m}_{t}}{s_t} $$
where
$$ \hat{m}_{t} =\frac{m_t}{1-\beta_{1}^{t}} $$
and
$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ s_t &=\max \left(\beta_{2} s_{t-1},\left|g_t\right|\right) \end{aligned} $$
$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot \beta_{1} \hat{m}_{t} - \frac{1-\beta_{1}}{1-\beta_{1}^{t}} \cdot g_t $$
where
$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\frac{v_t}{1-\beta_{2}^{t}} \end{aligned} $$
and
$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned} $$
Note: SGDW is equivalent to optimising a loss function with L2 regularisation using SGD with momentum and weight decay (see appendix). Scheduler rate is set to 1 for readability.
$$ w_{t+1} = w_{t}-\alpha m_t - \lambda w_t $$
where
$$ m_t =\beta m_{t-1}+(1-\beta) g_t $$
Note: Here the scheduler rate is set to 1 for readability.
$$ w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot \hat{m}_{t} - \lambda w_t$$
where
$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\frac{v_t}{1-\beta_{2}^{t}} \end{aligned} $$
and
$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned} $$
Apart from
$$ w_{t+1}=w_{t}-\alpha g_t $$
the optimiser maintains another set of weights $m$ which will be used in the final model. The update of $m$ is:
$$ m_{t+1}=\left\{\begin{array}{lr} {\beta_t m_t + (1-\beta_t) w_t} & {\text{if mod} (t,c) = 0} \\ {m_t} & {\text { otherwise }} \end{array}\right. $$
where $m$ is updated with the running average every $c$ time steps and
$$ \beta_t = \frac{t}{t+c} $$
The equations look different from the paper but are essentially identical due to my reparametrisation (see appendix).
$$ w_{t+1} = w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot m_t $$
where
$$ \hat{v}_{t} =\max \left(\hat{v}_{t-1}, v_t\right) $$
and
$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \end{aligned}$$
Note that weight update is done per layer, indicated by the superscript $^{(l)}$.
$$ w_{t+1}^{(l)} =w_{t}^{(l)} - \alpha m_t^{(l)} $$
where
$$ m_t^{(l)} = \beta_{1} m_{t-1}^{(l)} + \left(1-\beta_{1}\right) ( \frac{1}{\sqrt{v_t^{(l)}} + \epsilon} \cdot g_t^{(l)} + \lambda w_t^{(l)} ) $$
and
$$ v_t^{(l)} = \beta_{2} v_{t-1}^{(l)} + \left(1-\beta_{2}\right) \left\| g_t^{(l)}\right\| _2 $$
$$ w_{t+1}=w_{t}-\frac{\alpha}{\sqrt{\hat{v}_{t}}+\epsilon} \cdot r_{t} \hat{m}_{t} $$
where
$$ \begin{aligned} \hat{m}_{t} &=\frac{m_t}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &=\left\{\begin{array}{cc}{\frac{v_t}{1-\beta_{2}^{t}}} & {\text { if } p_{t}>4} \\ {1} & {\text { otherwise }}\end{array}\right.\\ r_{t} &=\left\{\begin{array}{cc}{\sqrt{\frac{\left(p_{t}-4\right)\left(p_{t}-2\right) p_{\infty}}{\left(p_{\infty}-4\right)\left(p_{\infty}-2\right) p_{t}}}} & {\text { if } p_{t}>4} \\ {1} & {\text { otherwise }}\end{array}\right.\end{aligned} $$
and
$$ \begin{aligned} m_t &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_t \\ v_t &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right)\left[g_t\right]^{2} \\ p_{t} &=p_{\infty}-\frac{\beta_{2}^{t}}{1-\beta_{2}^{t}} \cdot 2 t \\ p_{\infty} &=\frac{2}{1-\beta_{2}}-1 \end{aligned} $$
If you see mistakes or want to suggest changes, kindly email me at raimi.bkarim@gmail.com
$$ w_{t+1} = (1-\lambda)w_t - \alpha g_t$$
SGD with L2 regularisation is the same as SGD with weight decay, as proven in
For a loss function with L2 regularisation and regularisation parameter $\lambda^{\prime}$ , we have:
$$ f(w) = L(w) + \frac{1}{2}\lambda^{\prime}||w||_2^2 $$
and its corresponding gradient descent weight update is:
$$ \begin{aligned} w_{t+1} &= w_t - \alpha \frac{\partial f}{\partial w} \\ &= w_t - \alpha \frac{\partial L}{\partial w} - \alpha \lambda^{\prime} w_t \end{aligned}$$
For SGD with weight decay $\lambda$ but no regularisation, the gradient update is
$$ \begin{aligned} w_{t+1} &= (1-\lambda)w_t - \alpha \frac{\partial L}{\partial w} \\ &= w_t - \alpha \frac{\partial L}{\partial w} - \lambda w_t \end{aligned} $$
Here we can see that $\lambda$ is just a reparametrisation of $\lambda = \alpha \lambda^{\prime}$.
From the pseudo code in the paper and taking $n_{\text{models}}=n$, the update is
$$ \begin{aligned} w_{\mathrm{SWA}} &= \frac{w_{\mathrm{SWA}} \cdot n+w}{n+1} \\ &= \frac{n}{n+1} w_{\mathrm{SWA}} + \frac{1}{n+1} w \\ &= \frac{n}{n+1} w_{\mathrm{SWA}} + \frac{(n+1)-n}{n+1} w \\ &= \frac{n}{n+1} w_{\mathrm{SWA}} + (1-\frac{n}{n+1}) w \\ &= \beta w_{\mathrm{SWA}} + (1-\beta) w \end{aligned}$$
where $\beta = \frac{n}{n+1}$. And because $n=n_{\text{models}}=\frac{t}{c},$
$$ \begin{aligned} \beta = \frac{t/c}{t/c+1} = \frac{t/c}{(t+c)/c} = \frac{t}{t+c} \end{aligned}$$
https://towardsdatascience.com/10-gradient-descent-optimisation-algorithms-86989510b5e9