Notes on GMM and EM
The probability density function of a normal distribution shall be defined as
$$ f(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp(-\frac{(x-\mu)^2}{2\sigma^2}) $$
In the univariate setting, the probability density function of a GMM is
$$ \begin{aligned} p(x) &= \sum_{k=1}^{K} \pi_k \cdot \frac{1}{\sqrt{2\pi\sigma_k^2}} \exp(-\frac{(x-\mu_k)^2}{2\sigma_k^2}) \\ &= \sum_{k=1}^{K} \pi_k \cdot f(x|\mu_k,\sigma_k) \end{aligned} $$
where $k$ is the number of 'clusters' or components, $\pi_k$ is the mixing coefficient with $ \sum_{k=1}^{K} \pi_k = 1 $ and $\pi$ is the mathematical constant. Read on to learn how to derive this.
Let's find $ p(\mathbf{z}|x) $ by using Bayes' theorem. This result will be useful for the expectation-maximisation (EM) algorithm in the next section.
$$ \begin{aligned} p(\mathbf{z}|x) &= \frac{p(x|\mathbf{z})p(\mathbf{z})}{p(x)} \\ &= \frac{\pi_k f(x|\mu_k, \sigma_k)}{\sum_{i=1}^K \pi_k f(x|\mu_k, \sigma_k)} \\ \end{aligned} $$
The EM algorithm aims to maximise the likelihood function, which is the maximum likelihood estimation (MLE).
$$ \begin{aligned} & \text{argmax}_\theta \prod_{i=1}^N p(x_i) \\ &= \text{argmax}_\theta \log \prod_{i=1}^N p(x_i) \\ &= \text{argmax}_\theta \sum_{i=1}^N \log p(x_i) \\ &= \text{argmax}_\theta \sum_{i=1}^N \log \sum_{j=1}^K \pi_j f(x|\mu_j, \sigma_j) \\ \end{aligned} $$
We then try to equate to solve for $\mu_k$, $\sigma_k$ and $\pi_k$:
$$ \begin{aligned} \frac{\partial }{\partial \mu_{k}} \sum_{i=1}^N \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \\ \frac{\partial }{\partial \sigma_{k}} \sum_{i=1}^N \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \\ \frac{\partial }{\partial \pi_{k}} \sum_{i=1}^N \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \end{aligned} $$ Let's start with solving $\mu_k$ first.
For readability, let's expand $\text{argmax}_\theta \sum_{i} \log \sum_{k} \pi_k f(x|\mu_k, \sigma_k)$: $$ \begin{aligned} \text{argmax}_\theta ( \\ & \log \sum_{j=1}^{K} \pi_{j} f\left(x_{1} | \mu_{j}, \sigma_{j}\right) \\ &+ \log \sum_{j=1}^{K} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right) \\ &+ ... \\ &+ \log \sum_{j=1}^{K} \pi_{j} f\left(x_{N} | \mu_{j}, \sigma_{j}\right) \\ ) \end{aligned} $$
For a particular $x_i$, say $x_2$:
$$ \begin{aligned} & \frac{\partial }{\partial \mu_{k}}\log \sum_{j=1}^{K} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right) \\ &= \frac{1}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\partial}{\partial \mu_{k}} \pi_{k} f\left(x_{2} | \mu_{k}, \sigma_{k}\right) \\ &= \frac{1}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\partial}{\partial \mu_k} \pi_{k} \frac{1}{\sqrt{2 \pi \sigma_{k}^{2}}} \exp \left(-\frac{\left(x_{2}-\mu_{k}\right)^{2}}{2 \sigma_{k}^{2}}\right) \\ &= \frac{1}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\left(x_{2}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot \pi_{k} \frac{1}{\sqrt{2 \pi \sigma_{k}^{2}}} \exp \left(-\frac{\left(x_{2}-\mu_{k}\right)^{2}}{2 \sigma_{k}^{2}}\right) \\ &= \frac{1}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\left(x_{2}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot f\left(x_{2} | \mu_{j}, \sigma_{j}\right) \\ &= \frac{f\left(x_{2} | \mu_{j}, \sigma_{j}\right)}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\left(x_{2}-\mu_{k}\right)}{\sigma_{k}^{2}} \\ \end{aligned}$$
The derivative of the sum of $x_i$'s is thus
$$ \begin{aligned} & \frac{\partial }{\partial \mu_{k}}\sum_{i=1}^{N} \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) \\ &= \sum_{i=1}^{N} \frac{\left(x_{i}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot \frac{f\left(x_{i} | \mu_{k}, \sigma_{k}\right)}{\sum_{j=1}^{k} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right)} \\ \end{aligned}$$
So, back to trying to solve for $\mu_k$, $$\begin{aligned} \frac{\partial }{\partial \mu_{k}} \sum_{i=1}^N \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \\ \sum_{i=1}^{N} \frac{\left(x_{i}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot \frac{f\left(x_{i} | \mu_{k}, \sigma_{k}\right)}{\sum_{j=1}^{k} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right)} &= 0 \end{aligned}$$ And we realise that there's no analytical solution for this 😱!
Enter EM algorithm.
In this step, we'll be replacing ? to ?. Why does this work?
This tractability comes at the expense of having to do this iteratively to convergence.
This step is called the 'expectation' step because the expectation of an indicator variable (which is the case for our latent variable) is the posterior distribution of the latent variable itself.
$$ \begin{aligned} E[Z|X] &= 0 \cdot p(z_{\text{not }k}|X=x) + 1 \cdot p(z_k|X=x) \\ &= p(z_k|x) \\ &= \frac{p(x|z_k)p(z_k)}{p(x)} \\ &= \frac{\pi_k f(x|\mu_k, \sigma_k)}{\sum_{j=1}^K \pi_j f(x|\mu_k, \sigma_k)} \\ &= \gamma_k(x) \end{aligned} $$
At every iteration, the expectation is calculated based on the existing values of $\mu$ and $\sigma$. $\pi_k$ is calculated as:
$$ \pi_k = \frac{N_k}{N} $$
If we're only starting with this optimisation, there won't be any existing values so it's our responsibility to initialise these values.
$$ \begin{aligned} & \frac{\partial }{\partial \mu_{k}}\sum_{i=1}^{N} \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) \\ &= \sum_{i=1}^{N} \frac{\left(x_{2}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot \frac{f\left(x_{2} | \mu_{j}, \sigma_{j}\right)}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \\ &= \sum_{i=1}^{N} \frac{\left(x_{i}-\mu_{k}\right)}{\sigma_{k}^{2}} \gamma_k(x_i) \end{aligned}$$
Note that the $\gamma_k(x_i)$ is to be treated as a constant.
Like how we do it for an MLE, we equate to 0.
Equate to 0 to find MLE of $\mu_k$:
$$ \begin{aligned} \frac{\partial }{\partial \mu_{k}}\sum_{i=1}^{N} \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \\ \sum_{i=1}^{N} \frac{\left(x_{i}-\mu_{k}\right)}{\sigma_{k}^{2}} \gamma_k(x_i) &= 0 \\ \mu_k \sum_{i=1}^{N} \gamma_k(x_i) &= \sum_{i=1}^{N} x_i\gamma_k(x_i) \\ \mu_k &= \frac{\sum_{i=1}^{N} x_i\gamma_k(x_i)}{\sum_{i=1}^{N} \gamma_k(x_i)} \end{aligned} $$
Do the same for MLE of $\sigma_k$:
$$ \sigma_k = \frac{\sum_{i=1}^{N} x_i\gamma_k(x_i)}{\sum_{i=1}^{N} \gamma_k(x_i)} $$
MLE of $\pi_k$ is estimated empirically (because it's a categorical distribution):
$$ \pi_k = \frac{1}{N} \sum_{i=1}^{N} \gamma_k(x_i) $$
https://stephens999.github.io/fiveMinuteStats/intro_to_em.html
https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
http://www.cse.iitm.ac.in/~vplab/courses/DVP/PDF/gmm.pdf
http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture21.pdf