[WIP] Gaussian mixture model and the EM algorithm

Notes on GMM and EM

Motivation: clustering

Defining Gaussian mixture model

Probability distribution of each observation

The probability density function of a normal distribution shall be defined as

$$ f(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp(-\frac{(x-\mu)^2}{2\sigma^2}) $$

Marginal distribution

In the univariate setting, the probability density function of a GMM is

$$ \begin{aligned} p(x) &= \sum_{k=1}^{K} \pi_k \cdot \frac{1}{\sqrt{2\pi\sigma_k^2}} \exp(-\frac{(x-\mu_k)^2}{2\sigma_k^2}) \\ &= \sum_{k=1}^{K} \pi_k \cdot f(x|\mu_k,\sigma_k) \end{aligned} $$

where $k$ is the number of 'clusters' or components, $\pi_k$ is the mixing coefficient with $ \sum_{k=1}^{K} \pi_k = 1 $ and $\pi$ is the mathematical constant. Read on to learn how to derive this.

Posterior distribution

Let's find $ p(\mathbf{z}|x) $ by using Bayes' theorem. This result will be useful for the expectation-maximisation (EM) algorithm in the next section.

$$ \begin{aligned} p(\mathbf{z}|x) &= \frac{p(x|\mathbf{z})p(\mathbf{z})}{p(x)} \\ &= \frac{\pi_k f(x|\mu_k, \sigma_k)}{\sum_{i=1}^K \pi_k f(x|\mu_k, \sigma_k)} \\ \end{aligned} $$

MLE & EM

The EM algorithm aims to maximise the likelihood function, which is the maximum likelihood estimation (MLE).

$$ \begin{aligned} & \text{argmax}_\theta \prod_{i=1}^N p(x_i) \\ &= \text{argmax}_\theta \log \prod_{i=1}^N p(x_i) \\ &= \text{argmax}_\theta \sum_{i=1}^N \log p(x_i) \\ &= \text{argmax}_\theta \sum_{i=1}^N \log \sum_{j=1}^K \pi_j f(x|\mu_j, \sigma_j) \\ \end{aligned} $$

We then try to equate to solve for $\mu_k$, $\sigma_k$ and $\pi_k$:

$$ \begin{aligned} \frac{\partial }{\partial \mu_{k}} \sum_{i=1}^N \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \\ \frac{\partial }{\partial \sigma_{k}} \sum_{i=1}^N \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \\ \frac{\partial }{\partial \pi_{k}} \sum_{i=1}^N \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \end{aligned} $$ Let's start with solving $\mu_k$ first.

For readability, let's expand $\text{argmax}_\theta \sum_{i} \log \sum_{k} \pi_k f(x|\mu_k, \sigma_k)$: $$ \begin{aligned} \text{argmax}_\theta ( \\ & \log \sum_{j=1}^{K} \pi_{j} f\left(x_{1} | \mu_{j}, \sigma_{j}\right) \\ &+ \log \sum_{j=1}^{K} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right) \\ &+ ... \\ &+ \log \sum_{j=1}^{K} \pi_{j} f\left(x_{N} | \mu_{j}, \sigma_{j}\right) \\ ) \end{aligned} $$

For a particular $x_i$, say $x_2$:

$$ \begin{aligned} & \frac{\partial }{\partial \mu_{k}}\log \sum_{j=1}^{K} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right) \\ &= \frac{1}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\partial}{\partial \mu_{k}} \pi_{k} f\left(x_{2} | \mu_{k}, \sigma_{k}\right) \\ &= \frac{1}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\partial}{\partial \mu_k} \pi_{k} \frac{1}{\sqrt{2 \pi \sigma_{k}^{2}}} \exp \left(-\frac{\left(x_{2}-\mu_{k}\right)^{2}}{2 \sigma_{k}^{2}}\right) \\ &= \frac{1}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\left(x_{2}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot \pi_{k} \frac{1}{\sqrt{2 \pi \sigma_{k}^{2}}} \exp \left(-\frac{\left(x_{2}-\mu_{k}\right)^{2}}{2 \sigma_{k}^{2}}\right) \\ &= \frac{1}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\left(x_{2}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot f\left(x_{2} | \mu_{j}, \sigma_{j}\right) \\ &= \frac{f\left(x_{2} | \mu_{j}, \sigma_{j}\right)}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \cdot \frac{\left(x_{2}-\mu_{k}\right)}{\sigma_{k}^{2}} \\ \end{aligned}$$

The derivative of the sum of $x_i$'s is thus

$$ \begin{aligned} & \frac{\partial }{\partial \mu_{k}}\sum_{i=1}^{N} \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) \\ &= \sum_{i=1}^{N} \frac{\left(x_{i}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot \frac{f\left(x_{i} | \mu_{k}, \sigma_{k}\right)}{\sum_{j=1}^{k} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right)} \\ \end{aligned}$$

So, back to trying to solve for $\mu_k$, $$\begin{aligned} \frac{\partial }{\partial \mu_{k}} \sum_{i=1}^N \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \\ \sum_{i=1}^{N} \frac{\left(x_{i}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot \frac{f\left(x_{i} | \mu_{k}, \sigma_{k}\right)}{\sum_{j=1}^{k} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right)} &= 0 \end{aligned}$$ And we realise that there's no analytical solution for this 😱!

Enter EM algorithm.

Step 1: Expectation

In this step, we'll be replacing ? to ?. Why does this work?

  1. Gaussian's posterior distribution is tractable
  2. Has meaning. Empirical
  3. It's the expectation of the latent variable. It gives you the probability of a datapoint $x$ being at cluster $k$.

This tractability comes at the expense of having to do this iteratively to convergence.

This step is called the 'expectation' step because the expectation of an indicator variable (which is the case for our latent variable) is the posterior distribution of the latent variable itself.

$$ \begin{aligned} E[Z|X] &= 0 \cdot p(z_{\text{not }k}|X=x) + 1 \cdot p(z_k|X=x) \\ &= p(z_k|x) \\ &= \frac{p(x|z_k)p(z_k)}{p(x)} \\ &= \frac{\pi_k f(x|\mu_k, \sigma_k)}{\sum_{j=1}^K \pi_j f(x|\mu_k, \sigma_k)} \\ &= \gamma_k(x) \end{aligned} $$

Calculating the responsibility

At every iteration, the expectation is calculated based on the existing values of $\mu$ and $\sigma$. $\pi_k$ is calculated as:

$$ \pi_k = \frac{N_k}{N} $$

If we're only starting with this optimisation, there won't be any existing values so it's our responsibility to initialise these values.

$$ \begin{aligned} & \frac{\partial }{\partial \mu_{k}}\sum_{i=1}^{N} \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) \\ &= \sum_{i=1}^{N} \frac{\left(x_{2}-\mu_{k}\right)}{\sigma_{k}^{2}} \cdot \frac{f\left(x_{2} | \mu_{j}, \sigma_{j}\right)}{\sum_{j=1}^{k} \pi_{j} f\left(x_{2} | \mu_{j}, \sigma_{j}\right)} \\ &= \sum_{i=1}^{N} \frac{\left(x_{i}-\mu_{k}\right)}{\sigma_{k}^{2}} \gamma_k(x_i) \end{aligned}$$

Note that the $\gamma_k(x_i)$ is to be treated as a constant.

Step 2: Maximisation

Like how we do it for an MLE, we equate to 0.

Equate to 0 to find MLE of $\mu_k$:

$$ \begin{aligned} \frac{\partial }{\partial \mu_{k}}\sum_{i=1}^{N} \log \sum_{j=1}^{K} \pi_{j} f\left(x_{i} | \mu_{j}, \sigma_{j}\right) &= 0 \\ \sum_{i=1}^{N} \frac{\left(x_{i}-\mu_{k}\right)}{\sigma_{k}^{2}} \gamma_k(x_i) &= 0 \\ \mu_k \sum_{i=1}^{N} \gamma_k(x_i) &= \sum_{i=1}^{N} x_i\gamma_k(x_i) \\ \mu_k &= \frac{\sum_{i=1}^{N} x_i\gamma_k(x_i)}{\sum_{i=1}^{N} \gamma_k(x_i)} \end{aligned} $$

Do the same for MLE of $\sigma_k$:

$$ \sigma_k = \frac{\sum_{i=1}^{N} x_i\gamma_k(x_i)}{\sum_{i=1}^{N} \gamma_k(x_i)} $$

MLE of $\pi_k$ is estimated empirically (because it's a categorical distribution):

$$ \pi_k = \frac{1}{N} \sum_{i=1}^{N} \gamma_k(x_i) $$

Resources

https://stephens999.github.io/fiveMinuteStats/intro_to_em.html

https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95

http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf

https://stats.stackexchange.com/questions/72774/numerical-example-to-understand-expectation-maximization

http://www.cse.iitm.ac.in/~vplab/courses/DVP/PDF/gmm.pdf

http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture21.pdf