Mixture models

Notes on mixture models

Mixture model

Component
Latent variable
Mixture proportions
Probability distribution of an observation
Marginal distribution

Example
Shorthands
Final notes

It is recommended that the reader have a basic understanding of the following:

Categorical and multinomial distributions
Conditional distribution and marginalisation

Mixture model

Language use: "We model the price of a book as a mixture model."

A mixture model is a model which assumes that each observation $X_i$ comes from one of the $K$ mixture components, i.e., each random variable $X_i$ is associated with a label $Z_i \in \{1,2,...,K\}$.

It helps to look at it from the perspective of generating the observations from the mixture model. TLDR:

Sample (or "choose") a component from the $K$ components.
Based on this component, sample a value.

Here are the concepts.

Component / Mixture component

A cluster entity, labelled as one of $\{1,2,...,K\}$. Each component has its own probability distribution.

Latent variable

The random variable (or "label" as mentioned previously) $Z_i \in \{1,2,...,K\}$, where each of the $\{1,2,..,K\}$ represents a component. Often we don't observe this variable (hence "latent").

Mixture proportions / mixture weights

The random variable $Z$ follows a probability distribution: $$ P(Z_i=z)=\left\{\begin{array}{ll} {\pi_{1}} & {\text { if } z=1} \\ {\pi_{2}} & {\text { if } z=2} \\ ... \\ {\pi_{K}} & {\text { if } z=K} \end{array}\right. $$ where the probabilities are the mixture proportions or mixture weights adding up to 1. This distribution is a categorical distribution.

Probability distribution of an observation

Given a component $k$, the probability distribution within this component is

$$ P(X_i=x|Z_i=k) $$

which is a conditional distribution.

Marginal distribution of mixture model

The probability of observing $x$ is

$\begin{aligned} P(\text{observing } x) &= P(\text{observing } x \text{ if } x \text{ came from component } 1) \\ &+ P(\text{observing } x \text{ if } x \text{ came from component } 2) \\ &+ ... \\ &+ P(\text{observing } x \text{ if } x \text{ came from component } K) \end{aligned}$

More formally (the index $i$ has been omitted for readability),

$\begin{aligned} P(X = x) =& P(X = x, Z = 1) + \\ & P(X = x, Z = 2) + \\ & ... +\\ & P(X = x, Z = K) \\ =& P(X = x | Z = 1) P(Z = 1) + \\ & P(X = x | Z = 2) P(Z = 2) + \\ & ... + \\ & P(X = x | Z = K) P(Z = K) \\ =& P(X = x | Z = 1) \pi_1 + \\ & P(X = x | Z = 2) \pi_2 + \\ & ... + \\ & P(X = x | Z = K) \pi_K \end{aligned}$

which can be re-written as

$$ P(X = x) = \sum_{k=1}^{K} \pi_k P(X=x|Z=k) $$

where

$\pi_k$ is the probability of picking component $k$, and
$P(X=x|Z=k)$ is the probability of observing a sample $x$ in this component $k$.

Example

We're in the business of generating zombies. We model the height of a zombie as a mixture of 6 components, where each component is modeled as a Gaussian distribution.

Here are the predetermined mixture components from throwing a die: $$ P(Z_i=z)=\left\{\begin{array}{ll} {0.20} & {\text { if } z=1} \\ {0.10} & {\text { if } z=2} \\ {0.10} & {\text { if } z=3} \\ {0.15} & {\text { if } z=4} \\ {0.25} & {\text { if } z=5} \\ {0.20} & {\text { if } z=6} \end{array}\right. $$

And here is the distribution associated with each component: $$ P(X_i=x|Z_i=k) = f(\mu_i, \sigma_i) $$ where $f$ is the probability density for a Gaussian distribution.

To generate a datapoint:

Throw the die above. The face shows $2$.
Based on this number 2, we generate a sample from a predefined density $f(\mu_2=171,\sigma_2=4)$, giving us $174$. So we will produce a $174$ cm-tall zombie.

Let's generate another datapoint:

Throw the die above. The face shows $5$.
Based on this number 5, we generate a sample from a predefined density $f(\mu_5=176,\sigma_5=2)$, giving us $175$. So we will produce a $175$ cm-tall zombie.

From an outsider's point of view, the only data (zombies' heights) they observe are $\{174, 175\}$. What they don't observe are the components $\{2,5\}$ from the die throws.

This idea of having latent variables brings us to a clustering method called Gaussian mixture models. The notion is that each datapoint is generated from a distribution corresponding to a cluster which had been randomly chosen (latently). Read more in the post.

Shorthands

In the first section, we saw that the categorical distribution is used to describe the latent distribution, i.e. $Z_i$ takes values $\{1,2,..K\}$.

$$ P(Z_i=z)=\left\{\begin{array}{ll} {\pi_{1}} & {\text { if } z=1} \\ {\pi_{2}} & {\text { if } z=2} \\ ... \\ {\pi_{K}} & {\text { if } z=K} \end{array}\right. $$

While it's more readable that way, the convention is to use indicator variables instead to show whether an observation belongs to a component. This is effectively a multinomial distribution with $n=1$ trials.

$$ P(Z_i = z_k) $$

which means probability of $Z_i$ coming from cluster $k$. Also note that $z_k$ is a realised value where

$$ z_k=\left\{\begin{array}{ll} {1} & {\text { if cluster $k$}} \\ {0} & {\text { otherwise }} \\ \end{array}\right. $$

So the following expression

$$ P(X_i=x|Z_i = z_k) $$

means "the probability of observing $X_i$ given that it came from cluster $k$."

Final notes

Note that mixture models can have a multimodal or even unimodal probability density.

If a distribution is multimodal, then it is a mixture model.

If a distribution is unimodal, it is not conclusive whether it is a mixture model.

Gaussian Mixture Model

Resources

Introduction to Mixture Models