Mixture model

Language use: "We model the price of a book as a mixture model."

A mixture model is a model which assumes that each observation $X_i$ comes from one of the $K$ mixture components, i.e., each random variable $X_i$ is associated with a label $Z_i \in \{1,2,...,K\}$.

It helps to look at it from the perspective of generating the observations from the mixture model. TLDR:

  1. Sample (or "choose") a component from the $K$ components.
  2. Based on this component, sample a value.

Here are the concepts.

Component / Mixture component

A cluster entity, labelled as one of $\{1,2,...,K\}$. Each component has its own probability distribution.

Latent variable

The random variable (or "label" as mentioned previously) $Z_i \in \{1,2,...,K\}$, where each of the $\{1,2,..,K\}$ represents a component. Often we don't observe this variable (hence "latent").

Mixture proportions / mixture weights

The random variable $Z$ follows a probability distribution: $$ P(Z_i=z)=\left\{\begin{array}{ll} {\pi_{1}} & {\text { if } z=1} \\ {\pi_{2}} & {\text { if } z=2} \\ ... \\ {\pi_{K}} & {\text { if } z=K} \end{array}\right. $$ where the probabilities are the mixture proportions or mixture weights adding up to 1. This distribution is a categorical distribution.

Probability distribution of an observation

Given a component $k$, the probability distribution within this component is

$$ P(X_i=x|Z_i=k) $$

which is a conditional distribution.

Marginal distribution of mixture model

The probability of observing $x$ is

$\begin{aligned} P(\text{observing } x) &= P(\text{observing } x \text{ if } x \text{ came from component } 1) \\ &+ P(\text{observing } x \text{ if } x \text{ came from component } 2) \\ &+ ... \\ &+ P(\text{observing } x \text{ if } x \text{ came from component } K) \end{aligned}$

More formally (the index $i$ has been omitted for readability),

$\begin{aligned} P(X = x) =& P(X = x, Z = 1) + \\ & P(X = x, Z = 2) + \\ & ... +\\ & P(X = x, Z = K) \\ =& P(X = x | Z = 1) P(Z = 1) + \\ & P(X = x | Z = 2) P(Z = 2) + \\ & ... + \\ & P(X = x | Z = K) P(Z = K) \\ =& P(X = x | Z = 1) \pi_1 + \\ & P(X = x | Z = 2) \pi_2 + \\ & ... + \\ & P(X = x | Z = K) \pi_K \end{aligned}$

which can be re-written as

$$ P(X = x) = \sum_{k=1}^{K} \pi_k P(X=x|Z=k) $$



We're in the business of generating zombies. We model the height of a zombie as a mixture of 6 components, where each component is modeled as a Gaussian distribution.

Here are the predetermined mixture components from throwing a die: $$ P(Z_i=z)=\left\{\begin{array}{ll} {0.20} & {\text { if } z=1} \\ {0.10} & {\text { if } z=2} \\ {0.10} & {\text { if } z=3} \\ {0.15} & {\text { if } z=4} \\ {0.25} & {\text { if } z=5} \\ {0.20} & {\text { if } z=6} \end{array}\right. $$

And here is the distribution associated with each component: $$ P(X_i=x|Z_i=k) = f(\mu_i, \sigma_i) $$ where $f$ is the probability density for a Gaussian distribution.

To generate a datapoint:

  1. Throw the die above. The face shows $2$.
  2. Based on this number 2, we generate a sample from a predefined density $f(\mu_2=171,\sigma_2=4)$, giving us $174$. So we will produce a $174$ cm-tall zombie.

Let's generate another datapoint:

  1. Throw the die above. The face shows $5$.
  2. Based on this number 5, we generate a sample from a predefined density $f(\mu_5=176,\sigma_5=2)$, giving us $175$. So we will produce a $175$ cm-tall zombie.

From an outsider's point of view, the only data (zombies' heights) they observe are $\{174, 175\}$. What they don't observe are the components $\{2,5\}$ from the die throws.

This idea of having latent variables brings us to a clustering method called Gaussian mixture models. The notion is that each datapoint is generated from a distribution corresponding to a cluster which had been randomly chosen (latently). Read more in the post.


In the first section, we saw that the categorical distribution is used to describe the latent distribution, i.e. $Z_i$ takes values $\{1,2,..K\}$.

$$ P(Z_i=z)=\left\{\begin{array}{ll} {\pi_{1}} & {\text { if } z=1} \\ {\pi_{2}} & {\text { if } z=2} \\ ... \\ {\pi_{K}} & {\text { if } z=K} \end{array}\right. $$

While it's more readable that way, the convention is to use indicator variables instead to show whether an observation belongs to a component. This is effectively a multinomial distribution with $n=1$ trials.

$$ P(Z_i = z_k) $$

which means probability of $Z_i$ coming from cluster $k$. Also note that $z_k$ is a realised value where

$$ z_k=\left\{\begin{array}{ll} {1} & {\text { if cluster $k$}} \\ {0} & {\text { otherwise }} \\ \end{array}\right. $$

So the following expression

$$ P(X_i=x|Z_i = z_k) $$

means "the probability of observing $X_i$ given that it came from cluster $k$."

Final notes

Note that mixture models can have a multimodal or even unimodal probability density.

If a distribution is multimodal, then it is a mixture model.

If a distribution is unimodal, it is not conclusive whether it is a mixture model.


