Notes on mixture models
Language use: "We model the price of a book as a mixture model."
A mixture model is a model which assumes that each observation $X_i$ comes from one of the $K$ mixture components, i.e., each random variable $X_i$ is associated with a label $Z_i \in \{1,2,...,K\}$.
It helps to look at it from the perspective of generating the observations from the mixture model. TLDR:
Here are the concepts.
A cluster entity, labelled as one of $\{1,2,...,K\}$. Each component has its own probability distribution.
The random variable (or "label" as mentioned previously) $Z_i \in \{1,2,...,K\}$, where each of the $\{1,2,..,K\}$ represents a component. Often we don't observe this variable (hence "latent").
The random variable $Z$ follows a probability distribution: $$ P(Z_i=z)=\left\{\begin{array}{ll} {\pi_{1}} & {\text { if } z=1} \\ {\pi_{2}} & {\text { if } z=2} \\ ... \\ {\pi_{K}} & {\text { if } z=K} \end{array}\right. $$ where the probabilities are the mixture proportions or mixture weights adding up to 1. This distribution is a categorical distribution.
Given a component $k$, the probability distribution within this component is
$$ P(X_i=x|Z_i=k) $$
which is a conditional distribution.
The probability of observing $x$ is
$\begin{aligned} P(\text{observing } x) &= P(\text{observing } x \text{ if } x \text{ came from component } 1) \\ &+ P(\text{observing } x \text{ if } x \text{ came from component } 2) \\ &+ ... \\ &+ P(\text{observing } x \text{ if } x \text{ came from component } K) \end{aligned}$
More formally (the index $i$ has been omitted for readability),
$\begin{aligned} P(X = x) =& P(X = x, Z = 1) + \\ & P(X = x, Z = 2) + \\ & ... +\\ & P(X = x, Z = K) \\ =& P(X = x | Z = 1) P(Z = 1) + \\ & P(X = x | Z = 2) P(Z = 2) + \\ & ... + \\ & P(X = x | Z = K) P(Z = K) \\ =& P(X = x | Z = 1) \pi_1 + \\ & P(X = x | Z = 2) \pi_2 + \\ & ... + \\ & P(X = x | Z = K) \pi_K \end{aligned}$
which can be re-written as
$$ P(X = x) = \sum_{k=1}^{K} \pi_k P(X=x|Z=k) $$
where
We're in the business of generating zombies. We model the height of a zombie as a mixture of 6 components, where each component is modeled as a Gaussian distribution.
Here are the predetermined mixture components from throwing a die: $$ P(Z_i=z)=\left\{\begin{array}{ll} {0.20} & {\text { if } z=1} \\ {0.10} & {\text { if } z=2} \\ {0.10} & {\text { if } z=3} \\ {0.15} & {\text { if } z=4} \\ {0.25} & {\text { if } z=5} \\ {0.20} & {\text { if } z=6} \end{array}\right. $$
And here is the distribution associated with each component: $$ P(X_i=x|Z_i=k) = f(\mu_i, \sigma_i) $$ where $f$ is the probability density for a Gaussian distribution.
To generate a datapoint:
Let's generate another datapoint:
From an outsider's point of view, the only data (zombies' heights) they observe are $\{174, 175\}$. What they don't observe are the components $\{2,5\}$ from the die throws.
This idea of having latent variables brings us to a clustering method called Gaussian mixture models. The notion is that each datapoint is generated from a distribution corresponding to a cluster which had been randomly chosen (latently). Read more in the post.
In the first section, we saw that the categorical distribution is used to describe the latent distribution, i.e. $Z_i$ takes values $\{1,2,..K\}$.
$$ P(Z_i=z)=\left\{\begin{array}{ll} {\pi_{1}} & {\text { if } z=1} \\ {\pi_{2}} & {\text { if } z=2} \\ ... \\ {\pi_{K}} & {\text { if } z=K} \end{array}\right. $$
While it's more readable that way, the convention is to use indicator variables instead to show whether an observation belongs to a component. This is effectively a multinomial distribution with $n=1$ trials.
$$ P(Z_i = z_k) $$
which means probability of $Z_i$ coming from cluster $k$. Also note that $z_k$ is a realised value where
$$ z_k=\left\{\begin{array}{ll} {1} & {\text { if cluster $k$}} \\ {0} & {\text { otherwise }} \\ \end{array}\right. $$
So the following expression
$$ P(X_i=x|Z_i = z_k) $$
means "the probability of observing $X_i$ given that it came from cluster $k$."
Note that mixture models can have a multimodal or even unimodal probability density.
If a distribution is multimodal, then it is a mixture model.
If a distribution is unimodal, it is not conclusive whether it is a mixture model.
Introduction to Mixture Models