Mixture distributions

Reference

The material in this section is mostly borrowed from:

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning (second edition) (2009).
https://hastie.su.domains/ElemStatLearn/
https://link.springer.com/book/10.1007%2F978-0-387-84858-7

Section

Generative models

A generative model is a statistical model for the process that could have generated your data. Generative models offer many advantages compared to combinatorial algorithms that treat data as a collection of objects. Most importantly, working with a generative model forces you to be explicit about your assumptions. Likewise, a generative model allows you to encode, and be explicit about, prior (biological) knowledge you may have about the data generating process.

Gaussian mixture models

What kind of process could generate data with separated groups or clusters of observations?

Let’s assume there is an unmeasured (or hidden) random variable $Z$ that determines to which group an observation $X$ belongs. For simplicity, assume that $Z$ can only take two values, 0 or 1, and that the measurement $X$ is one-dimensional and normally distributed in each group.

Consider the following process:

Randomly sample cluster label $Z=1$ with probability $\pi$ and $Z=0$ with probability $1-\pi$.
Sample features

$$ \begin{aligned} X \sim \begin{cases} \mathcal{N}(\mu_0,\sigma_0^2) & \text{ if } Z=0\ \mathcal{N}(\mu_1,\sigma_1^2) & \text{ if } Z=1 \end{cases} \end{aligned} $$

where $\mu_k$ and $\sigma_k^2$ are the means and variances of two normal distributions.

What data would this process generate?

Hence the model generates cluster labels $Z$ and real numbers $x\in\mathbb{R}$ from the model

$$ \begin{aligned} Z &\longrightarrow X\\ p(Z,x) = P(Z) &\times p(x\mid Z) \end{aligned} $$

where we use lower-case “p” for probability density functions ($X$ is continuous) and upper-case “P” for discrete probabilities, and

$$ \begin{aligned} P(Z=1) &= \pi = 1 - P(Z=0) \\ p(x\mid Z=k) &= \mathcal{N}(\mu_k,\sigma_k^2) \end{aligned} $$

Distribution generated by the model:

Observed distribution:

Some EM language

The joint distribution is the probability distribution of a cluster label $Z$ and feature value $x$ both being produced by the model:

$$ p(Z,x) = p(x\mid Z)\; P(Z) $$

The marginal distribution is th probability distribution that the model produces a feature value $x$:

$$ p(x) = \sum_{k=0,1} p(x\mid Z=k)\; P(Z=k) $$

The responsibility of $Z$ for feature value $x$, also called the recognition distribution, is obtained using Bayes’ theorem

$$ P(Z=k\mid x) = \frac{p(x\mid Z=k) \; P(Z=k)}{p(x)} $$

This value can be used as a soft cluster assignment: with probability $P(Z=k \mid x)$, an observed value $x$ belongs to cluster k.

Note that the expected value of $Z$ given a data point $x$ is:

$$ \begin{aligned} \mathbb{E}\left(Z\mid x\right) &= 1 \cdot P(Z=1 \mid x) + 0\cdot P(Z=0 \mid x) = P(Z=1 \mid x) % &= \frac{\pi p(x\mid \mu_1, \igma_1^2)}{\pi p(x \mid \mu_1, \sigma_1^2) + (1-\pi) p(x \mid \mu_0, \sigma_0^2)} \end{aligned} $$

Maximum-likelihood estimation

To fit the model to the data, we can only use the observed data $x$, which follows the Gaussian mixture distribution

$$ p(x) = \sum_{k=0,1} p(x\mid Z=k)\; P(Z=k) $$

The log-likelihood of observing $N$ independent samples $(x_1,\dots,x_N)$ is

$$ \mathcal{L}= \log\left(\prod_{i=1}^N p(x_{i}) \right) = \sum_{i=1}^N \log p(x_{i}) $$

We want to find the best-fitting model by maximizing the log-likelihood.

Directly maximizing the log-likelihood with respect to the paramaters $\pi$, $\mu_k$, and $\sigma_k^2$ is difficult, because:

Only the feature values $x$ are observed.
The cluster labels $Z$ are hidden, they are latent variables.
The log-likelihood is expressed purely in terms of the observable distribution, involves logarithms of sums, and is intractable.

If we knew the cluster labels $k$ for each sample, we could easily fit the parameters $(\pi,\mu_k,\sigma_k^2)$ from the data for each cluster.

If we knew the parameters $(\pi, \mu_k,\sigma_k^2)$, we could easily determine the probability for each data point to belong to each cluster and determine cluster labels.

To get around this catch-22, we replace actual cluster labels by their current expected values given current values for the parameters, and then iterate the above two steps until convergence - this is the Expectation-Maximization (EM) algorithm.

The EM algorithm

Take initial guesses $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ for the model parameters.
Expectation step: Compute the responsibilities $P(Z_i=k\mid x_i)$ for each data point $x_i$.
Maximization step: Update $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ by maximizing the log-likelihood using the soft cluster assignments $P(Z_i=k\mid x_i)$.
Iterate steps 2 and 3 until convergence.

What are “soft cluster assignments”?

Consider $N$ samples $(x_1,\dots,x_N)$ from a normal distribution $p(x\mid \mu,\sigma^2)$. The log-likelihood is

$$ \begin{aligned} \mathcal{L}= \sum_{i=1}^N \log \left(\frac1{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x_i-\mu)^2}\right) = -\frac{N}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^N (x_i-\mu)^2 \end{aligned} $$

$\mathcal{L}$ is maximized for

$$ \begin{aligned} \hat\mu &= \frac{1}{N} \sum_{i=1}^N x_i\\ \hat\sigma^2 &= \frac{1}{N} \sum_{i=1}^N (x_i-\hat\mu)^2 \end{aligned} $$

Now consider $N$ samples $(Z_1,\dots,Z_N)$ and $(x_1,\dots,x_N)$ from the generative model where the cluster labels are also observed. The log-likelihood is

$$ \begin{aligned} \mathcal{L}&= \sum_{i=1}^N \log p(Z_i,x_i)\\ &= \sum_{i=1}^N \Bigl(Z_i \log p(Z_i=1,x_i) + (1-Z_i) \log p(Z_i=0,x_i)\Bigr) \\ &= \sum_{i=1}^N \Bigl(Z_i \log p(x_i\mid \mu_1,\sigma_1^2) + (1-Z_i) \log p(x_i\mid \mu_0,\sigma_0^2)\Bigr) + \sum_{i=1}^N \Bigl(Z_i\log\pi + (1-Z_i) \log(1-\pi)\Bigr) \end{aligned} $$

$\mathcal{L}$ is maximized for

$$ \begin{aligned} \hat\pi &= \frac{N_1}{N}\\ \hat\mu_k &= \frac{1}{N_k} \sum_{Z_i=k} x_i\\ \hat\sigma_k^2 &= \frac{1}{N_k} \sum_{Z_i=k}^N (x_i-\hat\mu_k)^2 \end{aligned} $$

Since the cluster labels are not observed, we don’t know the value of the $Z_i$. The “trick” is to replace them with their expectation values $\mathbb{E}(Z_i=k) = P(Z_i=k\mid x_i)$ in the EM algorithm, using the current estimates for $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$.

This leads to updated estimates

$$ \begin{aligned} \hat\pi^{\text{(new)}} &= \frac{\sum_{i=1}^N P(Z_i=k\mid x_i)}{N}\\ \hat\mu_k^{\text{(new)}} &= \frac{1}{N} \sum_{i=1}^N P(Z_i=k\mid x_i)\; x_i \\ (\hat\sigma_k^2)^{\text{(new)}} &= \frac{1}{N} \sum_{i=1}^N P(Z_i=k\mid x_i)\; (x_i-\hat\mu_k)^2 \end{aligned} $$

Hence, instead of a “hard” assignment of each sample $x_i$ to one cluster $k$ when the $Z_i$ are observed, each sample now contributes with a “soft assignment” weight $P(Z_i=k\mid x_i)$ to the parameters of each cluster.

After convergence, the final $P(Z_i=k\mid x_i)$ can be used to assign data points to clusters, for instance to the cluster $k$ with highest responsibility for $x_i$.

Generalizations

We have so far considered the case where the data are one-dimensional (real numbers) and the number of clusters is pre-fixed. Important generalizations are:

The data can be of any dimension $D$. In higher dimensions the components of the mixture model are multivariate normal distributions. The mean parameters $\mu_k$ simply become $D$-dimensional vectors. The variance parameters $\sigma_k^2$ however become $D\times D$ covariance matrices. For simplicity and to reduce the number of paramaters, it is often assumed that the covariance matrices are diagonal, such that the number of variance parameters again scales linearly in $D$. However, when features are correlated, this assumption will be a poor representation of the underlying data.
Instead of treating the cluster weights, means, and variances as fixed parameters that need to be estimated, they can be seen as random variables themselves with a distribution (uncertainty) that can be learned from the data. For more information, see this tutorial.
In an infinite mixture model, the number of clusters need not be fixed in advance, but can be learned from the data. For more information, see this tutorial.

Assignment

EM implementation

Implement an algorithm to generate random samples from a one-dimensional Gaussian mixture distribution with two components. Your algorithm should have as parameters $N$, the number of samples to generate, and $(\pi,\mu_k,\sigma_k)$, $k=0,1$
Implement the EM algorithm for maximum-likelihood estimation of the parameters of the previous model.
Test your EM algorithm by applying it on the simulated data from step 1 and evaluating how close your parameter estimates are to their true values. Draw a histogram of the sampled data colored by cluster label and overlay intermediate and final responsibility values for each data point (cf. [ESL] Fig. 8.5)

Last modified May 30, 2023: update Gaussian processes (ff6a6c2)