This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Cluster analysis

Cancer subtypes. Combinatorial clustering. Mixture distributions.

1 - Cancer subtype classification

Gene expression profling of tumour samples.

Gene expression profiling predicts clinical outcome of breast cancer

The study by van ’t Veer et al was one of the first to use to microarrays, a brand-new technology at the time, to profile gene expression on a genome-wide scale from surgically removed tumour samples - breast tumours in this case. Another paper from around the same time is: Perou et al. Molecular portraits of human breast tumours. The credit for being the first to using cluster analysis on gene expression data (from yeast) probably goes to Eisen et al. Cluster analysis and display of genome-wide expression patterns.

Van ’t Veer et al clustered data from 98 tumours based on their similarities across approximately 5,000 differentially expressed genes - genes that showed more variation than expected by chance in the dataset. The most striking finding is in their Figure 1: the tumours segregated in two distinct groups that correlated strongly with clinical features, namely:

Overall, tumours in the bottom group of the figure were clearly associated with measures that predict better patient outcome.

Following the discovery that unsupervised clustering of gene expression profiles identifies good and poor prognosis groups, the authors tried to identify a minimal prognostic signature from their data, which resulted in an optimal set of 70 marker genes, which they could validate in an independent set of tumour samples.

Clearly, with such a strong signature, the race to bring it to the clinic is on. That this is far from trivial can be seen by tracing the follow-up studies and clinical trials:

They got there eventually, and the gene expression signature is now commercially available under the name of Mammaprint.

The Cancer Genome Atlas

Although the results by Van ’t Veer et al. were obtained from a small (by current standards!) sample size, they have been reproduced consistenly in larger studies (see the assignment in the next cluster analysis lecture) and arguably spawned a search for similar signatures in other cancer types through large-scale projects, such as The Cancer Genome Atlas (TCGA) Program.

The amount of data and number of publications produced by TCGA is too enormous to survey here.

For the purposes of illustration, have a look at the Pan-Cancer Atlas, and then do the following assignment.

Assignment

2 - Combinatorial clustering

Combinatorial clustering algorithms, K-means.

Introduction

The goal of cluster analysis is to group or segment a collection of objects into subsets or “clusters”, such that objects within a cluster are more closely related than objects in different clusters.

Many datasets exhibit a hierarchical structure, with no clear demarcation of clusters, and clusters can themselves be grouped successively such that clusters within one group are more similar than those in different groups. Deciding where to make the “cut” is usually done by setting parameters of a clustering algorithm, and almost always involves an arbitrary choice by the user.

Central to all clustering methods is the notion of the degree of similarity (or dissimilarity) between the objects being clustered. Sometimes data are presented directly in terms of proximity/similarity between objects. More often we have measurements (e.g. gene expression) on objects (e.g. genes or samples) that we want to cluster, and a (dis)similariy matrix must be constructed first.

Dissimilarities based on measurements

Assume we have measurements $x_{ij}$ for objects $i=1,2,\dots,N$ on variables (or attributes) $j=1,2,\dots,p$. Usually, we first define a dissimilarity function $d_j(x_{ij},x_{i’j})$ between values of the $j$th attribute, and then define the dissimilarity between objects as

$$ D(x_i,x_{i’})=\sum_{j=1}^p w_j \cdot d_j(x_{ij},x_{i’j}) $$

with $w_j$ a weight assigned to the $j$th attribute that determines its relative influence in the overall dissimilarity between objects. By convention we scale the weights such that $\sum_j w_j=1$.

Setting $w_j=1$ for all $j$ does not necessarily give all attributes equal influence! To see this, we compute the average object dissimilarity over all pairs of objects as

$$ \begin{aligned} \bar D &= \frac1{N^2}\sum_{i=1}^N \sum_{i’=1}^N D(x_i,x_{i’}) = \sum_{j=1}^p w_j \cdot \bar{d}_j, \end{aligned} $$ with $\bar{d}_j$ defined above. Hence the relative influence of the $j$th attribute is $w_j \cdot \bar{d}_j$.

The most common choice of dissimilarity function is squared-error distance:

$$ d_j(x_{ij},x_{i’j}) = (x_{ij}-x_{i’j})^2 $$

Define the mean and variance of each attribute over all objects as

$$ \begin{aligned} \mu_j &= \frac1N \sum_{i=1}^N x_{ij}\\ \sigma_j^2 &= \frac1N \sum_{i=1}^N (x_{ij}-\mu_j)^2 \end{aligned} $$

Then $$ \begin{aligned} \overline{d_j} &= \frac1{N^2}\sum_{i=1}^N \sum_{i’=1}^N (x_{ij}-x_{i’j})^2\\ &= \frac{1}{N^2}\sum_{i=1}^N \sum_{i’=1}^N \left((x_{ij}-\mu_j) - (x_{i’j}-\mu_j)\right)^2\\ &= \frac1N \sum_{i=1}^N (x_{ij}-\mu_j)^2 + \frac1N \sum_{i=1}^N (x_{i’j}-\mu_j)^2 = 2 \sigma_j^2 \end{aligned} $$

It is often recommended to standardize data before clustering:

$$ x_{ij} \to y_{ij}=\frac{x_{ij}-\mu_j}{\sigma_j} $$

With squared-error loss, this is equivalent to setting weights $w_j \sim 1/\sigma_j^2 \sim 1/\bar{d}_j$, that is, to give all attributes equal influence on the average object dissimilarity.

Beware that sometimes some attributes exhibit more grouping tendency than others, which may be obscured by standardizing.

Combinatorial clustering

Combinatorial clustering algorithms assign each object to a cluster without regard to a probability model describing the data. Understanding combinatorial clustering is a necessary basis for understanding probabilistic methods.

In combinatorial clustering, a prespecified number of clusters $K<N$ is postulated ($N$ the number of objects). An assignment of objects $i\in{1,\dots,N}$ to clusters $k\in{1,\dots,K}$ is charcterized by a many-to-one mapping or encoder $k=C(i)$.

$C$ is obtained by minizing the “within cluster” point scatter:

$$ W(C) = \frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i’)=k} d(x_i,x_{i’}) $$

$W(C)$ characterizes the extent to which objects assigned to the same cluster tend to be close to one another. Notice that:

$$ \begin{aligned} T &= \frac12 \sum_{i=1}^N \sum_{i’=1}^N d_{ii’} \\ &= \frac12 \sum_{k=1}^K \sum_{C(i)=k} \left(\sum_{C(i’)=k} d_{ii’} + \sum_{C(i’)\neq k} d_{ii’}\right) \\ &= W(C) + B(C) \end{aligned} $$ where $d_{ii’} = d(x_i,x_{i’})$ and

$$ B(C) = \frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i’)\neq k} d_{ii’} $$
is the “between cluster” point scatter.

$B(C)$ characterizes the extent to which objects assigned to different clusters tend to be far apart.

Since $T$ is constant given the data, minimizing $W(C)$ is equivalent to maximizing $B(C)$.

$K$-means clustering

The $K$-means algorithm uses the squared Euclidean distance $$ d(x_i,x_{i’}) = \sum_{j=1}^p (x_{ij}-x_{i’j})^2 = \| x_i - x_{i’}\|^2 $$ and an iterative greedy descent algorithm to minimize $W(C)$.

Using the Euclidean distance expression, $W(C)$ can be written as $$ W(C) = \sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - \overline{x_k}\|^2 $$ where $N_k$ is the number of objects assigned to cluster $k$, and $\overline{x_k}=(\overline{x_{1k}},\dots,\overline{x_{pk}})$ is the mean vector associated to cluster $k$.

$W(C)$ is minimized if within each cluster, the average dissimilarity of the objects from the cluster mean, as defined by the points in that cluster, is minimized.

Note that for any set of objects $S$, $$ \overline{x_S} = \frac{1}{|S|} \sum_{i\in S} x_i = \argmin_m \sum_{i\in S}\|x_i-m\|^2 $$

Hence $$ \begin{aligned} \min_C W(C) &= \min_C \sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - \overline{x_k}\|^2 \\ & = \min_{C} \min_{\{m_k\}}\sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - m_k\|^2 \end{aligned} $$

This result is used in a greedy descent algorithm where alternatingly the mean vectors are updated for the current cluster assignments, and object assignments are updated by assigning objects to the nearest current mean vector.

How to choose the number of clusters $K$?

Find “kink” in the within-cluster-dissimilarity: Read Elements of Statistical Learning Section 14.3.8, 14.3.11

Assignment

3 - Mixture distributions

Mixture distributions

Generative models

A generative model is a statistical model for the process that could have generated your data. Generative models offer many advantages compared to combinatorial algorithms that treat data as a collection of objects. Most importantly, working with a generative model forces you to be explicit about your assumptions. Likewise, a generative model allows you to encode, and be explicit about, prior (biological) knowledge you may have about the data generating process.

Gaussian mixture models

What kind of process could generate data with separated groups or clusters of observations?

image

Let’s assume there is an unmeasured (or hidden) random variable $Z$ that determines to which group an observation $X$ belongs. For simplicity, assume that $Z$ can only take two values, 0 or 1, and that the measurement $X$ is one-dimensional and normally distributed in each group.

Consider the following process:

  1. Randomly sample cluster label $Z=1$ with probability $\pi$ and $Z=0$ with probability $1-\pi$.

  2. Sample features

$$ \begin{aligned} X \sim \begin{cases} \mathcal{N}(\mu_0,\sigma_0^2) & \text{ if } Z=0\ \mathcal{N}(\mu_1,\sigma_1^2) & \text{ if } Z=1 \end{cases} \end{aligned} $$

where $\mu_k$ and $\sigma_k^2$ are the means and variances of two normal distributions.

What data would this process generate?

image

Hence the model generates cluster labels $Z$ and real numbers $x\in\mathbb{R}$ from the model

$$ \begin{aligned} Z &\longrightarrow X\\ p(Z,x) = P(Z) &\times p(x\mid Z) \end{aligned} $$

where we use lower-case “p” for probability density functions ($X$ is continuous) and upper-case “P” for discrete probabilities, and

$$ \begin{aligned} P(Z=1) &= \pi = 1 - P(Z=0) \\ p(x\mid Z=k) &= \mathcal{N}(\mu_k,\sigma_k^2) \end{aligned} $$

Distribution generated by the model:

image

Observed distribution:

image

Some EM language

The joint distribution is the probability distribution of a cluster label $Z$ and feature value $x$ both being produced by the model:

$$ p(Z,x) = p(x\mid Z)\; P(Z) $$

The marginal distribution is th probability distribution that the model produces a feature value $x$:

$$ p(x) = \sum_{k=0,1} p(x\mid Z=k)\; P(Z=k) $$

The responsibility of $Z$ for feature value $x$, also called the recognition distribution, is obtained using Bayes’ theorem

$$ P(Z=k\mid x) = \frac{p(x\mid Z=k) \; P(Z=k)}{p(x)} $$

This value can be used as a soft cluster assignment: with probability $P(Z=k \mid x)$, an observed value $x$ belongs to cluster k.

Note that the expected value of $Z$ given a data point $x$ is:

$$ \begin{aligned} \mathbb{E}\left(Z\mid x\right) &= 1 \cdot P(Z=1 \mid x) + 0\cdot P(Z=0 \mid x) = P(Z=1 \mid x) % &= \frac{\pi p(x\mid \mu_1, \igma_1^2)}{\pi p(x \mid \mu_1, \sigma_1^2) + (1-\pi) p(x \mid \mu_0, \sigma_0^2)} \end{aligned} $$

Maximum-likelihood estimation

To fit the model to the data, we can only use the observed data $x$, which follows the Gaussian mixture distribution

$$ p(x) = \sum_{k=0,1} p(x\mid Z=k)\; P(Z=k) $$

image

The log-likelihood of observing $N$ independent samples $(x_1,\dots,x_N)$ is

$$ \mathcal{L}= \log\left(\prod_{i=1}^N p(x_{i}) \right) = \sum_{i=1}^N \log p(x_{i}) $$

We want to find the best-fitting model by maximizing the log-likelihood.

Directly maximizing the log-likelihood with respect to the paramaters $\pi$, $\mu_k$, and $\sigma_k^2$ is difficult, because:

  • Only the feature values $x$ are observed.
  • The cluster labels $Z$ are hidden, they are latent variables.
  • The log-likelihood is expressed purely in terms of the observable distribution, involves logarithms of sums, and is intractable.

If we knew the cluster labels $k$ for each sample, we could easily fit the parameters $(\pi,\mu_k,\sigma_k^2)$ from the data for each cluster.

If we knew the parameters $(\pi, \mu_k,\sigma_k^2)$, we could easily determine the probability for each data point to belong to each cluster and determine cluster labels.

To get around this catch-22, we replace actual cluster labels by their current expected values given current values for the parameters, and then iterate the above two steps until convergence - this is the Expectation-Maximization (EM) algorithm.

The EM algorithm

  1. Take initial guesses $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ for the model parameters.
  2. Expectation step: Compute the responsibilities $P(Z_i=k\mid x_i)$ for each data point $x_i$.
  3. Maximization step: Update $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ by maximizing the log-likelihood using the soft cluster assignments $P(Z_i=k\mid x_i)$.
  4. Iterate steps 2 and 3 until convergence.

What are “soft cluster assignments”?

Consider $N$ samples $(x_1,\dots,x_N)$ from a normal distribution $p(x\mid \mu,\sigma^2)$. The log-likelihood is

$$ \begin{aligned} \mathcal{L}= \sum_{i=1}^N \log \left(\frac1{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x_i-\mu)^2}\right) = -\frac{N}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^N (x_i-\mu)^2 \end{aligned} $$

$\mathcal{L}$ is maximized for

$$ \begin{aligned} \hat\mu &= \frac{1}{N} \sum_{i=1}^N x_i\\ \hat\sigma^2 &= \frac{1}{N} \sum_{i=1}^N (x_i-\hat\mu)^2 \end{aligned} $$

Now consider $N$ samples $(Z_1,\dots,Z_N)$ and $(x_1,\dots,x_N)$ from the generative model where the cluster labels are also observed. The log-likelihood is

$$ \begin{aligned} \mathcal{L}&= \sum_{i=1}^N \log p(Z_i,x_i)\\ &= \sum_{i=1}^N \Bigl(Z_i \log p(Z_i=1,x_i) + (1-Z_i) \log p(Z_i=0,x_i)\Bigr) \\ &= \sum_{i=1}^N \Bigl(Z_i \log p(x_i\mid \mu_1,\sigma_1^2) + (1-Z_i) \log p(x_i\mid \mu_0,\sigma_0^2)\Bigr) + \sum_{i=1}^N \Bigl(Z_i\log\pi + (1-Z_i) \log(1-\pi)\Bigr) \end{aligned} $$

$\mathcal{L}$ is maximized for

$$ \begin{aligned} \hat\pi &= \frac{N_1}{N}\\ \hat\mu_k &= \frac{1}{N_k} \sum_{Z_i=k} x_i\\ \hat\sigma_k^2 &= \frac{1}{N_k} \sum_{Z_i=k}^N (x_i-\hat\mu_k)^2 \end{aligned} $$

Since the cluster labels are not observed, we don’t know the value of the $Z_i$. The “trick” is to replace them with their expectation values $\mathbb{E}(Z_i=k) = P(Z_i=k\mid x_i)$ in the EM algorithm, using the current estimates for $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$.

This leads to updated estimates

$$ \begin{aligned} \hat\pi^{\text{(new)}} &= \frac{\sum_{i=1}^N P(Z_i=k\mid x_i)}{N}\\ \hat\mu_k^{\text{(new)}} &= \frac{1}{N} \sum_{i=1}^N P(Z_i=k\mid x_i)\; x_i \\ (\hat\sigma_k^2)^{\text{(new)}} &= \frac{1}{N} \sum_{i=1}^N P(Z_i=k\mid x_i)\; (x_i-\hat\mu_k)^2 \end{aligned} $$

Hence, instead of a “hard” assignment of each sample $x_i$ to one cluster $k$ when the $Z_i$ are observed, each sample now contributes with a “soft assignment” weight $P(Z_i=k\mid x_i)$ to the parameters of each cluster.

After convergence, the final $P(Z_i=k\mid x_i)$ can be used to assign data points to clusters, for instance to the cluster $k$ with highest responsibility for $x_i$.

Generalizations

We have so far considered the case where the data are one-dimensional (real numbers) and the number of clusters is pre-fixed. Important generalizations are:

  1. The data can be of any dimension $D$. In higher dimensions the components of the mixture model are multivariate normal distributions. The mean parameters $\mu_k$ simply become $D$-dimensional vectors. The variance parameters $\sigma_k^2$ however become $D\times D$ covariance matrices. For simplicity and to reduce the number of paramaters, it is often assumed that the covariance matrices are diagonal, such that the number of variance parameters again scales linearly in $D$. However, when features are correlated, this assumption will be a poor representation of the underlying data.

  2. Instead of treating the cluster weights, means, and variances as fixed parameters that need to be estimated, they can be seen as random variables themselves with a distribution (uncertainty) that can be learned from the data. For more information, see this tutorial.

  3. In an infinite mixture model, the number of clusters need not be fixed in advance, but can be learned from the data. For more information, see this tutorial.

Assignment

4 - Combinatorial clustering tutorial

Illustrating data exploration and K-means clustering using TCGA breast cancer data.

Tutorials are available as Pluto notebooks. To run the reactive notebooks, you need to first follow the instructions in the “Code” section of the repository’s readme file, then follow the instruction at the bottom of the Pluto homepage.