This is the multi-page printable view of this section. Click here to print.
Cluster analysis
1 - Cancer subtype classification
Gene expression profiling predicts clinical outcome of breast cancer
Classic paper
van ’t Veer L et al. Gene expression profiling predicts clinical outcome of breast cancer . Nature 415:530 (2002).
See also the comment: The molecular outlook
The study by van ’t Veer et al was one of the first to use to microarrays, a brand-new technology at the time, to profile gene expression on a genome-wide scale from surgically removed tumour samples - breast tumours in this case. Another paper from around the same time is: Perou et al. Molecular portraits of human breast tumours. The credit for being the first to using cluster analysis on gene expression data (from yeast) probably goes to Eisen et al. Cluster analysis and display of genome-wide expression patterns.
Van ’t Veer et al clustered data from 98 tumours based on their similarities across approximately 5,000 differentially expressed genes - genes that showed more variation than expected by chance in the dataset. The most striking finding is in their Figure 1: the tumours segregated in two distinct groups that correlated strongly with clinical features, namely:
- BRCA1 germline mutation: harmful variants in the BRCA1 or BRCA2 genes that markedly increase risk for developing breast cancer.
- Estrogen receptor (ER) status: breast tumour cells that express ER on their surface need estrogen to grow, and are therefore more susceptible to hormone therapy.
- Tumour grade: a measure of degree of abnormality of cancer cells.
- Lymphocyte infiltration: an indication whether the cancer has spread to the lymph nodes.
- Angionvasion: an indication whether the cancer has spread to the blood vessels.
- Metastatic status: an indication whether the cancer has spread to othre organs.
Overall, tumours in the bottom group of the figure were clearly associated with measures that predict better patient outcome.
Cluster analysis of breast tumours
a, Two-dimensional presentation of transcript ratios for 98 breast tumours across 4,968 significant genes. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Clinical data for the 98 patients. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. c, Enlarged portion from a containing a group of genes that co-regulate with the ER-α gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
Figure obtained from full text on EuropePMC.
Following the discovery that unsupervised clustering of gene expression profiles identifies good and poor prognosis groups, the authors tried to identify a minimal prognostic signature from their data, which resulted in an optimal set of 70 marker genes, which they could validate in an independent set of tumour samples.
Prognostic signature of breast tumours
a, Use of prognostic reporter genes to identify optimally two types of disease outcome from 78 sporadic breast tumours into a poor prognosis and good prognosis group. b, Expression data matrix of 70 prognostic marker genes. Each row represents a tumour and each column a gene, whose name is labelled between b and c. Genes are ordered according to their correlation coefficient with the two prognostic groups. Tumours are ordered by the correlation to the average profile of the good prognosis group (middle panel). Solid line, prognostic classifier with optimal accuracy; dashed line, with optimized sensitivity. Above the dashed line patients have a good prognosis signature, below the dashed line the prognosis signature is poor. The metastasis status for each patient is shown in the right panel: white indicates patients who developed distant metastases within 5 years after the primary diagnosis; black indicates patients who continued to be disease-free for at least 5 years. c, Same as for b, but the expression data matrix is for tumours of 19 additional breast cancer patients using the same 70 optimal prognostic marker genes. Thresholds in the classifier (solid and dashed line) are the same as b. (See Fig. 1 for colour scheme.)
Figure obtained from full text on EuropePMC.
Clearly, with such a strong signature, the race to bring it to the clinic is on. That this is far from trivial can be seen by tracing the follow-up studies and clinical trials:
- Van De Vijver MJ et al. A gene-expression signature as a predictor of survival in breast cancer. NEJM 347:1999 (2002).
- Buyse M et al. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Ntnl Canc Inst 98:1183 (2006).
- Mook S et al. Individualization of therapy using MammaPrint: From development to the MINDACT Trial. Canc Genomics & Proteomics 4:147 (2007).
- Cardoso F et al. 70-gene signature as an aid to treatment decisions in early-stage breast cancer. NEJM 375:717 (2016).
- Brandão M, Pondé N, Piccart-Gebhart M. Mammaprint: a comprehensive review. Fut Onc 15:207 (2019).
They got there eventually, and the gene expression signature is now commercially available under the name of Mammaprint.
The Cancer Genome Atlas
Although the results by Van ’t Veer et al. were obtained from a small (by current standards!) sample size, they have been reproduced consistenly in larger studies (see the assignment in the next cluster analysis lecture) and arguably spawned a search for similar signatures in other cancer types through large-scale projects, such as The Cancer Genome Atlas (TCGA) Program.
The amount of data and number of publications produced by TCGA is too enormous to survey here.
For the purposes of illustration, have a look at the Pan-Cancer Atlas, and then do the following assignment.
Assignment
Reading assignment
Read the Pan-Cancer Atlas flagship paper, Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancers
Answer the following questions:
- Which data did the study analyze? Where do the different data types map on the genotype to phenotype axis?
- Why are data from all cancer types analyzed together? What is the underlying hypothesis or motivation for the study? Did the study achieve its aim?
- What is observed when data types are clustered independently? Do the same clusters reappear in multiple data types? Do clusters overlap with cancer types?
- What is observed when data types are clustered together? Which data types are included in the joint analysis and why?
- What is the main difference between the COCA and iCluster methods? What does the TumorMap show?
- The final number of clusters (28) is close to the number of cancer types (33), what do you think this means?
- What do you think is the main challenge when jointly clustering multiple data types and how would you address it?
2 - Combinatorial clustering
Reference
The material in this section is mostly borrowed from:
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning (second edition) (2009).
https://hastie.su.domains/ElemStatLearn/
https://link.springer.com/book/10.1007%2F978-0-387-84858-7
Section 14.2
Introduction
The goal of cluster analysis is to group or segment a collection of objects into subsets or “clusters”, such that objects within a cluster are more closely related than objects in different clusters.
Many datasets exhibit a hierarchical structure, with no clear demarcation of clusters, and clusters can themselves be grouped successively such that clusters within one group are more similar than those in different groups. Deciding where to make the “cut” is usually done by setting parameters of a clustering algorithm, and almost always involves an arbitrary choice by the user.
Central to all clustering methods is the notion of the degree of similarity (or dissimilarity) between the objects being clustered. Sometimes data are presented directly in terms of proximity/similarity between objects. More often we have measurements (e.g. gene expression) on objects (e.g. genes or samples) that we want to cluster, and a (dis)similariy matrix must be constructed first.
Dissimilarities based on measurements
Assume we have measurements $x_{ij}$ for objects $i=1,2,\dots,N$ on variables (or attributes) $j=1,2,\dots,p$. Usually, we first define a dissimilarity function $d_j(x_{ij},x_{i’j})$ between values of the $j$th attribute, and then define the dissimilarity between objects as
$$ D(x_i,x_{i’})=\sum_{j=1}^p w_j \cdot d_j(x_{ij},x_{i’j}) $$
with $w_j$ a weight assigned to the $j$th attribute that determines its relative influence in the overall dissimilarity between objects. By convention we scale the weights such that $\sum_j w_j=1$.
Equal attribute influence
To give all attributes equal influence in the object dissimilarity, we must set $w_j\sim 1/\overline{d_j}$, with
$$ \overline{d_j}=\frac{1}{N^2} \sum_{i=1}^N \sum_{i’=1}^N d_j(x_{ij},x_{i’j}) $$
Setting $w_j=1$ for all $j$ does not necessarily give all attributes equal influence! To see this, we compute the average object dissimilarity over all pairs of objects as
$$ \begin{aligned} \bar D &= \frac1{N^2}\sum_{i=1}^N \sum_{i’=1}^N D(x_i,x_{i’}) = \sum_{j=1}^p w_j \cdot \bar{d}_j, \end{aligned} $$ with $\bar{d}_j$ defined above. Hence the relative influence of the $j$th attribute is $w_j \cdot \bar{d}_j$.
Squared error distance and standardization
With squared-error distance, the average object dissimilarity on the $j$th attribute is proportional to its variance.The most common choice of dissimilarity function is squared-error distance:
$$ d_j(x_{ij},x_{i’j}) = (x_{ij}-x_{i’j})^2 $$
Define the mean and variance of each attribute over all objects as
$$ \begin{aligned} \mu_j &= \frac1N \sum_{i=1}^N x_{ij}\\ \sigma_j^2 &= \frac1N \sum_{i=1}^N (x_{ij}-\mu_j)^2 \end{aligned} $$
Then $$ \begin{aligned} \overline{d_j} &= \frac1{N^2}\sum_{i=1}^N \sum_{i’=1}^N (x_{ij}-x_{i’j})^2\\ &= \frac{1}{N^2}\sum_{i=1}^N \sum_{i’=1}^N \left((x_{ij}-\mu_j) - (x_{i’j}-\mu_j)\right)^2\\ &= \frac1N \sum_{i=1}^N (x_{ij}-\mu_j)^2 + \frac1N \sum_{i=1}^N (x_{i’j}-\mu_j)^2 = 2 \sigma_j^2 \end{aligned} $$
It is often recommended to standardize data before clustering:
$$ x_{ij} \to y_{ij}=\frac{x_{ij}-\mu_j}{\sigma_j} $$
With squared-error loss, this is equivalent to setting weights $w_j \sim 1/\sigma_j^2 \sim 1/\bar{d}_j$, that is, to give all attributes equal influence on the average object dissimilarity.
Beware that sometimes some attributes exhibit more grouping tendency than others, which may be obscured by standardizing.
To standardize or not?
Randomly sampled data from a mixture of two multivariate distributions with means differing only in the first dimension, showing the raw (left) and standardized (data) colored according to K-means cluster label. Standardization has obscured the two well-separated groups. Note that each plot uses the same units in the horizontal and vertical axes.
Filter attributes by their variance before standardizing…
Combinatorial clustering
Combinatorial clustering algorithms assign each object to a cluster without regard to a probability model describing the data. Understanding combinatorial clustering is a necessary basis for understanding probabilistic methods.
In combinatorial clustering, a prespecified number of clusters $K<N$ is postulated ($N$ the number of objects). An assignment of objects $i\in{1,\dots,N}$ to clusters $k\in{1,\dots,K}$ is charcterized by a many-to-one mapping or encoder $k=C(i)$.
$C$ is obtained by minizing the “within cluster” point scatter:
$$ W(C) = \frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i’)=k} d(x_i,x_{i’}) $$
$W(C)$ characterizes the extent to which objects assigned to the same cluster tend to be close to one another. Notice that:
$$ \begin{aligned} T &= \frac12 \sum_{i=1}^N \sum_{i’=1}^N d_{ii’} \\ &= \frac12 \sum_{k=1}^K \sum_{C(i)=k} \left(\sum_{C(i’)=k} d_{ii’} + \sum_{C(i’)\neq k} d_{ii’}\right) \\ &= W(C) + B(C) \end{aligned} $$ where $d_{ii’} = d(x_i,x_{i’})$ and
$$
B(C) = \frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i’)\neq k} d_{ii’}
$$
is the “between cluster” point scatter.
$B(C)$ characterizes the extent to which objects assigned to different clusters tend to be far apart.
Since $T$ is constant given the data, minimizing $W(C)$ is equivalent to maximizing $B(C)$.
$K$-means clustering
The $K$-means algorithm uses the squared Euclidean distance $$ d(x_i,x_{i’}) = \sum_{j=1}^p (x_{ij}-x_{i’j})^2 = \| x_i - x_{i’}\|^2 $$ and an iterative greedy descent algorithm to minimize $W(C)$.
Using the Euclidean distance expression, $W(C)$ can be written as $$ W(C) = \sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - \overline{x_k}\|^2 $$ where $N_k$ is the number of objects assigned to cluster $k$, and $\overline{x_k}=(\overline{x_{1k}},\dots,\overline{x_{pk}})$ is the mean vector associated to cluster $k$.
$W(C)$ is minimized if within each cluster, the average dissimilarity of the objects from the cluster mean, as defined by the points in that cluster, is minimized.
Note that for any set of objects $S$, $$ \overline{x_S} = \frac{1}{|S|} \sum_{i\in S} x_i = \argmin_m \sum_{i\in S}\|x_i-m\|^2 $$
Hence $$ \begin{aligned} \min_C W(C) &= \min_C \sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - \overline{x_k}\|^2 \\ & = \min_{C} \min_{\{m_k\}}\sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - m_k\|^2 \end{aligned} $$
This result is used in a greedy descent algorithm where alternatingly the mean vectors are updated for the current cluster assignments, and object assignments are updated by assigning objects to the nearest current mean vector.
K-means algorithm
- For a given cluster assignment $C$, the total cluster variance $W(C)$ is minimized with respect to ${m_1,\dots, m_K}$ yielding the means of the currently assigned clusters. That is
$$ m_k = \frac1{N_k} \sum_{C(i)=k} x_i $$
- Given a current set of means ${m_1,\dots, m_K}$, $W(C)$ is minimized by assigning each observation to the closest (current) cluster mean. That is,
$$ C(i) = \argmin_{1\leq k\leq K} \| x_i - m_k \| $$
- These steps are iterated until the assignments do not change.
How to choose the number of clusters $K$?
Find “kink” in the within-cluster-dissimilarity: Read Elements of Statistical Learning Section 14.3.8, 14.3.11
Assignment
TCGA BRCA data analysis
We will analyze expression data from the TCGA paper:
Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Gene expression data are available from:
https://gdc.cancer.gov/about-data/publications/brca_2012
Download the expression data file “BRCA.exp.348.med.txt” and the paper Supplementary Tables spreadsheet.
Filter for the most variable genes, and create data structures for the expression data, estrogen receptor (ER) status, and the AJCC cancer stage of each individual.
Apply K-means clustering to the expression data.Does your K-means implementation standardize data by default or not? Choose an appropriate value of $K$ in K-means and justify your choice (cf. ESL Sections 14.3.8 and 14.3.11). Compare standardized vs non-standardized data.
Do the individuals cluster by ER status? By cancer stage?
3 - Mixture distributions
Reference
The material in this section is mostly borrowed from:
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning (second edition) (2009).
https://hastie.su.domains/ElemStatLearn/
https://link.springer.com/book/10.1007%2F978-0-387-84858-7
Section
Generative models
A generative model is a statistical model for the process that could have generated your data. Generative models offer many advantages compared to combinatorial algorithms that treat data as a collection of objects. Most importantly, working with a generative model forces you to be explicit about your assumptions. Likewise, a generative model allows you to encode, and be explicit about, prior (biological) knowledge you may have about the data generating process.
Gaussian mixture models
What kind of process could generate data with separated groups or clusters of observations?
Let’s assume there is an unmeasured (or hidden) random variable $Z$ that determines to which group an observation $X$ belongs. For simplicity, assume that $Z$ can only take two values, 0 or 1, and that the measurement $X$ is one-dimensional and normally distributed in each group.
Consider the following process:
Randomly sample cluster label $Z=1$ with probability $\pi$ and $Z=0$ with probability $1-\pi$.
Sample features
$$ \begin{aligned} X \sim \begin{cases} \mathcal{N}(\mu_0,\sigma_0^2) & \text{ if } Z=0\ \mathcal{N}(\mu_1,\sigma_1^2) & \text{ if } Z=1 \end{cases} \end{aligned} $$
where $\mu_k$ and $\sigma_k^2$ are the means and variances of two normal distributions.
What data would this process generate?
Hence the model generates cluster labels $Z$ and real numbers $x\in\mathbb{R}$ from the model
$$ \begin{aligned} Z &\longrightarrow X\\ p(Z,x) = P(Z) &\times p(x\mid Z) \end{aligned} $$
where we use lower-case “p” for probability density functions ($X$ is continuous) and upper-case “P” for discrete probabilities, and
$$ \begin{aligned} P(Z=1) &= \pi = 1 - P(Z=0) \\ p(x\mid Z=k) &= \mathcal{N}(\mu_k,\sigma_k^2) \end{aligned} $$
Distribution generated by the model:
Observed distribution:
Some EM language
The joint distribution is the probability distribution of a cluster label $Z$ and feature value $x$ both being produced by the model:
$$ p(Z,x) = p(x\mid Z)\; P(Z) $$
The marginal distribution is th probability distribution that the model produces a feature value $x$:
$$ p(x) = \sum_{k=0,1} p(x\mid Z=k)\; P(Z=k) $$
The responsibility of $Z$ for feature value $x$, also called the recognition distribution, is obtained using Bayes’ theorem
$$ P(Z=k\mid x) = \frac{p(x\mid Z=k) \; P(Z=k)}{p(x)} $$
This value can be used as a soft cluster assignment: with probability $P(Z=k \mid x)$, an observed value $x$ belongs to cluster k.
Note that the expected value of $Z$ given a data point $x$ is:
$$ \begin{aligned} \mathbb{E}\left(Z\mid x\right) &= 1 \cdot P(Z=1 \mid x) + 0\cdot P(Z=0 \mid x) = P(Z=1 \mid x) % &= \frac{\pi p(x\mid \mu_1, \igma_1^2)}{\pi p(x \mid \mu_1, \sigma_1^2) + (1-\pi) p(x \mid \mu_0, \sigma_0^2)} \end{aligned} $$
Maximum-likelihood estimation
To fit the model to the data, we can only use the observed data $x$, which follows the Gaussian mixture distribution
$$ p(x) = \sum_{k=0,1} p(x\mid Z=k)\; P(Z=k) $$
The log-likelihood of observing $N$ independent samples $(x_1,\dots,x_N)$ is
$$ \mathcal{L}= \log\left(\prod_{i=1}^N p(x_{i}) \right) = \sum_{i=1}^N \log p(x_{i}) $$
We want to find the best-fitting model by maximizing the log-likelihood.
Directly maximizing the log-likelihood with respect to the paramaters $\pi$, $\mu_k$, and $\sigma_k^2$ is difficult, because:
- Only the feature values $x$ are observed.
- The cluster labels $Z$ are hidden, they are latent variables.
- The log-likelihood is expressed purely in terms of the observable distribution, involves logarithms of sums, and is intractable.
If we knew the cluster labels $k$ for each sample, we could easily fit the parameters $(\pi,\mu_k,\sigma_k^2)$ from the data for each cluster.
If we knew the parameters $(\pi, \mu_k,\sigma_k^2)$, we could easily determine the probability for each data point to belong to each cluster and determine cluster labels.
To get around this catch-22, we replace actual cluster labels by their current expected values given current values for the parameters, and then iterate the above two steps until convergence - this is the Expectation-Maximization (EM) algorithm.
The EM algorithm
- Take initial guesses $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ for the model parameters.
- Expectation step: Compute the responsibilities $P(Z_i=k\mid x_i)$ for each data point $x_i$.
- Maximization step: Update $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ by maximizing the log-likelihood using the soft cluster assignments $P(Z_i=k\mid x_i)$.
- Iterate steps 2 and 3 until convergence.
What are “soft cluster assignments”?
Consider $N$ samples $(x_1,\dots,x_N)$ from a normal distribution $p(x\mid \mu,\sigma^2)$. The log-likelihood is
$$ \begin{aligned} \mathcal{L}= \sum_{i=1}^N \log \left(\frac1{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x_i-\mu)^2}\right) = -\frac{N}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^N (x_i-\mu)^2 \end{aligned} $$
$\mathcal{L}$ is maximized for
$$ \begin{aligned} \hat\mu &= \frac{1}{N} \sum_{i=1}^N x_i\\ \hat\sigma^2 &= \frac{1}{N} \sum_{i=1}^N (x_i-\hat\mu)^2 \end{aligned} $$
Now consider $N$ samples $(Z_1,\dots,Z_N)$ and $(x_1,\dots,x_N)$ from the generative model where the cluster labels are also observed. The log-likelihood is
$$ \begin{aligned} \mathcal{L}&= \sum_{i=1}^N \log p(Z_i,x_i)\\ &= \sum_{i=1}^N \Bigl(Z_i \log p(Z_i=1,x_i) + (1-Z_i) \log p(Z_i=0,x_i)\Bigr) \\ &= \sum_{i=1}^N \Bigl(Z_i \log p(x_i\mid \mu_1,\sigma_1^2) + (1-Z_i) \log p(x_i\mid \mu_0,\sigma_0^2)\Bigr) + \sum_{i=1}^N \Bigl(Z_i\log\pi + (1-Z_i) \log(1-\pi)\Bigr) \end{aligned} $$
$\mathcal{L}$ is maximized for
$$ \begin{aligned} \hat\pi &= \frac{N_1}{N}\\ \hat\mu_k &= \frac{1}{N_k} \sum_{Z_i=k} x_i\\ \hat\sigma_k^2 &= \frac{1}{N_k} \sum_{Z_i=k}^N (x_i-\hat\mu_k)^2 \end{aligned} $$
Since the cluster labels are not observed, we don’t know the value of the $Z_i$. The “trick” is to replace them with their expectation values $\mathbb{E}(Z_i=k) = P(Z_i=k\mid x_i)$ in the EM algorithm, using the current estimates for $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$.
This leads to updated estimates
$$ \begin{aligned} \hat\pi^{\text{(new)}} &= \frac{\sum_{i=1}^N P(Z_i=k\mid x_i)}{N}\\ \hat\mu_k^{\text{(new)}} &= \frac{1}{N} \sum_{i=1}^N P(Z_i=k\mid x_i)\; x_i \\ (\hat\sigma_k^2)^{\text{(new)}} &= \frac{1}{N} \sum_{i=1}^N P(Z_i=k\mid x_i)\; (x_i-\hat\mu_k)^2 \end{aligned} $$
Hence, instead of a “hard” assignment of each sample $x_i$ to one cluster $k$ when the $Z_i$ are observed, each sample now contributes with a “soft assignment” weight $P(Z_i=k\mid x_i)$ to the parameters of each cluster.
After convergence, the final $P(Z_i=k\mid x_i)$ can be used to assign data points to clusters, for instance to the cluster $k$ with highest responsibility for $x_i$.
Generalizations
We have so far considered the case where the data are one-dimensional (real numbers) and the number of clusters is pre-fixed. Important generalizations are:
The data can be of any dimension $D$. In higher dimensions the components of the mixture model are multivariate normal distributions. The mean parameters $\mu_k$ simply become $D$-dimensional vectors. The variance parameters $\sigma_k^2$ however become $D\times D$ covariance matrices. For simplicity and to reduce the number of paramaters, it is often assumed that the covariance matrices are diagonal, such that the number of variance parameters again scales linearly in $D$. However, when features are correlated, this assumption will be a poor representation of the underlying data.
Instead of treating the cluster weights, means, and variances as fixed parameters that need to be estimated, they can be seen as random variables themselves with a distribution (uncertainty) that can be learned from the data. For more information, see this tutorial.
In an infinite mixture model, the number of clusters need not be fixed in advance, but can be learned from the data. For more information, see this tutorial.
Assignment
EM implementation
- Implement an algorithm to generate random samples from a one-dimensional Gaussian mixture distribution with two components. Your algorithm should have as parameters $N$, the number of samples to generate, and $(\pi,\mu_k,\sigma_k)$, $k=0,1$
- Implement the EM algorithm for maximum-likelihood estimation of the parameters of the previous model.
- Test your EM algorithm by applying it on the simulated data from step 1 and evaluating how close your parameter estimates are to their true values. Draw a histogram of the sampled data colored by cluster label and overlay intermediate and final responsibility values for each data point (cf. [ESL] Fig. 8.5)
4 - Combinatorial clustering tutorial
Tutorials are available as Pluto notebooks. To run the reactive notebooks, you need to first follow the instructions in the “Code” section of the repository’s readme file, then follow the instruction at the bottom of the Pluto homepage.
Data preprocessing: static html file or reactive notebook.
Cluster analysis: static html file or reactive notebook.