This is the multi-page printable view of this section. Click here to print.
Dimensionality reduction
1 - Single-cell genomics
Reference
A Wagner, A Regev & N Yosef. Revealing the vectors of cellular identity with Single-cell genomics. Nature Biotechnology 34:1145 (2016).
See also:
Kobak & Berens. The art of using t-SNE for single-cell transcriptomics. Nat. Comm. 10:5416 (2019)
R Jiang, et al. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biology 23: 31 (2022)
Understanding cells
To understand a cell, we must determine the factors that shape its identity:
- its position in a taxonomy of cell types,
- the progress of multiple time-dependent processes happening simultaneously,
- the cell’s response to signals from its local environment,
- the location and neighbourhood in which it resides.
Technological advances have enabled genome-wide profiling of RNA, DNA, protein, epigenetic modifications, chromatin accessibility, and other molecular events in single cells.
Cellular identity
Diverse factors combine to create a cell’s unique identity.
Figure obtained from full text on EuropePMC.
Sources of variation in single-cell genomics
Sources of variation in single-cell genomics
Biological and technical factors combine to determine the measured genomic profiles of single cells.
Figure obtained from full text on EuropePMC.
Technical variation:
- Due to technical factors in the data generation process.
- Main axes of variation in scRNA-seq often dominated by technical factors.
- False negatives (expressed but undetected) and false positives (overamplification) are due to minute quantities of starting RNA material and corresponding need for massive amplification.
Allele-intrinsic variation:
- Stochastic factors intrinsic to the molecular mechanisms that control gene expression.
- Does not correlate between two alleles of the same gene.
Allele-extrinsic variation
- Factors extrinsic to the process of transcription.
- Contribute to establishing differences between cell types or states, either stably or transiently.
- Correlated between two alleles of the same gene.
Most studies aim to understand allele-extrinsic variation and its function, while technical and allele-intrinsic variations are major confounders.
Technical confounders of single-cell RNA-seq
Technical confounders of single-cell RNA-seq and computational methods to handle them.
Figure obtained from full text on EuropePMC.
Addressing overamplification
False-positive detections due to amplification occur at early PCR amplification cycle.
Can be addressed using random molecular tags (RMTs), also called unique molecular identifiers (UMIs):
- Short barcode attached to 3’ or 5’ end of cDNA molecules before amplification.
- After amplification, the number of unique barcodes associated with reads aligned to a gene/transcript, rather than number of aligned reads serves as the gene/transcript abundance.
Beware:
- Sequencing errors will cause RMT sequences that are close to but not identical, and need to be collapsed.
- “Barcode collision” – two or more copies of a transcript may receive the same barcode, making them appear as a single copy.
Addressing false negatives
Limited efficiency of RNA capture and conversion to cDNA leads to dropout events: transcripts that are expressed in the cell but undetected in its mRNA profile.
Zero-inflated models:
Gene expression is modelled as a mixture of two distributions:
- one in which it is successfully amplified and detected at a level that correlates with its true abundance,
- one in which it is undetected because of technical effects.
(Hidden) component labels can be inferred using EM.
Only cells belonging to the “success” component are used for downstream analysis, e.g. differential expression analysis between subpopulations of cells.
Revealing the vectors of cellular identity
Single-cell genomics is a means to identify and understand the factors that jointly define a cell’s identity:
- Discrete cell types and subtypes
- Continuous phenotypic spectra
- Dynamic processes
- Spatial location
Distinguishing discrete cell types and subtypes
Single-cell analysis can detect cellular subtypes that cannot be defined by a handful of markers, or for which markers are not yet known.
Classification of cells in discrete subtypes is a problem in unsupervised clusterring.
Key challenges:
- Exponential increasing scale of single-cell data.
- Ensuring reproducible classification.
- Finding the proper granularity and detecting hierarchies of cell types and subtypes, especially when cell type frequency varies by multiple orders of magnitude.
- Distilling molecular markers and signatures to characterize cell types.
- Matching the resulting classes to known cell types.
- Visualizing, sharing, and comparing classifications.
Continuous phenotypic spectra
- Sometimes more appropriate to speak of a continuous phenotypic spectrum within a type than of a discrete cell type. e.g. a cell’s ability to trigger an immune response.
- Continuous facets can be characterized by combining dimensionality reduction (e.g. PCA) with enrichment for functional annotation.
- Prior knowledge can be used to interpret the biological relevance of the data’s main axes of variation.
Mapping dynamic processes
- Cells undergo dynamic transitions, incl. short-term responses to environmental signals, cell differentiation, oscillations, etc.
- Single-cell genomics provides a snapshot of the entire dynamic process: the set of single cells captured at any time point will Stochastically contain cells positioned in different instantaneous time points along the temporal trajectory.
- Temporal ordering can be recovered by creating a graph that connects cells by their profile similarity and finding an optimal path on this graph.
- This introduces the notion of “pseudo-time”.
Inference of spatial location
- Most high-throughput single-cell approaches require dissociation of tissues and therefore lose spatial information.
- Spatial information can be obtained for a limited number of “landmark” genes.
- Computational methods can combine limited spatial cues for landmark genes with single-cell RNA-seq from the same type of tissue.
- Limited number of landmarks expressed at any spatial position and dropouts in single-cell RNA-seq can severely compromise the mapping.
- Requires tissues with “canonical” structure (e.g. developing embryo, most normal tissues), which is faithfully reproduced in replicate samples.
Assignment
Assignment
Kobak & Berens. The art of using t-SNE for single-cell transcriptomics. Nat. Comm. 10:5416 (2019) https://doi.org/10.1038/s41467-019-13056-x
2 - Principal component analysis
Reference
Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learning (second edition) (2009).
https://hastie.su.domains/ElemStatLearn/
https://link.springer.com/book/10.1007%2F978-0-387-84858-7
Section 14.5.1
Tipping, Michael E., and Christopher M. Bishop. “Probabilistic principal component analysis.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61.3 (1999): 611-622.
Dimensionality reduction
Assume we have a set of $N$ observations (e.g. cells) of $p$ variables (e.g. genes).
Dimensionality reduction consists of representing the observations in a lower-dimensional space with dimension $q\ll p$ while preserving some characteristics (e.g. relative distance) of the original $p$-dimensional representation.
For visualizing relationships among observations, we typically use $q=2$.
Two-dimensional visualization of single-cell data.
Two-dimensional visualization of single-cell data.
Figure obtained from full text on EuropePMC.
Principal component analysis (PCA)
Assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.
The principal components of $\mathbf{X}$ provide a sequence of best linear approximations to $\mathbf{X}$, representing $\mathbf{X}$ by a line ($q=1$), plane ($q=2$), etc. in $p$-dimensional space.
Rank-1 and rank-2 approximations of a set of data.
The best rank-1 linear approximation of a set of data.
The best rank-2 linear approximation of a set of data.
Figures from the Elements of Statistical Learning.
A rank-$q$ linear model in $p$-dimensional space takes the form
$$ f(y) = \mu + y_1 v_1 + \dots + y_q v_q = \mu + \mathbf{V}_q z, $$
where $\mu\in\mathbb{R}^p$ is a location vector, $z\in\mathbb{R}^q$ is a (low-dimensional) vector, and $\mathbf{V}_q=(v_1,\dots,v_q)$ is a $p\times q$ matrix of $q$ orthogonal unit vectors,
$$ \begin{aligned} v_k v_{k’}^T = \sum_{j=1}^p v_{kj}v_{k’j}= \delta_{kk’} \end{aligned} $$
Fitting this model to the data using least squares amounts to minimizing the reconstruction error:
$$ \min_{\mu,\{y_{i}\},\mathbf{V}_q} \sum_{i=1}^N \Bigl\| x_i - \mu - \mathbf{V}_q z_i \Bigr\|^2 $$
Partial optimization for $\mu$ and the $y_i$ gives
$$ \begin{aligned} \hat{\mu} &= \bar x = \frac1{N}\sum_{i=1}^N x_i\\ \hat{z}_i &= \mathbf{V}_q^T(x_i-\bar{x}) \end{aligned} $$
Plugging this into () leaves us to find the orthogonal matrix $\mathbf{V}_q$:
$$ \begin{aligned} \min_{\mathbf{V}_q} \sum_{i=1}^{N} \Bigl\| (x_i - \bar{x}) - \mathbf{V}_q\mathbf{V}_q^T(x_i-\bar{x}) \Bigr\|^2 \end{aligned} $$
It can be shown that the optimal solution is found when $\mathbf{V}_q=(v_1,\dots,v_q)$ contains the $q$ eigenvectors with largest eigenvalues of the $p\times p$ matrix $\mathbf{X}^T\mathbf{X}$ (if the data is centred such that $\bar x=0$).
The solution can be expressed using the singular value decomposition of the obsered data matrix $\mathbf{X}$:
$$ \mathbf{X} = \mathbf{U} \Lambda \mathbf{V}^T $$
where $\mathbf{U}$ is an $N\times p$ orthogonal matrix ($\mathbf{U}^T\mathbf{U}=\mathbf{I}_p$) whose columns are called the left singular vectors, $\mathbf{V}$ is a $p\times p$ orthogonal matrix ($\mathbf{V}^T\mathbf{V}=\mathbf{I}_p$) whose columns are called the right singular vectors, and $\Lambda$ is a $p\times p$ diagonal matrix with diagonal elements $\lambda_1\geq \lambda_2 \geq \dots \geq \lambda_p\geq 0$ called the singular values.
The solution $\mathbf{V}_q$ above consists of the first $q$ columns of $\mathbf{V}$.
The columns of $\mathbf{U}\Lambda$ are called the principal components of $\mathbf{X}$. The first $q$ principal components are $\mathbf{U}_q\Lambda_q$. Each principal component is a vector in $\R^N$, that is, it takes a value for each observation. The solution $\hat{z}_i$ above consists of the rows of $\mathbf{U}_q\Lambda_q$, that is, for each observation $i$, $\hat{z}_i$ consists of the values of the first $q$ principal components for observation $i$.
Probabilistic PCA
Probabilistic PCA is a continuous generalization of the Gaussian mixture model where we again assume that an observation $X=(x_1,\dots,x_p)$ of $p$ variables is generated by a latent variable $Z$:
In the Gaussian mixture model, $Z$ is assumed to take discrete values leading to clustered observations. Here we assume that $Z=(z_1,\dots,z_q)$ is a $q$-dimensional continous variable with distribution
$$ Z\sim \mathcal{N}(0,\mathbf{I}) $$
that is, we assume each $z_i$ is drawn from an independent normal distribution with mean zero and standard deviation 1. The conditional distribution of $X$ given $Z$ is just like in the discrete mixture distribution assumed to be a multivariate normal with mean dependent on the value of $Z$; the covariance matrix is assumed to be a multiple of the identity matrix, meaning that conditional on the latent factors $Z$, the $p$ observed features $X$ are independent and have equal error distribution:
$$ p(x \mid z) = \mathcal{N}(\mathbf{\mu} + \mathbf{V}z, \sigma^2 \mathbf{I}) $$
where $\mu\in\R^p$ is a mean offset vector, $\mathbf{V}\in\R^{p\times q}$ is a linear maps, and $\sigma^2>0$ a common variance parameter for all $p$ dimensions. Note the similarity between this model and the model above.
Following the Gaussian mixture model example, we want to maximize the likelihood of the observed data, which follows the marginal distribution
$$ p(x) = \int dz\; p(x\mid z) p(z) $$
which can be interpreted as a “continuous” Gaussian mixture distribution. Unlike in the finite mixture model, this integral can in fact be solved using properties of the multivariate normal distribution:
$$ p(x) = \mathcal{N}(\mu,\mathbf{C}) $$
with covariance matrix $\mathbf{C}=\mathbf{V}\mathbf{V}^T+\sigma^2\mathbf{I}$. The corresponding log-likelihood for $N$ observations $x_1.\dots,x_N\in\R^p$ is
$$ \mathcal{L} = -\frac{N}{2}\Bigl\{p \ln(2\pi) + \ln(\det(\mathbf{C})) + \mathrm{tr}(\mathbf{C}^{-1}\mathbf{S}) \Bigr\} $$
where
$$ \mathbf{S}= \frac{1}{N}\sum_{i=1}^N (x_i-\mu)(x_i-\mu) $$
The maximum-likelihood for $\mu$ is the sample mean, $\hat{\mu}=\overline{x}=\frac{1}{N}\sum_{i=1}^N x_i$, such that $\mathbf{S}$ is the sample covariance matrix. The log-likelihood function measures how much $\mathbf{C}$ differs from the sample covariance and is maximized by making $\mathbf{C}$ as “similar” as possible to it. The solution turns out to be the standard PCA solution where the columns of $\mathbf{V}$ are the $q$ eigenvectors with largest eigenvalues of $\mathbf{S}$. This conclusion is valid only if we assume the independent error model, where features are independent given latent factors, that is, the latent factors explain all the covariance between the features. In summary:
Probabilistic PCA interpretation
If we assume that our high-dimensional observations are generated by a linear combination of low-dimensional latent (=hidden) factors, and the observed features are independent given the latent factors, then the maximum-likelihood solution for the map between latent and observed space is PCA.Note the important benefit of the probabilistic approach. In standard PCA, we simply fit a linear function to a set of observation, without giving any thought to the deviations (errors) between the (noisy) observations and the fitted model. In probabilistic PCA, as in all generative models, we must be explicit about our assumptions, and standard PCA is only recovered if we assume that features are independent given the latent factors.
3 - t-SNE and UMAP
Reference
[TSNE] L.J.P. van der Maaten and G.E. Hinton. Visualizing Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.
https://lvdmaaten.github.io/tsne/
[UMAP] McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018
t-Stochastic Neighbor Embedding (t-SNE)
Linear dimensionality reduction methods (like PCA) may not able to retain both the local and global structure of the data in a single map.
t-SNE is a non-linear method that aims to overcome this limitation.
t-SNE is particularly successful at visualizing high-dimensional data (reducing it to two dimensions).
Care must be taken when applying t-SNE due to random initialization of the algorithm, which may lead to solutions that do not represent the global structure of the data well.
(Symmetric) Stochastic Neighbor Embedding (SNE)
As before, assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.
SNE introduces a directional similarity of $x_j$ to $x_i$,
$$ \begin{aligned} p_{j\mid i} = \frac{\exp\bigl(-\frac{1}{2\sigma_i^2}|x_i-x_j|^2\bigr)}{\sum_{k\neq i}\exp\bigl(-\frac{1}{2\sigma_i^2}|x_i-x_k|^2\bigr)} \end{aligned} $$
The variances $\sigma_i$ are chosen such that the perplexities
$$ \begin{aligned} \mathcal{P_i} = \exp\Bigl(-\sum_{j\neq i} p_{j\mid i}\log p_{j\mid i}\Bigr) \end{aligned} $$
take some prespecified value.
Symmetric, undirectional similarities are defined as
$$ \begin{aligned} p_{ij} = \frac{p_{j\mid i}+p_{i\mid j}}{2n} \end{aligned} $$
such that $\sum_{ij}p_{ij}=1$
A map is a two or three-dimensional representation ${\cal Y}={y_1,\dots,y_N}$ of the high-dimensional data ${\cal X}={x_1,\dots,x_N}$.
In the low-dimensional representation, we define the probability of picking $y_i$ and $y_j$ as neighbors by
$$ \begin{aligned} q_{ij} = \frac{\exp\bigl(-|y_i-y_j|^2\bigr)}{\sum_{k\neq l}\exp\bigl(-|y_k-y_l|^2\bigr)} \end{aligned} $$
(Symmetric) Stochastic Neighbor Embedding (SNE) finds the map ${\cal Y}$ that minimizes the mismatch between $p_{ij}$ and $q_{ij}$, across all pairs $i$ and $j$.
The mismatch cost function is given by the Kullback-Leibler divergence:
$$ \begin{aligned} C = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}\geq 0 \end{aligned} $$
with $C=0$ if, and only if, $q_{ij}=p_{ij}$ for all $i$ and $j$.
SNE suffers from a “crowding problem”
The volume of a sphere with radius $r$ in $p$ dimensions scales as $r^p$.
Consider a cluster of datapoints that are approximately uniformly distributed in a sphere of radius $r$.
If $p\gg 2$, then the available 2-dimensional area to model the distances in this cluster accurately will be too small compared to the total area available, forcing clusters of nearby points to crush together.
Heavy-tails in the low-dimensional space resolve the crowding problem
A heavy-tail distribution in the low-dimensional space allows a moderate distance in the high-dimensional space to be faithfully modelled by a much larger distance in the map.
In t-SNE a Student t-distribution with 1 d.o.f. (a Cauchy distribution) is used as the heavy-tailed distribution in the low-dimensional map:
$$ \begin{aligned} q_{ij} = \frac{\left(1+|y_i-y_j|^2\right)^{-1}}{\sum_{k\neq l}\left(1+|y_k-y_l|^2\right)^{-1}} \end{aligned} $$
The Cauchy distribution offers additional advantages in the numerical optimization of the cost function.
Gaussian vs. Cauchy distribution
Gaussian ($\sigma=1/\sqrt{2}$) vs Cauchy p.d.f. for $r\geq 0$.
Summary
t-SNE:
- puts emphasis on modelling dissimilar datapoints by means of large pairwise distances, and similar datapoints by small pairwise distances,
- offers dramatic improvement in finding and preserving local structure in the data, compared to, for instance, PCA.
Optimization of the t-SNE cost function is easier than optimizing earlier SNE versions, but:
- the cost function is non-convex,
- several optimization parameters need to be chosen,
- the constructed solutions depend on these choices and may be different each time t-SNE is run from an initial random configuration.
Formal theory supporting t-SNE is lacking:
- similarity measures and cost function are based on heuristics, appealing to “intuitive” justifications,
- there is no formal generative model connecting high- and low-dimensional representations,
- the probability semantics used to describe t-SNE is descriptive, rather than foundational.
Uniform Manifold Approximation and Projection (UMAP)
In the words of the authors:
- UMAPs design decisions were all grounded in a solid theoretic foundation and not derived through experimentation with any particular task focused objective function.
- The theoretical foundations for UMAP are largely based in manifold theory and topological data analysis.
- A purely computational view [of UMAP] fails to shed any light upon the reasoning that underlies the algorithmic decisions made in UMAP.
A computational view of UMAP
From a practical computational perspective, UMAP can be described in terms of, construction of, and operations on, weighted graphs:
Graph construction:
- Construct a weighted k-neighbour graph
- Apply some transform on the edges to represent local distance.
- Deal with the inherent asymmetry of the k-neighbour graph.\
Graph layout:
- Define a cost function that preserves desired characteristics of this k-neighbour graph.
- Find a low dimensional representation which optimizes this cost function.
t-SNE and UMAP can both be cast in this form, but with some (subtle) differences in the edge weighting (dissimilarity measure) and graph layout (cost function).
A comparison of dimension reduction algorithms
A comparison of dimension reduction algorithms.
A comparison of dimension reduction algorithms.
Figure from the UMAP paper.
Practical recommendations
Claims of UMAP or t-SNE (or related methods) being superior are usually overrated.
Both algorithms suffer from a need for careful parameter tuning and initialization choices.
Using and understanding one algorithm well is more important flipping between algorithms and never changing default parameters.
An excellent reference for using t-SNE for single-cell data:
Kobak & Berens. The art of using t-SNE for single-cell transcriptomics. Nat. Comm. 10:5416 (2019)
See also: https://github.com/berenslab/rna-seq-tsne
Assignment
Assignment
In this assignment you will go through a typical single-cell RNA-seq analysis, trying to reproduce some published results from:
Kobak & Berens. The art of using t-SNE for single-cell transcriptomics. Nat. Comm. 10:5416 (2019)
Prerequisites
The Github repository accompanying the paper: https://github.com/berenslab/rna-seq-tsne.
Mouse V1 (primary visual cortex) and ALM (anterior lateral motor cortex) SMART-seq data: https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-v1-and-alm-smart-seq. This is the data that Kobak & Berens refer to as the “Tasic et al. data”:
- Download the gene-level read count zip files.
- Read the readme files (of course!)
- We will work with the exon matrix files only.
Cell cluster annotation file: https://raw.githubusercontent.com/berenslab/rna-seq-tsne/master/data/tasic-sample_heatmap_plot_data.csv.
Tasks
Write a script to read the SMART-seq data files and merge the data in a single data structure. Your structure will need:
- A gene x cell (or cell x gene) read count matrix containing all cells (V1 + ALM). You should have data for 45,768 genes in 25,481 cells.
- The unique identifiers of all cells.
Read the cell cluster annotation file and remove all cells that don’t have an annotation from your data. You should now have 23,822 cells remaining.
Remove all cells and all genes that have all near-zero counts. What threshold (number of reads) did Kobak & Berens use to define “near-zero” expression?
Implement the “Feature selection” steps described in the paper Methods:
- Compute the fraction of near-zero counts for each gene (using the same threshold as in the previous step) (eq 8 in the paper).
- Compute the mean log2 expression over the non-zero values for each gene (eq. 9 in the paper).
- Reproduce Supplementary Figure 4 (without the gene names).
- Select the relevant genes using eq. 10. You can use the parameter values from Supplementary Figure 4 and don’t have to implement the parameter fitting procedure. How many genes do you retain?
Implement the “Sequencing depth normalization” described in the paper Methods:
- Note that sequencing depth normalization is applied by computing the library sizes (total number of reads) for each cell using all genes.
- Keep the normalized data only for the genes from the “feature selection” step.
Perform PCA and t-SNE dimensionality reduction:
- You may work with the full data if you have sufficient compute resources, otherwise create a small toy dataset where you keep every 25th cell (953 cells in total).
- Standardize (mean zero, standard deviation one) the data. Recall the lecture on unsupervised clustering: if we want all features (genes) to have equal influence on the clustering/dimensionality reduction, should we standardize over the gene or cell dimension?
- Do PCA and retain the first two principal components.
- Apply t-SNE using your choice of software implementation. Generate multiple solutions (easy if you use the toy dataset), similar to Kobak & Berens, Figure 2 (c-f) (you don’t need to use identical settings, but explore some of the input settings for t-SNE).
Generate scatter plot figures:
- Make figures for PCA and the various t-SNE results you generated.
- Use the colors from the cell cluster annotation file to confirm that your figures look similar to Kobak & Berens, Figure 2. The figures will of course not be identical, especially if you use a toy dataset, but should display similar patterns.
4 - Tutorial
Tutorials are available as Pluto notebooks. To run the reactive notebooks, you need to first follow the instructions in the “Code” section of the repository’s readme file, then follow the instruction at the bottom of the Pluto homepage.
Data preprocessing: static html file or reactive notebook.
Cluster analysis: static html file or reactive notebook.