This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Dimensionality reduction

Single-cell genomics. Probabilistic PCA. T-SNE and UMAP.

1 - Single-cell genomics

Single-cell genomics.

Understanding cells

To understand a cell, we must determine the factors that shape its identity:

  • its position in a taxonomy of cell types,
  • the progress of multiple time-dependent processes happening simultaneously,
  • the cell’s response to signals from its local environment,
  • the location and neighbourhood in which it resides.

Technological advances have enabled genome-wide profiling of RNA, DNA, protein, epigenetic modifications, chromatin accessibility, and other molecular events in single cells.

Sources of variation in single-cell genomics

Technical variation:

  • Due to technical factors in the data generation process.
  • Main axes of variation in scRNA-seq often dominated by technical factors.
  • False negatives (expressed but undetected) and false positives (overamplification) are due to minute quantities of starting RNA material and corresponding need for massive amplification.

Allele-intrinsic variation:

  • Stochastic factors intrinsic to the molecular mechanisms that control gene expression.
  • Does not correlate between two alleles of the same gene.

Allele-extrinsic variation

  • Factors extrinsic to the process of transcription.
  • Contribute to establishing differences between cell types or states, either stably or transiently.
  • Correlated between two alleles of the same gene.

Most studies aim to understand allele-extrinsic variation and its function, while technical and allele-intrinsic variations are major confounders.

Addressing overamplification

False-positive detections due to amplification occur at early PCR amplification cycle.

Can be addressed using random molecular tags (RMTs), also called unique molecular identifiers (UMIs):

  • Short barcode attached to 3’ or 5’ end of cDNA molecules before amplification.
  • After amplification, the number of unique barcodes associated with reads aligned to a gene/transcript, rather than number of aligned reads serves as the gene/transcript abundance.

Beware:

  • Sequencing errors will cause RMT sequences that are close to but not identical, and need to be collapsed.
  • “Barcode collision” – two or more copies of a transcript may receive the same barcode, making them appear as a single copy.

Addressing false negatives

Limited efficiency of RNA capture and conversion to cDNA leads to dropout events: transcripts that are expressed in the cell but undetected in its mRNA profile.

Zero-inflated models:

Gene expression is modelled as a mixture of two distributions:

  • one in which it is successfully amplified and detected at a level that correlates with its true abundance,
  • one in which it is undetected because of technical effects.

(Hidden) component labels can be inferred using EM.

Only cells belonging to the “success” component are used for downstream analysis, e.g. differential expression analysis between subpopulations of cells.

Revealing the vectors of cellular identity

Single-cell genomics is a means to identify and understand the factors that jointly define a cell’s identity:

  • Discrete cell types and subtypes
  • Continuous phenotypic spectra
  • Dynamic processes
  • Spatial location

Distinguishing discrete cell types and subtypes

Single-cell analysis can detect cellular subtypes that cannot be defined by a handful of markers, or for which markers are not yet known.

Classification of cells in discrete subtypes is a problem in unsupervised clusterring.

Key challenges:

  • Exponential increasing scale of single-cell data.
  • Ensuring reproducible classification.
  • Finding the proper granularity and detecting hierarchies of cell types and subtypes, especially when cell type frequency varies by multiple orders of magnitude.
  • Distilling molecular markers and signatures to characterize cell types.
  • Matching the resulting classes to known cell types.
  • Visualizing, sharing, and comparing classifications.

Continuous phenotypic spectra

  • Sometimes more appropriate to speak of a continuous phenotypic spectrum within a type than of a discrete cell type. e.g. a cell’s ability to trigger an immune response.
  • Continuous facets can be characterized by combining dimensionality reduction (e.g. PCA) with enrichment for functional annotation.
  • Prior knowledge can be used to interpret the biological relevance of the data’s main axes of variation.

Mapping dynamic processes

  • Cells undergo dynamic transitions, incl. short-term responses to environmental signals, cell differentiation, oscillations, etc.
  • Single-cell genomics provides a snapshot of the entire dynamic process: the set of single cells captured at any time point will Stochastically contain cells positioned in different instantaneous time points along the temporal trajectory.
  • Temporal ordering can be recovered by creating a graph that connects cells by their profile similarity and finding an optimal path on this graph.
  • This introduces the notion of “pseudo-time”.

Inference of spatial location

  • Most high-throughput single-cell approaches require dissociation of tissues and therefore lose spatial information.
  • Spatial information can be obtained for a limited number of “landmark” genes.
  • Computational methods can combine limited spatial cues for landmark genes with single-cell RNA-seq from the same type of tissue.
  • Limited number of landmarks expressed at any spatial position and dropouts in single-cell RNA-seq can severely compromise the mapping.
  • Requires tissues with “canonical” structure (e.g. developing embryo, most normal tissues), which is faithfully reproduced in replicate samples.

Assignment

2 - Principal component analysis

PCA. Probabilistic PCA.

Dimensionality reduction

Assume we have a set of $N$ observations (e.g. cells) of $p$ variables (e.g. genes).

Dimensionality reduction consists of representing the observations in a lower-dimensional space with dimension $q\ll p$ while preserving some characteristics (e.g. relative distance) of the original $p$-dimensional representation.

For visualizing relationships among observations, we typically use $q=2$.

Principal component analysis (PCA)

Assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.

The principal components of $\mathbf{X}$ provide a sequence of best linear approximations to $\mathbf{X}$, representing $\mathbf{X}$ by a line ($q=1$), plane ($q=2$), etc. in $p$-dimensional space.

A rank-$q$ linear model in $p$-dimensional space takes the form

$$ f(y) = \mu + y_1 v_1 + \dots + y_q v_q = \mu + \mathbf{V}_q z, $$

where $\mu\in\mathbb{R}^p$ is a location vector, $z\in\mathbb{R}^q$ is a (low-dimensional) vector, and $\mathbf{V}_q=(v_1,\dots,v_q)$ is a $p\times q$ matrix of $q$ orthogonal unit vectors,

$$ \begin{aligned} v_k v_{k’}^T = \sum_{j=1}^p v_{kj}v_{k’j}= \delta_{kk’} \end{aligned} $$

Fitting this model to the data using least squares amounts to minimizing the reconstruction error:

$$ \min_{\mu,\{y_{i}\},\mathbf{V}_q} \sum_{i=1}^N \Bigl\| x_i - \mu - \mathbf{V}_q z_i \Bigr\|^2 $$

Partial optimization for $\mu$ and the $y_i$ gives

$$ \begin{aligned} \hat{\mu} &= \bar x = \frac1{N}\sum_{i=1}^N x_i\\ \hat{z}_i &= \mathbf{V}_q^T(x_i-\bar{x}) \end{aligned} $$

Plugging this into () leaves us to find the orthogonal matrix $\mathbf{V}_q$:

$$ \begin{aligned} \min_{\mathbf{V}_q} \sum_{i=1}^{N} \Bigl\| (x_i - \bar{x}) - \mathbf{V}_q\mathbf{V}_q^T(x_i-\bar{x}) \Bigr\|^2 \end{aligned} $$

It can be shown that the optimal solution is found when $\mathbf{V}_q=(v_1,\dots,v_q)$ contains the $q$ eigenvectors with largest eigenvalues of the $p\times p$ matrix $\mathbf{X}^T\mathbf{X}$ (if the data is centred such that $\bar x=0$).

The solution can be expressed using the singular value decomposition of the obsered data matrix $\mathbf{X}$:

$$ \mathbf{X} = \mathbf{U} \Lambda \mathbf{V}^T $$

where $\mathbf{U}$ is an $N\times p$ orthogonal matrix ($\mathbf{U}^T\mathbf{U}=\mathbf{I}_p$) whose columns are called the left singular vectors, $\mathbf{V}$ is a $p\times p$ orthogonal matrix ($\mathbf{V}^T\mathbf{V}=\mathbf{I}_p$) whose columns are called the right singular vectors, and $\Lambda$ is a $p\times p$ diagonal matrix with diagonal elements $\lambda_1\geq \lambda_2 \geq \dots \geq \lambda_p\geq 0$ called the singular values.

The solution $\mathbf{V}_q$ above consists of the first $q$ columns of $\mathbf{V}$.

The columns of $\mathbf{U}\Lambda$ are called the principal components of $\mathbf{X}$. The first $q$ principal components are $\mathbf{U}_q\Lambda_q$. Each principal component is a vector in $\R^N$, that is, it takes a value for each observation. The solution $\hat{z}_i$ above consists of the rows of $\mathbf{U}_q\Lambda_q$, that is, for each observation $i$, $\hat{z}_i$ consists of the values of the first $q$ principal components for observation $i$.

Probabilistic PCA

Probabilistic PCA is a continuous generalization of the Gaussian mixture model where we again assume that an observation $X=(x_1,\dots,x_p)$ of $p$ variables is generated by a latent variable $Z$:

flowchart LR Z --> X

In the Gaussian mixture model, $Z$ is assumed to take discrete values leading to clustered observations. Here we assume that $Z=(z_1,\dots,z_q)$ is a $q$-dimensional continous variable with distribution

$$ Z\sim \mathcal{N}(0,\mathbf{I}) $$

that is, we assume each $z_i$ is drawn from an independent normal distribution with mean zero and standard deviation 1. The conditional distribution of $X$ given $Z$ is just like in the discrete mixture distribution assumed to be a multivariate normal with mean dependent on the value of $Z$; the covariance matrix is assumed to be a multiple of the identity matrix, meaning that conditional on the latent factors $Z$, the $p$ observed features $X$ are independent and have equal error distribution:

$$ p(x \mid z) = \mathcal{N}(\mathbf{\mu} + \mathbf{V}z, \sigma^2 \mathbf{I}) $$

where $\mu\in\R^p$ is a mean offset vector, $\mathbf{V}\in\R^{p\times q}$ is a linear maps, and $\sigma^2>0$ a common variance parameter for all $p$ dimensions. Note the similarity between this model and the model above.

Following the Gaussian mixture model example, we want to maximize the likelihood of the observed data, which follows the marginal distribution

$$ p(x) = \int dz\; p(x\mid z) p(z) $$

which can be interpreted as a “continuous” Gaussian mixture distribution. Unlike in the finite mixture model, this integral can in fact be solved using properties of the multivariate normal distribution:

$$ p(x) = \mathcal{N}(\mu,\mathbf{C}) $$

with covariance matrix $\mathbf{C}=\mathbf{V}\mathbf{V}^T+\sigma^2\mathbf{I}$. The corresponding log-likelihood for $N$ observations $x_1.\dots,x_N\in\R^p$ is

$$ \mathcal{L} = -\frac{N}{2}\Bigl\{p \ln(2\pi) + \ln(\det(\mathbf{C})) + \mathrm{tr}(\mathbf{C}^{-1}\mathbf{S}) \Bigr\} $$

where

$$ \mathbf{S}= \frac{1}{N}\sum_{i=1}^N (x_i-\mu)(x_i-\mu) $$

The maximum-likelihood for $\mu$ is the sample mean, $\hat{\mu}=\overline{x}=\frac{1}{N}\sum_{i=1}^N x_i$, such that $\mathbf{S}$ is the sample covariance matrix. The log-likelihood function measures how much $\mathbf{C}$ differs from the sample covariance and is maximized by making $\mathbf{C}$ as “similar” as possible to it. The solution turns out to be the standard PCA solution where the columns of $\mathbf{V}$ are the $q$ eigenvectors with largest eigenvalues of $\mathbf{S}$. This conclusion is valid only if we assume the independent error model, where features are independent given latent factors, that is, the latent factors explain all the covariance between the features. In summary:

Note the important benefit of the probabilistic approach. In standard PCA, we simply fit a linear function to a set of observation, without giving any thought to the deviations (errors) between the (noisy) observations and the fitted model. In probabilistic PCA, as in all generative models, we must be explicit about our assumptions, and standard PCA is only recovered if we assume that features are independent given the latent factors.

3 - t-SNE and UMAP

T-SNE, UMAP.

t-Stochastic Neighbor Embedding (t-SNE)

Linear dimensionality reduction methods (like PCA) may not able to retain both the local and global structure of the data in a single map.

t-SNE is a non-linear method that aims to overcome this limitation.

t-SNE is particularly successful at visualizing high-dimensional data (reducing it to two dimensions).

Care must be taken when applying t-SNE due to random initialization of the algorithm, which may lead to solutions that do not represent the global structure of the data well.

(Symmetric) Stochastic Neighbor Embedding (SNE)

As before, assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.

SNE introduces a directional similarity of $x_j$ to $x_i$,

$$ \begin{aligned} p_{j\mid i} = \frac{\exp\bigl(-\frac{1}{2\sigma_i^2}|x_i-x_j|^2\bigr)}{\sum_{k\neq i}\exp\bigl(-\frac{1}{2\sigma_i^2}|x_i-x_k|^2\bigr)} \end{aligned} $$

The variances $\sigma_i$ are chosen such that the perplexities

$$ \begin{aligned} \mathcal{P_i} = \exp\Bigl(-\sum_{j\neq i} p_{j\mid i}\log p_{j\mid i}\Bigr) \end{aligned} $$

take some prespecified value.

Symmetric, undirectional similarities are defined as

$$ \begin{aligned} p_{ij} = \frac{p_{j\mid i}+p_{i\mid j}}{2n} \end{aligned} $$

such that $\sum_{ij}p_{ij}=1$

A map is a two or three-dimensional representation ${\cal Y}={y_1,\dots,y_N}$ of the high-dimensional data ${\cal X}={x_1,\dots,x_N}$.

In the low-dimensional representation, we define the probability of picking $y_i$ and $y_j$ as neighbors by

$$ \begin{aligned} q_{ij} = \frac{\exp\bigl(-|y_i-y_j|^2\bigr)}{\sum_{k\neq l}\exp\bigl(-|y_k-y_l|^2\bigr)} \end{aligned} $$

(Symmetric) Stochastic Neighbor Embedding (SNE) finds the map ${\cal Y}$ that minimizes the mismatch between $p_{ij}$ and $q_{ij}$, across all pairs $i$ and $j$.

The mismatch cost function is given by the Kullback-Leibler divergence:

$$ \begin{aligned} C = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}\geq 0 \end{aligned} $$

with $C=0$ if, and only if, $q_{ij}=p_{ij}$ for all $i$ and $j$.

SNE suffers from a “crowding problem”

The volume of a sphere with radius $r$ in $p$ dimensions scales as $r^p$.

Consider a cluster of datapoints that are approximately uniformly distributed in a sphere of radius $r$.

If $p\gg 2$, then the available 2-dimensional area to model the distances in this cluster accurately will be too small compared to the total area available, forcing clusters of nearby points to crush together.

Heavy-tails in the low-dimensional space resolve the crowding problem

A heavy-tail distribution in the low-dimensional space allows a moderate distance in the high-dimensional space to be faithfully modelled by a much larger distance in the map.

In t-SNE a Student t-distribution with 1 d.o.f. (a Cauchy distribution) is used as the heavy-tailed distribution in the low-dimensional map:

$$ \begin{aligned} q_{ij} = \frac{\left(1+|y_i-y_j|^2\right)^{-1}}{\sum_{k\neq l}\left(1+|y_k-y_l|^2\right)^{-1}} \end{aligned} $$

The Cauchy distribution offers additional advantages in the numerical optimization of the cost function.

Summary

t-SNE:

  • puts emphasis on modelling dissimilar datapoints by means of large pairwise distances, and similar datapoints by small pairwise distances,
  • offers dramatic improvement in finding and preserving local structure in the data, compared to, for instance, PCA.

Optimization of the t-SNE cost function is easier than optimizing earlier SNE versions, but:

  • the cost function is non-convex,
  • several optimization parameters need to be chosen,
  • the constructed solutions depend on these choices and may be different each time t-SNE is run from an initial random configuration.

Formal theory supporting t-SNE is lacking:

  • similarity measures and cost function are based on heuristics, appealing to “intuitive” justifications,
  • there is no formal generative model connecting high- and low-dimensional representations,
  • the probability semantics used to describe t-SNE is descriptive, rather than foundational.

Uniform Manifold Approximation and Projection (UMAP)

In the words of the authors:

  • UMAPs design decisions were all grounded in a solid theoretic foundation and not derived through experimentation with any particular task focused objective function.
  • The theoretical foundations for UMAP are largely based in manifold theory and topological data analysis.
  • A purely computational view [of UMAP] fails to shed any light upon the reasoning that underlies the algorithmic decisions made in UMAP.

A computational view of UMAP

From a practical computational perspective, UMAP can be described in terms of, construction of, and operations on, weighted graphs:

  • Graph construction:

    1. Construct a weighted k-neighbour graph
    2. Apply some transform on the edges to represent local distance.
    3. Deal with the inherent asymmetry of the k-neighbour graph.\
  • Graph layout:

    1. Define a cost function that preserves desired characteristics of this k-neighbour graph.
    2. Find a low dimensional representation which optimizes this cost function.

t-SNE and UMAP can both be cast in this form, but with some (subtle) differences in the edge weighting (dissimilarity measure) and graph layout (cost function).

A comparison of dimension reduction algorithms

Practical recommendations

  • Claims of UMAP or t-SNE (or related methods) being superior are usually overrated.

  • Both algorithms suffer from a need for careful parameter tuning and initialization choices.

  • Using and understanding one algorithm well is more important flipping between algorithms and never changing default parameters.

An excellent reference for using t-SNE for single-cell data:

Kobak & Berens. The art of using t-SNE for single-cell transcriptomics. Nat. Comm. 10:5416 (2019)

See also: https://github.com/berenslab/rna-seq-tsne

Assignment

4 - Tutorial

Tutorial on dimensionality reduction of single-cell RNA-seq data using t-SNE and PCA.

Tutorials are available as Pluto notebooks. To run the reactive notebooks, you need to first follow the instructions in the “Code” section of the repository’s readme file, then follow the instruction at the bottom of the Pluto homepage.