These lecture notes focus on probabilistic machine learning methods for computational biology, where the experimental data are viewed as random samples from an underlying data-generating process.
The “probabilistic modelling” in the title refers to the use of abstract data-generating processes, not based on any specific biological mechanisms, and derived from generic models and methods. A typical example will be clustering using Gaussian mixture models.
To speak of “causal modelling” will require something more, namely that the data-generating process is based on some qualitative prior knowledge or understanding of the true underlying biological process. A typical example will be path analysis.
The notes are divided in chapters, each focusing on a specific class of methods:
Clustering
Regularized regression
Dimensionality reduction
Causal inference
Graphical models
Spatio-Temporal models
Each chapter follows the same structure:
A “classic” biological or biomedical research paper is studied where the algorithm (or class of algorithms) of interest was first used. A more recent follow-up or related paper is given as a reading assignment.
The method used in the classic paper is presented in detail, along with additional methods to solve the same type of problem. The methods are put in practice in a programming assignment. Where possible, original data from the papers studied in the first part is used.
Four appendices contain the minimum required background knowledge on gene regulation, probability theory, linear algebra, and optimization.
The theoretical sections contain the basic information to understand a method. For more background, try the following textbooks (with free pdfs!), all used in preparation of this course:
The use of classic or path-breaking papers is motivated by Back to the future: education for systems-level biologists. Since the field of genome-scale data analysis is still relatively young, the choice of papers for study is still a bit open and likely to evolve as the course matures.
One of the objectives of the course is to learn to read and understand scientific papers. Figures from papers selected for discussion are reproduced in these notes. Attribution to the original authors is always given. Where possible open access papers are used, but some “classic” papers date from before the birth of open access. If the full text version of the paper is available on EuropePMC, through Unpaywall or otherwise, I’ve reused figures without seeking any further reprint permission. If anyone feels their copyright is violated, please let me know.
1 - Cluster analysis
Cancer subtypes. Combinatorial clustering. Mixture distributions.
1.1 - Cancer subtype classification
Gene expression profling of tumour samples.
Gene expression profiling predicts clinical outcome of breast cancer
Van ’t Veer et al clustered data from 98 tumours based on their similarities across approximately 5,000 differentially expressed genes - genes that showed more variation than expected by chance in the dataset. The most striking finding is in their Figure 1: the tumours segregated in two distinct groups that correlated strongly with clinical features, namely:
BRCA1 germline mutation: harmful variants in the BRCA1 or BRCA2 genes that markedly increase risk for developing breast cancer.
Estrogen receptor (ER) status: breast tumour cells that express ER on their surface need estrogen to grow, and are therefore more susceptible to hormone therapy.
Tumour grade: a measure of degree of abnormality of cancer cells.
Angionvasion: an indication whether the cancer has spread to the blood vessels.
Metastatic status: an indication whether the cancer has spread to othre organs.
Overall, tumours in the bottom group of the figure were clearly associated with measures that predict better patient outcome.
Cluster analysis of breast tumours
a, Two-dimensional presentation of transcript ratios for 98 breast tumours across 4,968 significant genes. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Clinical data for the 98 patients. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. c, Enlarged portion from a containing a group of genes that co-regulate with the ER-α gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
Following the discovery that unsupervised clustering of gene expression profiles identifies good and poor prognosis groups, the authors tried to identify a minimal prognostic signature from their data, which resulted in an optimal set of 70 marker genes, which they could validate in an independent set of tumour samples.
Prognostic signature of breast tumours
a, Use of prognostic reporter genes to identify optimally two types of disease outcome from 78 sporadic breast tumours into a poor prognosis and good prognosis group. b, Expression data matrix of 70 prognostic marker genes. Each row represents a tumour and each column a gene, whose name is labelled between b and c. Genes are ordered according to their correlation coefficient with the two prognostic groups. Tumours are ordered by the correlation to the average profile of the good prognosis group (middle panel). Solid line, prognostic classifier with optimal accuracy; dashed line, with optimized sensitivity. Above the dashed line patients have a good prognosis signature, below the dashed line the prognosis signature is poor. The metastasis status for each patient is shown in the right panel: white indicates patients who developed distant metastases within 5 years after the primary diagnosis; black indicates patients who continued to be disease-free for at least 5 years. c, Same as for b, but the expression data matrix is for tumours of 19 additional breast cancer patients using the same 70 optimal prognostic marker genes. Thresholds in the classifier (solid and dashed line) are the same as b. (See Fig. 1 for colour scheme.)
Clearly, with such a strong signature, the race to bring it to the clinic is on. That this is far from trivial can be seen by tracing the follow-up studies and clinical trials:
They got there eventually, and the gene expression signature is now commercially available under the name of Mammaprint.
The Cancer Genome Atlas
Although the results by Van ’t Veer et al. were obtained from a small (by current standards!) sample size, they have been reproduced consistenly in larger studies (see the assignment in the next cluster analysis lecture) and arguably spawned a search for similar signatures in other cancer types through large-scale projects, such as The Cancer Genome Atlas (TCGA) Program.
The amount of data and number of publications produced by TCGA is too enormous to survey here.
For the purposes of illustration, have a look at the Pan-Cancer Atlas, and then do the following assignment.
Which data did the study analyze? Where do the different data types map on the genotype to phenotype axis?
Why are data from all cancer types analyzed together? What is the underlying hypothesis or motivation for the study? Did the study achieve its aim?
What is observed when data types are clustered independently? Do the same clusters reappear in multiple data types? Do clusters overlap with cancer types?
What is observed when data types are clustered together? Which data types are included in the joint analysis and why?
What is the main difference between the COCA and iCluster methods? What does the TumorMap show?
The final number of clusters (28) is close to the number of cancer types (33), what do you think this means?
What do you think is the main challenge when jointly clustering multiple data types and how would you address it?
1.2 - Combinatorial clustering
Combinatorial clustering algorithms, K-means.
Reference
The material in this section is mostly borrowed from:
The goal of cluster analysis is to group or segment a collection of objects into subsets or “clusters”, such that objects within a cluster are more closely related than objects in different clusters.
Many datasets exhibit a hierarchical structure, with no clear demarcation of clusters, and clusters can themselves be grouped successively such that clusters within one group are more similar than those in different groups. Deciding where to make the “cut” is usually done by setting parameters of a clustering algorithm, and almost always involves an arbitrary choice by the user.
Central to all clustering methods is the notion of the degree of similarity (or dissimilarity) between the objects being clustered. Sometimes data are presented directly in terms of proximity/similarity between objects. More often we have measurements (e.g. gene expression) on objects (e.g. genes or samples) that we want to cluster, and a (dis)similariy matrix must be constructed first.
Dissimilarities based on measurements
Assume we have measurements $x_{ij}$ for objects $i=1,2,\dots,N$ on variables (or attributes) $j=1,2,\dots,p$. Usually, we first define a dissimilarity function $d_j(x_{ij},x_{i’j})$ between values of the $j$th attribute, and then define the dissimilarity between objects as
with $w_j$ a weight assigned to the $j$th attribute that determines its relative influence in the overall dissimilarity between objects. By convention we scale the weights such that $\sum_j w_j=1$.
Equal attribute influence
To give all attributes equal influence in the object dissimilarity, we must set $w_j\sim 1/\overline{d_j}$, with
Setting $w_j=1$ for all $j$ does not necessarily give all attributes equal influence! To see this, we compute the average object dissimilarity over all pairs of objects as
$$
\begin{aligned}
\bar D &= \frac1{N^2}\sum_{i=1}^N \sum_{i’=1}^N D(x_i,x_{i’}) = \sum_{j=1}^p w_j \cdot \bar{d}_j,
\end{aligned}
$$
with $\bar{d}_j$ defined above. Hence the relative influence of the $j$th attribute is $w_j \cdot \bar{d}_j$.
Squared error distance and standardization
With squared-error distance, the average object dissimilarity on the $j$th attribute is proportional to its variance.
The most common choice of dissimilarity function is squared-error distance:
$$
d_j(x_{ij},x_{i’j}) = (x_{ij}-x_{i’j})^2
$$
Define the mean and variance of each attribute over all objects as
With squared-error loss, this is equivalent to setting weights $w_j \sim 1/\sigma_j^2 \sim 1/\bar{d}_j$, that is, to give all attributes equal influence on the average object dissimilarity.
Beware that sometimes some attributes exhibit more grouping tendency than others, which may be obscured by standardizing.
To standardize or not?
Randomly sampled data from a mixture of two multivariate distributions with means differing only in the first dimension, showing the raw (left) and standardized (data) colored according to K-means cluster label. Standardization has obscured the two well-separated groups. Note that each plot uses the same units in the horizontal and vertical axes.
Filter attributes by their variance before standardizing…
Combinatorial clustering
Combinatorial clustering algorithms assign each object to a cluster without regard to a probability model describing the data. Understanding combinatorial clustering is a necessary basis for understanding probabilistic methods.
In combinatorial clustering, a prespecified number of clusters $K<N$ is postulated ($N$ the number of objects). An assignment of objects $i\in{1,\dots,N}$ to clusters $k\in{1,\dots,K}$ is charcterized by a many-to-one mapping or encoder $k=C(i)$.
$C$ is obtained by minizing the “within cluster” point scatter:
$$
B(C) = \frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i’)\neq k} d_{ii’}
$$ is the “between cluster” point scatter.
$B(C)$ characterizes the extent to which objects assigned to different clusters tend to be far apart.
Since $T$ is constant given the data, minimizing $W(C)$ is equivalent to maximizing $B(C)$.
$K$-means clustering
The $K$-means algorithm uses the squared Euclidean distance
$$
d(x_i,x_{i’}) = \sum_{j=1}^p (x_{ij}-x_{i’j})^2 = \| x_i - x_{i’}\|^2
$$
and an iterative greedy descent algorithm to minimize $W(C)$.
Using the Euclidean distance expression, $W(C)$ can be written as
$$
W(C) = \sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - \overline{x_k}\|^2
$$
where $N_k$ is the number of objects assigned to cluster $k$, and $\overline{x_k}=(\overline{x_{1k}},\dots,\overline{x_{pk}})$ is the mean vector associated to cluster $k$.
$W(C)$ is minimized if within each cluster, the average dissimilarity of the objects from the cluster mean, as defined by the points in that cluster, is minimized.
Note that for any set of objects $S$,
$$
\overline{x_S} = \frac{1}{|S|} \sum_{i\in S} x_i = \argmin_m \sum_{i\in S}\|x_i-m\|^2
$$
This result is used in a greedy descent algorithm where alternatingly the mean vectors are updated for the current cluster assignments, and object assignments are updated by assigning objects to the nearest current mean vector.
K-means algorithm
For a given cluster assignment $C$, the total cluster variance $W(C)$ is minimized with respect to ${m_1,\dots, m_K}$ yielding the means of the currently assigned clusters. That is
$$
m_k = \frac1{N_k} \sum_{C(i)=k} x_i
$$
Given a current set of means ${m_1,\dots, m_K}$, $W(C)$ is minimized by assigning each observation to the closest (current) cluster mean. That is,
Download the expression data file “BRCA.exp.348.med.txt” and the paper Supplementary Tables spreadsheet.
Filter for the most variable genes, and create data structures for the expression data, estrogen receptor (ER) status, and the AJCC cancer stage of each individual.
Apply K-means clustering to the expression data.Does your K-means implementation standardize data by default or not? Choose an appropriate value of $K$ in K-means and justify your choice (cf. ESL Sections 14.3.8 and 14.3.11). Compare standardized vs non-standardized data.
Do the individuals cluster by ER status? By cancer stage?
1.3 - Mixture distributions
Mixture distributions
Reference
The material in this section is mostly borrowed from:
A generative model is a statistical model for the process that could have generated your data. Generative models offer many advantages compared to combinatorial algorithms that treat data as a collection of objects. Most importantly, working with a generative model forces you to be explicit about your assumptions. Likewise, a generative model allows you to encode, and be explicit about, prior (biological) knowledge you may have about the data generating process.
Gaussian mixture models
What kind of process could generate data with separated groups or clusters of observations?
Let’s assume there is an unmeasured (or hidden) random variable $Z$ that determines to which group an observation $X$ belongs. For simplicity, assume that $Z$ can only take two values, 0 or 1, and that the measurement $X$ is one-dimensional and normally distributed in each group.
Consider the following process:
Randomly sample cluster label $Z=1$ with probability $\pi$ and $Z=0$ with probability $1-\pi$.
Sample features
$$
\begin{aligned}
X \sim \begin{cases}
\mathcal{N}(\mu_0,\sigma_0^2) & \text{ if } Z=0\
\mathcal{N}(\mu_1,\sigma_1^2) & \text{ if } Z=1
\end{cases}
\end{aligned}
$$
where $\mu_k$ and $\sigma_k^2$ are the means and variances of two normal distributions.
What data would this process generate?
Hence the model generates cluster labels $Z$ and real numbers $x\in\mathbb{R}$ from the model
We want to find the best-fitting model by maximizing the log-likelihood.
Directly maximizing the log-likelihood with respect to the paramaters
$\pi$, $\mu_k$, and $\sigma_k^2$ is difficult, because:
Only the feature values $x$ are observed.
The cluster labels $Z$ are hidden, they are latent variables.
The log-likelihood is expressed purely in terms of the observable distribution, involves logarithms of sums, and is intractable.
If we knew the cluster labels $k$ for each sample, we could easily fit the parameters $(\pi,\mu_k,\sigma_k^2)$ from the data for each cluster.
If we knew the parameters $(\pi, \mu_k,\sigma_k^2)$, we could easily determine the probability for each data point to belong to each cluster and determine cluster labels.
To get around this catch-22, we replace actual cluster labels by their current expected values given current values for the parameters, and then iterate the above two steps until convergence - this is the Expectation-Maximization (EM) algorithm.
The EM algorithm
Take initial guesses $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ for the model parameters.
Expectation step: Compute the responsibilities $P(Z_i=k\mid x_i)$ for each data point $x_i$.
Maximization step: Update $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ by maximizing the log-likelihood using the soft cluster assignments $P(Z_i=k\mid x_i)$.
Iterate steps 2 and 3 until convergence.
What are “soft cluster assignments”?
Consider $N$ samples $(x_1,\dots,x_N)$ from a normal distribution $p(x\mid \mu,\sigma^2)$. The log-likelihood is
Now consider $N$ samples $(Z_1,\dots,Z_N)$ and $(x_1,\dots,x_N)$ from the generative model where the cluster labels are also observed. The log-likelihood is
Since the cluster labels are not observed, we don’t know the value of the $Z_i$. The “trick” is to replace them with their expectation values $\mathbb{E}(Z_i=k) = P(Z_i=k\mid x_i)$ in the EM algorithm, using the current estimates for $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$.
Hence, instead of a “hard” assignment of each sample $x_i$ to one cluster $k$ when the $Z_i$ are observed, each sample now contributes with a “soft assignment” weight $P(Z_i=k\mid x_i)$ to the parameters of each cluster.
After convergence, the final $P(Z_i=k\mid x_i)$ can be used to assign data points to clusters, for instance to the cluster $k$ with highest responsibility for $x_i$.
Generalizations
We have so far considered the case where the data are one-dimensional (real numbers) and the number of clusters is pre-fixed. Important generalizations are:
The data can be of any dimension $D$. In higher dimensions the components of the mixture model are multivariate normal distributions. The mean parameters $\mu_k$ simply become $D$-dimensional vectors. The variance parameters $\sigma_k^2$ however become $D\times D$ covariance matrices. For simplicity and to reduce the number of paramaters, it is often assumed that the covariance matrices are diagonal, such that the number of variance parameters again scales linearly in $D$. However, when features are correlated, this assumption will be a poor representation of the underlying data.
Instead of treating the cluster weights, means, and variances as fixed parameters that need to be estimated, they can be seen as random variables themselves with a distribution (uncertainty) that can be learned from the data. For more information, see this tutorial.
In an infinite mixture model, the number of clusters need not be fixed in advance, but can be learned from the data. For more information, see this tutorial.
Assignment
EM implementation
Implement an algorithm to generate random samples from a one-dimensional Gaussian mixture distribution with two components. Your algorithm should have as parameters $N$, the number of samples to generate, and $(\pi,\mu_k,\sigma_k)$, $k=0,1$
Implement the EM algorithm for maximum-likelihood estimation of the parameters of the previous model.
Test your EM algorithm by applying it on the simulated data from step 1 and evaluating how close your parameter estimates are to their true values. Draw a histogram of the sampled data colored by cluster label and overlay intermediate and final responsibility values for each data point (cf. [ESL] Fig. 8.5)
1.4 - Combinatorial clustering tutorial
Illustrating data exploration and K-means clustering using TCGA breast cancer data.
When we analyze genome-wide data, we are usually interested in identifying statistically significant features or patterns, that is, patterns we would not expect to see purely by chance. For example:
Genes differentially expressed between healthy and disease samples, treated and untreated samples, etc.
Genes coexpressed across tumour samples
…
We will use differential expression between two conditions, treated and untreated, as a running example. Of course this applies equally to other types of comparisons, e.g. ER positive vs ER negative breast tumours.
Statistical significance is expressed in terms of comparing a null hypothesis $H_0$ to an alternative hypothesis $H_1$. For instance
$$
\begin{aligned}
H_0 &= \text{treatment has no effect on gene expression}\\
H_1 &= \text{treatment does have an effect on gene expression}
\end{aligned}
$$
To compare gene expression between the two groups, we can for instance do a two-sample $t$-test: compute for each gene
$$
t = \frac{\overline{x_1} - \overline{x_2}}{\text{se}}
$$
where $\overline{x_1}$ and $\overline{x_2}$ are the average expression levels of the gene in each group, and $\text{se}$ is the pooled within-group standard error:
The $t$-statistic measures the difference in average expression between the two groups, relative to the natural variation within each group. The greater $|t|$, the more likely there is a true difference between the groups. Nevertheless, even if there is no true difference, some variation between the groups will be observed in a finite number of samples due to random sampling noise. The distribution of $t$-values observed between two groups when in fact there is no true difference between them is called the null distribution. Statistical significance is usually expressed by the $p$-value, which, for a given value of $t$, expresses the probability to observe a $t$-statistic greater than $t$ (in absolute value) under the null hypothesis:
$$
\begin{aligned}
p = \text{Pr}(|T|\geq |t| \mid H_0)
\end{aligned}
$$
For some test statistics, the true null distribution to compute the $p$-value is known. Otherwise random permutations can be used: randomly permute the group labels of the samples a large number of times, compute the test statistic for each permutation, and use these values to construct an empirical null distribution.
An important property of $p$-values is the following: under the null hypothesis, $p$-values are uniformly distributed between 0 and 1:
$$
\text{Pr}(P\leq p \mid H_0) = p
$$
Conventionally, a $p$-value threshold of 0.05 is often used. This means that the probability that we declare a gene as being affected by the treatment, even though in reality it is not affected, is less than 5%.
In genome-wide studies, we typically measure around 20,000 genes (in humans). From the uniformity of the $p$-value distribution under the null hypothesis, it follows that even if none of the genes are affected by treatment, we would still expect 5% or 1,000 genes to have a $p$-value less than 0.05. Clearly it would be wrong to call any of these genes differentially expressed!
False discovery rate
The $p$-value is often wrongly interpreted as saying that the probability that our gene is truly affected by treatment is greater than 95% if $p<0.05$. The latter is the probability that the alternative hypothesis is true given that the $p$-value is below a certain threshold, or $\text{Pr}(H_1 \mid P\leq p)$. More commonly, we express this through the false discrovery rate (FDR), that is, the probability that the null hypothesis is true given that the $p$-value is below a certain threshold, $\text{Pr}(H_0 \mid P\leq p)=1-\text{Pr}(H_1 \mid P\leq p)$.
To get a feel for what such probabilities mean, consider the following table:
Significant
Not Significant
Total
$H_0$ true
$F$
$M_0-F$
$M_0$
$H_1$ true
$T$
$M_1-T$
$M_1$
Total
$S$
$M-S$
$M$
Here “F” stands for “false positive” (not affected by treatment, but still called significant), “T” for “true positive”. The second column contains respectively the “true negative” and “false negative” results.
Obviously, we never know the true value of the numbers in the table, and hence FDR must be estimated. There exist several approaches for doing this. We consider two popular ones: the plug-in estimator and a Bayesian approach.
Plug-in estimator of the FDR
Consider again the differential expression problem. Compute $t$-statistics for each gene and consider a range of thresholds $C$, either on the absolute $t$-values or on the corresponding $p$-values. For each value of $C$, we know $S_{obs}(C)$, the observed number of significant genes at threshold $C$.
Now generate $K$ random permutations of the group labels, and compute for each permutation $k$, $F_{obs}(C,k)$, the number of significant genes at threshold $C$ in permutation $k$, which by virtue of the randomization, must all correspond to cases where $H_0$ is true. Compute the average over all permutations:
We can now vary $C$ until we reached a desired FDR value, e.g. 10%, such that we expect no more than 10% of the genes we call differentially expressed to be false positives.
Note that in practice, the number of permutations $K$ need not be very large to reach a stable value for the average. This is because in each permutation, we obtain random $p$-values for a large number of features (genes).
Bayesian estimate of the FDR
When we perform a genome-wide test for differential expression, we obtain on the order of $10^4$ test statistic values, with corresponding $p$-values under the null hypothesis. The distribution of these $p$-values, visualized using a histogram, typically looks like this:
We know that under the null hypothesis, $p$-values are uniformly distributed. We recognize the uniform distribution in the flat profile of the histogram for unsignificant ($p\to 1$) $p$-values, which with very high probability come from genes unaffected by the treatment. At very small $p$-values, we see a deviation from the uniform distribution, which indicates the presence of truly affected genes, which obsviously don’t follow the null distribution.
Hence it seems like we can model the observed $p$-value distribution as a mixture distribution with one component corresponding to the null distribution and one component corresponding to the alternative distribution:
$$
f(p) = \pi_0 f_0(p) + (1-\pi_0) f_1(p)
$$
where $f(p)$ is the probability density function (p.d.f.) of the observed $p$-value distribution, $f_0$ and $f_1$ are the p.d.f. of the $p$-value distribution under the null and alternative distribution, respectively, and $\pi_0 = \text{Pr}(H_0)$, the prior probability of a gene being unaffected by treatment (prior as in, before observing any data), .
As before, we can imagine that our $p$-values are generated by a process that first samples a hidden variable $Z=0$ with probability $\pi_0$ and $Z=1$ with probability $1-\pi_0$, and then samples a $p$-value (or $t$-statistic) from the null or alternative distribution depending on the value of $Z$. We know the null distribution is the uniform distribution. If we knew a parametric form for the alternative distribution, we could apply EM to estimate $\pi_0$ and the parameters of the alternative distribution, and then compute for each gene the recognition distribution or local false discovery rate (lfdr)
The word “local” refers to the fact that the above expression gives the expected rate of false positives among all genes with $p$-value equal to $p$, as opposed to the “tail” FDR estimated in the previous section giving the he expected rate of false positive among all genes with $p$-value less than or equal to $p$.
In reality, EM is rarely used in this context, because we rarely have a good enough idea about a parametric form for the alternative distribution. Instead a non-parametric approach is used that exploits the knowledge that:
$p$-values are uniformly distributed under the null hypothesis, that is
$$
f_0(p) = 1
$$
$p$-values close to 1 almost surely come from the null distribution, that is,
Hence in the observed $p$-value histogram, $\pi_0$ can be estimated from the height of the “flat” region near $p\approx 1$.
We obtain
$$
\text{lfdr}(p) = \frac{\hat{\pi}_0}{f(p)}
$$
It remains to estimate the p.d.f. of the real $p$-value distribution $f(p)$. In the popular q-value package, this is done by kernel density estimation followed by some smoothing.
The local false discovery rate provide a useful measure for the importance of a feature with $p$-value $p$. However, $p$-values themselves represent tail probabilities, $p=\text{Pr}(P\leq p \mid H_0)$, see above. Similarly, we can compute the FDR among all features with $p$-value less than some threshold $p$:
$$
\text{FDR}(p) = \text{Pr}(H_0\mid P\leq p).
$$
This tells us the estimated false discovery rate among all features that are equally or more significant than a feature with $p$-value $p$.
Writing $F(p)$, $F_0(p)$ and $F_1(p)$ for the cumulative probability functions of the observed, null, and alternative $p$-value distributions, the reasoning above can be repeated to obtain
The $q$-value of a feature with $p$-value $p$ is defined as the minimum FDR that can be attained when calling that feature significant, that is, by choosing a $p$-value threshold $r\geq p$. Hence
In a linear model, a (continuous) output \(Y\) is predicted from a vector of inputs \(X^T=(X_1,X_2,\dots,X_p)\) via
$$\begin{aligned}
\hat Y = \hat\beta_0 + \sum_{j=1}^p \hat{\beta}_j X_j
\end{aligned}$$
\(\hat Y\) is the predicted value of \(Y\),
\(\hat\beta_0\) is the intercept (in statistics) or bias (in machinelearning),
\((\hat\beta_1,\hat\beta_2,\dots,\hat\beta_p)^T\) is the vector of (regression) coefficients,
For convenience, we often include a constant variable 1 in $X$, include $\hat \beta_0$ in the vector of coefficients, and write the linear model in vector form:
$$\begin{aligned}
\hat Y = X^T\hat\beta
\end{aligned}$$
Least squares is the most popular method to fit the linear model to a set of training data $(x_1,y_1)$ $\dots$ $(x_N,y_N)$: we pick the coefficients $\beta$ to minimize the residual sum of squares
Collecting the data for the inputs and ouptput in a $N\times p$ matrix $\mathbf{X}$ and $N$-vector $\mathbf{y}$, respectively,
$$\begin{aligned}
\mathbf{X}= (x_1,x_2,\dots,x_p) = \begin{pmatrix}
x_{11} & x_{12} & \dots & x_{1p}\\
x_{21} & x_{22} & \dots & x_{2p}\\
\vdots & \vdots & & \vdots\\
x_{N1} & x_{N2} & \dots & x_{Np}
\end{pmatrix} &&
\mathbf{y}= \begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_N
\end{pmatrix}
\end{aligned}$$
we can write
$$\begin{aligned}
RSS(\beta) &= (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)
\end{aligned}$$
If the $p\times p$ matrix $\mathbf{X}^T\mathbf{X}$ is non-singular, then the unique minimizer of $RSS(\beta)$ is given by
$$\tag{1}
\hat\beta = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}$$
Exercise
Prove eq. (1) by differentiating $RSS(\beta)$ w.r.t. $\beta_i$.
The fitted value at the $i$th input $x_i$ is $\hat y_i=x_i^T\hat\beta$. In matrix-vector notation, we can write
$$\begin{aligned}
\hat{\mathbf{y}} = \mathbf{X}\hat\beta = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}
\end{aligned}$$
Write
$$\begin{aligned}
\mathbf{H}= \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T
\end{aligned}$$
Exercise
Show that $\mathbf{H}$ is a projection matrix, $\mathbf{H}^2=\mathbf{H}$. $\mathbf{H}$ projects on the linear subspace of $\mathbb{R}^N$ spanned by the columns of $\mathbf{X}$.
Limitations of least-squares linear regression
Least-squares linear regression involves the matrix inverse
$(\mathbf{X}^T\mathbf{X})^{-1}$:
If the columns of $\mathbf{X}$ are not linearly independent ($\mathbf{X}$ is not full rank), then $\mathbf{X}^T\mathbf{X}$ is singular and $\hat\beta$ are not uniquely defined, although the fitted values $\hat{\mathbf{y}}=\mathbf{X}\hat\beta$ are still the projection of $\mathbf{y}$ on the column space of $\mathbf{X}$. This is usually resolved automatically by stats software packages.
If $\mathbf{X}$ is full rank, but some predictors are highly correlated, $\det(\mathbf{X})$ will be close to 0. This leads to numerical instability and high variance in $\hat\beta$ (small changes in the training data lead to large changes in $\hat\beta$).
If $p>N$ (but $rnk(\mathbf{X})=N$), then the columns of $\mathbf{X}$ span the entire space $\mathbb{R}^N$, that is, $\mathbf{H}=\mathbb{1}$ and $\hat{\mathbf{y}}=\mathbf{y}$: overfitting the training data.
High variance and overfitting result in poor prediction accuracy (generalization to unseen data).
Prediction accuracy can be improved by shrinking regression coefficients towards zero (imposing a penalty on their size).
Ridge regression
Regularization by shrinkage: Ridge regression
The ridge regression coefficients minimize a penalized residual sum of
of squares:
$$\tag{2}
\beta^{\text{ridge}} = \argmin_\beta \left\{ \sum_{i=1}^N \Bigl(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\Bigr)^2 + \textcolor{red}{\lambda\sum_{j=1}^p\beta_j^2} \right\}
$$
$\lambda\geq 0$ is a hyperparameter that controls the amount of shrinkage.
The inputs $X_j$ (columns of $\mathbf{X}$) are normally standardized before solving eq. (2).
The intercept $\beta_0$ is not penalized.
With centered inputs, $\hat\beta_0=\bar y=\frac1N\sum_i y_i$, and remaining $\beta_j$ can be estimated by a ridge regression without intercept.
The criterion in eq. (2) can be written in matrix form:
$$\begin{aligned}
RSS(\beta,\lambda) = (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta) + \lambda \beta^T\beta
\end{aligned}$$
with unique minimizer
$$\tag{3}
\beta^{\text{ridge}}= (\mathbf{X}^T\mathbf{X}+\lambda\mathbb{1})^{-1}\mathbf{X}^T\mathbf{y}$$
Note that $\mathbf{X}^T\mathbf{X}+\lambda\mathbb{1}$ is always non-singular.
Exercise
Prove eq. (3) by differentiating $RSS(\beta)$ w.r.t. $\beta_i$
Lasso regression
Lasso regression: regularization by shrinkage and subset selection
The lasso regression coefficients minimize a penalized residual sum of squares:
$$\tag{4}
\beta^{\text{lasso}}= = \argmin_\beta \left\{ \sum_{i=1}^N \Bigl(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\Bigr)^2 + \textcolor{red}{\lambda\sum_{j=1}^p|\beta_j|} \right\}$$
Lasso regression has no closed form solution, the solutions are non-linear in the $y_i$,
If $\lambda$ is large enough, some coefficients will be exactly zero.
Exercise
Consider the case with one predictor ($p=1$) and training data
$$\begin{aligned}
\mathbf{x}= \begin{pmatrix}
x_1\\
x_2\\
\vdots\\
x_N
\end{pmatrix} &&
\mathbf{y}= \begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_N
\end{pmatrix}
\end{aligned}$$
Assume that the data are standardized (mean zero and standard deviation one), and that $\mathbf{x}^T\mathbf{y}>0$ (positive correlation between input and output).
Write down the loss functions and find analytic solutions for the ordinary least squares, ridge regression, and lasso regression coefficient.
Draw schematically how $\beta^{\text{ridge}}$ and $\beta^{\text{lasso}}$ vary as a function of $\beta^{\text{OLS}}$.
Explain what shrinkage and variable selection mean in this simple example.
How can the exact lasso solution with one predictor be used to construct
an algorithm to solve the general case?
Elastic net regression
Elastic net regression combines ridge and lasso regularization.
In genomic applications, there are often strong correlations among predictor variables (genes operate in molecular pathways).
The lasso penalty is somewhat indifferent to the choice among a set of strong but correlated variables.
The ridge penalty tends to shrink the coefficients of correlated variables towards each other.
The elastic net penalty is a compromise, and has the form
$$\begin{aligned}
\lambda\sum_{j=1}^p \Bigl(\alpha|\beta_j|+\tfrac12(1-\alpha)\beta_j^2\Bigr).
\end{aligned}$$
The second term encourages highly correlated features to be averaged, while the first term encourages a sparse solution in the coefficients of these averaged features.
Generalized linear models
Ridge, lasso, and elastic net regression can also be used to fit generalized linear models when the number of predictors is high.
The most commonly used model is logistic regression, where a binary output $Y$ is predicted from a vector of inputs $X^T=(X_1,\dots,X_p)$ via
$$\begin{aligned}
\log\frac{P(Y=1\mid X)}{P(Y=0\mid X)} &= \beta_0 + \sum_{j=1}^p \beta_j X_j\\
P(Y=1\mid X) = 1 - P(Y=0\mid X) &= \frac1{1+e^{-\beta_0 - \sum_{j=1}^p \beta_j X_j}}
\end{aligned}$$
With training data $(\mathbf{X},\mathbf{y})$, the parameters are estimated by maximizing the penalized log-likelihood:
$$\begin{aligned}
\mathcal{L}(\beta) &= \log \prod_{i=1}^N P(y_i\mid x_{i1},\dots,x_{ip}) - \sum_{j=1}^p \left(\alpha|\beta_j|+\tfrac12(1-\alpha)\beta_j^2\right)\\
&= \sum_{i=1}^N \log P(y_i\mid x_{i1},\dots,x_{ip}) - \lambda \sum_{j=1}^p \left(\alpha|\beta_j|+\tfrac12(1-\alpha)\beta_j^2\right)
\end{aligned}$$
Generalized linear models can be fitted using a iterative least squares approximation: a quadratic approximation to the log-likelihood around the current estimates for $\beta$, and this quadratic approximation is used to find the next estimates using the standard linear elastic net solution.
Exercise
Find an expression for $\L(\beta)$ in the case of logistic regression.
Bayesian regularized regression
flowchart TB
beta -->Y_i
subgraph i=1,...,N
X_i-->Y_i
end
Assignment
CCLE data analysis
We will use data from the original Cancer Cell Line Encyclopedia.
Expression data from GSE36139; use the Series Matrix File GSE36139-GPL15308_series_matrix.txt
Drug sensitivities: Supplementary Table 11 from the original publication; use the “activitiy area” (actarea) as a response variable.
Clean data:
Remove genes with low standard deviation
Remove compounds with more than 10% zero values
Align gene expression and sensitivity data to a common set of samples
Find and understand, and implement the elastic net regression fitting procedure employed by the authors in the paper’s supplementary methods. Efficient implementations for fitting ridge, lasso, and elastic net regularized regression models are available in most programming languages (R, python, julia, matlab), search for glmnet in your language of choice!
Reproduce Fig 2c and Fig 4a. Do you find the same selected expression features? Note that the paper used both gene expression and mutational features and you used only gene expression features. How could this difference affect the results?
To understand a cell, we must determine the factors that shape its identity:
its position in a taxonomy of cell types,
the progress of multiple time-dependent processes happening simultaneously,
the cell’s response to signals from its local environment,
the location and neighbourhood in which it resides.
Technological advances have enabled genome-wide profiling of RNA, DNA, protein, epigenetic modifications, chromatin accessibility, and other molecular events in single cells.
Cellular identity
Diverse factors combine to create a cell’s unique identity.
Due to technical factors in the data generation process.
Main axes of variation in scRNA-seq often dominated by technical factors.
False negatives (expressed but undetected) and false positives (overamplification) are due to minute quantities of starting RNA material and corresponding need for massive amplification.
Allele-intrinsic variation:
Stochastic factors intrinsic to the molecular mechanisms that control gene expression.
Does not correlate between two alleles of the same gene.
Allele-extrinsic variation
Factors extrinsic to the process of transcription.
Contribute to establishing differences between cell types or states, either stably or transiently.
Correlated between two alleles of the same gene.
Most studies aim to understand allele-extrinsic variation and its function, while technical and allele-intrinsic variations are major confounders.
Technical confounders of single-cell RNA-seq
Technical confounders of single-cell RNA-seq and computational methods to handle them.
False-positive detections due to amplification occur at early PCR amplification cycle.
Can be addressed using random molecular tags (RMTs), also called unique molecular identifiers (UMIs):
Short barcode attached to 3’ or 5’ end of cDNA molecules before amplification.
After amplification, the number of unique barcodes associated with reads aligned to a gene/transcript, rather than number of aligned reads serves as the gene/transcript abundance.
Beware:
Sequencing errors will cause RMT sequences that are close to but not identical, and need to be collapsed.
“Barcode collision” – two or more copies of a transcript may receive the same barcode, making them appear as a single copy.
Addressing false negatives
Limited efficiency of RNA capture and conversion to cDNA leads to dropout events: transcripts that are expressed in the cell but undetected in its mRNA profile.
Zero-inflated models:
Gene expression is modelled as a mixture of two distributions:
one in which it is successfully amplified and detected at a level that correlates with its true abundance,
one in which it is undetected because of technical effects.
(Hidden) component labels can be inferred using EM.
Only cells belonging to the “success” component are used for downstream analysis, e.g. differential expression analysis between subpopulations of cells.
Revealing the vectors of cellular identity
Single-cell genomics is a means to identify and understand the factors that jointly define a cell’s identity:
Discrete cell types and subtypes
Continuous phenotypic spectra
Dynamic processes
Spatial location
Distinguishing discrete cell types and subtypes
Single-cell analysis can detect cellular subtypes that cannot be defined by a handful of markers, or for which markers are not yet known.
Classification of cells in discrete subtypes is a problem in unsupervised clusterring.
Key challenges:
Exponential increasing scale of single-cell data.
Ensuring reproducible classification.
Finding the proper granularity and detecting hierarchies of cell types and subtypes, especially when cell type frequency varies by multiple orders of magnitude.
Distilling molecular markers and signatures to characterize cell types.
Matching the resulting classes to known cell types.
Visualizing, sharing, and comparing classifications.
Continuous phenotypic spectra
Sometimes more appropriate to speak of a continuous phenotypic spectrum within a type than of a discrete cell type. e.g. a cell’s ability to trigger an immune response.
Continuous facets can be characterized by combining dimensionality reduction (e.g. PCA) with enrichment for functional annotation.
Prior knowledge can be used to interpret the biological relevance of the data’s main axes of variation.
Mapping dynamic processes
Cells undergo dynamic transitions, incl. short-term responses to environmental signals, cell differentiation, oscillations, etc.
Single-cell genomics provides a snapshot of the entire dynamic
process: the set of single cells captured at any time point will Stochastically contain cells positioned in different instantaneous time points along the temporal trajectory.
Temporal ordering can be recovered by creating a graph that connects cells by their profile similarity and finding an optimal path on this graph.
This introduces the notion of “pseudo-time”.
Inference of spatial location
Most high-throughput single-cell approaches require dissociation of tissues and therefore lose spatial information.
Spatial information can be obtained for a limited number of “landmark” genes.
Computational methods can combine limited spatial cues for landmark genes with single-cell RNA-seq from the same type of tissue.
Limited number of landmarks expressed at any spatial position and dropouts in single-cell RNA-seq can severely compromise the mapping.
Requires tissues with “canonical” structure (e.g. developing embryo, most normal tissues), which is faithfully reproduced in replicate samples.
Tipping, Michael E., and Christopher M. Bishop. “Probabilistic principal component analysis.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61.3 (1999): 611-622.
Dimensionality reduction
Assume we have a set of $N$ observations (e.g. cells) of $p$ variables (e.g. genes).
Dimensionality reduction consists of representing the observations in a lower-dimensional space with dimension $q\ll p$ while preserving some characteristics (e.g. relative distance) of the original $p$-dimensional representation.
For visualizing relationships among observations, we typically use $q=2$.
Two-dimensional visualization of single-cell data.
Two-dimensional visualization of single-cell data.
Assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.
The principal components of $\mathbf{X}$ provide a sequence of best linear approximations to $\mathbf{X}$, representing $\mathbf{X}$ by a line ($q=1$), plane ($q=2$), etc. in $p$-dimensional space.
Rank-1 and rank-2 approximations of a set of data.
The best rank-1 linear approximation of a set of data.
The best rank-2 linear approximation of a set of data.
where $\mu\in\mathbb{R}^p$ is a location vector, $z\in\mathbb{R}^q$ is a (low-dimensional) vector, and $\mathbf{V}_q=(v_1,\dots,v_q)$ is a $p\times q$ matrix of $q$ orthogonal unit vectors,
It can be shown that the optimal solution is found when $\mathbf{V}_q=(v_1,\dots,v_q)$ contains the $q$ eigenvectors with largest eigenvalues of the $p\times p$ matrix $\mathbf{X}^T\mathbf{X}$ (if the data is centred such that $\bar x=0$).
where $\mathbf{U}$ is an $N\times p$ orthogonal matrix ($\mathbf{U}^T\mathbf{U}=\mathbf{I}_p$) whose columns are called the left singular vectors, $\mathbf{V}$ is a $p\times p$ orthogonal matrix ($\mathbf{V}^T\mathbf{V}=\mathbf{I}_p$) whose columns are called the right singular vectors, and $\Lambda$ is a $p\times p$ diagonal matrix with diagonal elements $\lambda_1\geq \lambda_2 \geq \dots \geq \lambda_p\geq 0$ called the singular values.
The solution $\mathbf{V}_q$ above consists of the first $q$ columns of $\mathbf{V}$.
The columns of $\mathbf{U}\Lambda$ are called the principal components of $\mathbf{X}$. The first $q$ principal components are $\mathbf{U}_q\Lambda_q$. Each principal component is a vector in $\R^N$, that is, it takes a value for each observation. The solution $\hat{z}_i$ above consists of the rows of $\mathbf{U}_q\Lambda_q$, that is, for each observation $i$, $\hat{z}_i$ consists of the values of the first $q$ principal components for observation $i$.
Probabilistic PCA
Probabilistic PCA is a continuous generalization of the Gaussian mixture model where we again assume that an observation $X=(x_1,\dots,x_p)$ of $p$ variables is generated by a latent variable $Z$:
flowchart LR
Z --> X
In the Gaussian mixture model, $Z$ is assumed to take discrete values leading to clustered observations. Here we assume that $Z=(z_1,\dots,z_q)$ is a $q$-dimensional continous variable with distribution
$$
Z\sim \mathcal{N}(0,\mathbf{I})
$$
that is, we assume each $z_i$ is drawn from an independent normal distribution with mean zero and standard deviation 1. The conditional distribution of $X$ given $Z$ is just like in the discrete mixture distribution assumed to be a multivariate normal with mean dependent on the value of $Z$; the covariance matrix is assumed to be a multiple of the identity matrix, meaning that conditional on the latent factors $Z$, the $p$ observed features $X$ are independent and have equal error distribution:
where $\mu\in\R^p$ is a mean offset vector, $\mathbf{V}\in\R^{p\times q}$ is a linear maps, and $\sigma^2>0$ a common variance parameter for all $p$ dimensions. Note the similarity between this model and the model above.
Following the Gaussian mixture model example, we want to maximize the likelihood of the observed data, which follows the marginal distribution
$$
p(x) = \int dz\; p(x\mid z) p(z)
$$
which can be interpreted as a “continuous” Gaussian mixture distribution. Unlike in the finite mixture model, this integral can in fact be solved using properties of the multivariate normal distribution:
$$
p(x) = \mathcal{N}(\mu,\mathbf{C})
$$
with covariance matrix $\mathbf{C}=\mathbf{V}\mathbf{V}^T+\sigma^2\mathbf{I}$. The corresponding log-likelihood for $N$ observations $x_1.\dots,x_N\in\R^p$ is
The maximum-likelihood for $\mu$ is the sample mean, $\hat{\mu}=\overline{x}=\frac{1}{N}\sum_{i=1}^N x_i$, such that $\mathbf{S}$ is the sample covariance matrix. The log-likelihood function measures how much $\mathbf{C}$ differs from the sample covariance and is maximized by making $\mathbf{C}$ as “similar” as possible to it. The solution turns out to be the standard PCA solution where the columns of $\mathbf{V}$ are the $q$ eigenvectors with largest eigenvalues of $\mathbf{S}$. This conclusion is valid only if we assume the independent error model, where features are independent given latent factors, that is, the latent factors explain all the covariance between the features. In summary:
Probabilistic PCA interpretation
If we assume that our high-dimensional observations are generated by a linear combination of low-dimensional latent (=hidden) factors, and the observed features are independent given the latent factors, then the maximum-likelihood solution for the map between latent and observed space is PCA.
Note the important benefit of the probabilistic approach. In standard PCA, we simply fit a linear function to a set of observation, without giving any thought to the deviations (errors) between the (noisy) observations and the fitted model. In probabilistic PCA, as in all generative models, we must be explicit about our assumptions, and standard PCA is only recovered if we assume that features are independent given the latent factors.
4.3 - t-SNE and UMAP
T-SNE, UMAP.
Reference
[TSNE] L.J.P. van der Maaten and G.E. Hinton. Visualizing Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.
Linear dimensionality reduction methods (like PCA) may not able to retain both the local and global structure of the data in a single map.
t-SNE is a non-linear method that aims to overcome this limitation.
t-SNE is particularly successful at visualizing high-dimensional data (reducing it to two dimensions).
Care must be taken when applying t-SNE due to random initialization of the algorithm, which may lead to solutions that do not represent the global structure of the data well.
(Symmetric) Stochastic Neighbor Embedding (SNE)
As before, assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.
SNE introduces a directional similarity of $x_j$ to $x_i$,
(Symmetric) Stochastic Neighbor Embedding (SNE) finds the map ${\cal Y}$ that minimizes the mismatch between $p_{ij}$ and $q_{ij}$, across all pairs $i$ and $j$.
The mismatch cost function is given by the Kullback-Leibler divergence:
with $C=0$ if, and only if, $q_{ij}=p_{ij}$ for all $i$ and $j$.
SNE suffers from a “crowding problem”
The volume of a sphere with radius $r$ in $p$ dimensions scales as $r^p$.
Consider a cluster of datapoints that are approximately uniformly distributed in a sphere of radius $r$.
If $p\gg 2$, then the available 2-dimensional area to model the distances in this cluster accurately will be too small compared to the total area available, forcing clusters of nearby points to crush together.
Heavy-tails in the low-dimensional space resolve the crowding problem
A heavy-tail distribution in the low-dimensional space allows a moderate distance in the high-dimensional space to be faithfully modelled by a much larger distance in the map.
In t-SNE a Student t-distribution with 1 d.o.f. (a Cauchy distribution) is used as the heavy-tailed distribution in the low-dimensional map:
The Cauchy distribution offers additional advantages in the numerical optimization of the cost function.
Gaussian vs. Cauchy distribution
Gaussian ($\sigma=1/\sqrt{2}$) vs Cauchy p.d.f. for $r\geq 0$.
Summary
t-SNE:
puts emphasis on modelling dissimilar datapoints by means of large pairwise distances, and similar datapoints by small pairwise distances,
offers dramatic improvement in finding and preserving local structure in the data, compared to, for instance, PCA.
Optimization of the t-SNE cost function is easier than optimizing
earlier SNE versions, but:
the cost function is non-convex,
several optimization parameters need to be chosen,
the constructed solutions depend on these choices and may be different each time t-SNE is run from an initial random configuration.
Formal theory supporting t-SNE is lacking:
similarity measures and cost function are based on heuristics, appealing to “intuitive” justifications,
there is no formal generative model connecting high- and low-dimensional representations,
the probability semantics used to describe t-SNE is descriptive, rather than foundational.
Uniform Manifold Approximation and Projection (UMAP)
In the words of the authors:
UMAPs design decisions were all grounded in a solid theoretic foundation and not derived through experimentation with any particular task focused objective function.
The theoretical foundations for UMAP are largely based in manifold theory and topological data analysis.
A purely computational view [of UMAP] fails to shed any light upon the reasoning that underlies the algorithmic decisions made in UMAP.
A computational view of UMAP
From a practical computational perspective, UMAP can be described in
terms of, construction of, and operations on, weighted graphs:
Graph construction:
Construct a weighted k-neighbour graph
Apply some transform on the edges to represent local distance.
Deal with the inherent asymmetry of the k-neighbour graph.\
Graph layout:
Define a cost function that preserves desired characteristics of this k-neighbour graph.
Find a low dimensional representation which optimizes this cost function.
t-SNE and UMAP can both be cast in this form, but with some (subtle)
differences in the edge weighting (dissimilarity measure) and graph
layout (cost function).
Write a script to read the SMART-seq data files and merge the data in a single data structure. Your structure will need:
A gene x cell (or cell x gene) read count matrix containing all cells (V1 + ALM). You should have data for 45,768 genes in 25,481 cells.
The unique identifiers of all cells.
Read the cell cluster annotation file and remove all cells that don’t have an annotation from your data. You should now have 23,822 cells remaining.
Remove all cells and all genes that have all near-zero counts. What threshold (number of reads) did Kobak & Berens use to define “near-zero” expression?
Implement the “Feature selection” steps described in the paper Methods:
Compute the fraction of near-zero counts for each gene (using the same threshold as in the previous step) (eq 8 in the paper).
Compute the mean log2 expression over the non-zero values for each gene (eq. 9 in the paper).
Reproduce Supplementary Figure 4 (without the gene names).
Select the relevant genes using eq. 10. You can use the parameter values from Supplementary Figure 4 and don’t have to implement the parameter fitting procedure. How many genes do you retain?
Implement the “Sequencing depth normalization” described in the paper Methods:
Note that sequencing depth normalization is applied by computing the library sizes (total number of reads) for each cell using all genes.
Keep the normalized data only for the genes from the “feature selection” step.
Perform PCA and t-SNE dimensionality reduction:
You may work with the full data if you have sufficient compute resources, otherwise create a small toy dataset where you keep every 25th cell (953 cells in total).
Standardize (mean zero, standard deviation one) the data. Recall the lecture on unsupervised clustering: if we want all features (genes) to have equal influence on the clustering/dimensionality reduction, should we standardize over the gene or cell dimension?
Do PCA and retain the first two principal components.
Apply t-SNE using your choice of software implementation. Generate multiple solutions (easy if you use the toy dataset), similar to Kobak & Berens, Figure 2 (c-f) (you don’t need to use identical settings, but explore some of the input settings for t-SNE).
Generate scatter plot figures:
Make figures for PCA and the various t-SNE results you generated.
Use the colors from the cell cluster annotation file to confirm that your figures look similar to Kobak & Berens, Figure 2. The figures will of course not be identical, especially if you use a toy dataset, but should display similar patterns.
4.4 - Tutorial
Tutorial on dimensionality reduction of single-cell RNA-seq data using t-SNE and PCA.
The GTEx study identified trans-eQTLs that are also cis-eQTLs and asked if the cis-eGene could be the cause of the trans-eQTL association (see fig above), that is, if the following model is supported by the data:
flowchart LR
Z --> X --> Y
where $Z$ is a SNP that is a cis-eQTL for gene $X$ and trans-eQTL for gene $Y$. This model implies that $X$ blocks the path between $Z$ and $Y$, and hence that $Z$ and $Y$ are independent conditional on $X$, in mathematical notation
$$
Z \perp Y \mid X
$$
The principle for testing whether the model $Z\to X \to Y$ is true using the conditional independence criterion is illustrated in the figure below. Assuming linear relations between all variables, three conditions must be met:
The expression levels of $X$ differ significantly between the genotype groups of $Z$ (to confirm the $Z\to X$ association).
The expression levels of $Y$ differ significantly between the genotype groups of $Z$ (to confirm the $Z\to Y$ association).
The residuals of $Y$ after regression on $X$ do not differ differ significantly between the genotype groups of $Z$ (to confirm that the $Z\to Y$ association is mediated by $X$, and hence that $Z\to X \to Y$ is true).
The mediation method fails if $X$ and $Y$ are affected by common cause $U$ (which may be an unknown or hidden variable):
flowchart LR
Z --> X --> Y
U --> X & Y
In this case, conditioning on $X$ opens the collider $Z \to X \leftarrow U$, creating a path $Z\; — \; U \to Y$, such that the residuals of $Y$ will still show a difference between the genotype groups of $Z$, and the mediation method concludes (wrongly!) that the $Z\to Y$ association must be due to another factor than $X$ (no causal $X\to Y$ relation).
Instrumental variable, known as Mendelian randomization (MR), is an alternative causal inference approach that is not affected by hidden confounders $U$, but with subtly different underlying assumptions.
Specifically, in MR we assume that the $Z\to Y$ association must be due to $X$ (for instance because $X$ is the only cis-eGene of $Z$, and trans-eQTL associations must be mediated by some initial cis effects), and we seek to estimate the magnitude of the causal effect of $X$ on $Y$.
$$
\begin{aligned}
X &= a Z + c_X U + E_X\
Y &= b X + c_Y U + E_Y
\end{aligned}
$$
where $E_X$ and $E_Y$ are error terms, mutually independent and independent of $Z$ and $U$. Since $Z$ and $U$ are assumed to be independent (no arrows in the diagram), it follows that
In this tutorial we consider two continuous and correlated variables, $X$ and $Y$, representing the expression levels of two genes. We also consider a discrete variable $Z$ representing a genotype for gene $X$. Typically, $Z$ will have been obtained by eQTL mapping for gene $X$. We wish to determine if variation in $X$ causes variation in $Y$.
The aim of causal model selection is: given observational data for $Z$, $X$, and $Y$ in a set of independent samples, which causal model (represented by a directed acyclic graph) explains the data best.
To restrict the space of possible models that need to be considered, a number of assumptions reflecting biological knowledge can be made:
Genetic variation influences variation in gene expression (and phenotypes more generally), but changing the value of an individual’s phenotype does not change their genome. Hence, in our models there can be no incoming arrows into $Z$.
We assume that the statistical association between $Z$ and $X$ is due to a direct effect, that is, all causal models we consider must contain the directed edge $Z\to X$. This assumption is justified if $Z$ is located within or near to $X$ (on the genome) and in a known regulatory region for $X$.
For $X$ and $Y$ to be correlated (non-independent), there must be a path in the graph between them, more precisely $X$ and $Y$ must not bed-separated.
By the laws of Mendelian inheritance (particularly the random segregation of alleles), we may assume that $Z$ is independent of any unobserved confounding factors $U$ that cause $X$ and $Y$ to be correlated, and therefore there are no edges between $Z$ and any unobserved $U$.
If we assume that there are no unobserved factors, there are 5 possible models satisfying assumptions 1-4 (see figure below). If we allow for the presence of unobserved factors, we have the same 5 models with an additional unobserved $U$ having a causal effect on $X$ and $Y$, plus one more model without an edge between $X$ and $Y$ where all of their correlation is explained by $U$ (see figure).
Below each model in the figure, some of the key conditional independences implied by the model are shown, using the mathematical notation $⫫$ for “is independent of”, $∣$ for “conditional on”, $∧$ for “and”, and $¬$ for “it is not the case that”.
From this figure it is immediately clear that our aim of deciding whether $X$ causes $Y$ is unachievable. Even without any unobserved factors, models 4a and 5a are Markov equivalent: they entail the same conditional independences and are indistinguishable using observational data. In other words, there exists no condition that is both necessary and sufficient for $X$ to cause $Y$ given the above 4 assumptions.
There are two possible paths forward (apart from giving up): to use a sufficient condition or a necessary condition for causality. If the joint probability distribution of $Z$, $X$, and $Y$ passes a sufficient condition, then it is guaranteed to have been generated by a model where $X$ causes $Y$, but there may be distributions with $X\to Y$ that don’t pass the test (false negatives). Conversely, all joint distributions generated by models with $X\to Y$ will pass a necessary condition, but there may be distributions generated by models where $X$ does not cause $Y$ that also pass the test (false positives).
Because estimating the joint distribution of $Z$, $X$, and $Y$ and its conditional independences from a finite number of samples is itself an imperfect process that can only ever approximate the true distribution, having additional errors from using an imperfect causality test is not necessarily catastrophic, provided those errors are small enough.
A sufficient condition for $X\to Y$
Our sufficient condition for $X\to Y$ is based on model 1a. This model implies that $Z$ and $Y$ are not independent, but $Z$ is independent of $Y$ conditioned on $X$ ($X$ is a mediator for the causal path from $Z$ to $Y$). No other model satisfies those two relations:
In models 2a, 2b, and 6b, $Z$ and $Y$ are independent.
In all other models $Z$ is not independent of $Y$ conditioned on $X$, either because there is a direct path from $Z$ to $Y$ not passing through $X$, or because conditioning on $X$ opens a path from $Z$ to $Y$ via the confounder $U$ (due to the v-structure or collider at $X$).
Hence any joint distribution in which $Z$ and $Y$ are not independent, but $Z$ is independent of $Y$ conditioned on $X$, must have been generated by model 1a, that is, by a model where $X\to Y$. In mathematical notation:
$$
¬(Z⫫Y) ∧ (Z⫫Y∣X) ⇒ (X→Y)
$$
A necessary condition for $X\to Y$
Because all models contain the edge $G→X$, it follows that if $X→Y$, then $Z$ and $Y$ cannot be independent, providing a simple necessary condition for $X→Y$. However, $Z$ and $Y$ are also not independent in models 3a-b and 5a-b, in which $Y→X$, because of the direct edge $G→Y$. Of these, only 3a can be excluded, because in 3a, $X⫫Y∣Z$, a condition not satisfied in any model where $X→Y$. Combining these two results, we obtain
$$
(X→Y) ⇒ ¬(Z⫫Y) ∧ ¬(X⫫Y∣Z)
$$
Import packages
In this tutorial, we will first set up some code to generate and visualize simulated data for $Z$, $X$, and $Y$ given one of the models in the figure above. Then we will implement the sufficient and necessary conditions for $X→Y$ and assess how well they perform in a number of scenarios. The code is written in julia. It should be fairly straightforward to translate into other languages.
First we set a random seed and fix the number of samples:
Random.seed!(123)N=200;
In our simulations, as in real biology, a genotype value is sampled first. Then expression values for the two genes are sampled conditional on the genotype value.
Genotype data simulation
We simulate the genotype of a bi-allelic (2 values), diploid (2 copies) polymorphism. We encode the major and minor alleles by the values 0 and 1, respectively. The genotype is sampled by defining the minor allele frequency (MAF) as a parameter of the simulation. Two haploids are sampled using a Bernoulli distribution and summed to give the genotype, that is, the genotype is the number of copies of the minor allele in an individual.
Because the mean of a Bernoulli distribution with probability of success $p$ is $p$, the mean of $Z$ is 2 times the minor allele frequency, and we therefore subtract this value from the sampled genotypes to obtain samples from a random variable $Z$ with mean zero. Note that we cannot center the actual sampled values at this point, because we still need to generate the expression data conditional on $Z$, and the expression levels of an individual cannot depend on the sample mean of the genotypes in a population!
To simulate data for $X$ and $Y$, we must first set up the structural equations for the causal model we want to simulate. We will assume linear models with additive Gaussian noise throughout.
Let’s start by models 1a, 1b, and 6b. Their structural equations can be written as
$$
\begin{align}
X &= a Z + U_X\\
Y &= b X + U_Y
\end{align}
$$
where $U_X$ and $U_Y$ are normally distributed errors with joint distribution
In model 1a, the errors are uncorrelated, $\rho=0$, and we arbitrarily set the errors to have unit variance. In model 1b, the unobserved confounder $U$ has the effect of correlating the errors, that is $0<\rho<1$. In model 6b, the errors are also correlated, $0<\rho<1$, but there is no direct effect of $X$ on $Y$, that is $b=0$.
The parameters $a$ and $b$ are the causal effect sizes, and their magnitudes should be interpreted relative to the unit standard deviation of the random errors. In other words, each additional alternative allele shifts the mean of $X$ by $a$ standard deviations of the random errors.
Given a value $Z=z$, eqs. (1)–(2) can be rewritten in matrix-vector notation as
$$
\begin{pmatrix}
X\\
Y
\end{pmatrix}
\begin{pmatrix}
az \\
abz
\end{pmatrix} +
\begin{pmatrix}
1 & 0\\
b & 1
\end{pmatrix}
\begin{pmatrix}
U_X\\
U_Y
\end{pmatrix}
$$
Hence given a sampled genotype value $Z$, we can sample values for $X$ and $Y$ by sampling from the multivariate normal distribution (3). In fact, julia knows how to do calculus with distributions, and we only need to specify the distribution of the random errors and the affine matrix-vector transformation:
functionsample_XcauseY(Z,a=1.0,b=0.8,rho=0.0)# number of samplesN=length(Z)# Covariance matrix for the unmodelled factorscovU=[1.0rhorho1.0]# distribution of the unmodelled factorsdistU=MvNormal(covU)# the covariance between X and Y due to direct and unmodelled effectsB=[1.0.b1.]distXY=B*distU# sample XY expression levels from model 1a, 1b, or 6b, depending on the values of rho and bXY=[a*Za*b*Z]+rand(distXY,N)'returnXYend;
Your turn: Can you derive and implement the structural equations for the other models?
The function to sample from models 3a or 3b is as follows:
functionsample_GcauseXY(Z,a=1.0,b=0.8,rho=0.0)# number of samplesN=length(Z)# Covariance matrix for the unmodelled factorscovU=[1.0rhorho1.0]# distribution of unmodelled factorsdistU=MvNormal(covU)# sample XY expression levels from model 3a or 3b, depending on the value of rhoXY=[a*Zb*Z]+rand(distU,N)'returnXYend;
Now sample some data:
XY1a=sample_XcauseY(Z,1.,0.8,0.)# sample from model 1aXY1b=sample_XcauseY(Z,1.,0.8,0.4)# sample from model 1bXY6b=sample_XcauseY(Z,1.,0.,0.4);# sample from model 6bXY3a=sample_GcauseXY(Z,1.,0.8,0.)# sample from model 3aXY3b=sample_GcauseXY(Z,1.,0.8,0.4);# sample from model 3b
For the sufficient condition we have to test whether $Z$ and $Y$ are dependent and whether this dependence disappears after conditioning on $X$. If we assume linear models, these tests can be performed by standard least squares regression and testing for non-zero coefficients.
Testing data generated by Model 1a
First we test whether $Z$ and $Y$ are dependent by regressing $Y$ on $Z$. Note that by construction $Y$ and $Z$ are samples from random variables with zero mean and hence we don’t include an intercept in the regression:
yg=lm(@formula(Y1a~0+Z),data)
From the small $p$-value we can reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.
Next we test whether $Z$ and $Y$ become independent after conditioning on $X$, that is, after regressing out $X$ from $Y$:
yx=lm(@formula(Y1a~0+X1a),data)
The residuals of the regression of $Y$ on $X$ can now be tested for association with $Z$:
Clearly the null hypothesis of no association cannot be rejected, and we conclude that the second test has also been passed.
In conclusion, the data satisfies the sufficient condition for causality and we conclude (correctly) that it has been generated by a causal model with an edge $X\to Y$ (a true positive result).
Testing data generated by Model 1b
We repeat the same procedure for the data generated by model 1b:
yg=lm(@formula(Y1b~0+Z),data)
From the small $p$-value we can again reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.
Next we regress out $X$ from $Y$, and test the residuals for association with $Z$:
This time we have to reject the null hypothesis of no association because of the small $p$-value, and we conclude that the second test has not been passed.
In conclusion, the data does not satisfy the sufficient condition for causality and we conclude (wrongly) that it has not been generated by a causal model with an edge $X\to Y$ (a false negative result).
Testing data generated by Model 6b
We repeat the procedure one more time for the data generated by model 6b:
yg=lm(@formula(Y6b~0+Z),data)
This time we cannot reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test has not passed.
We can conclude immediately that the data does not satisfy the sufficient condition for causality and we conclude (correctly) that it has not been generated by a causal model with an edge $X\to Y$ (a true negative result).
Your turn
Repeat the analysis of the sufficient condition for the data generated by model 3a and 3b. What do you conclude and are those conclusions correct or not?
Testing the necessary condition for $X\to Y$
For the necessary condition we have to test whether $Z$ and $Y$ are dependent and whether $X$ and $Y$ become independent after conditioning on $Z$. Assuming linear models, these tests can again be performed by least squares regression and testing for non-zero coefficients.
Testing data generated by Model 1b
As before, we test whether $Z$ and $Y$ are dependent by regressing $Y$ on $Z$:
yg=lm(@formula(Y1b~0+Z),data)
From the small $p$-value we can again reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.
Next we also regress $X$ on $Z$, and test the residuals of both regressions for association with each other:
From the small $p$-value we can reject the null hypothesis of no dependence between the residuals, and hence we conclude that the second test has also been passed.
In conclusion, the data satisfiess the necessary condition for causality and we conclude (correctly) that it has been generated by a causal model with an edge $X\to Y$ (a true positive result).
Testing data generated by Model 3a
We repeat the same procedure for the data generated by model 3a:
yg=lm(@formula(Y3a~0+Z),data)
From the small $p$-value we can again reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.
Next we again regress $X$ on $Z$, and test the residuals of both regressions for association with each other:
This time we cannot reject the null hypothesis of no dependence between the residuals, and hence we conclude that the second test has not been passed.
In conclusion, the data does not satisfy the necessary condition for causality and we conclude (correctly) that it has not been generated by a causal model with an edge $X\to Y$ (a true negative result).
Testing data generated by Model 3b
We repeat the same procedure for the data generated by model 3b:
yg=lm(@formula(Y3b~0+Z),data)
From the small $p$-value we can again reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.
Next we again regress $X$ on $Z$, and test the residuals of both regressions for association with each other:
Because of the small $p$-value we must reject the null hypothesis of no dependence between the residuals, and hence we conclude that the second test has also passed.
In conclusion, the data satisfies the necessary condition for causality and we conclude (wrongly) that it has been generated by a causal model with an edge $X\to Y$ (a false positive result).
Your turn
Repeat the analysis of the sufficient condition for the data generated by model 1a and 6b. What do you conclude and are those conclusions correct or not?
Some comments and further reading
Which condition for causality to choose?
Despite the simplicity of these examples they show the full spectrum of what to expect from these approaches to causal inference from molecular QTL data. It is important to stress that the false negative and false positive findings are not due to any errors in the tests themselves, misspecifications of the model (the models used to generate and fit the data are identical) or measurement noise (none was added in the simulations). Instead it is a basic mathematical truth that a condition for causality that is sufficient but not necessary will be prone to false negative predictions, whereas a condition that is necessary but not sufficient will be prone to false positive predictions.
Whether to use the sufficient or the necessary condition in concrete applications depends on a number of considerations:
Is it more important to generate only high-confidence predictions before performing expensive experimental validation, or is it more important not to miss any real causal interactions before applying additional filtering steps?
Do you expect that unknown confounders play a major or minor role in your data? For instance, in gene regulatory networks (GRNs) feedforward loops (where a transcription factor (TF) and its target are coregulated by a 2nd TF) are common, and therefore the necessary condition systematically outperforms the sufficient condition for reconstructing GRNs, see for instance these papers:
How precise are your measurements? The sufficient condition is not only susceptible to hidden confounders, but also to measurement noise, which further increases its false negative rate (see the first paper above for details).
How to quantify statistical significance and uncertainty?
In the examples above, we basically eye-balled the $p$-values and gave thumbs-up or thumbs-down for a causal relation. Not only is this practically infeasible when testing causality between ten-thousands of pairs in omics data, we would also like to quantify the statistical significance and uncertainty of each finding. This turns out to be a non-trivial question, because:
Both the sufficient and necessary condition are combinations of statistical tests.
The necessary condition includes a test where accepting the null hypothesis is a positive finding, but a non-significant $p$-value (no evidence for rejecting the null hypothesis) does not imply significant evidence against the alternative hypothesis.
One strand of research has focused on summarizing the outcome of multiple hypothesis tests by a single $p$-value with the usual interpretation that a small $p$-value is evidence for rejecting the null hypothesis of no causal interaction, see these papers:
However, to overcome in particular the second problem above, these methods have to introduce pairwise model comparison statistics almost by stealth. A much better approach (in my not so objective opinion!) is to do this explicitly in a Bayesian framework.
In the papers below, each of the component tests (for instance, $¬(Z⫫Y)$ or $Z⫫Y\mid X$ in the sufficient condition) is expressed as a likelihood-ratio test between two nested, null and alternative, models. Using $q$-values, the likelihood-ratio test statistics are converted into probabilities of the null or alternative model being true. These probabilities can then be combined by the usual rules of probability theory (e.g., multiplied to express that two tests must be true).
In addition, the second paper provides a software implementation that is highly efficient, and can test both the sufficient and the necessary condition for causality (all other papers essentially only study the sufficient condition).
Your turn?
If you are convinced by the above arguments and have some molecular QTL data waiting to be analyzed by causal inference, try our Findr software.
Assignment
Assignment
6 - Graphical models
Gene regulatory networks. Bayesian networks. Other network inference methods.
In Bayesian networks, the joint distribution over a set ${X_1,\dots, X_p}$ of random variables is represented by:
a directed acyclic graph (DAG) $\cal G$ with the variables as vertices,
a set of conditional probability distributions,
$$\begin{aligned}
P\bigl(X_i \mid {X_j \mid j\in \mathrm{Pa}_i}\bigr),
\end{aligned}$$
where $\mathrm{Pa}_i$ is the set of parents of vertex $i$ in $\mathcal{G}$,
such that
$$\begin{aligned}
P(X_1,\dots,X_p) = \prod_{i=1}^p P\bigl(X_i \mid {X_j \mid j\in \mathrm{Pa}_i}\bigr)
\end{aligned}$$
In GRN reconstruction:
$X_i$ represents the expression level of gene $i$,
$P(X_1,\dots,X_p)$ represents the joint distribution from which (independent) experimental samples are drawn,
$\mathcal{G}$ represents the unknown GRN.
Assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.
Assume we know $\mathcal{G}$ and the conditional distributions. Then the likelihood of observing the data is:
We can now use an iterative algorithm to optimize $\mathcal{G}$ and the conditional distributions:
Start with a random graph $\mathcal{G}$.
Given $\mathcal{G}$, the likelihood decomposes in a product of independent likelihoods, one for each gene, and the conditional distributions can be optimized by standard regression analysis.
In the next iterations, randomly add, delete, or reverse edges in $\mathcal{G}$, as long as the likelihood improves.
A more formal approach to optimizing $\mathcal{G}$ uses Bayes’ theorem:
$$\begin{aligned}
P(\mathcal{G}\mid \mathbf{X}) = \frac{P(\mathbf{X}\mid \mathcal{G}) P(\mathcal{G})}{P(\mathbf{X})}
\end{aligned}$$
$P(\mathcal{G})$ represents the prior distribution: even without seeing any data, not all graphs need to be equally likely a priori.
$P(\mathbf{X})$ represents the marginal distribution: $P(\mathbf{X}) = \sum_{\mathcal{G}’} P(\mathbf{X}\mid \mathcal{G}’) P(\mathcal{G}’)$. It is independent of $\mathcal{G}$ and can be ignored.
We can use $P(\mathcal{G})$ to encode evidence for causal interactions from integrating genomics and transcriptomics data. This is the main idea from Zhu et al. (2004).
Assignment
Assignment
6.3 - Tutorial
Illustration of gene regulatory network reconstruction and evaluation using data from E. coli.
Spatially resolved, highly multiplexed RNA profiling in single cells
Figure 1
MERFISH = multiplexed error-robust FISH (fluorescence in situ hybridization), a single-molecule imaging approach that allows the copy numbers and spatial localizations of thousands of RNA species to be determined in single cells.
Each RNA labelled with set of encoding probes, which contain:
targeting sequences that bind the RNA
readout sequences that bind fluorescently labeled readout probes
Each RNA assigned a binary word in a modified, error-robust Hamming code:
Sequences have $N$ bits, $N$ number of fluorescent labels
Sequences have exactly four 1’s. Same number of ones guarantees same overall error rate in situation where $1\to 0$ and $0\to 1$ error rates are different.
Hamming codes can do 1-bit error correction, see example on wikipedia. Higher calling rate and lower misidentification rate at the cost of encoding fewer RNA species with a given number of bits.
MERFISH: a highly multiplexed smFISH approach enabled by combinatorial labeling and error-robust encoding
that is, in a fixed effect model, the predictors affect the average or expected value of $Y$. In a fixed effect model, we always assume that the data to fit the parameters are obtained from independent samples of an underlying model.
If our data consists of non-independent samples (e.g. due to temporal, spatial or familial relations among the observations), we can use a random effect model to explain these correlations as a function of covariates $X$. As before, we collect the data for the inputs and ouptput in a $N\times p$ matrix $\mathbf{X}$ and $N$-vector $\mathbf{y}$, respectively,
$$\begin{aligned}
\mathbf{X}= (x_1,x_2,\dots,x_p) = \begin{pmatrix}
x_{11} & x_{12} & \dots & x_{1p}\\
x_{21} & x_{22} & \dots & x_{2p}\\
\vdots & \vdots & & \vdots\\
x_{N1} & x_{N2} & \dots & x_{Np}
\end{pmatrix} &&
\mathbf{y}= \begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_N
\end{pmatrix}
\end{aligned}$$
The random effects are assumed to be independent or the errors. We assumed the random effects are mutually independent, but this can easily be generalized.
and we see that $\mathbf{X}$, or more precisely the covariance matrix $\mathbf{X}\mathbf{X}^T$ indeed models correlations among the samples in $\mathbf{y}$.
Naturally, fixed and random effects can be combined into a so-called mixed model.
Models of this kind are often used in genetics, where the random effect covariates $\mathbf{X}$ are genetic markers and their covariance matrix $\mathbf{X}\mathbf{X}^T$ expresses the genetic similarity between individuals in the study.
So far we assumed that all random effects had the same variance $\sigma_a^2$. If we assume that each $\alpha_j$ has a different variance $\sigma_j^2$, we would obtain
is the variance in $\mathbf{y}$ explained by variance component $\mathbf{x}_i$.
Kernel-based variance component models
Since the posterior distribution $p(\mathbf{y}\mid \mathbf{X})$ only depends on the matrix $\mathbf{K}$, an immediate generalization is to define variance component models directly in terms of kernel matrices instead of starting from a linear, random effect model, namely define
$$
p(\mathbf{y}) = \mathcal{N}(0, \mathbf{K})
$$
For instance, if the elements (samples) $y_i$ of $\mathbf{y}$ are obtained at specific locations (e.g. pixels in an image) $\mathbf{x_i}\in \mathbb{R}^2$ (or $\mathbb{R}^p$ more generally), we can define
$$
K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)
$$
where the kernel function $k$ is a function that measures the similarity or closeness between samples.
Generalizing from before, the kernel matrix $\mathbf{K}$ can consist of multiple components,
In many applications, the “positions” $\mathbf{x}_i$ where samples are obtained are not fixed, but a finite subset of a possibly infinite, continuous range. For instance, when studying a dynamic process, we typically obtain noisy measurements $y_i$ at a finite number of time points $t_i$, and are interested in the entire underlying process $y(t)$ for all times $t$ in some time interval. Similarly, we may have measurements $y_i$ at a finite number of locations $\mathbf{x}_i$, and are interested in the entire function $y(\mathbf{x})$ for all positions $\mathbf{x}$ in some spatial region.
A Gaussian process is a stochastic process, to be understood as a probability distribution over functions $y(\mathbf{x})$ over some continuous domain $D$, such that the set of values of $y(\mathbf{x})$ evaluated at a finite set of points $\mathbf{x}_1,\dots,\mathbf{x}_N\in D$ are jointly normally distributed. A Gaussian process is specified by a kernel function $k(\mathbf{x},\mathbf{x}’)$, for $\mathbf{x},\mathbf{x}’ \in D$. Writing $\mathbf{y} = (y(\mathbf{x}_1), \dots, y(\mathbf{x}_N))$, the kernel defines the probability distribution
$$
p(\mathbf{y}) = \mathcal{N}(0, \mathbf{K})
$$
where
$$
K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)
$$
Hence, for the purposes of parameter estimation, the Gaussian process and variance component viewpoints are identical, which is why the term are often used interchangeably. The main difference lies in the interpretation, where Gaussian process see the data as a finite set of samples from a continuous space. Hence in the Gaussian process viewpoint, we would be particularly interested in interpolation, predicting expected values of the function $y(\mathbf{x})$ at locations $\mathbf{x}$ that were not part of the initial (training) data.
Keeping the notation $\mathbf{y}$ for the function values at the training positions $\mathbf{x}_i$, we know that the joint distribution of $\mathbf{y}$ and the value of $y(\mathbf{x})$ at a new position $\mathbf{x}$ is given by
Generalization to interpolation to multiple points is immediate.
Assignment
Assignment
8 - Appendix
Gene regulation. Probability theory. Linear algebra. Optimization.
8.1 - Gene regulation
Gene regulation.
Reference
From genotype to phenotype
To understand how genetic, epigenetic and gene expression variation cause variation in health and disease traits in human, data-driven biology seeks to learn molecular mechanisms from interconnected omics data.
How does information flow from the genotype to the phenotype?
Gene expression determines cellular states
Genes are transcribed into mRNA and translated into protein.
Every step in this process can be regulated by genetic or environmental factors.
The repertoire and relative levels of proteins expressed determines cellular identity, fate, cell-to-cell communication, etc.
Variation in cellular properties causes variation in phenotypes.
Genetic variation causes phenotype variation by affecting gene
expression.
How does genetic variation affect gene expression?
Genetic variation influences gene regulation
Genetic variation directly affects DNA methylation, binding of
epigenetic and transcription factors and 3D chromatin organization, leading to altered transcription.
Genetic variation informs on local regulatory mechanisms
Coupling genetic variation data with information on TF binding events allows to predict functional effects of motif disruption and cooperative and collaborative TF binding mechanisms in a locus-specific manner.
Genetic variation informs on distal regulatory mechanisms
Genotypes of regulatory variants are fixed from birth and act as experimental instruments to inhibit or promote expression of regulatory factors independent of environmental confounders.
Genes are organized in hierarchical, multi-tissue causal networks
Variation in expression of one gene has downstream consequences on expression of other genes.
Example: Introduction of just 4 TFs (“Yamanaka factors”) converts adult cells into stem cells.
Hundreds to thousands of genes are differentially expressed between cellular states (e.g. healthy vs. disease).
Gene expression in one tissue can affect gene expression in other tissues.
Phenotype variation also causes gene expression variation (“reverse
causation”).
Reconstruction of causal pathwas and gene networks is essential to understand how the genotype determines the phenotype.
Systems genetics in human requires large data and new analytical methods
Instead of direct experimental manipulation of DNA elements or
regulatory factors, the relation between genetic variation, gene regulation, gene expression, and (disease) phenotypes in human must be studied by:
Statistical sampling of the causes (genotypes) and consequences (gene expression) of gene regulation using genome-wide measurements in unrelated individuals.
Mapping associations between data measured at multiple scales.
Using statistical and machine learning approaches to reconstruct models of molecular mechanisms.
These basic sample guidelines assume that your Docsy site is deployed using Netlify and your files are stored in GitHub. You can use the guidelines “as is” or adapt them with your own instructions: for example, other deployment options, information about your doc project’s file structure, project-specific review guidelines, versioning guidelines, or any other information your users might find useful when updating your site. Kubeflow has a great example.
Don’t forget to link to your own doc repo rather than our example site! Also make sure users can find these guidelines from your doc repo README: either add them there and link to them from this page, add them here and link to them from the README, or include them in both locations.
We use Hugo to format and generate our website, the
Docsy theme for styling and site structure,
and Netlify to manage the deployment of the site.
Hugo is an open-source static site generator that provides us with templates,
content organisation in a standard directory structure, and a website generation
engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.
All submissions, including submissions by project members, require review. We
use GitHub pull requests for this purpose. Consult
GitHub Help for more
information on using pull requests.
Quick start with Netlify
Here’s a quick guide to updating the docs. It assumes you’re familiar with the
GitHub workflow and you’re happy to use the automated preview of your doc
updates:
If you’re not yet ready for a review, add “WIP” to the PR name to indicate
it’s a work in progress. (Don’t add the Hugo property
“draft = true” to the page front matter, because that prevents the
auto-deployment of the content preview described in the next point.)
Wait for the automated PR workflow to do some checks. When it’s ready,
you should see a comment like this: deploy/netlify — Deploy preview ready!
Click Details to the right of “Deploy preview ready” to see a preview
of your updates.
Continue updating your doc and pushing your changes until you’re happy with
the content.
When you’re ready for a review, add a comment to the PR, and remove any
“WIP” markers.
Updating a single page
If you’ve just spotted something you’d like to change while using the docs, Docsy has a shortcut for you:
Click Edit this page in the top right hand corner of the page.
If you don’t already have an up to date fork of the project repo, you are prompted to get one - click Fork this repository and propose changes or Update your Fork to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.
Follow the rest of the Quick start with Netlify process above to make, preview, and propose your changes.
Previewing your changes locally
If you want to run your own local Hugo server to preview your changes as you work:
Follow the instructions in Getting started to install Hugo and any other tools you need. You’ll need at least Hugo version 0.45 (we recommend using the most recent available version), and it must be the extended version, which supports SCSS.
Fork the Goldydocs repo repo into your own project, then create a local copy using git clone. Don’t forget to use --recurse-submodules or you won’t pull down some of the code you need to generate a working site.
Run hugo server in the site root directory. By default your site will be available at http://localhost:1313/. Now that you’re serving your site locally, Hugo will watch for changes to the content and automatically refresh your site.
Continue with the usual GitHub workflow to edit files, commit them, push the
changes up to your fork, and create a pull request.
Creating an issue
If you’ve found a problem in the docs, but you’re not sure how to fix it yourself, please create an issue in the Goldydocs repo. You can also create an issue about a specific page by clicking the Create Issue button in the top right hand corner of the page.
Useful resources
Docsy user guide: All about Docsy, including how it manages navigation, look and feel, and multi-language support.