This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Probabilistic and causal modelling of genome-scale data

Machine learning plays an important role in computational biology. See the Machine Learning in Computational and Systems Biology community or the Machine Learning in Computational Biology conference series.

These lecture notes focus on probabilistic machine learning methods for computational biology, where the experimental data are viewed as random samples from an underlying data-generating process.

The “probabilistic modelling” in the title refers to the use of abstract data-generating processes, not based on any specific biological mechanisms, and derived from generic models and methods. A typical example will be clustering using Gaussian mixture models.

To speak of “causal modelling” will require something more, namely that the data-generating process is based on some qualitative prior knowledge or understanding of the true underlying biological process. A typical example will be path analysis.

The notes are divided in chapters, each focusing on a specific class of methods:

  • Clustering
  • Regularized regression
  • Dimensionality reduction
  • Causal inference
  • Graphical models
  • Spatio-Temporal models

Each chapter follows the same structure:

  • A “classic” biological or biomedical research paper is studied where the algorithm (or class of algorithms) of interest was first used. A more recent follow-up or related paper is given as a reading assignment.
  • The method used in the classic paper is presented in detail, along with additional methods to solve the same type of problem. The methods are put in practice in a programming assignment. Where possible, original data from the papers studied in the first part is used.

Four appendices contain the minimum required background knowledge on gene regulation, probability theory, linear algebra, and optimization.

The theoretical sections contain the basic information to understand a method. For more background, try the following textbooks (with free pdfs!), all used in preparation of this course:

The use of classic or path-breaking papers is motivated by Back to the future: education for systems-level biologists. Since the field of genome-scale data analysis is still relatively young, the choice of papers for study is still a bit open and likely to evolve as the course matures.

These lecture are taught as part of the master program in bioinformatics at UiB, making up about half of the BINF301 Genome-scale Algorithms course. As such, good background knowledge on basic bioinformatics and omics data is assumed.

1 - Cluster analysis

Cancer subtypes. Combinatorial clustering. Mixture distributions.

1.1 - Cancer subtype classification

Gene expression profling of tumour samples.

Gene expression profiling predicts clinical outcome of breast cancer

The study by van ’t Veer et al was one of the first to use to microarrays, a brand-new technology at the time, to profile gene expression on a genome-wide scale from surgically removed tumour samples - breast tumours in this case. Another paper from around the same time is: Perou et al. Molecular portraits of human breast tumours. The credit for being the first to using cluster analysis on gene expression data (from yeast) probably goes to Eisen et al. Cluster analysis and display of genome-wide expression patterns.

Van ’t Veer et al clustered data from 98 tumours based on their similarities across approximately 5,000 differentially expressed genes - genes that showed more variation than expected by chance in the dataset. The most striking finding is in their Figure 1: the tumours segregated in two distinct groups that correlated strongly with clinical features, namely:

Overall, tumours in the bottom group of the figure were clearly associated with measures that predict better patient outcome.

Following the discovery that unsupervised clustering of gene expression profiles identifies good and poor prognosis groups, the authors tried to identify a minimal prognostic signature from their data, which resulted in an optimal set of 70 marker genes, which they could validate in an independent set of tumour samples.

Clearly, with such a strong signature, the race to bring it to the clinic is on. That this is far from trivial can be seen by tracing the follow-up studies and clinical trials:

They got there eventually, and the gene expression signature is now commercially available under the name of Mammaprint.

The Cancer Genome Atlas

Although the results by Van ’t Veer et al. were obtained from a small (by current standards!) sample size, they have been reproduced consistenly in larger studies (see the assignment in the next cluster analysis lecture) and arguably spawned a search for similar signatures in other cancer types through large-scale projects, such as The Cancer Genome Atlas (TCGA) Program.

The amount of data and number of publications produced by TCGA is too enormous to survey here.

For the purposes of illustration, have a look at the Pan-Cancer Atlas, and then do the following assignment.

Assignment

1.2 - Combinatorial clustering

Combinatorial clustering algorithms, K-means.

Introduction

The goal of cluster analysis is to group or segment a collection of objects into subsets or “clusters”, such that objects within a cluster are more closely related than objects in different clusters.

Many datasets exhibit a hierarchical structure, with no clear demarcation of clusters, and clusters can themselves be grouped successively such that clusters within one group are more similar than those in different groups. Deciding where to make the “cut” is usually done by setting parameters of a clustering algorithm, and almost always involves an arbitrary choice by the user.

Central to all clustering methods is the notion of the degree of similarity (or dissimilarity) between the objects being clustered. Sometimes data are presented directly in terms of proximity/similarity between objects. More often we have measurements (e.g. gene expression) on objects (e.g. genes or samples) that we want to cluster, and a (dis)similariy matrix must be constructed first.

Dissimilarities based on measurements

Assume we have measurements $x_{ij}$ for objects $i=1,2,\dots,N$ on variables (or attributes) $j=1,2,\dots,p$. Usually, we first define a dissimilarity function $d_j(x_{ij},x_{i’j})$ between values of the $j$th attribute, and then define the dissimilarity between objects as

$$ D(x_i,x_{i’})=\sum_{j=1}^p w_j \cdot d_j(x_{ij},x_{i’j}) $$

with $w_j$ a weight assigned to the $j$th attribute that determines its relative influence in the overall dissimilarity between objects. By convention we scale the weights such that $\sum_j w_j=1$.

Setting $w_j=1$ for all $j$ does not necessarily give all attributes equal influence! To see this, we compute the average object dissimilarity over all pairs of objects as

$$ \begin{aligned} \bar D &= \frac1{N^2}\sum_{i=1}^N \sum_{i’=1}^N D(x_i,x_{i’}) = \sum_{j=1}^p w_j \cdot \bar{d}_j, \end{aligned} $$ with $\bar{d}_j$ defined above. Hence the relative influence of the $j$th attribute is $w_j \cdot \bar{d}_j$.

The most common choice of dissimilarity function is squared-error distance:

$$ d_j(x_{ij},x_{i’j}) = (x_{ij}-x_{i’j})^2 $$

Define the mean and variance of each attribute over all objects as

$$ \begin{aligned} \mu_j &= \frac1N \sum_{i=1}^N x_{ij}\\ \sigma_j^2 &= \frac1N \sum_{i=1}^N (x_{ij}-\mu_j)^2 \end{aligned} $$

Then $$ \begin{aligned} \overline{d_j} &= \frac1{N^2}\sum_{i=1}^N \sum_{i’=1}^N (x_{ij}-x_{i’j})^2\\ &= \frac{1}{N^2}\sum_{i=1}^N \sum_{i’=1}^N \left((x_{ij}-\mu_j) - (x_{i’j}-\mu_j)\right)^2\\ &= \frac1N \sum_{i=1}^N (x_{ij}-\mu_j)^2 + \frac1N \sum_{i=1}^N (x_{i’j}-\mu_j)^2 = 2 \sigma_j^2 \end{aligned} $$

It is often recommended to standardize data before clustering:

$$ x_{ij} \to y_{ij}=\frac{x_{ij}-\mu_j}{\sigma_j} $$

With squared-error loss, this is equivalent to setting weights $w_j \sim 1/\sigma_j^2 \sim 1/\bar{d}_j$, that is, to give all attributes equal influence on the average object dissimilarity.

Beware that sometimes some attributes exhibit more grouping tendency than others, which may be obscured by standardizing.

Combinatorial clustering

Combinatorial clustering algorithms assign each object to a cluster without regard to a probability model describing the data. Understanding combinatorial clustering is a necessary basis for understanding probabilistic methods.

In combinatorial clustering, a prespecified number of clusters $K<N$ is postulated ($N$ the number of objects). An assignment of objects $i\in{1,\dots,N}$ to clusters $k\in{1,\dots,K}$ is charcterized by a many-to-one mapping or encoder $k=C(i)$.

$C$ is obtained by minizing the “within cluster” point scatter:

$$ W(C) = \frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i’)=k} d(x_i,x_{i’}) $$

$W(C)$ characterizes the extent to which objects assigned to the same cluster tend to be close to one another. Notice that:

$$ \begin{aligned} T &= \frac12 \sum_{i=1}^N \sum_{i’=1}^N d_{ii’} \\ &= \frac12 \sum_{k=1}^K \sum_{C(i)=k} \left(\sum_{C(i’)=k} d_{ii’} + \sum_{C(i’)\neq k} d_{ii’}\right) \\ &= W(C) + B(C) \end{aligned} $$ where $d_{ii’} = d(x_i,x_{i’})$ and

$$ B(C) = \frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i’)\neq k} d_{ii’} $$
is the “between cluster” point scatter.

$B(C)$ characterizes the extent to which objects assigned to different clusters tend to be far apart.

Since $T$ is constant given the data, minimizing $W(C)$ is equivalent to maximizing $B(C)$.

$K$-means clustering

The $K$-means algorithm uses the squared Euclidean distance $$ d(x_i,x_{i’}) = \sum_{j=1}^p (x_{ij}-x_{i’j})^2 = \| x_i - x_{i’}\|^2 $$ and an iterative greedy descent algorithm to minimize $W(C)$.

Using the Euclidean distance expression, $W(C)$ can be written as $$ W(C) = \sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - \overline{x_k}\|^2 $$ where $N_k$ is the number of objects assigned to cluster $k$, and $\overline{x_k}=(\overline{x_{1k}},\dots,\overline{x_{pk}})$ is the mean vector associated to cluster $k$.

$W(C)$ is minimized if within each cluster, the average dissimilarity of the objects from the cluster mean, as defined by the points in that cluster, is minimized.

Note that for any set of objects $S$, $$ \overline{x_S} = \frac{1}{|S|} \sum_{i\in S} x_i = \argmin_m \sum_{i\in S}\|x_i-m\|^2 $$

Hence $$ \begin{aligned} \min_C W(C) &= \min_C \sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - \overline{x_k}\|^2 \\ & = \min_{C} \min_{\{m_k\}}\sum_{k=1}^K N_k \sum_{C(i)=k} \| x_i - m_k\|^2 \end{aligned} $$

This result is used in a greedy descent algorithm where alternatingly the mean vectors are updated for the current cluster assignments, and object assignments are updated by assigning objects to the nearest current mean vector.

How to choose the number of clusters $K$?

Find “kink” in the within-cluster-dissimilarity: Read Elements of Statistical Learning Section 14.3.8, 14.3.11

Assignment

1.3 - Mixture distributions

Mixture distributions

Generative models

A generative model is a statistical model for the process that could have generated your data. Generative models offer many advantages compared to combinatorial algorithms that treat data as a collection of objects. Most importantly, working with a generative model forces you to be explicit about your assumptions. Likewise, a generative model allows you to encode, and be explicit about, prior (biological) knowledge you may have about the data generating process.

Gaussian mixture models

What kind of process could generate data with separated groups or clusters of observations?

image

Let’s assume there is an unmeasured (or hidden) random variable $Z$ that determines to which group an observation $X$ belongs. For simplicity, assume that $Z$ can only take two values, 0 or 1, and that the measurement $X$ is one-dimensional and normally distributed in each group.

Consider the following process:

  1. Randomly sample cluster label $Z=1$ with probability $\pi$ and $Z=0$ with probability $1-\pi$.

  2. Sample features

$$ \begin{aligned} X \sim \begin{cases} \mathcal{N}(\mu_0,\sigma_0^2) & \text{ if } Z=0\ \mathcal{N}(\mu_1,\sigma_1^2) & \text{ if } Z=1 \end{cases} \end{aligned} $$

where $\mu_k$ and $\sigma_k^2$ are the means and variances of two normal distributions.

What data would this process generate?

image

Hence the model generates cluster labels $Z$ and real numbers $x\in\mathbb{R}$ from the model

$$ \begin{aligned} Z &\longrightarrow X\\ p(Z,x) = P(Z) &\times p(x\mid Z) \end{aligned} $$

where we use lower-case “p” for probability density functions ($X$ is continuous) and upper-case “P” for discrete probabilities, and

$$ \begin{aligned} P(Z=1) &= \pi = 1 - P(Z=0) \\ p(x\mid Z=k) &= \mathcal{N}(\mu_k,\sigma_k^2) \end{aligned} $$

Distribution generated by the model:

image

Observed distribution:

image

Some EM language

The joint distribution is the probability distribution of a cluster label $Z$ and feature value $x$ both being produced by the model:

$$ p(Z,x) = p(x\mid Z)\; P(Z) $$

The marginal distribution is th probability distribution that the model produces a feature value $x$:

$$ p(x) = \sum_{k=0,1} p(x\mid Z=k)\; P(Z=k) $$

The responsibility of $Z$ for feature value $x$, also called the recognition distribution, is obtained using Bayes’ theorem

$$ P(Z=k\mid x) = \frac{p(x\mid Z=k) \; P(Z=k)}{p(x)} $$

This value can be used as a soft cluster assignment: with probability $P(Z=k \mid x)$, an observed value $x$ belongs to cluster k.

Note that the expected value of $Z$ given a data point $x$ is:

$$ \begin{aligned} \mathbb{E}\left(Z\mid x\right) &= 1 \cdot P(Z=1 \mid x) + 0\cdot P(Z=0 \mid x) = P(Z=1 \mid x) % &= \frac{\pi p(x\mid \mu_1, \igma_1^2)}{\pi p(x \mid \mu_1, \sigma_1^2) + (1-\pi) p(x \mid \mu_0, \sigma_0^2)} \end{aligned} $$

Maximum-likelihood estimation

To fit the model to the data, we can only use the observed data $x$, which follows the Gaussian mixture distribution

$$ p(x) = \sum_{k=0,1} p(x\mid Z=k)\; P(Z=k) $$

image

The log-likelihood of observing $N$ independent samples $(x_1,\dots,x_N)$ is

$$ \mathcal{L}= \log\left(\prod_{i=1}^N p(x_{i}) \right) = \sum_{i=1}^N \log p(x_{i}) $$

We want to find the best-fitting model by maximizing the log-likelihood.

Directly maximizing the log-likelihood with respect to the paramaters $\pi$, $\mu_k$, and $\sigma_k^2$ is difficult, because:

  • Only the feature values $x$ are observed.
  • The cluster labels $Z$ are hidden, they are latent variables.
  • The log-likelihood is expressed purely in terms of the observable distribution, involves logarithms of sums, and is intractable.

If we knew the cluster labels $k$ for each sample, we could easily fit the parameters $(\pi,\mu_k,\sigma_k^2)$ from the data for each cluster.

If we knew the parameters $(\pi, \mu_k,\sigma_k^2)$, we could easily determine the probability for each data point to belong to each cluster and determine cluster labels.

To get around this catch-22, we replace actual cluster labels by their current expected values given current values for the parameters, and then iterate the above two steps until convergence - this is the Expectation-Maximization (EM) algorithm.

The EM algorithm

  1. Take initial guesses $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ for the model parameters.
  2. Expectation step: Compute the responsibilities $P(Z_i=k\mid x_i)$ for each data point $x_i$.
  3. Maximization step: Update $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$ by maximizing the log-likelihood using the soft cluster assignments $P(Z_i=k\mid x_i)$.
  4. Iterate steps 2 and 3 until convergence.

What are “soft cluster assignments”?

Consider $N$ samples $(x_1,\dots,x_N)$ from a normal distribution $p(x\mid \mu,\sigma^2)$. The log-likelihood is

$$ \begin{aligned} \mathcal{L}= \sum_{i=1}^N \log \left(\frac1{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x_i-\mu)^2}\right) = -\frac{N}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^N (x_i-\mu)^2 \end{aligned} $$

$\mathcal{L}$ is maximized for

$$ \begin{aligned} \hat\mu &= \frac{1}{N} \sum_{i=1}^N x_i\\ \hat\sigma^2 &= \frac{1}{N} \sum_{i=1}^N (x_i-\hat\mu)^2 \end{aligned} $$

Now consider $N$ samples $(Z_1,\dots,Z_N)$ and $(x_1,\dots,x_N)$ from the generative model where the cluster labels are also observed. The log-likelihood is

$$ \begin{aligned} \mathcal{L}&= \sum_{i=1}^N \log p(Z_i,x_i)\\ &= \sum_{i=1}^N \Bigl(Z_i \log p(Z_i=1,x_i) + (1-Z_i) \log p(Z_i=0,x_i)\Bigr) \\ &= \sum_{i=1}^N \Bigl(Z_i \log p(x_i\mid \mu_1,\sigma_1^2) + (1-Z_i) \log p(x_i\mid \mu_0,\sigma_0^2)\Bigr) + \sum_{i=1}^N \Bigl(Z_i\log\pi + (1-Z_i) \log(1-\pi)\Bigr) \end{aligned} $$

$\mathcal{L}$ is maximized for

$$ \begin{aligned} \hat\pi &= \frac{N_1}{N}\\ \hat\mu_k &= \frac{1}{N_k} \sum_{Z_i=k} x_i\\ \hat\sigma_k^2 &= \frac{1}{N_k} \sum_{Z_i=k}^N (x_i-\hat\mu_k)^2 \end{aligned} $$

Since the cluster labels are not observed, we don’t know the value of the $Z_i$. The “trick” is to replace them with their expectation values $\mathbb{E}(Z_i=k) = P(Z_i=k\mid x_i)$ in the EM algorithm, using the current estimates for $\hat\pi$, $\hat\mu_k$, $\hat\sigma_k^2$.

This leads to updated estimates

$$ \begin{aligned} \hat\pi^{\text{(new)}} &= \frac{\sum_{i=1}^N P(Z_i=k\mid x_i)}{N}\\ \hat\mu_k^{\text{(new)}} &= \frac{1}{N} \sum_{i=1}^N P(Z_i=k\mid x_i)\; x_i \\ (\hat\sigma_k^2)^{\text{(new)}} &= \frac{1}{N} \sum_{i=1}^N P(Z_i=k\mid x_i)\; (x_i-\hat\mu_k)^2 \end{aligned} $$

Hence, instead of a “hard” assignment of each sample $x_i$ to one cluster $k$ when the $Z_i$ are observed, each sample now contributes with a “soft assignment” weight $P(Z_i=k\mid x_i)$ to the parameters of each cluster.

After convergence, the final $P(Z_i=k\mid x_i)$ can be used to assign data points to clusters, for instance to the cluster $k$ with highest responsibility for $x_i$.

Generalizations

We have so far considered the case where the data are one-dimensional (real numbers) and the number of clusters is pre-fixed. Important generalizations are:

  1. The data can be of any dimension $D$. In higher dimensions the components of the mixture model are multivariate normal distributions. The mean parameters $\mu_k$ simply become $D$-dimensional vectors. The variance parameters $\sigma_k^2$ however become $D\times D$ covariance matrices. For simplicity and to reduce the number of paramaters, it is often assumed that the covariance matrices are diagonal, such that the number of variance parameters again scales linearly in $D$. However, when features are correlated, this assumption will be a poor representation of the underlying data.

  2. Instead of treating the cluster weights, means, and variances as fixed parameters that need to be estimated, they can be seen as random variables themselves with a distribution (uncertainty) that can be learned from the data. For more information, see this tutorial.

  3. In an infinite mixture model, the number of clusters need not be fixed in advance, but can be learned from the data. For more information, see this tutorial.

Assignment

1.4 - Combinatorial clustering tutorial

Illustrating data exploration and K-means clustering using TCGA breast cancer data.

Tutorials are available as Pluto notebooks. To run the reactive notebooks, you need to first follow the instructions in the “Code” section of the repository’s readme file, then follow the instruction at the bottom of the Pluto homepage.

2 - Statistical significance for genome-wide studies

Statistical significance for genome-wide studies. False discovery rate estimation.

2.1 - Statistical significance

Statistical significance for genome-wide studies.

We cover the following parts:

  • Abstract
  • Example 1, differentially expressed genes
  • Table 1
  • What is a p-value?
  • What is the FDR?
  • Derivation that p-values are uniformly distributed under the null hypothesis
  • Figure 1
  • FDR estimation
  • Figure 2
  • What is a q-value?

2.2 - False discovery rate estimation

False discovery rate estimation.

Statistical significance

When we analyze genome-wide data, we are usually interested in identifying statistically significant features or patterns, that is, patterns we would not expect to see purely by chance. For example:

  • Genes differentially expressed between healthy and disease samples, treated and untreated samples, etc.
  • Genes coexpressed across tumour samples

We will use differential expression between two conditions, treated and untreated, as a running example. Of course this applies equally to other types of comparisons, e.g. ER positive vs ER negative breast tumours.

Statistical significance is expressed in terms of comparing a null hypothesis $H_0$ to an alternative hypothesis $H_1$. For instance

$$ \begin{aligned} H_0 &= \text{treatment has no effect on gene expression}\\ H_1 &= \text{treatment does have an effect on gene expression} \end{aligned} $$

To compare gene expression between the two groups, we can for instance do a two-sample $t$-test: compute for each gene

$$ t = \frac{\overline{x_1} - \overline{x_2}}{\text{se}} $$

where $\overline{x_1}$ and $\overline{x_2}$ are the average expression levels of the gene in each group, and $\text{se}$ is the pooled within-group standard error:

$$ \text{se} = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}} $$

where $s_i^2$ and $n_i$ are the variance and number of samples in group $i$, respectively

$$ \begin{aligned} \overline{x_i} &= \frac{1}{n_i}\sum_{j\in G_i} x_{j} \\ s_i^2 &= \frac{1}{(n_i-1)}\sum_{j\in G_i} (x_i - \overline{x_i})^2 \end{aligned} $$

The $t$-statistic measures the difference in average expression between the two groups, relative to the natural variation within each group. The greater $|t|$, the more likely there is a true difference between the groups. Nevertheless, even if there is no true difference, some variation between the groups will be observed in a finite number of samples due to random sampling noise. The distribution of $t$-values observed between two groups when in fact there is no true difference between them is called the null distribution. Statistical significance is usually expressed by the $p$-value, which, for a given value of $t$, expresses the probability to observe a $t$-statistic greater than $t$ (in absolute value) under the null hypothesis:

$$ \begin{aligned} p = \text{Pr}(|T|\geq |t| \mid H_0) \end{aligned} $$

For some test statistics, the true null distribution to compute the $p$-value is known. Otherwise random permutations can be used: randomly permute the group labels of the samples a large number of times, compute the test statistic for each permutation, and use these values to construct an empirical null distribution.

An important property of $p$-values is the following: under the null hypothesis, $p$-values are uniformly distributed between 0 and 1:

$$ \text{Pr}(P\leq p \mid H_0) = p $$

Conventionally, a $p$-value threshold of 0.05 is often used. This means that the probability that we declare a gene as being affected by the treatment, even though in reality it is not affected, is less than 5%.

In genome-wide studies, we typically measure around 20,000 genes (in humans). From the uniformity of the $p$-value distribution under the null hypothesis, it follows that even if none of the genes are affected by treatment, we would still expect 5% or 1,000 genes to have a $p$-value less than 0.05. Clearly it would be wrong to call any of these genes differentially expressed!

False discovery rate

The $p$-value is often wrongly interpreted as saying that the probability that our gene is truly affected by treatment is greater than 95% if $p<0.05$. The latter is the probability that the alternative hypothesis is true given that the $p$-value is below a certain threshold, or $\text{Pr}(H_1 \mid P\leq p)$. More commonly, we express this through the false discrovery rate (FDR), that is, the probability that the null hypothesis is true given that the $p$-value is below a certain threshold, $\text{Pr}(H_0 \mid P\leq p)=1-\text{Pr}(H_1 \mid P\leq p)$.

To get a feel for what such probabilities mean, consider the following table:

SignificantNot SignificantTotal
$H_0$ true$F$$M_0-F$$M_0$
$H_1$ true$T$$M_1-T$$M_1$
Total$S$$M-S$$M$

Here “F” stands for “false positive” (not affected by treatment, but still called significant), “T” for “true positive”. The second column contains respectively the “true negative” and “false negative” results.

Then the false discovery rate is

$$ \text{FDR} = \mathbb{E}\left(\frac{F}{S}\right) $$

Obviously, we never know the true value of the numbers in the table, and hence FDR must be estimated. There exist several approaches for doing this. We consider two popular ones: the plug-in estimator and a Bayesian approach.

Plug-in estimator of the FDR

Consider again the differential expression problem. Compute $t$-statistics for each gene and consider a range of thresholds $C$, either on the absolute $t$-values or on the corresponding $p$-values. For each value of $C$, we know $S_{obs}(C)$, the observed number of significant genes at threshold $C$.

Now generate $K$ random permutations of the group labels, and compute for each permutation $k$, $F_{obs}(C,k)$, the number of significant genes at threshold $C$ in permutation $k$, which by virtue of the randomization, must all correspond to cases where $H_0$ is true. Compute the average over all permutations:

$$ F_{obs}(C) = \frac{1}{K}\sum_{k=1}^K F_{obs}(C,k) $$

The plug-in estimate of the FDR at threshold $C$ is then defined as

$$ \widehat{\text{FDR}}(C) = \frac{F_{obs}(C)}{S_{obs}(C)} $$

We can now vary $C$ until we reached a desired FDR value, e.g. 10%, such that we expect no more than 10% of the genes we call differentially expressed to be false positives.

Note that in practice, the number of permutations $K$ need not be very large to reach a stable value for the average. This is because in each permutation, we obtain random $p$-values for a large number of features (genes).

Bayesian estimate of the FDR

When we perform a genome-wide test for differential expression, we obtain on the order of $10^4$ test statistic values, with corresponding $p$-values under the null hypothesis. The distribution of these $p$-values, visualized using a histogram, typically looks like this:

Figure

See also: How to interpret a $p$-value histogram

We know that under the null hypothesis, $p$-values are uniformly distributed. We recognize the uniform distribution in the flat profile of the histogram for unsignificant ($p\to 1$) $p$-values, which with very high probability come from genes unaffected by the treatment. At very small $p$-values, we see a deviation from the uniform distribution, which indicates the presence of truly affected genes, which obsviously don’t follow the null distribution.

Hence it seems like we can model the observed $p$-value distribution as a mixture distribution with one component corresponding to the null distribution and one component corresponding to the alternative distribution:

$$ f(p) = \pi_0 f_0(p) + (1-\pi_0) f_1(p) $$

where $f(p)$ is the probability density function (p.d.f.) of the observed $p$-value distribution, $f_0$ and $f_1$ are the p.d.f. of the $p$-value distribution under the null and alternative distribution, respectively, and $\pi_0 = \text{Pr}(H_0)$, the prior probability of a gene being unaffected by treatment (prior as in, before observing any data), .

As before, we can imagine that our $p$-values are generated by a process that first samples a hidden variable $Z=0$ with probability $\pi_0$ and $Z=1$ with probability $1-\pi_0$, and then samples a $p$-value (or $t$-statistic) from the null or alternative distribution depending on the value of $Z$. We know the null distribution is the uniform distribution. If we knew a parametric form for the alternative distribution, we could apply EM to estimate $\pi_0$ and the parameters of the alternative distribution, and then compute for each gene the recognition distribution or local false discovery rate (lfdr)

$$ \begin{aligned} \text{lfdr}(p) &= \text{Pr}(H_0 \mid p) = \text{Pr}(Z=0 \mid p) = \frac{\pi_0 f_0(p)}{f(p)} \end{aligned} $$

The word “local” refers to the fact that the above expression gives the expected rate of false positives among all genes with $p$-value equal to $p$, as opposed to the “tail” FDR estimated in the previous section giving the he expected rate of false positive among all genes with $p$-value less than or equal to $p$.

In reality, EM is rarely used in this context, because we rarely have a good enough idea about a parametric form for the alternative distribution. Instead a non-parametric approach is used that exploits the knowledge that:

  1. $p$-values are uniformly distributed under the null hypothesis, that is

    $$ f_0(p) = 1 $$

  2. $p$-values close to 1 almost surely come from the null distribution, that is,

    $$ \begin{aligned} \lim_{p\to 1} f_1(p) &= 0 \\ \lim_{p\to 1} f(p) &= \pi_0 \lim_{p\to 1} f_0(p) = \pi_0 \end{aligned} $$

Hence in the observed $p$-value histogram, $\pi_0$ can be estimated from the height of the “flat” region near $p\approx 1$.

We obtain

$$ \text{lfdr}(p) = \frac{\hat{\pi}_0}{f(p)} $$

It remains to estimate the p.d.f. of the real $p$-value distribution $f(p)$. In the popular q-value package, this is done by kernel density estimation followed by some smoothing.

The local false discovery rate provide a useful measure for the importance of a feature with $p$-value $p$. However, $p$-values themselves represent tail probabilities, $p=\text{Pr}(P\leq p \mid H_0)$, see above. Similarly, we can compute the FDR among all features with $p$-value less than some threshold $p$:

$$ \text{FDR}(p) = \text{Pr}(H_0\mid P\leq p). $$

This tells us the estimated false discovery rate among all features that are equally or more significant than a feature with $p$-value $p$.

Writing $F(p)$, $F_0(p)$ and $F_1(p)$ for the cumulative probability functions of the observed, null, and alternative $p$-value distributions, the reasoning above can be repeated to obtain

$$ F(p)=\pi_0 F_0(p) + (1-\pi_0) F_1(p) = \pi_0 p + (1-\pi_0) F_1(p) $$

where we used $F_0(p)=p$. It follows that

$$ \text{FDR}(p) = \text{Pr}(H_0\mid P\leq p) = \text{Pr}(Z=0\mid P\leq p) =\frac{\pi_0 p}{F(p)} $$

$Q$-values

The $q$-value of a feature with $p$-value $p$ is defined as the minimum FDR that can be attained when calling that feature significant, that is, by choosing a $p$-value threshold $r\geq p$. Hence

$$ q(p) = \min_{r\geq p} \text{FDR}(r) $$

Assignment

2.3 - False discovery rate estimation tutorial

Illustrating Storey’s FDR and q-value estimation method using TCGA breast cancer data.

Tutorials are available as Pluto notebooks. To run the reactive notebooks, you need to first follow the instructions in the “Code” section of the repository’s readme file, then follow the instruction at the bottom of the Pluto homepage.

3 - Regularized regression

Drug sensitivity prediction. Ridge, lasso, and elastic net regression.

3.1 - Drug sensitivity prediction

Drug sensitivity prediction.

Abstract - What is the study about and why was it performed?

What is a cancer cell line? See

Which genomic profiles were generated?

Are cell lines good models for real tumours? Figure 1.

What are pharmacological profiles, how is drug sensitivity measured?

What is predictive modelling, how were robust models derived, what is shown in Figure 2?

3.2 - Ridge, lasso, and elastic net regression

Ridge, lasso and elastic net regression.

Linear models and least squares

In a linear model, a (continuous) output \(Y\) is predicted from a vector of inputs \(X^T=(X_1,X_2,\dots,X_p)\) via $$\begin{aligned} \hat Y = \hat\beta_0 + \sum_{j=1}^p \hat{\beta}_j X_j \end{aligned}$$

  • \(\hat Y\) is the predicted value of \(Y\),
  • \(\hat\beta_0\) is the intercept (in statistics) or bias (in machinelearning),
  • \((\hat\beta_1,\hat\beta_2,\dots,\hat\beta_p)^T\) is the vector of (regression) coefficients,

For convenience, we often include a constant variable 1 in $X$, include $\hat \beta_0$ in the vector of coefficients, and write the linear model in vector form: $$\begin{aligned} \hat Y = X^T\hat\beta \end{aligned}$$

Least squares is the most popular method to fit the linear model to a set of training data $(x_1,y_1)$ $\dots$ $(x_N,y_N)$: we pick the coefficients $\beta$ to minimize the residual sum of squares

$$\begin{aligned} RSS(\beta) &= \sum_{i=1}^N (y_i - \hat y_i)^2 = \sum_{i=1}^N \left(y_i - x_i^T\beta\right)^2 = \sum_{i=1}^N \Bigl(y_i - \sum_{j=1}^p x_{ij}\beta_j\Bigr)^2 \end{aligned}$$

Collecting the data for the inputs and ouptput in a $N\times p$ matrix $\mathbf{X}$ and $N$-vector $\mathbf{y}$, respectively, $$\begin{aligned} \mathbf{X}= (x_1,x_2,\dots,x_p) = \begin{pmatrix} x_{11} & x_{12} & \dots & x_{1p}\\ x_{21} & x_{22} & \dots & x_{2p}\\ \vdots & \vdots & & \vdots\\ x_{N1} & x_{N2} & \dots & x_{Np} \end{pmatrix} && \mathbf{y}= \begin{pmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{pmatrix} \end{aligned}$$ we can write $$\begin{aligned} RSS(\beta) &= (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta) \end{aligned}$$ If the $p\times p$ matrix $\mathbf{X}^T\mathbf{X}$ is non-singular, then the unique minimizer of $RSS(\beta)$ is given by $$\tag{1} \hat\beta = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}$$

The fitted value at the $i$th input $x_i$ is $\hat y_i=x_i^T\hat\beta$. In matrix-vector notation, we can write $$\begin{aligned} \hat{\mathbf{y}} = \mathbf{X}\hat\beta = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{aligned}$$ Write $$\begin{aligned} \mathbf{H}= \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \end{aligned}$$

Limitations of least-squares linear regression

Least-squares linear regression involves the matrix inverse $(\mathbf{X}^T\mathbf{X})^{-1}$:

  • If the columns of $\mathbf{X}$ are not linearly independent ($\mathbf{X}$ is not full rank), then $\mathbf{X}^T\mathbf{X}$ is singular and $\hat\beta$ are not uniquely defined, although the fitted values $\hat{\mathbf{y}}=\mathbf{X}\hat\beta$ are still the projection of $\mathbf{y}$ on the column space of $\mathbf{X}$. This is usually resolved automatically by stats software packages.
  • If $\mathbf{X}$ is full rank, but some predictors are highly correlated, $\det(\mathbf{X})$ will be close to 0. This leads to numerical instability and high variance in $\hat\beta$ (small changes in the training data lead to large changes in $\hat\beta$).
  • If $p>N$ (but $rnk(\mathbf{X})=N$), then the columns of $\mathbf{X}$ span the entire space $\mathbb{R}^N$, that is, $\mathbf{H}=\mathbb{1}$ and $\hat{\mathbf{y}}=\mathbf{y}$: overfitting the training data.

High variance and overfitting result in poor prediction accuracy (generalization to unseen data).

Prediction accuracy can be improved by shrinking regression coefficients towards zero (imposing a penalty on their size).

Ridge regression

Regularization by shrinkage: Ridge regression

The ridge regression coefficients minimize a penalized residual sum of of squares: $$\tag{2} \beta^{\text{ridge}} = \argmin_\beta \left\{ \sum_{i=1}^N \Bigl(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\Bigr)^2 + \textcolor{red}{\lambda\sum_{j=1}^p\beta_j^2} \right\} $$

  • $\lambda\geq 0$ is a hyperparameter that controls the amount of shrinkage.
  • The inputs $X_j$ (columns of $\mathbf{X}$) are normally standardized before solving eq. (2).
  • The intercept $\beta_0$ is not penalized.
  • With centered inputs, $\hat\beta_0=\bar y=\frac1N\sum_i y_i$, and remaining $\beta_j$ can be estimated by a ridge regression without intercept.

The criterion in eq. (2) can be written in matrix form: $$\begin{aligned} RSS(\beta,\lambda) = (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta) + \lambda \beta^T\beta \end{aligned}$$ with unique minimizer $$\tag{3} \beta^{\text{ridge}}= (\mathbf{X}^T\mathbf{X}+\lambda\mathbb{1})^{-1}\mathbf{X}^T\mathbf{y}$$

Note that $\mathbf{X}^T\mathbf{X}+\lambda\mathbb{1}$ is always non-singular.

Lasso regression

Lasso regression: regularization by shrinkage and subset selection

The lasso regression coefficients minimize a penalized residual sum of squares: $$\tag{4} \beta^{\text{lasso}}= = \argmin_\beta \left\{ \sum_{i=1}^N \Bigl(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\Bigr)^2 + \textcolor{red}{\lambda\sum_{j=1}^p|\beta_j|} \right\}$$

  • Lasso regression has no closed form solution, the solutions are non-linear in the $y_i$,
  • If $\lambda$ is large enough, some coefficients will be exactly zero.

Estimation picture for the lasso (left) and ridge regression (right). Shown are contours of the least squares error (red) and constraint (blue) functions.

Profiles of ridge (left) and lasso (right) coefficients as the penalty paramater $\lambda$ is varied.

Elastic net regression

Elastic net regression combines ridge and lasso regularization.

  • In genomic applications, there are often strong correlations among predictor variables (genes operate in molecular pathways).
  • The lasso penalty is somewhat indifferent to the choice among a set of strong but correlated variables.
  • The ridge penalty tends to shrink the coefficients of correlated variables towards each other.
  • The elastic net penalty is a compromise, and has the form $$\begin{aligned} \lambda\sum_{j=1}^p \Bigl(\alpha|\beta_j|+\tfrac12(1-\alpha)\beta_j^2\Bigr). \end{aligned}$$ The second term encourages highly correlated features to be averaged, while the first term encourages a sparse solution in the coefficients of these averaged features.

Generalized linear models

  • Ridge, lasso, and elastic net regression can also be used to fit generalized linear models when the number of predictors is high.
  • The most commonly used model is logistic regression, where a binary output $Y$ is predicted from a vector of inputs $X^T=(X_1,\dots,X_p)$ via $$\begin{aligned} \log\frac{P(Y=1\mid X)}{P(Y=0\mid X)} &= \beta_0 + \sum_{j=1}^p \beta_j X_j\\ P(Y=1\mid X) = 1 - P(Y=0\mid X) &= \frac1{1+e^{-\beta_0 - \sum_{j=1}^p \beta_j X_j}} \end{aligned}$$
  • With training data $(\mathbf{X},\mathbf{y})$, the parameters are estimated by maximizing the penalized log-likelihood: $$\begin{aligned} \mathcal{L}(\beta) &= \log \prod_{i=1}^N P(y_i\mid x_{i1},\dots,x_{ip}) - \sum_{j=1}^p \left(\alpha|\beta_j|+\tfrac12(1-\alpha)\beta_j^2\right)\\ &= \sum_{i=1}^N \log P(y_i\mid x_{i1},\dots,x_{ip}) - \lambda \sum_{j=1}^p \left(\alpha|\beta_j|+\tfrac12(1-\alpha)\beta_j^2\right) \end{aligned}$$
  • Generalized linear models can be fitted using a iterative least squares approximation: a quadratic approximation to the log-likelihood around the current estimates for $\beta$, and this quadratic approximation is used to find the next estimates using the standard linear elastic net solution.

Bayesian regularized regression

flowchart TB beta -->Y_i subgraph i=1,...,N X_i-->Y_i end

Assignment

3.3 - Elastic net regression tutorial

Elastic net regression tutorial using CCLE data.

Tutorials are available as Pluto notebooks. To run the reactive notebooks, you need to first follow the instructions in the “Code” section of the repository’s readme file, then follow the instruction at the bottom of the Pluto homepage.

4 - Dimensionality reduction

Single-cell genomics. Probabilistic PCA. T-SNE and UMAP.

4.1 - Single-cell genomics

Single-cell genomics.

Understanding cells

To understand a cell, we must determine the factors that shape its identity:

  • its position in a taxonomy of cell types,
  • the progress of multiple time-dependent processes happening simultaneously,
  • the cell’s response to signals from its local environment,
  • the location and neighbourhood in which it resides.

Technological advances have enabled genome-wide profiling of RNA, DNA, protein, epigenetic modifications, chromatin accessibility, and other molecular events in single cells.

Sources of variation in single-cell genomics

Technical variation:

  • Due to technical factors in the data generation process.
  • Main axes of variation in scRNA-seq often dominated by technical factors.
  • False negatives (expressed but undetected) and false positives (overamplification) are due to minute quantities of starting RNA material and corresponding need for massive amplification.

Allele-intrinsic variation:

  • Stochastic factors intrinsic to the molecular mechanisms that control gene expression.
  • Does not correlate between two alleles of the same gene.

Allele-extrinsic variation

  • Factors extrinsic to the process of transcription.
  • Contribute to establishing differences between cell types or states, either stably or transiently.
  • Correlated between two alleles of the same gene.

Most studies aim to understand allele-extrinsic variation and its function, while technical and allele-intrinsic variations are major confounders.

Addressing overamplification

False-positive detections due to amplification occur at early PCR amplification cycle.

Can be addressed using random molecular tags (RMTs), also called unique molecular identifiers (UMIs):

  • Short barcode attached to 3’ or 5’ end of cDNA molecules before amplification.
  • After amplification, the number of unique barcodes associated with reads aligned to a gene/transcript, rather than number of aligned reads serves as the gene/transcript abundance.

Beware:

  • Sequencing errors will cause RMT sequences that are close to but not identical, and need to be collapsed.
  • “Barcode collision” – two or more copies of a transcript may receive the same barcode, making them appear as a single copy.

Addressing false negatives

Limited efficiency of RNA capture and conversion to cDNA leads to dropout events: transcripts that are expressed in the cell but undetected in its mRNA profile.

Zero-inflated models:

Gene expression is modelled as a mixture of two distributions:

  • one in which it is successfully amplified and detected at a level that correlates with its true abundance,
  • one in which it is undetected because of technical effects.

(Hidden) component labels can be inferred using EM.

Only cells belonging to the “success” component are used for downstream analysis, e.g. differential expression analysis between subpopulations of cells.

Revealing the vectors of cellular identity

Single-cell genomics is a means to identify and understand the factors that jointly define a cell’s identity:

  • Discrete cell types and subtypes
  • Continuous phenotypic spectra
  • Dynamic processes
  • Spatial location

Distinguishing discrete cell types and subtypes

Single-cell analysis can detect cellular subtypes that cannot be defined by a handful of markers, or for which markers are not yet known.

Classification of cells in discrete subtypes is a problem in unsupervised clusterring.

Key challenges:

  • Exponential increasing scale of single-cell data.
  • Ensuring reproducible classification.
  • Finding the proper granularity and detecting hierarchies of cell types and subtypes, especially when cell type frequency varies by multiple orders of magnitude.
  • Distilling molecular markers and signatures to characterize cell types.
  • Matching the resulting classes to known cell types.
  • Visualizing, sharing, and comparing classifications.

Continuous phenotypic spectra

  • Sometimes more appropriate to speak of a continuous phenotypic spectrum within a type than of a discrete cell type. e.g. a cell’s ability to trigger an immune response.
  • Continuous facets can be characterized by combining dimensionality reduction (e.g. PCA) with enrichment for functional annotation.
  • Prior knowledge can be used to interpret the biological relevance of the data’s main axes of variation.

Mapping dynamic processes

  • Cells undergo dynamic transitions, incl. short-term responses to environmental signals, cell differentiation, oscillations, etc.
  • Single-cell genomics provides a snapshot of the entire dynamic process: the set of single cells captured at any time point will Stochastically contain cells positioned in different instantaneous time points along the temporal trajectory.
  • Temporal ordering can be recovered by creating a graph that connects cells by their profile similarity and finding an optimal path on this graph.
  • This introduces the notion of “pseudo-time”.

Inference of spatial location

  • Most high-throughput single-cell approaches require dissociation of tissues and therefore lose spatial information.
  • Spatial information can be obtained for a limited number of “landmark” genes.
  • Computational methods can combine limited spatial cues for landmark genes with single-cell RNA-seq from the same type of tissue.
  • Limited number of landmarks expressed at any spatial position and dropouts in single-cell RNA-seq can severely compromise the mapping.
  • Requires tissues with “canonical” structure (e.g. developing embryo, most normal tissues), which is faithfully reproduced in replicate samples.

Assignment

4.2 - Principal component analysis

PCA. Probabilistic PCA.

Dimensionality reduction

Assume we have a set of $N$ observations (e.g. cells) of $p$ variables (e.g. genes).

Dimensionality reduction consists of representing the observations in a lower-dimensional space with dimension $q\ll p$ while preserving some characteristics (e.g. relative distance) of the original $p$-dimensional representation.

For visualizing relationships among observations, we typically use $q=2$.

Principal component analysis (PCA)

Assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.

The principal components of $\mathbf{X}$ provide a sequence of best linear approximations to $\mathbf{X}$, representing $\mathbf{X}$ by a line ($q=1$), plane ($q=2$), etc. in $p$-dimensional space.

A rank-$q$ linear model in $p$-dimensional space takes the form

$$ f(y) = \mu + y_1 v_1 + \dots + y_q v_q = \mu + \mathbf{V}_q z, $$

where $\mu\in\mathbb{R}^p$ is a location vector, $z\in\mathbb{R}^q$ is a (low-dimensional) vector, and $\mathbf{V}_q=(v_1,\dots,v_q)$ is a $p\times q$ matrix of $q$ orthogonal unit vectors,

$$ \begin{aligned} v_k v_{k’}^T = \sum_{j=1}^p v_{kj}v_{k’j}= \delta_{kk’} \end{aligned} $$

Fitting this model to the data using least squares amounts to minimizing the reconstruction error:

$$ \min_{\mu,\{y_{i}\},\mathbf{V}_q} \sum_{i=1}^N \Bigl\| x_i - \mu - \mathbf{V}_q z_i \Bigr\|^2 $$

Partial optimization for $\mu$ and the $y_i$ gives

$$ \begin{aligned} \hat{\mu} &= \bar x = \frac1{N}\sum_{i=1}^N x_i\\ \hat{z}_i &= \mathbf{V}_q^T(x_i-\bar{x}) \end{aligned} $$

Plugging this into () leaves us to find the orthogonal matrix $\mathbf{V}_q$:

$$ \begin{aligned} \min_{\mathbf{V}_q} \sum_{i=1}^{N} \Bigl\| (x_i - \bar{x}) - \mathbf{V}_q\mathbf{V}_q^T(x_i-\bar{x}) \Bigr\|^2 \end{aligned} $$

It can be shown that the optimal solution is found when $\mathbf{V}_q=(v_1,\dots,v_q)$ contains the $q$ eigenvectors with largest eigenvalues of the $p\times p$ matrix $\mathbf{X}^T\mathbf{X}$ (if the data is centred such that $\bar x=0$).

The solution can be expressed using the singular value decomposition of the obsered data matrix $\mathbf{X}$:

$$ \mathbf{X} = \mathbf{U} \Lambda \mathbf{V}^T $$

where $\mathbf{U}$ is an $N\times p$ orthogonal matrix ($\mathbf{U}^T\mathbf{U}=\mathbf{I}_p$) whose columns are called the left singular vectors, $\mathbf{V}$ is a $p\times p$ orthogonal matrix ($\mathbf{V}^T\mathbf{V}=\mathbf{I}_p$) whose columns are called the right singular vectors, and $\Lambda$ is a $p\times p$ diagonal matrix with diagonal elements $\lambda_1\geq \lambda_2 \geq \dots \geq \lambda_p\geq 0$ called the singular values.

The solution $\mathbf{V}_q$ above consists of the first $q$ columns of $\mathbf{V}$.

The columns of $\mathbf{U}\Lambda$ are called the principal components of $\mathbf{X}$. The first $q$ principal components are $\mathbf{U}_q\Lambda_q$. Each principal component is a vector in $\R^N$, that is, it takes a value for each observation. The solution $\hat{z}_i$ above consists of the rows of $\mathbf{U}_q\Lambda_q$, that is, for each observation $i$, $\hat{z}_i$ consists of the values of the first $q$ principal components for observation $i$.

Probabilistic PCA

Probabilistic PCA is a continuous generalization of the Gaussian mixture model where we again assume that an observation $X=(x_1,\dots,x_p)$ of $p$ variables is generated by a latent variable $Z$:

flowchart LR Z --> X

In the Gaussian mixture model, $Z$ is assumed to take discrete values leading to clustered observations. Here we assume that $Z=(z_1,\dots,z_q)$ is a $q$-dimensional continous variable with distribution

$$ Z\sim \mathcal{N}(0,\mathbf{I}) $$

that is, we assume each $z_i$ is drawn from an independent normal distribution with mean zero and standard deviation 1. The conditional distribution of $X$ given $Z$ is just like in the discrete mixture distribution assumed to be a multivariate normal with mean dependent on the value of $Z$; the covariance matrix is assumed to be a multiple of the identity matrix, meaning that conditional on the latent factors $Z$, the $p$ observed features $X$ are independent and have equal error distribution:

$$ p(x \mid z) = \mathcal{N}(\mathbf{\mu} + \mathbf{V}z, \sigma^2 \mathbf{I}) $$

where $\mu\in\R^p$ is a mean offset vector, $\mathbf{V}\in\R^{p\times q}$ is a linear maps, and $\sigma^2>0$ a common variance parameter for all $p$ dimensions. Note the similarity between this model and the model above.

Following the Gaussian mixture model example, we want to maximize the likelihood of the observed data, which follows the marginal distribution

$$ p(x) = \int dz\; p(x\mid z) p(z) $$

which can be interpreted as a “continuous” Gaussian mixture distribution. Unlike in the finite mixture model, this integral can in fact be solved using properties of the multivariate normal distribution:

$$ p(x) = \mathcal{N}(\mu,\mathbf{C}) $$

with covariance matrix $\mathbf{C}=\mathbf{V}\mathbf{V}^T+\sigma^2\mathbf{I}$. The corresponding log-likelihood for $N$ observations $x_1.\dots,x_N\in\R^p$ is

$$ \mathcal{L} = -\frac{N}{2}\Bigl\{p \ln(2\pi) + \ln(\det(\mathbf{C})) + \mathrm{tr}(\mathbf{C}^{-1}\mathbf{S}) \Bigr\} $$

where

$$ \mathbf{S}= \frac{1}{N}\sum_{i=1}^N (x_i-\mu)(x_i-\mu) $$

The maximum-likelihood for $\mu$ is the sample mean, $\hat{\mu}=\overline{x}=\frac{1}{N}\sum_{i=1}^N x_i$, such that $\mathbf{S}$ is the sample covariance matrix. The log-likelihood function measures how much $\mathbf{C}$ differs from the sample covariance and is maximized by making $\mathbf{C}$ as “similar” as possible to it. The solution turns out to be the standard PCA solution where the columns of $\mathbf{V}$ are the $q$ eigenvectors with largest eigenvalues of $\mathbf{S}$. This conclusion is valid only if we assume the independent error model, where features are independent given latent factors, that is, the latent factors explain all the covariance between the features. In summary:

Note the important benefit of the probabilistic approach. In standard PCA, we simply fit a linear function to a set of observation, without giving any thought to the deviations (errors) between the (noisy) observations and the fitted model. In probabilistic PCA, as in all generative models, we must be explicit about our assumptions, and standard PCA is only recovered if we assume that features are independent given the latent factors.

4.3 - t-SNE and UMAP

T-SNE, UMAP.

t-Stochastic Neighbor Embedding (t-SNE)

Linear dimensionality reduction methods (like PCA) may not able to retain both the local and global structure of the data in a single map.

t-SNE is a non-linear method that aims to overcome this limitation.

t-SNE is particularly successful at visualizing high-dimensional data (reducing it to two dimensions).

Care must be taken when applying t-SNE due to random initialization of the algorithm, which may lead to solutions that do not represent the global structure of the data well.

(Symmetric) Stochastic Neighbor Embedding (SNE)

As before, assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.

SNE introduces a directional similarity of $x_j$ to $x_i$,

$$ \begin{aligned} p_{j\mid i} = \frac{\exp\bigl(-\frac{1}{2\sigma_i^2}|x_i-x_j|^2\bigr)}{\sum_{k\neq i}\exp\bigl(-\frac{1}{2\sigma_i^2}|x_i-x_k|^2\bigr)} \end{aligned} $$

The variances $\sigma_i$ are chosen such that the perplexities

$$ \begin{aligned} \mathcal{P_i} = \exp\Bigl(-\sum_{j\neq i} p_{j\mid i}\log p_{j\mid i}\Bigr) \end{aligned} $$

take some prespecified value.

Symmetric, undirectional similarities are defined as

$$ \begin{aligned} p_{ij} = \frac{p_{j\mid i}+p_{i\mid j}}{2n} \end{aligned} $$

such that $\sum_{ij}p_{ij}=1$

A map is a two or three-dimensional representation ${\cal Y}={y_1,\dots,y_N}$ of the high-dimensional data ${\cal X}={x_1,\dots,x_N}$.

In the low-dimensional representation, we define the probability of picking $y_i$ and $y_j$ as neighbors by

$$ \begin{aligned} q_{ij} = \frac{\exp\bigl(-|y_i-y_j|^2\bigr)}{\sum_{k\neq l}\exp\bigl(-|y_k-y_l|^2\bigr)} \end{aligned} $$

(Symmetric) Stochastic Neighbor Embedding (SNE) finds the map ${\cal Y}$ that minimizes the mismatch between $p_{ij}$ and $q_{ij}$, across all pairs $i$ and $j$.

The mismatch cost function is given by the Kullback-Leibler divergence:

$$ \begin{aligned} C = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}\geq 0 \end{aligned} $$

with $C=0$ if, and only if, $q_{ij}=p_{ij}$ for all $i$ and $j$.

SNE suffers from a “crowding problem”

The volume of a sphere with radius $r$ in $p$ dimensions scales as $r^p$.

Consider a cluster of datapoints that are approximately uniformly distributed in a sphere of radius $r$.

If $p\gg 2$, then the available 2-dimensional area to model the distances in this cluster accurately will be too small compared to the total area available, forcing clusters of nearby points to crush together.

Heavy-tails in the low-dimensional space resolve the crowding problem

A heavy-tail distribution in the low-dimensional space allows a moderate distance in the high-dimensional space to be faithfully modelled by a much larger distance in the map.

In t-SNE a Student t-distribution with 1 d.o.f. (a Cauchy distribution) is used as the heavy-tailed distribution in the low-dimensional map:

$$ \begin{aligned} q_{ij} = \frac{\left(1+|y_i-y_j|^2\right)^{-1}}{\sum_{k\neq l}\left(1+|y_k-y_l|^2\right)^{-1}} \end{aligned} $$

The Cauchy distribution offers additional advantages in the numerical optimization of the cost function.

Summary

t-SNE:

  • puts emphasis on modelling dissimilar datapoints by means of large pairwise distances, and similar datapoints by small pairwise distances,
  • offers dramatic improvement in finding and preserving local structure in the data, compared to, for instance, PCA.

Optimization of the t-SNE cost function is easier than optimizing earlier SNE versions, but:

  • the cost function is non-convex,
  • several optimization parameters need to be chosen,
  • the constructed solutions depend on these choices and may be different each time t-SNE is run from an initial random configuration.

Formal theory supporting t-SNE is lacking:

  • similarity measures and cost function are based on heuristics, appealing to “intuitive” justifications,
  • there is no formal generative model connecting high- and low-dimensional representations,
  • the probability semantics used to describe t-SNE is descriptive, rather than foundational.

Uniform Manifold Approximation and Projection (UMAP)

In the words of the authors:

  • UMAPs design decisions were all grounded in a solid theoretic foundation and not derived through experimentation with any particular task focused objective function.
  • The theoretical foundations for UMAP are largely based in manifold theory and topological data analysis.
  • A purely computational view [of UMAP] fails to shed any light upon the reasoning that underlies the algorithmic decisions made in UMAP.

A computational view of UMAP

From a practical computational perspective, UMAP can be described in terms of, construction of, and operations on, weighted graphs:

  • Graph construction:

    1. Construct a weighted k-neighbour graph
    2. Apply some transform on the edges to represent local distance.
    3. Deal with the inherent asymmetry of the k-neighbour graph.\
  • Graph layout:

    1. Define a cost function that preserves desired characteristics of this k-neighbour graph.
    2. Find a low dimensional representation which optimizes this cost function.

t-SNE and UMAP can both be cast in this form, but with some (subtle) differences in the edge weighting (dissimilarity measure) and graph layout (cost function).

A comparison of dimension reduction algorithms

Practical recommendations

  • Claims of UMAP or t-SNE (or related methods) being superior are usually overrated.

  • Both algorithms suffer from a need for careful parameter tuning and initialization choices.

  • Using and understanding one algorithm well is more important flipping between algorithms and never changing default parameters.

An excellent reference for using t-SNE for single-cell data:

Kobak & Berens. The art of using t-SNE for single-cell transcriptomics. Nat. Comm. 10:5416 (2019)

See also: https://github.com/berenslab/rna-seq-tsne

Assignment

4.4 - Tutorial

Tutorial on dimensionality reduction of single-cell RNA-seq data using t-SNE and PCA.

Tutorials are available as Pluto notebooks. To run the reactive notebooks, you need to first follow the instructions in the “Code” section of the repository’s readme file, then follow the instruction at the bottom of the Pluto homepage.

5 - Causal inference

Genetics of gene expression. The method of path coefficients. False discovery control.

5.1 - Genetics of gene expression

Genetics of gene expression

The GTEx Consortium atlas of genetic regulatory effects across human tissues

Figure 1

Concepts

  • Genome-wide association studies (GWAS)
  • Expression quantitative trait locus (eQTL)
  • cis vs trans eQTLs

Figure 2

Concepts

  • Linkage disequilibrium (LD)
  • Stepwise regression

Figure 3

Concepts

  • Statistical finemapping
  • ENCODE in GWAS

Figure 4

Concepts

  • Enrichment in functional annotations
  • Mediation analysis

Figure 5

Concepts

  • Mendelian randomization (MR)

Figure 6

Concepts

  • Logistic regression

Figure 7

5.2 - Causal model selection

Causal model selection. Mediation. Instrumental variables.

Mediation analysis

The GTEx study identified trans-eQTLs that are also cis-eQTLs and asked if the cis-eGene could be the cause of the trans-eQTL association (see fig above), that is, if the following model is supported by the data:

flowchart LR Z --> X --> Y

where $Z$ is a SNP that is a cis-eQTL for gene $X$ and trans-eQTL for gene $Y$. This model implies that $X$ blocks the path between $Z$ and $Y$, and hence that $Z$ and $Y$ are independent conditional on $X$, in mathematical notation

$$ Z \perp Y \mid X $$

The principle for testing whether the model $Z\to X \to Y$ is true using the conditional independence criterion is illustrated in the figure below. Assuming linear relations between all variables, three conditions must be met:

  1. The expression levels of $X$ differ significantly between the genotype groups of $Z$ (to confirm the $Z\to X$ association).
  2. The expression levels of $Y$ differ significantly between the genotype groups of $Z$ (to confirm the $Z\to Y$ association).
  3. The residuals of $Y$ after regression on $X$ do not differ differ significantly between the genotype groups of $Z$ (to confirm that the $Z\to Y$ association is mediated by $X$, and hence that $Z\to X \to Y$ is true).

Instrumental variable analysis / Mendelian randomization

The mediation method fails if $X$ and $Y$ are affected by common cause $U$ (which may be an unknown or hidden variable):

flowchart LR Z --> X --> Y U --> X & Y

In this case, conditioning on $X$ opens the collider $Z \to X \leftarrow U$, creating a path $Z\; — \; U \to Y$, such that the residuals of $Y$ will still show a difference between the genotype groups of $Z$, and the mediation method concludes (wrongly!) that the $Z\to Y$ association must be due to another factor than $X$ (no causal $X\to Y$ relation).

Instrumental variable, known as Mendelian randomization (MR), is an alternative causal inference approach that is not affected by hidden confounders $U$, but with subtly different underlying assumptions.

Specifically, in MR we assume that the $Z\to Y$ association must be due to $X$ (for instance because $X$ is the only cis-eGene of $Z$, and trans-eQTL associations must be mediated by some initial cis effects), and we seek to estimate the magnitude of the causal effect of $X$ on $Y$.

The diagram above can be written as a structural equation model

$$ \begin{aligned} X &= a Z + c_X U + E_X\ Y &= b X + c_Y U + E_Y \end{aligned} $$

where $E_X$ and $E_Y$ are error terms, mutually independent and independent of $Z$ and $U$. Since $Z$ and $U$ are assumed to be independent (no arrows in the diagram), it follows that

$$ \begin{aligned} \mathrm{cov}(Y,Z) &= b\; \mathrm{cov}(X,Z) \end{aligned} $$

and hence the causal effect of $X$ on $Y$ is estimated by the ratio of covariances:

$$ b = \frac{\mathrm{cov}(Y,Z)}{\mathrm{cov}(X,Z)} $$

Assignment

5.3 - Causal model selection tutorial

Causal model selection tutorial.

Background

In this tutorial we consider two continuous and correlated variables, $X$ and $Y$, representing the expression levels of two genes. We also consider a discrete variable $Z$ representing a genotype for gene $X$. Typically, $Z$ will have been obtained by eQTL mapping for gene $X$. We wish to determine if variation in $X$ causes variation in $Y$.

The aim of causal model selection is: given observational data for $Z$, $X$, and $Y$ in a set of independent samples, which causal model (represented by a directed acyclic graph) explains the data best.

To restrict the space of possible models that need to be considered, a number of assumptions reflecting biological knowledge can be made:

  1. Genetic variation influences variation in gene expression (and phenotypes more generally), but changing the value of an individual’s phenotype does not change their genome. Hence, in our models there can be no incoming arrows into $Z$.

  2. We assume that the statistical association between $Z$ and $X$ is due to a direct effect, that is, all causal models we consider must contain the directed edge $Z\to X$. This assumption is justified if $Z$ is located within or near to $X$ (on the genome) and in a known regulatory region for $X$.

  3. For $X$ and $Y$ to be correlated (non-independent), there must be a path in the graph between them, more precisely $X$ and $Y$ must not be d-separated.

  4. By the laws of Mendelian inheritance (particularly the random segregation of alleles), we may assume that $Z$ is independent of any unobserved confounding factors $U$ that cause $X$ and $Y$ to be correlated, and therefore there are no edges between $Z$ and any unobserved $U$.

If we assume that there are no unobserved factors, there are 5 possible models satisfying assumptions 1-4 (see figure below). If we allow for the presence of unobserved factors, we have the same 5 models with an additional unobserved $U$ having a causal effect on $X$ and $Y$, plus one more model without an edge between $X$ and $Y$ where all of their correlation is explained by $U$ (see figure).

Below each model in the figure, some of the key conditional independences implied by the model are shown, using the mathematical notation $⫫$ for “is independent of”, $∣$ for “conditional on”, $∧$ for “and”, and $¬$ for “it is not the case that”.

Causal models

From this figure it is immediately clear that our aim of deciding whether $X$ causes $Y$ is unachievable. Even without any unobserved factors, models 4a and 5a are Markov equivalent: they entail the same conditional independences and are indistinguishable using observational data. In other words, there exists no condition that is both necessary and sufficient for $X$ to cause $Y$ given the above 4 assumptions.

There are two possible paths forward (apart from giving up): to use a sufficient condition or a necessary condition for causality. If the joint probability distribution of $Z$, $X$, and $Y$ passes a sufficient condition, then it is guaranteed to have been generated by a model where $X$ causes $Y$, but there may be distributions with $X\to Y$ that don’t pass the test (false negatives). Conversely, all joint distributions generated by models with $X\to Y$ will pass a necessary condition, but there may be distributions generated by models where $X$ does not cause $Y$ that also pass the test (false positives).

Because estimating the joint distribution of $Z$, $X$, and $Y$ and its conditional independences from a finite number of samples is itself an imperfect process that can only ever approximate the true distribution, having additional errors from using an imperfect causality test is not necessarily catastrophic, provided those errors are small enough.

A sufficient condition for $X\to Y$

Our sufficient condition for $X\to Y$ is based on model 1a. This model implies that $Z$ and $Y$ are not independent, but $Z$ is independent of $Y$ conditioned on $X$ ($X$ is a mediator for the causal path from $Z$ to $Y$). No other model satisfies those two relations:

  • In models 2a, 2b, and 6b, $Z$ and $Y$ are independent.
  • In all other models $Z$ is not independent of $Y$ conditioned on $X$, either because there is a direct path from $Z$ to $Y$ not passing through $X$, or because conditioning on $X$ opens a path from $Z$ to $Y$ via the confounder $U$ (due to the v-structure or collider at $X$).

Hence any joint distribution in which $Z$ and $Y$ are not independent, but $Z$ is independent of $Y$ conditioned on $X$, must have been generated by model 1a, that is, by a model where $X\to Y$. In mathematical notation:

$$ ¬(Z⫫Y) ∧ (Z⫫Y∣X) ⇒ (X→Y) $$

A necessary condition for $X\to Y$

Because all models contain the edge $G→X$, it follows that if $X→Y$, then $Z$ and $Y$ cannot be independent, providing a simple necessary condition for $X→Y$. However, $Z$ and $Y$ are also not independent in models 3a-b and 5a-b, in which $Y→X$, because of the direct edge $G→Y$. Of these, only 3a can be excluded, because in 3a, $X⫫Y∣Z$, a condition not satisfied in any model where $X→Y$. Combining these two results, we obtain

$$ (X→Y) ⇒ ¬(Z⫫Y) ∧ ¬(X⫫Y∣Z) $$

Import packages

In this tutorial, we will first set up some code to generate and visualize simulated data for $Z$, $X$, and $Y$ given one of the models in the figure above. Then we will implement the sufficient and necessary conditions for $X→Y$ and assess how well they perform in a number of scenarios. The code is written in julia. It should be fairly straightforward to translate into other languages.

We start by importing some necessary packages:

using Random
using Distributions
using StatsBase
using DataFrames
using GLM
using LinearAlgebra
using CairoMakie
using LaTeXStrings

Data simulation

First we set a random seed and fix the number of samples:

Random.seed!(123)
N = 200;

In our simulations, as in real biology, a genotype value is sampled first. Then expression values for the two genes are sampled conditional on the genotype value.

Genotype data simulation

We simulate the genotype of a bi-allelic (2 values), diploid (2 copies) polymorphism. We encode the major and minor alleles by the values 0 and 1, respectively. The genotype is sampled by defining the minor allele frequency (MAF) as a parameter of the simulation. Two haploids are sampled using a Bernoulli distribution and summed to give the genotype, that is, the genotype is the number of copies of the minor allele in an individual.

Because the mean of a Bernoulli distribution with probability of success $p$ is $p$, the mean of $Z$ is 2 times the minor allele frequency, and we therefore subtract this value from the sampled genotypes to obtain samples from a random variable $Z$ with mean zero. Note that we cannot center the actual sampled values at this point, because we still need to generate the expression data conditional on $Z$, and the expression levels of an individual cannot depend on the sample mean of the genotypes in a population!

maf = 0.3
H1 = rand(Bernoulli(maf),N)
H2 = rand(Bernoulli(maf),N)
Z0 = H1 .+ H2 
Z = Z0 .- 2*maf;

Expression data simulation

To simulate data for $X$ and $Y$, we must first set up the structural equations for the causal model we want to simulate. We will assume linear models with additive Gaussian noise throughout.

Let’s start by models 1a, 1b, and 6b. Their structural equations can be written as

$$ \begin{align} X &= a Z + U_X\\ Y &= b X + U_Y \end{align} $$

where $U_X$ and $U_Y$ are normally distributed errors with joint distribution

$$ (U_X, U_Y)^T \sim {\cal N}(0,Σ_{U}) $$

with covariance matrix

$$ Σ_{U} = \begin{pmatrix} 1 & ρ\\ ρ & 1 \end{pmatrix} $$

In model 1a, the errors are uncorrelated, $\rho=0$, and we arbitrarily set the errors to have unit variance. In model 1b, the unobserved confounder $U$ has the effect of correlating the errors, that is $0<\rho<1$. In model 6b, the errors are also correlated, $0<\rho<1$, but there is no direct effect of $X$ on $Y$, that is $b=0$.

The parameters $a$ and $b$ are the causal effect sizes, and their magnitudes should be interpreted relative to the unit standard deviation of the random errors. In other words, each additional alternative allele shifts the mean of $X$ by $a$ standard deviations of the random errors.

Given a value $Z=z$, eqs. (1)–(2) can be rewritten in matrix-vector notation as

$$ \begin{pmatrix} X\\ Y \end{pmatrix}

\begin{pmatrix} az \\ abz \end{pmatrix} + \begin{pmatrix} 1 & 0\\ b & 1 \end{pmatrix} \begin{pmatrix} U_X\\ U_Y \end{pmatrix} $$

Using properties of the multivariate normal distribution, it follows that

$$ \begin{equation} \begin{pmatrix} X\\ Y \end{pmatrix} \mid Z=z ∼ {\cal N} \left( \mu_z, Σ_{XY} \right) \end{equation} $$

with

$$ \begin{align*} \mu_z &= \begin{pmatrix} az \\ abz \end{pmatrix} \\ Σ_{XY} &= \begin{pmatrix} 1 & 0\\ b & 1 \end{pmatrix} \begin{pmatrix} 1 & ρ\\ ρ & 1 \end{pmatrix} \begin{pmatrix} 1 & b \\ 0 & 1 \end{pmatrix} = \begin{pmatrix} 1 & b+ρ\\ b+ρ & b^2 + 2b\rho + 1 \end{pmatrix} \end{align*} $$

Hence given a sampled genotype value $Z$, we can sample values for $X$ and $Y$ by sampling from the multivariate normal distribution (3). In fact, julia knows how to do calculus with distributions, and we only need to specify the distribution of the random errors and the affine matrix-vector transformation:

function sample_XcauseY(Z,a=1.0,b=0.8,rho=0.0)
    # number of samples
    N = length(Z)
    # Covariance matrix for the unmodelled factors
    covU = [1.0 rho
            rho 1.0]
    # distribution of the unmodelled factors
    distU = MvNormal(covU)
    # the covariance between X and Y due to direct and unmodelled effects
    B = [1. 0.
         b  1.]
    distXY = B * distU
    # sample XY expression levels from model 1a, 1b, or 6b, depending on the values of rho and b
    XY = [a*Z a*b*Z] + rand(distXY,N)'
    return XY
end;

Your turn: Can you derive and implement the structural equations for the other models?

The function to sample from models 3a or 3b is as follows:

function sample_GcauseXY(Z,a=1.0,b=0.8,rho=0.0)
    # number of samples
    N = length(Z)
    # Covariance matrix for the unmodelled factors
    covU = [1.0 rho
            rho 1.0]
    # distribution of unmodelled factors
    distU = MvNormal(covU)
    # sample XY expression levels from model 3a or 3b, depending on the value of rho
    XY = [a*Z b*Z] + rand(distU,N)'
    return XY
end;

Now sample some data:

XY1a = sample_XcauseY(Z,1.,0.8,0.) # sample from model 1a
XY1b = sample_XcauseY(Z,1.,0.8,0.4) # sample from model 1b
XY6b = sample_XcauseY(Z,1.,0.,0.4); # sample from model 6b
XY3a = sample_GcauseXY(Z,1.,0.8,0.) # sample from model 3a
XY3b = sample_GcauseXY(Z,1.,0.8,0.4); # sample from model 3b

Let’s collect all our data in a dataframe:

data = DataFrame(Z0=Z0, Z=Z, 
            X1a=XY1a[:,1], Y1a=XY1a[:,2],
            X1b=XY1b[:,1], Y1b=XY1b[:,2],
            X6b=XY6b[:,1], Y6b=XY6b[:,2],
            X3a=XY3a[:,1], Y3a=XY3a[:,2],
            X3b=XY3b[:,1], Y3b=XY3b[:,2]);

We can visualize the data using scatter plots of $X$ and $Y$ colored by genotype value:

f1 = Figure()
figttl = ["Model 1a", "Model 1b", "Model 6b", "Model 3a", "Model 3b"]
rowid = [1, 1, 1, 2, 2]
colid = [1, 2, 3, 1, 2]
for k=1:5
    ax = Axis(f1[rowid[k],colid[k]], xlabel="x", ylabel="y", 
        title=figttl[k],
        aspect=AxisAspect(1))
    hidespines!(ax)

    scatter!(data[data.Z0.==0,2*k+1],data[data.Z0.==0,2*k+2],
        label="z=0", marker=:circle, color=:black, 
        markersize=7)
    scatter!(data[data.Z0.==1,2*k+1],data[data.Z0.==1,2*k+2],
        label="z=1", marker=:xcross, color=:lightblue, 
        markersize=12)
    scatter!(data[data.Z0.==2,2*k+1],data[data.Z0.==2,2*k+2], 
        label="z=2", marker=:cross, color =:red, 
        markersize=12)

    f1[2,3] = Legend(f1, ax, "Genotype", 
        framevisible = false, 
        tellheight=false, tellwidth=false,
        halign=:left, valign=:bottom)
end   
f1

Causal models

Model selection

Testing the sufficient condition for $X\to Y$

For the sufficient condition we have to test whether $Z$ and $Y$ are dependent and whether this dependence disappears after conditioning on $X$. If we assume linear models, these tests can be performed by standard least squares regression and testing for non-zero coefficients.

Testing data generated by Model 1a

First we test whether $Z$ and $Y$ are dependent by regressing $Y$ on $Z$. Note that by construction $Y$ and $Z$ are samples from random variables with zero mean and hence we don’t include an intercept in the regression:

yg = lm(@formula(Y1a ~ 0 + Z),data)

From the small $p$-value we can reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.

Next we test whether $Z$ and $Y$ become independent after conditioning on $X$, that is, after regressing out $X$ from $Y$:

yx = lm(@formula(Y1a ~ 0 + X1a),data)

The residuals of the regression of $Y$ on $X$ can now be tested for association with $Z$:

data.residYX1a = residuals(yx)
ygx = lm(@formula(residYX1a ~ 0 + Z),data)

Clearly the null hypothesis of no association cannot be rejected, and we conclude that the second test has also been passed.

In conclusion, the data satisfies the sufficient condition for causality and we conclude (correctly) that it has been generated by a causal model with an edge $X\to Y$ (a true positive result).

Testing data generated by Model 1b

We repeat the same procedure for the data generated by model 1b:

yg = lm(@formula(Y1b ~ 0 + Z),data)

From the small $p$-value we can again reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.

Next we regress out $X$ from $Y$, and test the residuals for association with $Z$:

yx = lm(@formula(Y1b ~ 0 + X1b),data)
data.residYX1b = residuals(yx)
ygx = lm(@formula(residYX1b ~ 0 + Z),data)

This time we have to reject the null hypothesis of no association because of the small $p$-value, and we conclude that the second test has not been passed.

In conclusion, the data does not satisfy the sufficient condition for causality and we conclude (wrongly) that it has not been generated by a causal model with an edge $X\to Y$ (a false negative result).

Testing data generated by Model 6b

We repeat the procedure one more time for the data generated by model 6b:

yg = lm(@formula(Y6b ~ 0 + Z),data)

This time we cannot reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test has not passed.

We can conclude immediately that the data does not satisfy the sufficient condition for causality and we conclude (correctly) that it has not been generated by a causal model with an edge $X\to Y$ (a true negative result).

Your turn

Repeat the analysis of the sufficient condition for the data generated by model 3a and 3b. What do you conclude and are those conclusions correct or not?

Testing the necessary condition for $X\to Y$

For the necessary condition we have to test whether $Z$ and $Y$ are dependent and whether $X$ and $Y$ become independent after conditioning on $Z$. Assuming linear models, these tests can again be performed by least squares regression and testing for non-zero coefficients.

Testing data generated by Model 1b

As before, we test whether $Z$ and $Y$ are dependent by regressing $Y$ on $Z$:

yg = lm(@formula(Y1b ~ 0 + Z),data)

From the small $p$-value we can again reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.

Next we also regress $X$ on $Z$, and test the residuals of both regressions for association with each other:

xg = lm(@formula(X1b ~ 0 + Z),data)
data.residGY1b = residuals(yg)
data.residGX1b = residuals(xg)
ygx = lm(@formula(residGY1b ~ 0 + residGX1b),data)

From the small $p$-value we can reject the null hypothesis of no dependence between the residuals, and hence we conclude that the second test has also been passed.

In conclusion, the data satisfiess the necessary condition for causality and we conclude (correctly) that it has been generated by a causal model with an edge $X\to Y$ (a true positive result).

Testing data generated by Model 3a

We repeat the same procedure for the data generated by model 3a:

yg = lm(@formula(Y3a ~ 0 + Z),data)

From the small $p$-value we can again reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.

Next we again regress $X$ on $Z$, and test the residuals of both regressions for association with each other:

xg = lm(@formula(X3a ~ 0 + Z),data)
data.residGY3a = residuals(yg)
data.residGX3a = residuals(xg)
ygx = lm(@formula(residGY3a ~ 0 + residGX3a),data)

This time we cannot reject the null hypothesis of no dependence between the residuals, and hence we conclude that the second test has not been passed.

In conclusion, the data does not satisfy the necessary condition for causality and we conclude (correctly) that it has not been generated by a causal model with an edge $X\to Y$ (a true negative result).

Testing data generated by Model 3b

We repeat the same procedure for the data generated by model 3b:

yg = lm(@formula(Y3b ~ 0 + Z),data)

From the small $p$-value we can again reject the null hypothesis of no dependence between $Y$ and $Z$, and hence we conclude that the first test is passed.

Next we again regress $X$ on $Z$, and test the residuals of both regressions for association with each other:

xg = lm(@formula(X3b ~ 0 + Z),data)
data.residGY3b = residuals(yg)
data.residGX3b = residuals(xg)
ygx = lm(@formula(residGY3b ~ 0 + residGX3b),data)

Because of the small $p$-value we must reject the null hypothesis of no dependence between the residuals, and hence we conclude that the second test has also passed.

In conclusion, the data satisfies the necessary condition for causality and we conclude (wrongly) that it has been generated by a causal model with an edge $X\to Y$ (a false positive result).

Your turn

Repeat the analysis of the sufficient condition for the data generated by model 1a and 6b. What do you conclude and are those conclusions correct or not?

Some comments and further reading

Which condition for causality to choose?

Despite the simplicity of these examples they show the full spectrum of what to expect from these approaches to causal inference from molecular QTL data. It is important to stress that the false negative and false positive findings are not due to any errors in the tests themselves, misspecifications of the model (the models used to generate and fit the data are identical) or measurement noise (none was added in the simulations). Instead it is a basic mathematical truth that a condition for causality that is sufficient but not necessary will be prone to false negative predictions, whereas a condition that is necessary but not sufficient will be prone to false positive predictions.

Whether to use the sufficient or the necessary condition in concrete applications depends on a number of considerations:

How to quantify statistical significance and uncertainty?

In the examples above, we basically eye-balled the $p$-values and gave thumbs-up or thumbs-down for a causal relation. Not only is this practically infeasible when testing causality between ten-thousands of pairs in omics data, we would also like to quantify the statistical significance and uncertainty of each finding. This turns out to be a non-trivial question, because:

  • Both the sufficient and necessary condition are combinations of statistical tests.
  • The necessary condition includes a test where accepting the null hypothesis is a positive finding, but a non-significant $p$-value (no evidence for rejecting the null hypothesis) does not imply significant evidence against the alternative hypothesis.

One strand of research has focused on summarizing the outcome of multiple hypothesis tests by a single $p$-value with the usual interpretation that a small $p$-value is evidence for rejecting the null hypothesis of no causal interaction, see these papers:

However, to overcome in particular the second problem above, these methods have to introduce pairwise model comparison statistics almost by stealth. A much better approach (in my not so objective opinion!) is to do this explicitly in a Bayesian framework.

In the papers below, each of the component tests (for instance, $¬(Z⫫Y)$ or $Z⫫Y\mid X$ in the sufficient condition) is expressed as a likelihood-ratio test between two nested, null and alternative, models. Using $q$-values, the likelihood-ratio test statistics are converted into probabilities of the null or alternative model being true. These probabilities can then be combined by the usual rules of probability theory (e.g., multiplied to express that two tests must be true).

In addition, the second paper provides a software implementation that is highly efficient, and can test both the sufficient and the necessary condition for causality (all other papers essentially only study the sufficient condition).

Your turn?

If you are convinced by the above arguments and have some molecular QTL data waiting to be analyzed by causal inference, try our Findr software.

Assignment

6 - Graphical models

Gene regulatory networks. Bayesian networks. Other network inference methods.

6.1 - Gene regulatory networks

Gene regulatory networks.

An integrative genomics approach to the reconstruction of gene networks in segregating populations

Figure 1

Figure 3

Large-Scale Mapping and Validation of E. coli Transcriptional Regulation from a Compendium of Expression Profiles

Software

Figure 1

Figure 2

Figure 5

Inferring Regulatory Networks from Expression Data Using Tree-Based Methods

Software

Genie3 (Python, Matlab, R)

Figure 1

Figure 4

6.2 - Bayesian networks

Bayesian networks.

Inferring Cellular Networks Using Probabilistic Graphical Models

Figure 1

Figure 3

A crash course in Bayesian networks

In Bayesian networks, the joint distribution over a set ${X_1,\dots, X_p}$ of random variables is represented by:

  • a directed acyclic graph (DAG) $\cal G$ with the variables as vertices,
  • a set of conditional probability distributions, $$\begin{aligned} P\bigl(X_i \mid {X_j \mid j\in \mathrm{Pa}_i}\bigr), \end{aligned}$$ where $\mathrm{Pa}_i$ is the set of parents of vertex $i$ in $\mathcal{G}$,

such that $$\begin{aligned} P(X_1,\dots,X_p) = \prod_{i=1}^p P\bigl(X_i \mid {X_j \mid j\in \mathrm{Pa}_i}\bigr) \end{aligned}$$

In GRN reconstruction:

  • $X_i$ represents the expression level of gene $i$,
  • $P(X_1,\dots,X_p)$ represents the joint distribution from which (independent) experimental samples are drawn,
  • $\mathcal{G}$ represents the unknown GRN.

Assume we have a set of $N$ observations $x_1,\dots,x_N\in \mathbb{R}^p$ of $p$ variables, collected in the $N\times p$ matrix $\mathbf{X}$.

Assume we know $\mathcal{G}$ and the conditional distributions. Then the likelihood of observing the data is:

$$\begin{aligned} P(\mathbf{X}\mid \mathcal{G}) &= \prod_{k=1}^N P(X_{k1},\dots,X_{kp}\mid \mathcal{G})\\ &= \prod_{i=1}^p \prod_{k=1}^N P\bigl(X_{ki} \mid {X_{kj} \mid j\in \mathrm{Pa}_i}\bigr) \end{aligned}$$

We can now use an iterative algorithm to optimize $\mathcal{G}$ and the conditional distributions:

  • Start with a random graph $\mathcal{G}$.
  • Given $\mathcal{G}$, the likelihood decomposes in a product of independent likelihoods, one for each gene, and the conditional distributions can be optimized by standard regression analysis.
  • In the next iterations, randomly add, delete, or reverse edges in $\mathcal{G}$, as long as the likelihood improves.

A more formal approach to optimizing $\mathcal{G}$ uses Bayes’ theorem: $$\begin{aligned} P(\mathcal{G}\mid \mathbf{X}) = \frac{P(\mathbf{X}\mid \mathcal{G}) P(\mathcal{G})}{P(\mathbf{X})} \end{aligned}$$

  • $P(\mathcal{G})$ represents the prior distribution: even without seeing any data, not all graphs need to be equally likely a priori.
  • $P(\mathbf{X})$ represents the marginal distribution: $P(\mathbf{X}) = \sum_{\mathcal{G}’} P(\mathbf{X}\mid \mathcal{G}’) P(\mathcal{G}’)$. It is independent of $\mathcal{G}$ and can be ignored.

We can use $P(\mathcal{G})$ to encode evidence for causal interactions from integrating genomics and transcriptomics data. This is the main idea from Zhu et al. (2004).

Assignment

6.3 - Tutorial

Illustration of gene regulatory network reconstruction and evaluation using data from E. coli.

Tutorials are available as Pluto notebooks. To run the reactive notebooks, you need to first follow the instructions in the “Code” section of the repository’s readme file, then follow the instruction at the bottom of the Pluto homepage.

7 - Spatio-temporal models

Spatial and temporal gene expression. Gaussian processes.

7.1 - Spatial and temporal gene expression

Spatial and temporal gene expression.

Spatially resolved, highly multiplexed RNA profiling in single cells

Figure 1

MERFISH = multiplexed error-robust FISH (fluorescence in situ hybridization), a single-molecule imaging approach that allows the copy numbers and spatial localizations of thousands of RNA species to be determined in single cells.

Each RNA labelled with set of encoding probes, which contain:

  • targeting sequences that bind the RNA
  • readout sequences that bind fluorescently labeled readout probes

Each RNA assigned a binary word in a modified, error-robust Hamming code:

  • Sequences have $N$ bits, $N$ number of fluorescent labels
  • Sequences have exactly four 1’s. Same number of ones guarantees same overall error rate in situation where $1\to 0$ and $0\to 1$ error rates are different.
  • Sequences have a mutual Hamming distance of at least 4.

Hamming codes can do 1-bit error correction, see example on wikipedia. Higher calling rate and lower misidentification rate at the cost of encoding fewer RNA species with a given number of bits.

Figure 2

7.2 - Variance component models and Gaussian processes

Variance component models and Gaussian processes.

Linear regression, from fixed to random coefficients

In a standard linear regression model for an outcome $Y$ on one or more predictors $X_1,\dots,X_p$

$$ \begin{aligned} Y &= \sum_i\beta_i X_i + \epsilon = X^T\beta + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma_e^2) \end{aligned} $$

the effects $\beta_i$ are treated as fixed, such that the distribution of $Y$ is:

$$ p(Y\mid X) = \mathcal{N}(X^T\beta,\sigma_e^2), $$

that is, in a fixed effect model, the predictors affect the average or expected value of $Y$. In a fixed effect model, we always assume that the data to fit the parameters are obtained from independent samples of an underlying model.

If our data consists of non-independent samples (e.g. due to temporal, spatial or familial relations among the observations), we can use a random effect model to explain these correlations as a function of covariates $X$. As before, we collect the data for the inputs and ouptput in a $N\times p$ matrix $\mathbf{X}$ and $N$-vector $\mathbf{y}$, respectively, $$\begin{aligned} \mathbf{X}= (x_1,x_2,\dots,x_p) = \begin{pmatrix} x_{11} & x_{12} & \dots & x_{1p}\\ x_{21} & x_{22} & \dots & x_{2p}\\ \vdots & \vdots & & \vdots\\ x_{N1} & x_{N2} & \dots & x_{Np} \end{pmatrix} && \mathbf{y}= \begin{pmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{pmatrix} \end{aligned}$$

We model the data as

$$ \begin{aligned} \mathbf{y} &= \mathbf{X} \alpha + \epsilon\ \end{aligned} $$

where $\alpha = (\alpha_1,\dots,\alpha_p)^T$ are random effects with multi-variate normal distribution

$$ \alpha \sim \mathcal{N}(0,\sigma_a^2 I_p) $$

and $\epsilon = (\epsilon_1,\dots,\epsilon_N)$ are independent error terms

$$ \epsilon \sim \mathcal{N}(0,\sigma_e^2 I_N). $$

The random effects are assumed to be independent or the errors. We assumed the random effects are mutually independent, but this can easily be generalized.

Using properties of the multivariate normal distribution, the distribution of $\mathbf{y}$ can be seen to be:

$$ p(\mathbf{y}\mid \mathbf{X}) = \mathcal{N}(0, \sigma_a^2 \mathbf{X}\mathbf{X}^T + \sigma_e^2 I_N) $$

and we see that $\mathbf{X}$, or more precisely the covariance matrix $\mathbf{X}\mathbf{X}^T$ indeed models correlations among the samples in $\mathbf{y}$.

Naturally, fixed and random effects can be combined into a so-called mixed model.

Models of this kind are often used in genetics, where the random effect covariates $\mathbf{X}$ are genetic markers and their covariance matrix $\mathbf{X}\mathbf{X}^T$ expresses the genetic similarity between individuals in the study.

So far we assumed that all random effects had the same variance $\sigma_a^2$. If we assume that each $\alpha_j$ has a different variance $\sigma_j^2$, we would obtain

$$ p(\mathbf{y}\mid \mathbf{X}) = \mathcal{N}(0, \mathbf{K}) $$

where

$$ \mathbf{K} = \sum_{j=1}^p \sigma_j^2 \mathbf{x}_j \mathbf{x}_j^T + \sigma_e^2 I_N $$

that is, $\sigma_j^2$ measures the contribution of covariate $X_j$ to the correlations among samples of $Y$.

To estimate the variance parameters, maximum-likelihood is employed, that is, we find the values of the $\sigma_j^2$ and $\sigma_e^2$ which maximize

$$ \log p(\mathbf{y}) = -\frac12 \log \det (\mathbf{K} ) - \frac12 \mathbf{y}^T \mathbf{K}^{-1} \mathbf{y} - \frac{N}2 \log(2\pi) $$

This is a difficult, non-convex optimization problem which requires the use of numerical, gradient-based optimizers.

Note that the covariance matrix satisfies

$$ \begin{aligned} \mathbb{E}(y_i y_j) = K_{ij} \end{aligned} $$ and hence the total variance of $\mathbf{y}$ can be written as

$$ \begin{aligned} \mathbb{E}(\mathbf{y}^T\mathbf{y}) = \sum_i \mathbb{E}(y_i^2) = \sum_i K_{ii} = \mathrm{tr}(K) \end{aligned} $$

From the definition of $\mathbf{K}$, its trace can be written as

$$ \mathrm{tr}(K) = \sum_{j=1}^p \sigma_j^2 \mathbf{x}_j^T \mathbf{x}_j + N \sigma_e^2. $$

Therefore we say that

$$ \frac{\sigma_j^2 \mathbf{x}_j^T \mathbf{x}_j}{\mathrm{tr}(K) } $$

is the variance in $\mathbf{y}$ explained by variance component $\mathbf{x}_i$.

Kernel-based variance component models

Since the posterior distribution $p(\mathbf{y}\mid \mathbf{X})$ only depends on the matrix $\mathbf{K}$, an immediate generalization is to define variance component models directly in terms of kernel matrices instead of starting from a linear, random effect model, namely define

$$ p(\mathbf{y}) = \mathcal{N}(0, \mathbf{K}) $$

For instance, if the elements (samples) $y_i$ of $\mathbf{y}$ are obtained at specific locations (e.g. pixels in an image) $\mathbf{x_i}\in \mathbb{R}^2$ (or $\mathbb{R}^p$ more generally), we can define

$$ K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j) $$

where the kernel function $k$ is a function that measures the similarity or closeness between samples.

Generalizing from before, the kernel matrix $\mathbf{K}$ can consist of multiple components,

$$ \mathbf{K} = \sum_{j=1}^p \sigma_j^2 \mathbf{K}_j + \sigma_e^2 I_N $$

and the variance explained by the $j^{\text{th}}$ component is

$$ \frac{\sigma_j^2\mathrm{tr}(\mathbf{K}_j)}{\mathrm{tr}(\mathbf{K})} $$

Gaussian processes

In many applications, the “positions” $\mathbf{x}_i$ where samples are obtained are not fixed, but a finite subset of a possibly infinite, continuous range. For instance, when studying a dynamic process, we typically obtain noisy measurements $y_i$ at a finite number of time points $t_i$, and are interested in the entire underlying process $y(t)$ for all times $t$ in some time interval. Similarly, we may have measurements $y_i$ at a finite number of locations $\mathbf{x}_i$, and are interested in the entire function $y(\mathbf{x})$ for all positions $\mathbf{x}$ in some spatial region.

A Gaussian process is a stochastic process, to be understood as a probability distribution over functions $y(\mathbf{x})$ over some continuous domain $D$, such that the set of values of $y(\mathbf{x})$ evaluated at a finite set of points $\mathbf{x}_1,\dots,\mathbf{x}_N\in D$ are jointly normally distributed. A Gaussian process is specified by a kernel function $k(\mathbf{x},\mathbf{x}’)$, for $\mathbf{x},\mathbf{x}’ \in D$. Writing $\mathbf{y} = (y(\mathbf{x}_1), \dots, y(\mathbf{x}_N))$, the kernel defines the probability distribution

$$ p(\mathbf{y}) = \mathcal{N}(0, \mathbf{K}) $$

where

$$ K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j) $$

Hence, for the purposes of parameter estimation, the Gaussian process and variance component viewpoints are identical, which is why the term are often used interchangeably. The main difference lies in the interpretation, where Gaussian process see the data as a finite set of samples from a continuous space. Hence in the Gaussian process viewpoint, we would be particularly interested in interpolation, predicting expected values of the function $y(\mathbf{x})$ at locations $\mathbf{x}$ that were not part of the initial (training) data.

Keeping the notation $\mathbf{y}$ for the function values at the training positions $\mathbf{x}_i$, we know that the joint distribution of $\mathbf{y}$ and the value of $y(\mathbf{x})$ at a new position $\mathbf{x}$ is given by

$$ p\left( \begin{bmatrix} \mathbf{y}\\ y(\mathbf{x}) \end{bmatrix} \right) = \mathcal{N}(0, \mathbf{K}’) $$

where $\mathbf{K}’$ is the $(N+1)\times (N+1)$ matrix

$$ \mathbf{K}’ = \begin{bmatrix} \mathbf{K} & \mathbf{k}\\ \mathbf{k}^T & c \end{bmatrix} $$

where

$$ \begin{aligned} \mathbf{K} &= \left[ k(\mathbf{x}_i,\mathbf{x}_j) \right] \in \mathbb{R}^{N\times N}\\ \mathbf{k} &= k(\mathbf{x}_i,\mathbf{x}) \in \mathbb{R}^{N} \\ c &= k(\mathbf{x},\mathbf{x})\in \mathbb{R} \end{aligned} $$

Using properties of the multivariate normal distribution, the conditional distribution of the unseen value given the training data is:

$$ \begin{aligned} p\left( y(\mathbf{x}) \mid \mathbf{y} \right) &= \mathcal{N}(\mu,\sigma^2)\\ \mu &= \mathbf{k}^T \mathbf{K}^{-1}\mathbf{y}\\ \sigma^2 &= c - \mathbf{k}^T \mathbf{K}^{-1} \mathbf{k} \end{aligned} $$

Generalization to interpolation to multiple points is immediate.

Assignment

8 - Appendix

Gene regulation. Probability theory. Linear algebra. Optimization.

8.1 - Gene regulation

Gene regulation.

From genotype to phenotype

To understand how genetic, epigenetic and gene expression variation cause variation in health and disease traits in human, data-driven biology seeks to learn molecular mechanisms from interconnected omics data.

How does information flow from the genotype to the phenotype?

Gene expression determines cellular states

  • Genes are transcribed into mRNA and translated into protein.
  • Every step in this process can be regulated by genetic or environmental factors.
  • The repertoire and relative levels of proteins expressed determines cellular identity, fate, cell-to-cell communication, etc.
  • Variation in cellular properties causes variation in phenotypes.

Genetic variation causes phenotype variation by affecting gene expression.

How does genetic variation affect gene expression?

Genetic variation influences gene regulation

Genetic variation directly affects DNA methylation, binding of epigenetic and transcription factors and 3D chromatin organization, leading to altered transcription.

Genetic variation informs on local regulatory mechanisms

Coupling genetic variation data with information on TF binding events allows to predict functional effects of motif disruption and cooperative and collaborative TF binding mechanisms in a locus-specific manner.

Genetic variation informs on distal regulatory mechanisms

Genotypes of regulatory variants are fixed from birth and act as experimental instruments to inhibit or promote expression of regulatory factors independent of environmental confounders.

Genes are organized in hierarchical, multi-tissue causal networks

  • Variation in expression of one gene has downstream consequences on expression of other genes.
  • Example: Introduction of just 4 TFs (“Yamanaka factors”) converts adult cells into stem cells.
  • Hundreds to thousands of genes are differentially expressed between cellular states (e.g. healthy vs. disease).
  • Gene expression in one tissue can affect gene expression in other tissues.
  • Phenotype variation also causes gene expression variation (“reverse causation”).

Reconstruction of causal pathwas and gene networks is essential to understand how the genotype determines the phenotype.

Systems genetics in human requires large data and new analytical methods

Instead of direct experimental manipulation of DNA elements or regulatory factors, the relation between genetic variation, gene regulation, gene expression, and (disease) phenotypes in human must be studied by:

  • Statistical sampling of the causes (genotypes) and consequences (gene expression) of gene regulation using genome-wide measurements in unrelated individuals.
  • Mapping associations between data measured at multiple scales.
  • Using statistical and machine learning approaches to reconstruct models of molecular mechanisms.

8.2 - Probability theory

Probability theory.

8.3 - Linear algebra

Linear algebra.

8.4 - Optimization

Optimization.

9 - Contribution Guidelines

How to contribute to the docs

These basic sample guidelines assume that your Docsy site is deployed using Netlify and your files are stored in GitHub. You can use the guidelines “as is” or adapt them with your own instructions: for example, other deployment options, information about your doc project’s file structure, project-specific review guidelines, versioning guidelines, or any other information your users might find useful when updating your site. Kubeflow has a great example.

Don’t forget to link to your own doc repo rather than our example site! Also make sure users can find these guidelines from your doc repo README: either add them there and link to them from this page, add them here and link to them from the README, or include them in both locations.

We use Hugo to format and generate our website, the Docsy theme for styling and site structure, and Netlify to manage the deployment of the site. Hugo is an open-source static site generator that provides us with templates, content organisation in a standard directory structure, and a website generation engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.

All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult GitHub Help for more information on using pull requests.

Quick start with Netlify

Here’s a quick guide to updating the docs. It assumes you’re familiar with the GitHub workflow and you’re happy to use the automated preview of your doc updates:

  1. Fork the Goldydocs repo on GitHub.
  2. Make your changes and send a pull request (PR).
  3. If you’re not yet ready for a review, add “WIP” to the PR name to indicate it’s a work in progress. (Don’t add the Hugo property “draft = true” to the page front matter, because that prevents the auto-deployment of the content preview described in the next point.)
  4. Wait for the automated PR workflow to do some checks. When it’s ready, you should see a comment like this: deploy/netlify — Deploy preview ready!
  5. Click Details to the right of “Deploy preview ready” to see a preview of your updates.
  6. Continue updating your doc and pushing your changes until you’re happy with the content.
  7. When you’re ready for a review, add a comment to the PR, and remove any “WIP” markers.

Updating a single page

If you’ve just spotted something you’d like to change while using the docs, Docsy has a shortcut for you:

  1. Click Edit this page in the top right hand corner of the page.
  2. If you don’t already have an up to date fork of the project repo, you are prompted to get one - click Fork this repository and propose changes or Update your Fork to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.
  3. Follow the rest of the Quick start with Netlify process above to make, preview, and propose your changes.

Previewing your changes locally

If you want to run your own local Hugo server to preview your changes as you work:

  1. Follow the instructions in Getting started to install Hugo and any other tools you need. You’ll need at least Hugo version 0.45 (we recommend using the most recent available version), and it must be the extended version, which supports SCSS.

  2. Fork the Goldydocs repo repo into your own project, then create a local copy using git clone. Don’t forget to use --recurse-submodules or you won’t pull down some of the code you need to generate a working site.

    git clone --recurse-submodules --depth 1 https://github.com/google/docsy-example.git
    
  3. Run hugo server in the site root directory. By default your site will be available at http://localhost:1313/. Now that you’re serving your site locally, Hugo will watch for changes to the content and automatically refresh your site.

  4. Continue with the usual GitHub workflow to edit files, commit them, push the changes up to your fork, and create a pull request.

Creating an issue

If you’ve found a problem in the docs, but you’re not sure how to fix it yourself, please create an issue in the Goldydocs repo. You can also create an issue about a specific page by clicking the Create Issue button in the top right hand corner of the page.

Useful resources