Probabilistic and causal modelling of genome-scale data

Machine learning plays an important role in computational biology. See the Machine Learning in Computational and Systems Biology community or the Machine Learning in Computational Biology conference series.

These lecture notes focus on probabilistic machine learning methods for computational biology, where the experimental data are viewed as random samples from an underlying data-generating process.

The “probabilistic modelling” in the title refers to the use of abstract data-generating processes, not based on any specific biological mechanisms, and derived from generic models and methods. A typical example will be clustering using Gaussian mixture models.

To speak of “causal modelling” will require something more, namely that the data-generating process is based on some qualitative prior knowledge or understanding of the true underlying biological process. A typical example will be path analysis.

The notes are divided in chapters, each focusing on a specific class of methods:

  • Clustering
  • Regularized regression
  • Dimensionality reduction
  • Causal inference
  • Graphical models
  • Spatio-Temporal models

Each chapter follows the same structure:

  • A “classic” biological or biomedical research paper is studied where the algorithm (or class of algorithms) of interest was first used. A more recent follow-up or related paper is given as a reading assignment.
  • The method used in the classic paper is presented in detail, along with additional methods to solve the same type of problem. The methods are put in practice in a programming assignment. Where possible, original data from the papers studied in the first part is used.

Four appendices contain the minimum required background knowledge on gene regulation, probability theory, linear algebra, and optimization.

The theoretical sections contain the basic information to understand a method. For more background, try the following textbooks (with free pdfs!), all used in preparation of this course:

The use of classic or path-breaking papers is motivated by Back to the future: education for systems-level biologists. Since the field of genome-scale data analysis is still relatively young, the choice of papers for study is still a bit open and likely to evolve as the course matures.

These lecture are taught as part of the master program in bioinformatics at UiB, making up about half of the BINF301 Genome-scale Algorithms course. As such, good background knowledge on basic bioinformatics and omics data is assumed.


Cluster analysis

Cancer subtypes. Combinatorial clustering. Mixture distributions.

Statistical significance for genome-wide studies

Statistical significance for genome-wide studies. False discovery rate estimation.

Regularized regression

Drug sensitivity prediction. Ridge, lasso, and elastic net regression.

Dimensionality reduction

Single-cell genomics. Probabilistic PCA. T-SNE and UMAP.

Causal inference

Genetics of gene expression. The method of path coefficients. False discovery control.

Graphical models

Gene regulatory networks. Bayesian networks. Other network inference methods.

Spatio-temporal models

Spatial and temporal gene expression. Gaussian processes.

Appendix

Gene regulation. Probability theory. Linear algebra. Optimization.

Contribution Guidelines

How to contribute to the docs

Last modified May 30, 2023: update Gaussian processes (ff6a6c2)