Machine learning for epigenetics: algorithms for next generation sequencing data
View/ Open
Mayo2018.pdf (12.41Mb)
Date
02/07/2018Item status
Restricted AccessEmbargo end date
31/12/2100Author
Mayo, Thomas Richard
Metadata
Abstract
The advent of Next Generation Sequencing (NGS), a little over a decade ago, has
led to a vast and rapid increase in the generation of genomic data. The drastically
reduced cost has in turn enabled powerful modifications that can be used to investigate
not just genetic, but epigenetic, phenomena. Epigenetics refers to the study of
mechanisms effecting gene expression other than the genetic code itself and thus, at
the transcription level, incorporates DNA methylation, transcription factor binding
and histone modifications amongst others. This thesis outlines and tackles two major
challenges in the computational analysis of such data using techniques from machine
learning.
Firstly, I address the problem of testing for differential methylation between groups
of bisulfite sequencing data sets. DNA methylation plays an important role in genomic
imprinting, X-chromosome inactivation and the repression of repetitive elements,
as well as being implicated in numerous diseases, such as cancer. Bisulfite sequencing
provides single nucleotide resolution methylation data at the whole genome
scale, but a sensitive analysis of such data is difficult. I propose a solution that uses a
powerful kernel-based machine learning technique, the Maximum Mean Discrepancy,
to leverage well-characterised spatial correlations in DNA methylation, and adapt the
method for this particular use. I use this tailored method to analyse a novel data set
from a study of ageing in three different tissues in the mouse. This study motivates
further modifications to the method and highlights the utility of the underlying measure
as an exploratory tool for methylation analysis.
Secondly, I address the problem of predictive and explanatory modelling of chromatin
immunoprecipitation sequencing data (ChIP-Seq). ChIP-Seq is typically used
to assay the binding of a protein of interest, such as a transcription factor or histone,
to the DNA, and as such is one of the most widely used sequencing assays. While
peak callers are a powerful tool in identifying binding sites of sparse and clean ChIPSeq
profiles, more broad signals defy analysis in this framework. Instead, generative
models that explain the data in terms of the underlying sequence can help uncover
mechanisms that predicting binding or the lack thereof. I explore current problems
with ChIP-Seq analysis, such as zero-inflation and the use of the control experiment,
known as the input. I then devise a method for representing k-mers that enables
the use of longer DNA sub-sequences within a flexible model development framework,
such as generalised linear models, without heavy programming requirements.
Finally, I use these insights to develop an appropriate Bayesian generative model
that predicts ChIP-Seq count data in terms of the underlying DNA sequence, incorporating
DNA methylation information where available, fitting the model with the
Expectation-Maximization algorithm. The model is tested on simulated data and real
data pertaining to the histone mark H3k27me3.
This thesis therefore straddles the fields of bioinformatics and machine learning.
Bioinformatics is both plagued and blessed by the plethora of different techniques
available for gathering data and their continual innovations. Each technique presents
a unique challenge, and hence out-of-the-box machine learning techniques have had
little success in solving biological problems. While I have focused on NGS data, the
methods developed in this thesis are likely to be applicable to future technologies,
such as Third Generation Sequencing methods, and the lessons learned in their adaptation
will be informative for the next wave of computational challenges.