Novel stochastic and entropy-based Expectation-Maximisation algorithm for transcription factor binding site motif discovery
View/ Open
Date
29/06/2015Author
Kilpatrick, Alastair Morris
Metadata
Abstract
The discovery of transcription factor binding site (TFBS) motifs remains an important
and challenging problem in computational biology. This thesis presents MITSU,
a novel algorithm for TFBS motif discovery which exploits stochastic methods as a
means of both overcoming optimality limitations in current algorithms and as a framework
for incorporating relevant prior knowledge in order to improve results.
The current state of the TFBS motif discovery field is surveyed, with a focus
on probabilistic algorithms that typically take the promoter regions of coregulated
genes as input. A case is made for an approach based on the stochastic Expectation-
Maximisation (sEM) algorithm; its position amongst existing probabilistic algorithms
for motif discovery is shown. The algorithm developed in this thesis is unique amongst
existing motif discovery algorithms in that it combines the sEM algorithm with a derived
data set which leads to an improved approximation to the likelihood function.
This likelihood function is unconstrained with regard to the distribution of motif occurrences
within the input dataset. MITSU also incorporates a novel heuristic to automatically
determine TFBS motif width. This heuristic, known as MCOIN, is shown to
outperform current methods for determining motif width. MITSU is implemented in
Java and an executable is available for download.
MITSU is evaluated quantitatively using realistic synthetic data and several collections
of previously characterised prokaryotic TFBS motifs. The evaluation demonstrates
that MITSU improves on a deterministic EM-based motif discovery algorithm
and an alternative sEM-based algorithm, in terms of previously established metrics.
The ability of the sEM algorithm to escape stable fixed points of the EM algorithm,
which trap deterministic motif discovery algorithms and the ability of MITSU to discover
multiple motif occurrences within a single input sequence are also demonstrated.
MITSU is validated using previously characterised Alphaproteobacterial motifs,
before being applied to motif discovery in uncharacterised Alphaproteobacterial data.
A number of novel results from this analysis are presented and motivate two extensions
of MITSU: a strategy for the discovery of multiple different motifs within a single
dataset and a higher order Markov background model. The effects of incorporating
these extensions within MITSU are evaluated quantitatively using previously characterised
prokaryotic TFBS motifs and demonstrated using Alphaproteobacterial motifs.
Finally, an information-theoretic measure of motif palindromicity is presented and its
advantages over existing approaches for discovering palindromic motifs discussed.