Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignments
View/ Open
Date
24/11/2011Author
Mantzaris, Alexander Vassilios
Metadata
Abstract
DNA sequence alignments are usually not homogeneous. Mosaic structures
may result as a consequence of recombination or rate heterogeneity. Interspecific
recombination, in which DNA subsequences are transferred between different
(typically viral or bacterial) strains may result in a change of the topology of
the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of
the nucleotide substitution rate. Various methods for simultaneously detecting
recombination and rate heterogeneity in DNA sequence alignments have recently
been proposed, based on complex probabilistic models that combine phylogenetic
trees with factorial hidden Markov models or multiple changepoint processes. The
objective of my thesis is to identify potential shortcomings of these models and
explore ways of how to improve them.
One shortcoming that I have identified is related to an approximation made in
various recently proposed Bayesian models. The Bayesian paradigm requires the
solution of an integral over the space of parameters. To render this integration
analytically tractable, these models assume that the vectors of branch lengths
of the phylogenetic tree are independent among sites. While this approximation
reduces the computational complexity considerably, I show that it leads to the
systematic prediction of spurious topology changes in the Felsenstein zone, that
is, the area in the branch lengths configuration space where maximum parsimony
consistently infers the wrong topology due to long-branch attraction. I demonstrate
these failures by using two Bayesian hypothesis tests, based on an inter- and
an intra-model approach to estimating the marginal likelihood. I then propose a
revised model that addresses these shortcomings, and demonstrate its improved
performance on a set of synthetic DNA sequence alignments systematically generated
around the Felsenstein zone.
The core model explored in my thesis is a phylogenetic factorial hidden Markov
model (FHMM) for detecting two types of mosaic structures in DNA sequence
alignments, related to recombination and rate heterogeneity. The focus of my
work is on improving the modelling of the latter aspect. Earlier research efforts by
other authors have modelled different degrees of rate heterogeneity with separate
hidden states of the FHMM. Their work fails to appreciate the intrinsic difference
between two types of rate heterogeneity: long-range regional effects, which are
potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the
genetic code.
I have improved these earlier phylogenetic FHMMs in two respects. Firstly,
by sampling the rate vector from the posterior distribution with RJMCMC I
have made the modelling of regional rate heterogeneity more flexible, and I infer
the number of different degrees of divergence directly from the DNA sequence
alignment, thereby dispensing with the need to arbitrarily select this quantity
in advance. Secondly, I explicitly model within-codon rate heterogeneity via a
separate rate modification vector. In this way, the within-codon effect of rate
heterogeneity is imposed on the model a priori, which facilitates the learning of
the biologically more interesting effect of regional rate heterogeneity a posteriori.
I have carried out simulations on synthetic DNA sequence alignments, which have
borne out my conjecture. The existing model, which does not explicitly include
the within-codon rate variation, has to model both effects with the same modelling
mechanism. As expected, it was found to fail to disentangle these two effects. On
the contrary, I have found that my new model clearly separates within-codon rate
variation from regional rate heterogeneity, resulting in more accurate predictions.