Data-driven cross-validation to identify novel therapeutic targets in a diverse range of human diseases
Abstract
A common experimental output in biomedical science is a list of genes implicated in a given
biological process or disease. Gene lists resulting from a group of studies answering the
same, or similar, questions can be combined by ranking aggregation methods to find a consensus
or a more reliable answer. In this thesis, a new algorithm called Meta-Analysis by
Information Content (MAIC) was refined to deal with the ranking aggregation problem for
gene lists with various realistic data features, including the mix of ranked and unranked lists,
various noise levels and heterogeneity of quality. Then, to explore the mathematical meaning
of MAIC results and potential improvements, we compared an expectation-maximization (EM)
algorithm and a Blocked Gibbs method with MAIC. Since the properties of a dataset can
influence the performance of an algorithm, evaluating a ranking aggregation method on a
specific type of data before using it is required to support the reliability. Such evaluation on
genomic data is usually based on a simulated database because of the lack of a known truth
for real data. However, simulated datasets tend to be too small compared to experimental
data and neglect key features, including heterogeneity of quality, relevance and the inclusion
of unranked lists. In this thesis, two versions of stochastic generative models were proposed
to emulate real genomic data, with various heterogeneity of quality, noise level, and a mix of
unranked and ranked data. A group of existing methods and their variations which are suitable
for meta-analysis of gene lists are implemented based on existing code and compared using
simulated and real data. In addition to the evaluation with simulated data, a comparison using
real genomic data on the SARS-CoV-2 virus, cancer (NSCLC), and bacteria (macrophage
apoptosis) was performed.We summarise our evaluation results in terms of a simple flowchart
to select a ranking aggregation method for genomic data. Some ranking aggregation methods
can accept the clustering of sources as additional input to deal with the common bias of
specific types of sources. A clustering validation method is proposed in this thesis to help
assign categories to sources.