Data-driven cross-validation to identify novel therapeutic targets in a diverse range of human diseases
A common experimental output in biomedical science is a list of genes implicated in a given biological process or disease. Gene lists resulting from a group of studies answering the same, or similar, questions can be combined by ranking aggregation methods to find a consensus or a more reliable answer. In this thesis, a new algorithm called Meta-Analysis by Information Content (MAIC) was refined to deal with the ranking aggregation problem for gene lists with various realistic data features, including the mix of ranked and unranked lists, various noise levels and heterogeneity of quality. Then, to explore the mathematical meaning of MAIC results and potential improvements, we compared an expectation-maximization (EM) algorithm and a Blocked Gibbs method with MAIC. Since the properties of a dataset can influence the performance of an algorithm, evaluating a ranking aggregation method on a specific type of data before using it is required to support the reliability. Such evaluation on genomic data is usually based on a simulated database because of the lack of a known truth for real data. However, simulated datasets tend to be too small compared to experimental data and neglect key features, including heterogeneity of quality, relevance and the inclusion of unranked lists. In this thesis, two versions of stochastic generative models were proposed to emulate real genomic data, with various heterogeneity of quality, noise level, and a mix of unranked and ranked data. A group of existing methods and their variations which are suitable for meta-analysis of gene lists are implemented based on existing code and compared using simulated and real data. In addition to the evaluation with simulated data, a comparison using real genomic data on the SARS-CoV-2 virus, cancer (NSCLC), and bacteria (macrophage apoptosis) was performed.We summarise our evaluation results in terms of a simple flowchart to select a ranking aggregation method for genomic data. Some ranking aggregation methods can accept the clustering of sources as additional input to deal with the common bias of specific types of sources. A clustering validation method is proposed in this thesis to help assign categories to sources.