Show simple item record

dc.contributor.advisorBaillie, John
dc.contributor.advisorGutmann, Michael
dc.contributor.advisorDockrell, David
dc.contributor.advisorFitzgerald, Jonathan
dc.contributor.authorWang, Bo
dc.date.accessioned2023-05-15T10:36:55Z
dc.date.available2023-05-15T10:36:55Z
dc.date.issued2023-05-15
dc.identifier.urihttps://hdl.handle.net/1842/40562
dc.identifier.urihttp://dx.doi.org/10.7488/era/3327
dc.description.abstractA common experimental output in biomedical science is a list of genes implicated in a given biological process or disease. Gene lists resulting from a group of studies answering the same, or similar, questions can be combined by ranking aggregation methods to find a consensus or a more reliable answer. In this thesis, a new algorithm called Meta-Analysis by Information Content (MAIC) was refined to deal with the ranking aggregation problem for gene lists with various realistic data features, including the mix of ranked and unranked lists, various noise levels and heterogeneity of quality. Then, to explore the mathematical meaning of MAIC results and potential improvements, we compared an expectation-maximization (EM) algorithm and a Blocked Gibbs method with MAIC. Since the properties of a dataset can influence the performance of an algorithm, evaluating a ranking aggregation method on a specific type of data before using it is required to support the reliability. Such evaluation on genomic data is usually based on a simulated database because of the lack of a known truth for real data. However, simulated datasets tend to be too small compared to experimental data and neglect key features, including heterogeneity of quality, relevance and the inclusion of unranked lists. In this thesis, two versions of stochastic generative models were proposed to emulate real genomic data, with various heterogeneity of quality, noise level, and a mix of unranked and ranked data. A group of existing methods and their variations which are suitable for meta-analysis of gene lists are implemented based on existing code and compared using simulated and real data. In addition to the evaluation with simulated data, a comparison using real genomic data on the SARS-CoV-2 virus, cancer (NSCLC), and bacteria (macrophage apoptosis) was performed.We summarise our evaluation results in terms of a simple flowchart to select a ranking aggregation method for genomic data. Some ranking aggregation methods can accept the clustering of sources as additional input to deal with the common bias of specific types of sources. A clustering validation method is proposed in this thesis to help assign categories to sources.en
dc.contributor.sponsorMRC SHIELD consortium (DHD PI)en
dc.contributor.sponsorEdinburgh Global Research Scholarshipen
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.subjectData-driven cross-validationen
dc.subjectMeta-Analysis by Information Contenten
dc.subjectMAICen
dc.subjectexpectation-maximization (EM) algorithmen
dc.subjectBlocked Gibbs methoden
dc.subjectSARS-CoV-2 virusen
dc.subjectmacrophage apoptosisen
dc.titleData-driven cross-validation to identify novel therapeutic targets in a diverse range of human diseasesen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen


Files in this item

This item appears in the following Collection(s)

Show simple item record