Machine learning-based approaches for functional variant classification across mammals
Item Status
Embargo End Date
Date
Authors
Zhao, Rongrong
Abstract
As a result of the continued growth of the world’s population, the demand for livestock products continues to grow. However, increasing livestock production results in more greenhouse gas emissions, and pressures on scarce resources such as potable water and land. Therefore, it is of vital importance to improve the productivity of livestock, including through advanced genomics breeding approaches and genome editing so that more can be produced without increasing animal numbers. The pivotal challenge of using advanced genomics breeding approaches is to identify the causal functional variants associated with the productivity traits of interest in livestock species. As in humans, genome-wide association studies (GWAS) have identified numerous genomic regions associated with diseases and traits in livestock, but it is difficult to determine the causal variants in these regions due to a range of factors such as linkage disequilibrium (LD). The overarching aim of my PhD was to utilize data-driven computational methods, such as machine learning, to improve the initial detection of novel functional variants in livestock species to ultimately enable the improvement of livestock breeding. This research focused on developing a reusable variant annotation pipeline for mammalian species with a broad range of features and demonstrating the utility of these features and machine learning approaches in predicting mammalian functional regulatory variants in both human and cattle.
Datasets suitable for machine learning are largely lacking in livestock. To address this and facilitate a diverse range of downstream projects I first developed a reusable variant annotation pipeline in Nextflow for use across platforms and species. The pipeline provides a wide range of annotations including sequence conservation, gene annotations, sequence context, and predicted functional genomic data from other machine learning tools such as Enformer, that can then be used in downstream variant analyses and employed as features in machine learning approaches for variant classification across species.
I first applied this pipeline to develop machine learning models for predicting where functional human variants have direct orthologues in livestock species, that may therefore be relevant to understanding livestock phenotypes. I demonstrate that it is possible to assign probabilities to whether a human variant will be found in other species from its annotations. Hundreds of human regulatory variants were identified with conserved functional impacts on gene expression in livestock species. This observation suggests it is possible to leverage information from well-annotated species, such as humans, to help with the prediction of regulatory variants and other functional variants in less well-annotated livestock species.
To explore the efficacy of using the annotation pipeline with machine learning approaches to predict functional variants, I applied them to directly predicting regulatory variants across humans and cattle. I compared the performance of various approaches of predicting cattle regulatory variants, including with or without incorporating annotations from humans. I highlight that the models incorporating human annotations and those based on cattle annotations demonstrated comparable performance, with the model relying on cattle annotations exhibiting a slightly superior performance.
Overall, the variant annotation pipeline and the machine learning models proposed in this thesis can be utilized to uncover the underlying characteristics of functional variants and prioritise functional variants related to important traits in livestock species for downstream genome editing or marker assisted breeding.
This item appears in the following Collection(s)

