Machine annotation of genome and transcriptome data
Item statusRestricted Access
Embargo end date31/12/2100
One of the key research topics of post-genome study is annotation of the gene with regards to specific function and biological processes. This can help us to understand the precise role that a gene or a group of genes carries. In this thesis, I developed techniques to automatically annotate genes on single gene and a group of genes levels. It is shown that these techniques improve our understanding of biological systems/diseases, and will aid drug discovery. In the first project, I attempted to achieve precise annotation for single genes. In the second and third projects, I performed annotations of a group of genes using pathway knowledge. I examined this problem from supervised and unsupervised learning aspects, respectively. The main contributions of the work are organized as follows: In gene annotation project, I built up an automated scheme to reconcile the term differences arising from the different automated annotation services. The method leaves less than 20% of the annotations for manual work. The generalization performance across other species is of a similar standard, again leaving less than 20% of the annotations for manual inspection. In addition, less than 10% of the results have different functions from EcoCyc results in E.coli genome annotation task. Overall, this method can significantly reduce human effort involvement (6 months’ work by several biologists for a bacterial genome) to resolve inconsistent gene annotations. Then I started from the current limitations of pathway analysis and presented a novel approach for pathway discovery. Enrichment analysis is the most popular approach to map gene expression profiling from genes to biological pathways. It is a powerful tool to identify pathways enriching of differentially expressed gene; however, it is unable to discover active/inhibitive pathways. In this study, I attempted to resolve this issue by integrative classification of KEGG and TF gene sets. I assumed that the pathways with good classification performance should be considered as the active/inhibitive pathways. Based on this hypothesis, I built up a generic approach to incorporate two types of biological data for active pathway discovery. The experimental results show that integration of transcription factor data boosts classification performance. In addition, this method identified relevant biological pathways, which are highly associated with tumour genesis and development. But they are ignored by Gene Set Enrichment Analysis, such as cancer pathway, inflammation and metabolic pathways. Furthermore, this method achieves comparable classification performance with the best-reported results. Lastly, I performed subtyping analysis of Rheumatoid Arthritis patients based on gene expression profiling. I revalidated the two clusters of patients based on two independent cohorts. The experimental results indicate that the subgroup structure does not correspond to the drug response status. In addition, I developed a pathway subtyping approach and achieved the same number of clusters as gene-level clustering results. The pathway clustering results show that one group of the patients has high proliferation and low inflammation response, while the other group has the reverse trend. It suggests that designing drugs with better trade-off between anti-inflammation and anti-proliferation for specific subgroup of patients may achieve better clinical outcomes.