On combining collaborative and automated curation for enzyme function prediction
De Ferrari, Luna Luciana
Data generation has vastly exceeded manual annotation in several areas of astronomy, biology, economy, geology, medicine and physics. At the same time, a public community of experts and hobbyists has developed around some of these disciplines thanks to open, editable web resources such as wikis and public annotation challenges. In this thesis I investigate under which conditions a combination of collaborative and automated curation could complete annotation tasks unattainable by human curators alone. My exemplar curation process is taken from the molecular biology domain: the association all existing enzymes (proteins catalysing a chemical reaction) with their function. Assigning enzymatic function to the proteins in a genome is the first essential problem of metabolic reconstruction, important for biology, medicine, industrial production and environmental studies. In the protein database UniProt, only 3% of the records are currently manually curated and only 60% of the 17 million recorded proteins have some functional annotation, including enzymatic annotation. The proteins in UniProt represent only about 380,000 animal species (2,000 of which have completely sequenced genomes) out of the estimated millions of species existing on earth. The enzyme annotation task already applies to millions of entries and this number is bound to increase rapidly as sequencing efforts intensify. To guide my analysis I first develop a basic model of collaborative curation and evaluate it against molecular biology knowledge bases. The analysis highlights a surprising similarity between open and closed annotation environments on metrics usually connected with “democracy” of content. I then develop and evaluate a method to enhance enzyme function annotation using machine learning which demonstrates very high accuracy, recall and precision and the capacity to scale to millions of enzyme instances. This method needs only a protein sequence as input and is thus widely applicable to genomic and metagenomic analysis. The last phase of the work uses active and guided learning to bring together collaborative and automatic curation. In active learning a machine learning algorithm suggests to the human curators which entry should be annotated next. This strategy has the potential to coordinate and reduce the amount of manual curation while improving classification performance and reducing the number of training instances needed. This work demonstrates the benefits of combining classic machine learning and guided learning to improve the quantity and quality of enzymatic knowledge and to bring us closer to the goal of annotating all existing enzymes.
The following license files are associated with this item: