On combining collaborative and automated curation for enzyme function prediction
View/ Open
Java code.zip (585.3Mb)
thesis files.zip (32.15Mb)
research data.zip (543.4Mb)
Date
29/11/2012Author
De Ferrari, Luna Luciana
Metadata
Abstract
Data generation has vastly exceeded manual annotation in several areas of astronomy, biology,
economy, geology, medicine and physics. At the same time, a public community of experts
and hobbyists has developed around some of these disciplines thanks to open, editable web resources
such as wikis and public annotation challenges. In this thesis I investigate under which
conditions a combination of collaborative and automated curation could complete annotation
tasks unattainable by human curators alone.
My exemplar curation process is taken from the molecular biology domain: the association
all existing enzymes (proteins catalysing a chemical reaction) with their function. Assigning
enzymatic function to the proteins in a genome is the first essential problem of metabolic reconstruction,
important for biology, medicine, industrial production and environmental studies. In
the protein database UniProt, only 3% of the records are currently manually curated and only
60% of the 17 million recorded proteins have some functional annotation, including enzymatic
annotation. The proteins in UniProt represent only about 380,000 animal species (2,000 of
which have completely sequenced genomes) out of the estimated millions of species existing
on earth. The enzyme annotation task already applies to millions of entries and this number is
bound to increase rapidly as sequencing efforts intensify.
To guide my analysis I first develop a basic model of collaborative curation and evaluate
it against molecular biology knowledge bases. The analysis highlights a surprising similarity
between open and closed annotation environments on metrics usually connected with “democracy”
of content.
I then develop and evaluate a method to enhance enzyme function annotation using machine
learning which demonstrates very high accuracy, recall and precision and the capacity to scale
to millions of enzyme instances. This method needs only a protein sequence as input and is
thus widely applicable to genomic and metagenomic analysis.
The last phase of the work uses active and guided learning to bring together collaborative
and automatic curation. In active learning a machine learning algorithm suggests to the human
curators which entry should be annotated next. This strategy has the potential to coordinate
and reduce the amount of manual curation while improving classification performance and
reducing the number of training instances needed. This work demonstrates the benefits of
combining classic machine learning and guided learning to improve the quantity and quality of
enzymatic knowledge and to bring us closer to the goal of annotating all existing enzymes.
Collections
The following license files are associated with this item: