Show simple item record

dc.contributor.advisorGoryanin, Igoren
dc.contributor.advisorAitken, Stuarten
dc.contributor.advisorvan Hemert, Janoen
dc.contributor.authorDe Ferrari, Luna Lucianaen
dc.date.accessioned2013-07-08T13:47:26Z
dc.date.available2013-07-08T13:47:26Z
dc.date.issued2012-11-29
dc.identifier.urihttp://hdl.handle.net/1842/7538
dc.descriptionGrant number BB/F529038/1en
dc.description.abstractData generation has vastly exceeded manual annotation in several areas of astronomy, biology, economy, geology, medicine and physics. At the same time, a public community of experts and hobbyists has developed around some of these disciplines thanks to open, editable web resources such as wikis and public annotation challenges. In this thesis I investigate under which conditions a combination of collaborative and automated curation could complete annotation tasks unattainable by human curators alone. My exemplar curation process is taken from the molecular biology domain: the association all existing enzymes (proteins catalysing a chemical reaction) with their function. Assigning enzymatic function to the proteins in a genome is the first essential problem of metabolic reconstruction, important for biology, medicine, industrial production and environmental studies. In the protein database UniProt, only 3% of the records are currently manually curated and only 60% of the 17 million recorded proteins have some functional annotation, including enzymatic annotation. The proteins in UniProt represent only about 380,000 animal species (2,000 of which have completely sequenced genomes) out of the estimated millions of species existing on earth. The enzyme annotation task already applies to millions of entries and this number is bound to increase rapidly as sequencing efforts intensify. To guide my analysis I first develop a basic model of collaborative curation and evaluate it against molecular biology knowledge bases. The analysis highlights a surprising similarity between open and closed annotation environments on metrics usually connected with “democracy” of content. I then develop and evaluate a method to enhance enzyme function annotation using machine learning which demonstrates very high accuracy, recall and precision and the capacity to scale to millions of enzyme instances. This method needs only a protein sequence as input and is thus widely applicable to genomic and metagenomic analysis. The last phase of the work uses active and guided learning to bring together collaborative and automatic curation. In active learning a machine learning algorithm suggests to the human curators which entry should be annotated next. This strategy has the potential to coordinate and reduce the amount of manual curation while improving classification performance and reducing the number of training instances needed. This work demonstrates the benefits of combining classic machine learning and guided learning to improve the quantity and quality of enzymatic knowledge and to bring us closer to the goal of annotating all existing enzymes.en
dc.contributor.sponsorBiotechnology and Biological Sciences Research Council (BBSRC)en
dc.language.isoen
dc.publisherThe University of Edinburghen
dc.relation.hasversionDe Ferrari, L.; Aitken, S.; van Hemert, J. and Goryanin, I. WikiSim: simulating knowledge collection and curation in structured wikis WikiSym ’08: Proceedings of the 4th International Symposium on Wikis, ACM, 2008, 1-2en
dc.relation.hasversionDe Ferrari, L.; Aitken, S.; van Hemert, J. and Goryanin, I. Edmonds, B. and Gilbert, N. (Eds.) A model of social collaboration in Molecular Biology knowledge bases Proceedings of the 6th Conference of the European Social Simulation Association (ESSA’09), European Social Simulation Association, 2009, 47en
dc.relation.hasversionDe Ferrari, L.; Aitken, S.; van Hemert, J. and Goryanin, I. Multi-label prediction of enzyme classes using InterPro signatures Machine Learning for Systems BiologyWorkshop (International Conference of Systems Biology), 2010en
dc.relation.hasversionDe Ferrari, L.; Aitken, S.; van Hemert, J. and Goryanin, I. EnzML: Multi-label prediction of enzyme classes using InterPro signatures BMC Bioinformatics, 13:61, 2012en
dc.relation.hasversionDe Ferrari, L.; Aitken, S.; Mitchell J. Active and guided learning for the prediction of enzyme function 11th European Conference on Computational Biology, 2012en
dc.subjectbioinformaticsen
dc.subjectmachine learningen
dc.subjectenzyimeen
dc.subjectactive learningen
dc.subjectwikien
dc.subjectcurationen
dc.titleOn combining collaborative and automated curation for enzyme function predictionen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen


Files in this item

This item appears in the following Collection(s)

Show simple item record