MilkMine: text-mining, milk proteins and hypothesis generation
The vast and increasing volume of biological data can make it a struggle for scientists to keep up-to-date with the latest research and as a consequence they may miss significant biological links, particularly those that extend outwith their own area of expertise. MilkMine is an attempt to provide a single informatics resource to help milk protein scientists mine this information mountain more effectively, by integrating standard experimental data types with data generated by emerging text-mining techniques. A method was initially developed to identify milk-related terminology from peer-reviewed biological literature and this was used to complement the Unified Medical Language System (UMLS), a large thesaurus of biological concepts, their variant names and their types. The resultant enriched ontology was then mapped to the free text of peer-reviewed biological literature using the MMTx program producing a database of semantically enriched sentences. A co-occurrence relation extraction algorithm was written to identify relationships between milk proteins and peptides, and other biological concepts, such as diseases or biological processes. Using these literature relation sets new hypotheses can be generated using the basic principle that if “A is linked to B”, and if “B is linked to C” then we can infer an association between A and C. Filtering and downstream processing of the many generated relationships promotes significant interactions. These literature relations and hypotheses are integrated with biological data into the MilkMine database. The MilkMine database is built upon on a generic data warehousing system, InterMine. This tool enabled the integration of traditional data types, such as protein sequence or structural data, from a variety of sources (e.g. UniProt). However, the standard InterMine model was also extended by the author to include other data sources (e.g. the Protein Data Bank) and to incorporate the output of the text-mining algorithm. This integration of otherwise disparate information allows more complex querying of the data, across many data types. For example, protein sequences are mapped to instances of the names, synonyms or symbols of the protein in text, therefore a raw fragment of amino acid sequence (e.g. a particular binding region) can be used to search the MilkMine database for literature information as well as the interactions and hypotheses of those proteins that contain the sequence. The MilkMine resource is accessible online (www.bioinformatics.ed.ac.uk/milkmine) through a professional level query interface offering many features such as an interactive query builder, standard ready-to-run queries, bulk downloads and the ability to store user preferences and query histories. Evaluation of MilkMine showed that the text-mining algorithm, as well as the data integration, could provide the user with interesting connections for further study.