Systematising and scaling literature curation for genetically determined developmental disorders
Yates, Thabo Michael
The widespread availability of genomic sequencing has transformed the diagnosis of genetically-determined developmental disorders (GDD). However, this type of test often generates a number of genetic variants, which have to be reviewed and related back to the clinical features (phenotype) of the individual being tested. This frequently entails a time-consuming review of the peer-reviewed literature to look for case reports describing variants in the gene(s) of interest. This is particularly true for newly described and/or very rare disorders not covered in phenotype databases. Therefore, there is a need for scalable, automated literature curation to increase the efficiency of this process. This should lead to improvements in the speed in which diagnosis is made, and an increase in the number of individuals who are diagnosed through genomic testing. Phenotypic data in case reports/case series is not usually recorded in a standardised, computationally-tractable format. Plain text descriptions of similar clinical features may be recorded in several different ways. For example, a technical term such as ‘hypertelorism’, may be recorded as its synonym ‘widely spaced eyes’. In addition, case reports are found across a wide range of journals, with different structures and file formats for each publication. The Human Phenotype Ontology (HPO) was developed to store phenotypic data in a computationally-accessible format. Several initiatives have been developed to link diseases to phenotype data, in the form of HPO terms. However, these rely on manual expert curation and therefore are not inherently scalable, and cannot be updated automatically. Methods of extracting phenotype data from text at scale developed to date have relied on abstracts or open access papers. At the time of writing, Europe PubMed Central (EPMC, https://europepmc.org/) contained approximately 39.5 million articles, of which only 3.8 million were open access. Therefore, there is likely a significant volume of phenotypic data which has not been used previously at scale, due to difficulties accessing non-open access manuscripts. In this thesis, I present a method for literature curation which can utilise all relevant published full text through a newly developed package which can download almost all manuscripts licenced by a university or other institution. This is scalable to the full spectrum of GDD. Using manuscripts identified through manual literature review, I use a full text download pipeline and NLP (natural language processing) based methods to generate disease models. These are comprised of HPO terms weighted according to their frequency in the literature. I demonstrate iterative refinement of these models, and use a custom annotated corpus of 50 papers to show the text mining process has high precision and recall. I demonstrate that these models clinically reflect true disease expressivity, as defined by manual comparison with expert literature reviews, for three well-characterised GDD. I compare these disease models to those in the most commonly used genetic disease phenotype databases. I show that the automated disease models have increased depth of phenotyping, i.e. there are more terms than those which are manually-generated. I show that, in comparison to ‘real life’ prospectively gathered phenotypic data, automated disease models outperform existing phenotype databases in predicting diagnosis, as defined by increased area under the curve (by 0.05 and 0.08 using different similarity measures) on ROC curve plots. I present a method for automated PubMed search at scale, to use as input for disease model generation. I annotated a corpus of 6500 abstracts. Using this corpus I show a high precision (up to 0.80) and recall (up to 1.00) for machine learning classifiers used to identify manuscripts relevant to GDD. These use hand-picked domain-specific features, for example utilising specific MeSH terms. This method can be used to scale automated literature curation to the full spectrum of GDD. I also present an analysis of the phenotypic terms used in one year of GDD-relevant papers in a prominent journal. This shows that use of supplemental data and parsing clinical report sections from manuscripts is likely to result in more patient-specific phenotype extraction in future. In summary, I present a method for automated curation of full text from the peer-reviewed literature in the context of GDD. I demonstrate that this method is robust, reflects clinical disease expressivity, outperforms existing manual literature curation, and is scalable. Applying this process to clinical testing in future should improve the efficiency and accuracy of diagnosis.