Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications
View/ Open
Date
01/03/2023Author
Campbell, Jamie
Metadata
Abstract
The size of the existing academic literature corpus and the incredible rate of new publications
offers a great need and opportunity to harness computational approaches to data and
knowledge extraction across all research fields. Elements of this challenge can be met by
developments in automation for retrieval of electronic documents, document classification
and knowledge extraction. In this thesis, I detail studies of these processes in three related
chapters. Although the focus of each chapter is distinct, they contribute to my aim of
developing a generalisable pipeline for clinical applications in Natural Language Processing
in the academic literature. In chapter one, I describe the development of “Cadmus”, An open-source system developed in Python to generate corpora of biomedical text from the published
literature. Cadmus comprises three main steps: Search query & meta-data collection,
document retrieval, and parsing of the retrieved text. I present an example of full-text
retrieval for a corpus of over two hundred thousand articles using a gene-based search query
with quality control metrics for this retrieval process and a high-level illustration of the utility
of full text over metadata for each article. For a corpus of 204,043 articles, the retrieval rate
was 85.2% with institutional subscription access and 54.4% without. Chapter Two details
developing a custom-built Naïve Bayes supervised machine learning document classifier.
This binary classifier is based on calculating the relative enrichment of biomedical terms
between two classes of documents in a training set.
The classifier is trained and tested upon a manually classified set of over 8000 abstract and
full-text articles to identify articles containing human phenotype descriptions. 10-fold cross-validation of the model showed a performance of recall of 85%, specificity of 99%, Precision
of 0.76%, f1 score of 0.82 and accuracy of 90%. Chapter three illustrates the clinical
applications of automated retrieval, processing, and classification by considering the
published literature on Paediatric COVID-19. Case reports and similar articles were classified
into “severe” and “non-severe” classes, and term enrichment was evaluated to find
biomarkers associated with, or predictive of, severe paediatric COVID-19. Time series
analysis was employed to illustrate emerging disease entities like the Multisystem
Inflammatory Syndrome in Children (MIS-C) and consider unrecognised trends through
literature-based discovery.