Edinburgh Research Archive

Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications

dc.contributor.advisor
Simpson, Ian
dc.contributor.advisor
Macleod, Malcolm
dc.contributor.author
Campbell, Jamie
dc.contributor.sponsor
Medical Research Council (MRC)
en
dc.date.accessioned
2023-03-01T12:41:01Z
dc.date.available
2023-03-01T12:41:01Z
dc.date.issued
2023-03-01
dc.description.abstract
The size of the existing academic literature corpus and the incredible rate of new publications offers a great need and opportunity to harness computational approaches to data and knowledge extraction across all research fields. Elements of this challenge can be met by developments in automation for retrieval of electronic documents, document classification and knowledge extraction. In this thesis, I detail studies of these processes in three related chapters. Although the focus of each chapter is distinct, they contribute to my aim of developing a generalisable pipeline for clinical applications in Natural Language Processing in the academic literature. In chapter one, I describe the development of “Cadmus”, An open-source system developed in Python to generate corpora of biomedical text from the published literature. Cadmus comprises three main steps: Search query & meta-data collection, document retrieval, and parsing of the retrieved text. I present an example of full-text retrieval for a corpus of over two hundred thousand articles using a gene-based search query with quality control metrics for this retrieval process and a high-level illustration of the utility of full text over metadata for each article. For a corpus of 204,043 articles, the retrieval rate was 85.2% with institutional subscription access and 54.4% without. Chapter Two details developing a custom-built Naïve Bayes supervised machine learning document classifier. This binary classifier is based on calculating the relative enrichment of biomedical terms between two classes of documents in a training set. The classifier is trained and tested upon a manually classified set of over 8000 abstract and full-text articles to identify articles containing human phenotype descriptions. 10-fold cross-validation of the model showed a performance of recall of 85%, specificity of 99%, Precision of 0.76%, f1 score of 0.82 and accuracy of 90%. Chapter three illustrates the clinical applications of automated retrieval, processing, and classification by considering the published literature on Paediatric COVID-19. Case reports and similar articles were classified into “severe” and “non-severe” classes, and term enrichment was evaluated to find biomarkers associated with, or predictive of, severe paediatric COVID-19. Time series analysis was employed to illustrate emerging disease entities like the Multisystem Inflammatory Syndrome in Children (MIS-C) and consider unrecognised trends through literature-based discovery.
en
dc.identifier.uri
https://hdl.handle.net/1842/40377
dc.identifier.uri
http://dx.doi.org/10.7488/era/3145
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.subject
automated analysis
en
dc.subject
medical literature analysis
en
dc.subject
Cadmus
en
dc.subject
medical literature databases
en
dc.subject
search techniques
en
dc.subject
full-text articles
en
dc.title
Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
MD Doctor of Medicine
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Campbell2023.pdf
Size:
9.73 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)