Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications
dc.contributor.advisor
Simpson, Ian
dc.contributor.advisor
Macleod, Malcolm
dc.contributor.author
Campbell, Jamie
dc.contributor.sponsor
Medical Research Council (MRC)
en
dc.date.accessioned
2023-03-01T12:41:01Z
dc.date.available
2023-03-01T12:41:01Z
dc.date.issued
2023-03-01
dc.description.abstract
The size of the existing academic literature corpus and the incredible rate of new publications
offers a great need and opportunity to harness computational approaches to data and
knowledge extraction across all research fields. Elements of this challenge can be met by
developments in automation for retrieval of electronic documents, document classification
and knowledge extraction. In this thesis, I detail studies of these processes in three related
chapters. Although the focus of each chapter is distinct, they contribute to my aim of
developing a generalisable pipeline for clinical applications in Natural Language Processing
in the academic literature. In chapter one, I describe the development of “Cadmus”, An open-source system developed in Python to generate corpora of biomedical text from the published
literature. Cadmus comprises three main steps: Search query & meta-data collection,
document retrieval, and parsing of the retrieved text. I present an example of full-text
retrieval for a corpus of over two hundred thousand articles using a gene-based search query
with quality control metrics for this retrieval process and a high-level illustration of the utility
of full text over metadata for each article. For a corpus of 204,043 articles, the retrieval rate
was 85.2% with institutional subscription access and 54.4% without. Chapter Two details
developing a custom-built Naïve Bayes supervised machine learning document classifier.
This binary classifier is based on calculating the relative enrichment of biomedical terms
between two classes of documents in a training set.
The classifier is trained and tested upon a manually classified set of over 8000 abstract and
full-text articles to identify articles containing human phenotype descriptions. 10-fold cross-validation of the model showed a performance of recall of 85%, specificity of 99%, Precision
of 0.76%, f1 score of 0.82 and accuracy of 90%. Chapter three illustrates the clinical
applications of automated retrieval, processing, and classification by considering the
published literature on Paediatric COVID-19. Case reports and similar articles were classified
into “severe” and “non-severe” classes, and term enrichment was evaluated to find
biomarkers associated with, or predictive of, severe paediatric COVID-19. Time series
analysis was employed to illustrate emerging disease entities like the Multisystem
Inflammatory Syndrome in Children (MIS-C) and consider unrecognised trends through
literature-based discovery.
en
dc.identifier.uri
https://hdl.handle.net/1842/40377
dc.identifier.uri
http://dx.doi.org/10.7488/era/3145
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.subject
automated analysis
en
dc.subject
medical literature analysis
en
dc.subject
Cadmus
en
dc.subject
medical literature databases
en
dc.subject
search techniques
en
dc.subject
full-text articles
en
dc.title
Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
MD Doctor of Medicine
en
Files
Original bundle
1 - 1 of 1
- Name:
- Campbell2023.pdf
- Size:
- 9.73 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

