Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications

Campbell, Jamie

Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications

Simple item page

dc.contributor.advisor

Simpson, Ian

dc.contributor.advisor

Macleod, Malcolm

dc.contributor.author

Campbell, Jamie

dc.contributor.sponsor

Medical Research Council (MRC)

en

dc.date.accessioned

2023-03-01T12:41:01Z

dc.date.available

2023-03-01T12:41:01Z

dc.date.issued

2023-03-01

dc.description.abstract

The size of the existing academic literature corpus and the incredible rate of new publications offers a great need and opportunity to harness computational approaches to data and knowledge extraction across all research fields. Elements of this challenge can be met by developments in automation for retrieval of electronic documents, document classification and knowledge extraction. In this thesis, I detail studies of these processes in three related chapters. Although the focus of each chapter is distinct, they contribute to my aim of developing a generalisable pipeline for clinical applications in Natural Language Processing in the academic literature. In chapter one, I describe the development of “Cadmus”, An open-source system developed in Python to generate corpora of biomedical text from the published literature. Cadmus comprises three main steps: Search query & meta-data collection, document retrieval, and parsing of the retrieved text. I present an example of full-text retrieval for a corpus of over two hundred thousand articles using a gene-based search query with quality control metrics for this retrieval process and a high-level illustration of the utility of full text over metadata for each article. For a corpus of 204,043 articles, the retrieval rate was 85.2% with institutional subscription access and 54.4% without. Chapter Two details developing a custom-built Naïve Bayes supervised machine learning document classifier. This binary classifier is based on calculating the relative enrichment of biomedical terms between two classes of documents in a training set. The classifier is trained and tested upon a manually classified set of over 8000 abstract and full-text articles to identify articles containing human phenotype descriptions. 10-fold cross-validation of the model showed a performance of recall of 85%, specificity of 99%, Precision of 0.76%, f1 score of 0.82 and accuracy of 90%. Chapter three illustrates the clinical applications of automated retrieval, processing, and classification by considering the published literature on Paediatric COVID-19. Case reports and similar articles were classified into “severe” and “non-severe” classes, and term enrichment was evaluated to find biomarkers associated with, or predictive of, severe paediatric COVID-19. Time series analysis was employed to illustrate emerging disease entities like the Multisystem Inflammatory Syndrome in Children (MIS-C) and consider unrecognised trends through literature-based discovery.

en

dc.identifier.uri

https://hdl.handle.net/1842/40377

dc.identifier.uri

http://dx.doi.org/10.7488/era/3145

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.subject

automated analysis

en

dc.subject

medical literature analysis

en

dc.subject

Cadmus

en

dc.subject

medical literature databases

en

dc.subject

search techniques

en

dc.subject

full-text articles

en

dc.title

Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

MD Doctor of Medicine

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Campbell2023.pdf
Size:: 9.73 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Edinburgh Medical School thesis and dissertation collection