Addressing concept sparsity in medical text with medical ontologies

Falis, Matúš

Addressing concept sparsity in medical text with medical ontologies

Simple item page

dc.contributor.advisor

Alex, Bea

dc.contributor.advisor

Birch-Mayne, Alexandra

dc.contributor.advisor

Whiteley, William

dc.contributor.author

Falis, Matúš

dc.date.accessioned

2025-05-01T09:33:34Z

dc.date.available

2025-05-01T09:33:34Z

dc.date.issued

2025-05-01

dc.description.abstract

Clinical Document Coding is the task of summarising unstructured or semi-structured medical text by assigning labels (codes) from a structured knowledge base - e.g., the International Classification of Diseases (ICD) - corresponding to medical concepts, such as conditions or procedures, to clinical documents. Clinical Document Coding is currently performed by humans. As the task requires time and effort that could be used elsewhere in healthcare, it is desirable to (at least partially) automate it. Hence, a variety of systems have been developed for this task ranging from rule-based to deep-neural-network solutions. Neural solutions often ignore the rich concept representation within the ontology, i.e., the ontological structure, or the code descriptions. Furthermore, while effective in a variety of tasks, neural networks are notorious for requiring large amounts of training data and struggling in low-resource scenarios, unless designed with a focus on low-resource performance. Concepts in clinical coding follow a big-head long-tail distribution - with few frequent (big head) and a large number of infrequent labels (long tail) - reflecting how common different conditions and procedures are. Few concepts, such as Hypertension and Type 2 Diabetes are very common, while many, such as the Marburg Virus Disease, are rare. This type of distribution is observed in data, with many concepts being infrequent or absent within even the largest publicly available datasets. This data sparsity issue is even more pronounced due to the demanding requirements on the amounts of data for training effective neural network models. This thesis strives to incorporate the rich ontological structure and background knowledge into the model development and evaluation process for coding discharge summaries with the ICD. The thesis makes the following contributions: (1) a hierarchical evaluation metric; (2) a hierarchical error analysis tool; (3) rule-based data augmentation and synthesis through adjustments to existing texts; and (4) exploration of data augmentation through generating synthetic text via a Large Language Model (GPT-3.5). The thesis presents hierarchy-aware evaluation approaches. Firstly, Count-Preserving Hierarchical Evaluation (CoPHE) compares the number of gold standard labels against predicted number of labels on different levels of the hierarchy. Beyond being able to assign partial credit to mispredictions based on their proximity to gold standard labels within the ontology, CoPHE is also capable of capturing over- or under-prediction within subgraphs of the hierarchy. Secondly, the popular confusion matrix visualisation and analysis approach commonly used in strongly-labelled scenarios was extended to the weakly-labelled scenario of ICD coding. This approach - Weak Hierarchical Confusion Matrices - allows understanding whether errors in prediction commonly arise due to assigning a different concept from the same family of concepts, or over/under-prediction. Ontology-guided data augmentation and synthesis was employed to address the data sparsity issue. The thesis explores the possibility of addressing this issue via enhancing concept variability through synonym replacement for relevant concepts identified with pre-existing named entity recognition and linking tools, and introducing concepts previously unseen in the training data. Finally, the thesis explores the possibility of creating synthetic discharge summaries with the aid of Large Language Models for the purpose of data augmentation for few-shot (appearing rarely in the training data) and zero-shot labels (absent from the training data). GPT-3.5 was prompted to generate discharge summaries based on diagnosis and procedure codes coming from real patient records. In models trained on augmented MIMIC-IV, concepts that appeared within the original training set albeit with few instances were found to have benefited from the further generated data. The method necessitates further refinement to be reliable in the zero-shot scenario. Clinical staff evaluated the generated discharge summaries and compared them to real data with similar labels. Synthetic discharge summaries correctly list individual concepts, but fail to note interaction among them. The resulting overall narrative is thus often flawed. The generated data may be useful for training neural network models, but would not be acceptable in a clinical setting. In summary, the thesis contributes methods of evaluating performance with regard to the structure of the ontology, and augmentation approaches to mitigating the effects of concept sparsity using both the ontological structure, and the textual descriptions of concepts within the ontology.

en

dc.identifier.uri

https://hdl.handle.net/1842/43399

dc.identifier.uri

http://dx.doi.org/10.7488/era/5935

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Automated Clinical Coding: What, Why, and Where We Are? Dong, H., Falis, M., Whiteley, W., Alex, B., Ji, S., Chen, J. & Wu, H., 21 Mar 2022, ArXiv, 8 p

en

dc.relation.hasversion

CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale Multi-Label Text Classification Falis, M., Dong, H., Birch, A. & Alex, B., 7 Nov 2021, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics (ACL), p. 907-912 6 p

en

dc.relation.hasversion

Falis, M., Dong, H., Birch, A., and Alex, B. (2022). Horses to zebras: ontologyguided data augmentation and synthesis for icd-9 coding. In Proceedings of the 21st Workshop on Biomedical Language Processing. Association for Computational Linguistics

en

dc.relation.hasversion

Can GPT-3.5 generate and code discharge summaries? Falis, M., Gema, A. P., Dong, H., Daines, L., Basetti, S., Holder, M., Penfold, R. S., Birch, A. & Alex, B., 13 Sept 2024, In: Journal of the American Medical Informatics Association. 31, 10, p. 2284–2293 10 p

en

dc.relation.hasversion

Ontological attention ensembles for capturing semantic concepts in ICD code prediction from clinical text Falis, M., Pajak, M., Lisowska, A., Schrempf, P., Deckers, L., Mikhael, S., Tsaftaris, S. & O'Neil, A., 1 Nov 2019, Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019). Hong Kong: Association for Computational Linguistics, p. 168-177 10 p

en

dc.subject

ICD Coding

en

dc.subject

Clinical Natural Language Processing

en

dc.subject

Data Augmentation

en

dc.subject

Evaluation

en

dc.title

Addressing concept sparsity in medical text with medical ontologies

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: FalisM_2025.pdf
Size:: 4.16 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection