Edinburgh Research Archive

Addressing concept sparsity in medical text with medical ontologies

dc.contributor.advisor
Alex, Bea
dc.contributor.advisor
Birch-Mayne, Alexandra
dc.contributor.advisor
Whiteley, William
dc.contributor.author
Falis, Matúš
dc.date.accessioned
2025-05-01T09:33:34Z
dc.date.available
2025-05-01T09:33:34Z
dc.date.issued
2025-05-01
dc.description.abstract
Clinical Document Coding is the task of summarising unstructured or semi-structured medical text by assigning labels (codes) from a structured knowledge base - e.g., the International Classification of Diseases (ICD) - corresponding to medical concepts, such as conditions or procedures, to clinical documents. Clinical Document Coding is currently performed by humans. As the task requires time and effort that could be used elsewhere in healthcare, it is desirable to (at least partially) automate it. Hence, a variety of systems have been developed for this task ranging from rule-based to deep-neural-network solutions. Neural solutions often ignore the rich concept representation within the ontology, i.e., the ontological structure, or the code descriptions. Furthermore, while effective in a variety of tasks, neural networks are notorious for requiring large amounts of training data and struggling in low-resource scenarios, unless designed with a focus on low-resource performance. Concepts in clinical coding follow a big-head long-tail distribution - with few frequent (big head) and a large number of infrequent labels (long tail) - reflecting how common different conditions and procedures are. Few concepts, such as Hypertension and Type 2 Diabetes are very common, while many, such as the Marburg Virus Disease, are rare. This type of distribution is observed in data, with many concepts being infrequent or absent within even the largest publicly available datasets. This data sparsity issue is even more pronounced due to the demanding requirements on the amounts of data for training effective neural network models. This thesis strives to incorporate the rich ontological structure and background knowledge into the model development and evaluation process for coding discharge summaries with the ICD. The thesis makes the following contributions: (1) a hierarchical evaluation metric; (2) a hierarchical error analysis tool; (3) rule-based data augmentation and synthesis through adjustments to existing texts; and (4) exploration of data augmentation through generating synthetic text via a Large Language Model (GPT-3.5). The thesis presents hierarchy-aware evaluation approaches. Firstly, Count-Preserving Hierarchical Evaluation (CoPHE) compares the number of gold standard labels against predicted number of labels on different levels of the hierarchy. Beyond being able to assign partial credit to mispredictions based on their proximity to gold standard labels within the ontology, CoPHE is also capable of capturing over- or under-prediction within subgraphs of the hierarchy. Secondly, the popular confusion matrix visualisation and analysis approach commonly used in strongly-labelled scenarios was extended to the weakly-labelled scenario of ICD coding. This approach - Weak Hierarchical Confusion Matrices - allows understanding whether errors in prediction commonly arise due to assigning a different concept from the same family of concepts, or over/under-prediction. Ontology-guided data augmentation and synthesis was employed to address the data sparsity issue. The thesis explores the possibility of addressing this issue via enhancing concept variability through synonym replacement for relevant concepts identified with pre-existing named entity recognition and linking tools, and introducing concepts previously unseen in the training data. Finally, the thesis explores the possibility of creating synthetic discharge summaries with the aid of Large Language Models for the purpose of data augmentation for few-shot (appearing rarely in the training data) and zero-shot labels (absent from the training data). GPT-3.5 was prompted to generate discharge summaries based on diagnosis and procedure codes coming from real patient records. In models trained on augmented MIMIC-IV, concepts that appeared within the original training set albeit with few instances were found to have benefited from the further generated data. The method necessitates further refinement to be reliable in the zero-shot scenario. Clinical staff evaluated the generated discharge summaries and compared them to real data with similar labels. Synthetic discharge summaries correctly list individual concepts, but fail to note interaction among them. The resulting overall narrative is thus often flawed. The generated data may be useful for training neural network models, but would not be acceptable in a clinical setting. In summary, the thesis contributes methods of evaluating performance with regard to the structure of the ontology, and augmentation approaches to mitigating the effects of concept sparsity using both the ontological structure, and the textual descriptions of concepts within the ontology.
en
dc.identifier.uri
https://hdl.handle.net/1842/43399
dc.identifier.uri
http://dx.doi.org/10.7488/era/5935
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Automated Clinical Coding: What, Why, and Where We Are? Dong, H., Falis, M., Whiteley, W., Alex, B., Ji, S., Chen, J. & Wu, H., 21 Mar 2022, ArXiv, 8 p
en
dc.relation.hasversion
CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale Multi-Label Text Classification Falis, M., Dong, H., Birch, A. & Alex, B., 7 Nov 2021, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics (ACL), p. 907-912 6 p
en
dc.relation.hasversion
Falis, M., Dong, H., Birch, A., and Alex, B. (2022). Horses to zebras: ontologyguided data augmentation and synthesis for icd-9 coding. In Proceedings of the 21st Workshop on Biomedical Language Processing. Association for Computational Linguistics
en
dc.relation.hasversion
Can GPT-3.5 generate and code discharge summaries? Falis, M., Gema, A. P., Dong, H., Daines, L., Basetti, S., Holder, M., Penfold, R. S., Birch, A. & Alex, B., 13 Sept 2024, In: Journal of the American Medical Informatics Association. 31, 10, p. 2284–2293 10 p
en
dc.relation.hasversion
Ontological attention ensembles for capturing semantic concepts in ICD code prediction from clinical text Falis, M., Pajak, M., Lisowska, A., Schrempf, P., Deckers, L., Mikhael, S., Tsaftaris, S. & O'Neil, A., 1 Nov 2019, Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019). Hong Kong: Association for Computational Linguistics, p. 168-177 10 p
en
dc.subject
ICD Coding
en
dc.subject
Clinical Natural Language Processing
en
dc.subject
Data Augmentation
en
dc.subject
Evaluation
en
dc.title
Addressing concept sparsity in medical text with medical ontologies
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
FalisM_2025.pdf
Size:
4.16 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)