Addressing concept sparsity in medical text with medical ontologies

Falis, Matúš

Addressing concept sparsity in medical text with medical ontologies

Files

FalisM_2025.pdf (4.16 MB)

Date

2025-05-01

Authors

Falis, Matúš

Full item page

Abstract

Clinical Document Coding is the task of summarising unstructured or semi-structured medical text by assigning labels (codes) from a structured knowledge base - e.g., the International Classification of Diseases (ICD) - corresponding to medical concepts, such as conditions or procedures, to clinical documents. Clinical Document Coding is currently performed by humans. As the task requires time and effort that could be used elsewhere in healthcare, it is desirable to (at least partially) automate it. Hence, a variety of systems have been developed for this task ranging from rule-based to deep-neural-network solutions. Neural solutions often ignore the rich concept representation within the ontology, i.e., the ontological structure, or the code descriptions. Furthermore, while effective in a variety of tasks, neural networks are notorious for requiring large amounts of training data and struggling in low-resource scenarios, unless designed with a focus on low-resource performance. Concepts in clinical coding follow a big-head long-tail distribution - with few frequent (big head) and a large number of infrequent labels (long tail) - reflecting how common different conditions and procedures are. Few concepts, such as Hypertension and Type 2 Diabetes are very common, while many, such as the Marburg Virus Disease, are rare. This type of distribution is observed in data, with many concepts being infrequent or absent within even the largest publicly available datasets. This data sparsity issue is even more pronounced due to the demanding requirements on the amounts of data for training effective neural network models. This thesis strives to incorporate the rich ontological structure and background knowledge into the model development and evaluation process for coding discharge summaries with the ICD. The thesis makes the following contributions: (1) a hierarchical evaluation metric; (2) a hierarchical error analysis tool; (3) rule-based data augmentation and synthesis through adjustments to existing texts; and (4) exploration of data augmentation through generating synthetic text via a Large Language Model (GPT-3.5). The thesis presents hierarchy-aware evaluation approaches. Firstly, Count-Preserving Hierarchical Evaluation (CoPHE) compares the number of gold standard labels against predicted number of labels on different levels of the hierarchy. Beyond being able to assign partial credit to mispredictions based on their proximity to gold standard labels within the ontology, CoPHE is also capable of capturing over- or under-prediction within subgraphs of the hierarchy. Secondly, the popular confusion matrix visualisation and analysis approach commonly used in strongly-labelled scenarios was extended to the weakly-labelled scenario of ICD coding. This approach - Weak Hierarchical Confusion Matrices - allows understanding whether errors in prediction commonly arise due to assigning a different concept from the same family of concepts, or over/under-prediction. Ontology-guided data augmentation and synthesis was employed to address the data sparsity issue. The thesis explores the possibility of addressing this issue via enhancing concept variability through synonym replacement for relevant concepts identified with pre-existing named entity recognition and linking tools, and introducing concepts previously unseen in the training data. Finally, the thesis explores the possibility of creating synthetic discharge summaries with the aid of Large Language Models for the purpose of data augmentation for few-shot (appearing rarely in the training data) and zero-shot labels (absent from the training data). GPT-3.5 was prompted to generate discharge summaries based on diagnosis and procedure codes coming from real patient records. In models trained on augmented MIMIC-IV, concepts that appeared within the original training set albeit with few instances were found to have benefited from the further generated data. The method necessitates further refinement to be reliable in the zero-shot scenario. Clinical staff evaluated the generated discharge summaries and compared them to real data with similar labels. Synthetic discharge summaries correctly list individual concepts, but fail to note interaction among them. The resulting overall narrative is thus often flawed. The generated data may be useful for training neural network models, but would not be acceptable in a clinical setting. In summary, the thesis contributes methods of evaluating performance with regard to the structure of the ontology, and augmentation approaches to mitigating the effects of concept sparsity using both the ontological structure, and the textual descriptions of concepts within the ontology.

URI

https://hdl.handle.net/1842/43399
http://dx.doi.org/10.7488/era/5935

This item appears in the following Collection(s)

Informatics thesis and dissertation collection