Data-driven approach to identifying sub-types of adult asthma from primary care electronic health records
Asthma is increasingly recognised as an umbrella term to describe a number of distinct clinical presentations (phenotypes) with underlying physiological mechanisms (endotypes). Numerous phenotypes and endotypes of asthma have been proposed, both using hypothesis-driven and, more recently, data-driven techniques. However, inconsistencies of findings mean that scientific and clinical consensus are yet to be reached and the pursuit of the true phenotypes and endotypes of asthma is ongoing. In the meantime, there is an unmet clinical need to identify practical subtypes of asthma (for the purposes of this thesis the term subtype refers to any grouping of asthma based on patient characteristics). Such subtypes could facilitate a transition from the current one-size-fits-all approach towards personalised medicine. The aim of this thesis was to identify subtypes of asthma in adults from UK primary care electronic health records using data-driven methods, specifically focusing on an unsupervised machine learning technique called cluster analysis. To inform the application of cluster analysis in this thesis, I reviewed its application in 63 previous studies that derived asthma subtypes from multimodal clinical data. I found that the methods used were often poorly suited to mixed-type clinical data. In addition, I found that studies assessing the stability and validity of findings were often inadequate or missing entirely. The clinical findings of studies in which such limitations are present should be interpreted with caution. The first step in the analysis reported in this thesis was the derivation of datasets from primary care electronic health records, to which data-driven methods could then be applied to identify subtypes. Two sources of primary care electronic health records were used: the Optimum Patient Care Research Database (OPCRD) and the Secure Anonymised Information Linkage (SAIL) Databank. I aimed to be as transparent as possible when describing the dataset derivation process by reporting the results of exploratory data analysis and by making all analysis code publicly available upon completion. This was to facilitate critical appraisal of the process and replication and validation of the findings. To identify subtypes of asthma from the derived datasets, multiple correspondence analysis (MCA) and k-means cluster analysis were applied to a training set of data from 50,000 patients with asthma registered at a primary care practice in England in 2016 (sourced from OPCRD). A novel framework based on the performance of a random forest model to replicate the outputs of the k-means clustering algorithm was used to select the number of dimensions to retain from the MCA. A resampling framework was used to discard unstable cluster solutions, and the number of clusters was selected using average silhouette widths. These methods identified five subtypes of adult asthma that can be tentatively interpreted as follows: (1) low healthcare utilisation; (2) low-to-medium medication use; (3) metabolic comorbidity; (4) high medication use; (5) very high medication use. Finally, a random forest model was trained to replicate the cluster labels using the original features. This model achieved a balanced accuracy of 93% in an unseen dataset comprising 50,000 patients sampled from OPCRD at the same time-point. In the internal validation analysis (unseen OPCRD data from 2017 and 2018) the random forest approximated cluster labels derived at two timepoints with balanced accuracies of 92-93%, and in the external validation analysis (unseen SAIL data from 2016, 2017 and 2018) the balanced accuracies were 74-79%. The asthma subtype characteristics across the unseen data (both the out-of-sample OPCRD and SAIL) were consistent with those in the training data. The investigation of data-driven methods for identifying asthma subtypes presented in this thesis builds on the current evidence in two key areas. First, limitations in the application of methods in previous studies were identified, and a novel framework which mitigates these limitations was proposed. This framework could be extended to other disease areas as a means of exploring patient subgroups and ultimately facilitating precision medicine. Second, this is the first study to derive data-driven subtypes of adult asthma directly from primary care electronic health record data. The result is subtypes that have the potential to be directly translated to clinical practice in a UK primary care setting. This could facilitate asthma patient stratification towards developing more personalised monitoring and treatment regimens.