Data-driven approaches for predicting asthma attacks in adults in primary care
Background Asthma attacks cause approximately 270 hospitalisations and four deaths per day in the United Kingdom (UK). Previous attempts to construct data-driven risk prediction models of asthma attacks have lacked clinical utility: either producing inaccurate predictions or requiring patient data which are not cost-effective to collect on a large scale (such as electronic monitoring device data). Electronic Health Record (EHR) use throughout the UK enables researchers to harness comprehensive and panoramic patient data, but their cleaning and pre-processing requires sophisticated empirical experimentation and data analytics approaches. My objectives were to appraise the previously utilised methods in asthma attack risk prediction modelling for feature extraction, model development, and model selection, and to train and test a model in Scottish EHRs. Methods In this thesis, I used a Scottish longitudinal primary care EHR dataset with linked secondary care records, to investigate the optimisation of an asthma attack risk prediction model. To inform the model, I refined methods for estimation of asthma medication adherence from EHRs, compared model training data enrichment procedures, and evaluated measures for validating model performance. After conducting a critical appraisal of the methods employed in the literature, I trained and tested four statistical learning algorithms for prediction in the next four weeks, i.e. logistic regression, naïve Bayes classification, random forests, and extreme gradient boosting, and validated model performance in an unseen hold-out dataset. Training data enrichment methods were compared across all algorithms to establish whether the sensitivity of estimating relatively uncommon event incidence, such as asthma attacks in the general asthma population, could be improved. Secondary event horizons were also examined, such as prediction in the next six months. Empirical experimentation established the balanced accuracy to be the most appropriate prediction model performance measure, and the calibration between estimated and observed risk was additionally assessed using the Area Under the Receiver-Operator Curve (AUC). Results Data were available for over 670,000 individuals, followed for up to 17 years (177,306 person-years in total). Binary prediction of asthma attacks in the following four-week period resulted in 1,203,476 data samples, of which 1% contained one or more attacks (12,193 total attacks). In the preliminary model selection phase, the random forest algorithm provided the best balance between accuracy in those with asthma attacks (sensitivity) and in those predicted to have attacks (positive predictive value) in the following four weeks. In an unseen data partition, the final random forest model, with optimised hyper-parameters, achieved an AUC of 0.91, and a balanced accuracy of 73.6% after the application of an optimised decision threshold. Accurate predictions were made for a median of 99.6% of those who did not go on to have attacks (specificity). As expected with rare event predictions, the sensitivity was lower at 47.7%, but this was well balanced with the positive predictive value of 48.9%. Furthermore, several of the secondary models, including predicting asthma attacks in the following 12 weeks, achieved state-of-the-art performance and still had high potential clinical utility. Conclusions I successfully developed an EHR-based model for predicting asthma attacks in the next four weeks. Accurately predicting asthma attacks occurrence may facilitate closer monitoring to ensure that preventative therapy is adequately managing symptoms, reinforce the need to keep abreast of triggers, and allow rescue treatments to be administered quickly when necessary.