Scalable formulation of joint modelling for longitudinal and time to event data and its application on large electronic health record data of diabetes complications
Item Status
Embargo End Date
Date
Authors
Thoma, Ioanna
Abstract
INTRODUCTION:
Clinical decision-making in the management of diabetes and other chronic diseases depends upon individualised risk predictions of progression of the disease or complica- tions of disease. With sequential measurements of biomarkers, it should be possible to make dynamic predictions that are updated as new data arrive. Since the 1990s, methods have been developed to jointly model longitudinal measurements of biomarkers and time-to-event data, aiming to facilitate predictions in various fields.
These methods offer a comprehensive approach to analyse both the longitudinal changes in biomarkers, and the occurrence of events, allowing for a more integrated understanding of the underlying processes and improved predictive capabilities. The aim of this thesis is to investigate whether established methods for joint modelling are able to scale to large-scale electronic health record datasets with multiple biomarkers measured asynchronously, and evaluates the performance of a novel approach that overcomes the limitations of existing methods.
METHODS:
The epidemiological study design utilised in this research is a retrospective observa- tional study. The data used for these analyses were obtained from a registry encompassing all individuals with type 1 diabetes in Scotland, which is delivered by the Scottish Care Information - Diabetes Collaboration platform. The two outcomes studied were time to cardiovascular disease (CVD) and time to end-stage renal disease (ESRD) from T1D diag- nosis. The longitudinal biomarkers examined in the study were glycosylated haemoglobin (HbA1c) and estimated glomerular filtration rate (eGFR). These biomarkers and endpoints were selected based on their prevalence in the T1D population and the established association between these biomarkers and the outcomes.
As a state-of-the-art method for joint modelling, Brilleman’s stan_jm() function was evaluated. This is an implementation of a shared parameter joint model for longitudinal and time-to- event data in Stan contributed to the rstanarm package. This was compared with a novel approach based on sequential Bayesian updating of a continuous-time state-space model for the biomarkers, with predictions generated by a Kalman filter algorithm using the ctsem package fed into a Poisson time-splitting regression model for the events. In contrast to the standard joint modelling approach that can only fit a linear mixed model to the biomarkers, the ctsem package is able to fit a broader family of models that include terms for autoregressive drift and diffusion. As a baseline for comparison, a last-observation-carried-forward model was evaluated to predict time-to-event.
RESULTS:
The analyses were conducted using renal replacement therapy outcome data regarding 29764 individuals and cardiovascular disease outcome data on 29479 individuals in Scotland (as per the 2019 national registry extract). The CVD dataset was reduced to 24779 individuals with both HbA1c and eGFR data measured on the same date; a limitation of the modelling function itself. The datasets include 799 events of renal replacement therapy (RRT) or death due to renal failure (6.71 years average follow-up) and 2274 CVD events (7.54 years average follow-up) respectively. The standard approach to joint modelling using quadrature to integrate over the trajectories of the latent biomarker states, implemented in rstanarm, was found to be too slow to use even with moderate-sized datasets, e.g. 17.5 hours for a subset
of 2633 subjects, 35.9 hours for 5265 subjects, and more than 68 hours for 10532 subjects. The sequential Bayesian updating approach was much faster, as it was able to analyse a dataset of 29121 individuals over 225598.3 person-years in 19 hours. Comparison of the fit of different longitudinal biomarker submodels showed that the fit of models that also included a drift and diffusion term was much better (AIC 51139 deviance units lower) than models that included only a linear mixed model slope term. Despite this, the improvement in predictive performance was slight for CVD (C-statistic 0.680 to 0.696 for 2112 individuals) and only moderate for end-stage renal disease (C-statistic 0.88 to 0.91 for 2000 individuals) by adding terms for diffusion and drift. The predictive performance of joint modelling in these datasets was only slightly better than using last-observation-carried-forward in the Poisson regression model (C-statistic 0.819 over 8625 person-years).
CONCLUSIONS:
I have demonstrated that unlike the standard approach to joint modelling, implemented in rstanarm, the time-splitting joint modelling approach based on sequential Bayesian updating can scale to a large dataset and allows biomarker trajectories to be modelled with a wider family of models that have better fit than simple linear mixed models. However, in this application, where the only biomarkers were HbA1c and eGFR, and the outcomes were time-to-CVD and end-stage renal disease, the increment in the predictive performance of joint modelling compared with last-observation-carried forward was slight. For other outcomes, where the ability to predict time-to-event depends upon modelling latent biomarker trajectories rather than just using the last-observation-carried-forward, the advantages of joint modelling may be greater.
This thesis proceeds as follows. The first two chapters serve as an introduction to the joint modelling of longitudinal and time-to-event data and its relation to other methods for clinical risk prediction. Briefly, this part explores the rationale for utilising such an approach to manage chronic diseases, such as T1D, better. The methodological chapters of this thesis describe the mathematical formulation of a multivariate shared-parameter joint model and introduce its application and performance on a subset of individuals with T1D and data pertaining to CVD and ESRD outcomes.
Additionally, the mathematical formulation of an alternative time-splitting approach is demonstrated and compared to a conventional method for estimating longitudinal trajectories of clinical biomarkers used in risk prediction. Also, the key features of the pipeline required to implement this approach are outlined. The final chapters of the thesis present an applied example that demonstrates the estimation and evaluation of the alternative modelling approach and explores the types of inferences that can be obtained for a subset of individuals with T1D that might progress to ESRD. Finally, this thesis highlights the strengths and weaknesses of applying and scaling up more complex modelling approaches to facilitate dynamic risk prediction for precision medicine.
This item appears in the following Collection(s)

