Joint modelling of longitudinal and survival data for dynamic prediction in credit-related applications
Item statusRestricted Access
Embargo end date09/12/2022
Medina-Olivares, Víctor H.
Lenders monitor their borrowers over time, allowing them to dynamically predict the probability of an event of interest, such as default. The widely used survival models focus on when the event happens and can handle time-varying covariates (TVCs) and censored observations. However, an issue little addressed in the literature is that the model specification and the predictive framework depend on the type of TVC included. TVCs can be either exogenous or endogenous to the survival time. Exogenous are those whose future paths are not affected by the event’s occurrence, such as macroeconomic variables. Endogenous, on the contrary, are those whose paths are influenced by the survival status. An example of the latter would be the unpaid principal balance when the event is the default. This thesis explores new mathematical models in credit-related applications, known as joint models of longitudinal and survival data. Initially developed in medical research, these models, in their standard version, are formed by two sub-models, one for the survival process and the other for the endogenous TVC (also named longitudinal outcome in this context). A latent structure links the sub-models, commonly in the form of random effects. Joint models have two advantages compared to survival models. First, they allow us to handle possible endogeneities in the TVCs. Second, by jointly modelling both processes, they offer us a dynamic prediction framework that incorporates their mutual evolution. We propose a series of innovations to make the approach appropriate to creditrelated applications. These innovations relate to the nature of survival time, the specific evolution of the TVCs, ways to scale the technique to large datasets and how to leverage the available data in the modelling framework. In concrete, we adapt the formulation of the joint models and their performance metrics to the discrete nature of the loan data. In addition, we include autoregressive terms in the TVC specification to address observed serial correlation and enhance predictive capability. Moreover, we can study more complex specifications with larger datasets by reformulating the approach within the INLA framework, a fast and accurate algorithm for Bayesian inference. Among these specifications are the joint models with more than one TVC and the joint model that leverages geographical information to include spatial and spatio-temporal effects in the hazard function. We also introduce a more accurate way to estimate individual survival predictions using the Laplace method. Finally, to compare different models, we propose a computationally efficient implementation of the cross-entropy estimate of the posterior predictive conditional density that uses the estimates obtained in the inference step. We apply joint models to predict the time to credit events in the following three settings: default in US mortgages, full prepayment in a German consumer loan portfolio, and full prepayment in US mortgages. The main empirical results show that the autoregressive terms in the joint model let us achieve better discrimination performance, the predictive ability is significantly enhanced compared to survival models when more TVCs are considered, and the inclusion of spatial effects consistently leads to better data representation.