Statistical models in prognostic modelling with many skewed variables and missing data: a case study in breast cancer
Baneshi, Mohammad Reza
Prognostic models have clinical appeal to aid therapeutic decision making. In the UK, the Nottingham Prognostic Index (NPI) has been used, for over two decades, to inform patient management. However, it has been commented that NPI is not capable of identifying a subgroup of patients with a prognosis so good that adjuvant therapy with potential harmful side effects can be withheld safely. Tissue Microarray Analysis (TMA) now makes possible measurement of biological tissue microarray features of frozen biopsies from breast cancer tumours. These give an insight to the biology of tumour and hence could have the potential to enhance prognostic modelling. I therefore wished to investigate whether biomarkers can add value to clinical predictors to provide improved prognostic stratification in terms of Recurrence Free Survival (RFS). However, there are very many biomarkers that could be measured, they usually exhibit skewed distribution and missing values are common. The statistical issues raised are thus number of variables being tested, form of the association, imputation of missing data, and assessment of the stability and internal validity of the model. Therefore the specific aim of this study was to develop and to demonstrate performance of statistical modelling techniques that will be useful in circumstances where there is a surfeit of explanatory variables and missing data; in particular to achieve useful and parsimonious models while guarding against instability and overfitting. I also sought to identify a subgroup of patients with a prognosis so good that a decision can be made to avoid adjuvant therapy. I aimed to provide statistically robust answers to a set of clinical question and develop strategies to be used in such data sets that would be useful and acceptable to clinicians. A unique data set of 401 Estrogen Receptor positive (ER+) tamoxifen treated breast cancer patients with measurement for a large panel of biomarkers (72 in total) was available. Taking a statistical approach, I applied a multi-faceted screening process to select a limited set of potentially informative variables and to detect the appropriate form of the association, followed by multiple imputations of missing data and bootstrapping. In comparison with the NPI, the final joint model derived assigned patients into more appropriate risk groups (14% of recurred and 4% of non-recurred cases). The actuarial 7-year RFS rate for patients in the lowest risk quartile was 95% (95% C.I.: 89%, 100%). To evaluate an alternative approach, biological knowledge was incorporated into the process of model development. Model building began with the use of biological expertise to divide the variables into substantive biomarker sets on the basis of presumed role in the pathway to cancer progression. For each biomarker family, an informative and parsimonious index was generated by combining family variables, to be offered to the final model as intermediate predictor. In comparison with NPI, patients into more appropriate risk groups (21% of recurred and 11% of non-recurred patients). This model identified a low-risk group with 7-year RFS rate at 98% (95% C.I.: 96%, 100%).