Applying missing data methods to routine data using the example of a population-based register of patients with diabetes
Read, Stephanie Helen
Background: Routinely-collected data offer great potential for epidemiological research and could be used to make randomised controlled trials (RCTs) more efficient. The use of routine data for research has been limited by concerns surrounding data quality, particularly data completeness. To fully exploit these information-rich data sources it is necessary to identify approaches capable of overcoming high proportions of missing data. Using a 2008 extract of the Scottish Care Information – Diabetes Collaboration (SCIDC) database, a population-based register of people with a diagnosis of diabetes in Scotland, I compared the findings of several methods for handling missing data in a retrospective cohort study investigating the association between body mass index (BMI) and all-cause mortality in patients with type 2 diabetes. Methods: Discussions with clinicians and logistic regression analyses were used to determine the likely mechanisms of missingness and the relative appropriateness of a selection of missing data methods, such as multiple imputation. Sequentially more complicated imputation approaches were used to handle missing data. Cox proportional hazard model coefficients for the association between BMI and all-cause mortality were compared for each missing data method. Age-standardised mortality rates by categories of BMI at around the time of diagnosis were also presented. Results: There were 66,472 patients diagnosed with type 2 DM between 2004 and 2008. Of these patients, 21% of patients did not have a recording of BMI at time of diagnosis. Amongst patients with complete BMI data, there were 5,491 deaths during 296,584 person years of follow-up. Amongst patients with incomplete data, there were 2,090 deaths during 79,067 person-years of follow-up. Analyses indicated that the primary mechanism of missingness was missing at random, conditional on patient year of diagnosis and vital status. In particular, patients with missing data had considerably worse survival than patients without missing data. Regardless of the method for handling the missing data, a U-shaped relationship between BMI and mortality was observed. Compared to complete case analysis, the association between BMI and alliii cause mortality was weaker using multiple imputation approaches with estimates moving towards the null. Closest observation imputation had the smallest effect on estimates compared to complete case analysis. Risk of mortality was consistently highest in the less than 25kg/m² BMI group. For example, estimates obtained using multiple imputation using chained equations indicated that patients with a BMI below 25kg/m² had a 38% higher risk of mortality than patients in the 25 to less than 30kg/m² BMI category. Conclusions: Alternative methods to complete case analysis can be computationally intensive with many important practical considerations. However, it remains valuable to explore the robustness of estimates to departures from the assumptions made by complete case analysis. The use of these methods can preserve the sample size and therefore may be useful in developing risk prediction scores. Mortality was lowest amongst overweight or obese patients relative to normal weight. Further work is required to identify optimal approaches to weight management amongst patients with diabetes.