eDataShield: Running an analysis of combined data when the individual records cannot be combined
In social or epidemiological research comparable data are often collected by agencies in different settings, e.g. in different countries or by different organisations. Disclosure concerns may prevent the agencies releasing their data to outside users. Comparison of results between the different agencies may be carried out by running separate analyses in the safe haven provided by each agency and comparing the published reports. This approach has several disadvantages. One can never be sure that the data sets and variables which are nominally the same are really comparable. An analysis that adjusts for covariates in each individual agency will not be identical to what one would obtain if the raw data were pooled. Tests for agency by covariate interactions are not readily carried out from published reports. A similar situation has arisen in the analysis of genomic data, where a pooled analysis of small individual studies is required for adequate inference, but the individual centres did not wish to share their data. In response to this problem the DataSHIELD system was developed (see www.datashield.org) where a joint analysis is carried out iteratively by linking the computer in each centre to an analysis computer (AC). The AC holds no raw data, but receives summary statistics from each of the individual studies, combines them, and passes the combined summaries back to the individual centres. This allows joint analyses using generalised linear models to be fitted by iterating this exchange of summary statistics. The interface between the AC and the other centres prevents any raw data being exchanged. When disclosure concerns would not allow centre computers to be linked in this way it is possible to adapt this procedure by exchanging summaries between agencies by email. Routines in R have been developed to allow such analyses to be carried out – the e-DataSHIELD protocol. We will describe how this process works and present an example of some analyses that used data from the Scottish Longitudinal Study and the ONS Longitudinal Study (England and Wales) to compare mortality between individuals living in urban centres in the two countries.