|dc.description.abstract||The analysis of large administrative data sets can provide researchers with answers to many research and policy questions. Scotland has a wide range of such data available, in many cases with more detail than is available in similar data from other parts of the UK or in other countries. Initiatives to widen the access to these data are in place in Scotland including the Administrative Data Research Centre - Scotland (ADRC-S) and the Longitudinal Study Centre Scotland (LSCS), both led by the University of Edinburgh. Researchers who want to use data from these sources must submit an application justifying their use to panels who will balance the public benefit of their proposed project against its potential for disclosing confidential information. Once the project is approved the researchers will usually have to visit a secure location (safe haven) where the data will be made available to them under supervision. These procedures are necessary because it is widely understood that simply removing identifiers such as names and addresses does not prevent individuals from being identified.
These procedures put constraints on researchers who want to use administrative data. It is difficult for them to acquire the experience and skills required to handle these large and often messy data sources. Also, the need to visit a safe haven can restrict users to certain geographic locations. A solution that helps to lessen these limitations is to make synthetic versions of administrative data available to researchers. Synthetic data maintain the analytical properties of the original data but contains no real individuals. They can be made available to researchers to develop exploratory analyses and de-bug code before they visit the secure setting. This means that safe haven visits are mainly used to run the final analyses on the real data. Another popular use of our synthetic data is to create realistic data sets to teach researchers methods for analysing large administrative data sets.
The task of producing synthetic data that have the same properties as the original data, i.e. results from analysing them will be close to the original, is a challenging one. To facilitate it we have developed open-source software (synthpop package for R) which we are now using both to make data available to researchers using Scottish Longitudinal Study (SLS) and to create teaching data sets.
This presentation will give an overview of synthetic data, highlight some of the difficulties and how we have overcome them. We will illustrate the use of synthpop to create a training data set based on an analysis of young people who are not in education, employment or training (NEETS)||en