Edinburgh Research Archive

A critical look at methods for protection of privacy in the 2015 Charter for Safe Havens in Scotland for handling unconsented data from NHS patient records to support research

Abstract

To allow e-health records to be exploited for research without individual-level consent, The 2015 Charter for Safe Havens in Scotland recommended that datasets should be made available only through “safe haven” data warehouses. To protect privacy, the Charter laid down that linked datasets should be kept only for the “minimal time necessary”, and that analytical outputs should be manually checked for “statistical disclosure” before being made available for the user to copy. These measures reduce research productivity and restrict the detailed work required to construct cohorts to study the outcome of chronic disease, without necessarily protecting privacy against attacks that might be realistically be mounted. This presentation examines how to protect against these threats. The risks to privacy in these deidentified datasets are either from reidentification attacks by a trusted user with access to individual-level data, or from attribute disclosure attacks that exploit aggregate data released for publication. These attacks depend critically on the availability of “side information” on the targeted individual. Basic principles of information theory can be applied to quantify privacy as the entropy of a probability distribution (measured in bits), and to protect privacy by limiting the ability to use side information. “Differential privacy” techniques, which protect against arbitrary side information, are unnecessarily stringent in the context of e-health records, where only a few of the variables in the e-health record are likely to be available to an attacker who does not already have access to the e-health record. The variables most likely to be used for side information - gender, geographical location (health board), and year of birth - contain about 12 bits of information, compared with an entropy of about 22 bits for the probability distribution over identities. For a successful reidentification attack, additional side information would be needed. A single hospital admission date, which might be available to an attacker, contains about 10 bits of information. Adding noise to dates of events, while preserving the sequence of these events and intervals between events, would provide additional protection against such attacks. For aggregate data, standard rules for statistical disclosure control such as the rule of at least 5 individuals in every cell of a frequency table, do not distinguish betweeen variables such as lab results that are unlikely to be available to an attacker, and variables such as geographic location. More fundamentally, the rules do not take account of how the information leak accumulates over multiple variables: in the most extreme case, where the attacker has access to an individual’s genotypes, a successful attribute disclosure attack could be mounted using data aggregated from thousands individuals. Risks to privacy are much higher when e-health records are linked to social data, as these datasets contain more variables that are likely to be available to an adversary and could be exploited as side information. The current “one-size-fits-all” approach, which does not distinguish between high-risk and low-risk linkages, imposes unnecessary constraints on research using only e-health records in which the risk to privacy is minimal. A video of this presentation can be viewed at https://media.ed.ac.uk/media/0_taraqidu

This item appears in the following Collection(s)