Are anonymised datasets from clinical trials truly anonymous?

Rodriguez, Aryelly

Are anonymised datasets from clinical trials truly anonymous?

Files

Rodriguez2025.pdf (28.48 MB)

Rodriguez2025_redacted.pdf (46.79 MB)

Date

2025-03-04

Authors

Rodriguez, Aryelly

Full item page

Abstract

BACKGROUND: Funders, regulators and publishers are increasingly requesting that clinical trial researchers share their research data with others, once the primary analysis has been completed. Existing clinical trial data could significantly contribute to expanding medical and scientific knowledge by investigating questions beyond the original study scope, facilitating individual participant data (IPD) meta-analysis, verifying results, and exploring novel methodologies for data analysis. Anonymisation of IPD before sharing can offer a way to safeguard participants' privacy. While there are several recommendations and guidance available for attempting data anonymisation prior to sharing, completely anonymising data while keeping it usable remains challenging. Moreover, many anonymised datasets are already publicly available for secondary research. However, it remains unclear whether study participants could potentially be at risk of re-identification, and under what circumstances re-identification is more likely to occur. METHODS: In the first phase of this PhD research, a systematic scoping review was conducted to gather publications that reported recommendations on anonymisation for enabling data sharing from clinical trials, to understand what guidance was available to researchers and how publicly available anonymised datasets from clinical trials might have been compiled. Two reviewers, Aryelly Rodriguez with Chris Tuck or Alastair Murray independently assessed titles, abstracts, and full texts for eligibility. One reviewer extracted data from selected papers using thematic synthesis, which was then reviewed by a second reviewer for accuracy. Results were summarised through narrative analysis. Moving on to the second phase, I collected a broad selection of publicly available anonymised datasets that have been made available for research purposes extending beyond their original scope, to explore the characteristics of these anonymised datasets, assess the feasibility of applying re-identification risk scores to them, and determine how these scores could be useful. I estimated their re-identification risk scores with three equations designed for calculation of such scores based on the information in the entire dataset. These equations are commonly applied to routinely collected health records and only generate numerical values ranging from 0 (lowest risk) to 1 (maximum risk), without attempting to re-identify individuals within the datasets. Subsequently, I calculated the re-identification risk scores for each dataset, using the three equations. This analysis explored the characteristics of the datasets associated with increased or decreased risk scores, and compared the risk scores to evaluate their practicality for implementation. In the third and final phase of this PhD research, I used an online exploratory cross-sectional descriptive survey that consisted of both open-ended and closed questions to gather the UK researchers’ views regarding their experiences with the de-identification, anonymisation, release methods and re-identification risk estimation for clinical trials datasets. RESULTS: The systematic scoping review identified 59 eligible articles (from 43 studies) for inclusion. From these articles, three distinct themes emerged: anonymisation, de-identification and pseudonymisation. The articles also showed that the most commonly recommended anonymisation techniques are removal of direct participant identifiers, and the careful evaluation and modification of indirect identifiers to minimise the risk of identification. Anonymisation of datasets in conjunction with controlled access was the most recommended method for data sharing. For the next phase, I contacted data holders and followed their local procedures to access the anonymised datasets. I identified 86 potentially eligible datasets from 18 repositories and successfully secured 76 of them. After full evaluation, 70 datasets met the inclusion criteria and were included in the analysis, representing 14 out of the 18 repositories. Thirty-one datasets were shared with minimal restrictions (open access), while 39 were shared with varying levels of restrictions before access was granted (controlled access). Datasets had, on average, four identifiers and mean risk scores ranging from 0.47 to 0.91. The most common pieces of information present in the datasets that, when combined, may indirectly identify a participant were sex (80%) and age (72.9%). For the final phase, the exploratory survey had 38 responses to invitation from June 2022 to October 2022. Thirty-five participants (92%) used internal documentation, institutional standard operating procedures and/or published guidance to de-identify/anonymise clinical trials datasets. De-identification followed by anonymisation and then fulfilling data holders’ requirements before access was granted (controlled access) was the most common process for releasing the datasets as reported by 18 (47%) participants. Eleven participants (29%) had previous knowledge of re-identification risk estimation but had not used this. Experiences in the process of de-identifying/anonymising the datasets and maintaining such datasets were mostly negative, the main reported issues were lack of resources, guidance, and training. CONCLUSIONS: There is no single standardised set of recommendations on how to anonymise clinical trial datasets for sharing. However, the systematic scoping review showed a developing consensus on techniques used to achieve anonymisation. Researchers in clinical trials still consider that anonymisation techniques by themselves are insufficient to protect participant privacy, and they need to be paired with controlled access. The second phase of this research confirmed that clinical trial datasets are very rich in personal details and using re-identification risk scores as a measure of this richness is feasible. These scores could inform the anonymisation process of clinical trials datasets to release them for secondary research. We proposed a strategy for incorporating these scores into the decision-making process for releasing clinical trials datasets. Finally, the majority of responders to the survey reported using documented processes for de-identification and anonymisation. However, our survey results clearly indicate that there are still gaps in the areas of guidance, resources and training to fulfil sharing requests of de-identified/anonymised datasets, and that re-identification risk estimation is an underdeveloped area. This work will be of interest to the clinical trials research community, funders and publishers seeking to improve the process of anonymisation and foster data sharing.

URI

https://hdl.handle.net/1842/43168
http://dx.doi.org/10.7488/era/5709

This item appears in the following Collection(s)

Edinburgh Medical School thesis and dissertation collection