Edinburgh Research Archive

Data privacy and valuation for trustworthy machine learning

Item Status

Embargo End Date

Authors

Watson, Lauren

Abstract

Widespread data collection - and the subsequent use of this data beyond the control of the original data contributors - has become a fact of modern life. The recent excitement caused by impressive applications of machine learning models is thus tempered by growing concern about their ethical impact. In this thesis, we examine two particular ethical aspects of the machine learning process: the privacy and value of an individual’s data, and their subtle relationship to effectively training machine learning models. Accurate valuation of data has several important applications, including allowing individuals to be fairly compensated for their contributions and enabling resource efficient machine learning. However, the current state-of-the-art data valuation techniques are highly computationally expensive, even to approximate. We introduce techniques exploiting the theoretical properties of machine learning algorithms to efficiently evaluate the impact of individual datapoints on machine learning models - without sacrificing accuracy. On the other hand, in data privacy it has become apparent that machine learning models and statistical analyses risk the privacy of their underlying datasets. Even when only the result of the analysis is released and the data is not publicly available. This has led to two key directions of work in the data privacy community: the development of attacks demonstrating the privacy risks of machine learning models, and the proposal of privacy protection approaches. In the area of privacy attacks, we examine the severity of privacy risks posed by machine learning models to individual data contributors. We demonstrate how to improve attacker efficacy by calibrating attacks to include the typical model behaviour with respect to a given point. In privacy protection, we propose techniques based on intuitive privacy approaches and the properties of the stochastic gradient descent algorithm. We provably provide differential privacy, the current gold standard privacy guarantee, while improving the privacy-utility trade-off for machine learning and statistical attack detection. We then investigate the inefficiencies of the differentially private stochastic gradient descent learning algorithm, and propose pruning as a way to alleviate these challenges.

This item appears in the following Collection(s)