Metric learning on high-dimensional data with optimal transport distances

Fulop, Patric-Manuel

Metric learning on high-dimensional data with optimal transport distances

Files

FulopP_2025.pdf (4 MB)

Date

2025-01-23

Authors

Fulop, Patric-Manuel

Full item page

Abstract

Optimal transport distances (OT), also known as Kantorovich or Wasserstein distances and its approximate variants such as Sinkhorn Divergences have been widely used in recent years in the field of Machine Learning and its applications. From being used as loss functions in generative model setups such as GANs & unsupervised domain adaptation, to the more recent cluster assignment in self-supervised large state-of-art models such as SWAV, they offer a principled way to compare probability distributions. It is an automatic machinery that takes as an input a ground metric on the data features and lifts this to distances between probabilities on that data space. One of the pitfalls of other often-used methods such as Kullback-Leibler from the family of f-Divergences is the breaking of euclidean metrics in high dimensional spaces, as well as infite solutions when the support of the distributions doesn’t match. In the first part of the thesis, we provide an introduction to Optimal Transport theory, followed by the relevant literature in metric learning & generative modelling that covers OT, including a few advancements in approximations of OT distances, i.e. Sinkhorn divergences, that make training generative models with the Wasserstein distance, faster and scalable. Two of the main challenges with using OT distances in practice is the cost of computation when the data lives in high dimension, and the choice of a suitable ground metric. Firstly, we look at the recent work by Paty and Cuturi (2019), which aims specifically at reducing the computational cost by computing OT using low-rank projections of the data, seen as discrete measures. We extend this approach and show that one can approximate OT distances by using more general families of maps provided they are 1-Lipschitz. The best estimate is obtained by maximising OT over the given family. As OT calculations are done after mapping data to a lower dimensional space, our method scales well with the original data dimension and is robust against noise. We demonstrate the idea with neural networks and provide some insights into using these methods for training generative models. Secondly, we look at learning the ground metric for OT distances in a supervised manner and compare this to traditional metric learning methods such as learning the parameters of Mahalanobis distances on MNIST. Finally, we consider potential avenues for future research in this area.

URI

https://hdl.handle.net/1842/43017
http://dx.doi.org/10.7488/era/5565

This item appears in the following Collection(s)

Informatics thesis and dissertation collection