Deep unsupervised machine learning in the presence of missing data

Šimkus, Vaidotas

Deep unsupervised machine learning in the presence of missing data

Files

ŠimkusV_2024.pdf (9.75 MB)

Date

2024-12-05

Authors

Šimkus, Vaidotas

Full item page

Abstract

Advances in deep statistical models have re-shaped modern data-driven applications, demonstrating remarkable empirical success across diverse domains. However, while some domains benefit from an abundance of clean and fully-observed data, enabling the practitioners to reap the full-benefits of deep models, other domains often grapple with incomplete data, hindering the effective application of these powerful models. In this thesis, we aim to investigate and address important challenges caused by missing data that hinder the use of deep models, focusing on two key statistical tasks: parameter estimation from incomplete training data sets and missing data imputation. Firstly, we explore the problem of missing data imputation using pre-trained models, focusing on deep statistical models in the class of variational autoencoders (VAEs). Our exploration reveals limitations of existing methods for conditional sampling of VAEs, identifying pitfalls, related to commonly desired properties of learnt VAEs, that hinder the methods’ performance in certain scenarios. To mitigate the pitfalls, we propose two novel methods based on Markov chain Monte Carlo and importance sampling. Our evaluation shows that the proposed methods improve missing data imputation using pre-trained VAEs across diverse data sets. Subsequently, we shift our attention to the estimation of VAEs from incomplete training data sets. While this area has received substantial attention in the literature, we report a previously unknown phenomenon caused by missing data that hinders the effective fitting of VAEs. To overcome the adverse effects and improve VAE estimation from incomplete data, we introduce two strategies based on variational mixture distributions that trade-off computational efficiency, model accuracy, and learnt latent structure. We demonstrate that the proposed approaches improve VAE estimation from incomplete data compared to existing approaches that do not use variational mixtures. Expanding our focus to the broader challenges of estimating general statistical models, we observe an uneven progress across different classes of deep models. To advance the adoption of all deep statistical models, we introduce variational Gibbs inference (VGI), a general-purpose method for maximum-likelihood-based estimation of general statistical models with tractable likelihood functions. We show that the method is capable of accurate model estimation from incomplete data, including VAEs and normalising flows. Importantly, VGI is one of the few probabilistically-principled methods in the current literature for normalising flow estimation from incomplete data, achieving state-of-the-art performance. By providing a unified framework for handling missing data in model estimation, VGI paves the way for leveraging the full potential of deep statistical models across diverse domains grappling with missing data.

URI

https://hdl.handle.net/1842/42874
http://dx.doi.org/10.7488/era/5431

This item appears in the following Collection(s)

Informatics thesis and dissertation collection