Data cleaning with variational autoencoders
Eduardo, Simão Fernandes Lopes Marques
A typical data science or machine learning pipeline starts with data exploration; then data engineering (wrangling, cleaning); then moves towards modelling (model selection, learning, validation); and finally model visualization or deployment. Most of the datasets used in industry are either structured or text based. Two relevant instances of structured datasets are: graph data (e.g. knowledge graphs), and tabular data (e.g. excel sheets, databases). However, image datasets are increasingly used in industry and have similar pipeline steps. This thesis explores the data cleaning problem, where two of its main steps are outlier detection and subsequent data repair. This work focuses on outliers that result from corruption processes that are applied to a subset of instances belonging to an original clean dataset. The remaining instances unaffected by corruption, or before corruption, are called inliers. The outlier detection step finds which data instances have been corrupted. The repair step either replaces the entire instance with a clean version, or imputes the values of specific features in that instance that are deemed corrupted. In both cases, an ideal repair process restores the underlying inlier instance, before having been corrupted by errors. The main goal is to devise machine learning (ML) models that automate both outlier detection and data repair, with minimal supervision by the end-user. In particular, we focus on solutions based on variational autoencoders (VAEs), because these are flexible generative models capable of providing repairs as samples or reconstructions. Moreover, the reconstruction provided by VAEs also allow for the detection of corrupted feature values, unlike classic outlier detection methods. Since the training dataset is corrupted by outliers, the key aspect to good performance in detection and repair is model robustness to data corruption, which prevents overfitting to errors. If the model overfits to errors, then it is difficult to distinguish inliers from outliers, therefore degrading performance. In this thesis two novel generative models are proposed for this task, to be used in different contexts. The two most common types of errors are either of random or systematic nature. Random errors corrupt each instance independently using an unknown distribution, exhibiting no clear anomalous pattern across outlier instances. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, exhibiting a clear pattern across outliers. Overall, this means high capacity models like VAEs more easily overfit to systematic errors, which compromises outlier detection and repair performance. This thesis focuses on point outliers as they are the most commonly found by practitioners. Point outliers are those that can be identified by only evaluating said instance individually, without the context of other instances (e.g. space, time, graphs). The first model proposal devises a novel unsupervised VAE that is robust to random errors for mixed-type (e.g. categorical, continuous) tabular data. This first model is called the Robust Variational Autoencoder (RVAE). We introduce this robustness by designing a decoder architecture that downweighs the contribution of corrupted feature values (cells) during training. Unlike traditional methods, besides providing which instances are outliers, the novel model provides which cells have been corrupted improving model interpretability. It is shown experimentally that the novel model performs better than baselines in cell outlier detection and repair, and is robust against initial hyper-parameter selection. In the second model proposal the focus is on detection and repair in datasets corrupted by systematic errors. This second model is called the Clean Subspace Variational Autoencoder (CLSVAE). The nature of systematic errors makes them easy to learn, and thus easy to overfit to. This means that if they are numerous in a dataset, then unsupervised methods will have difficulty distinguishing between inliers and outliers. A novel semi-supervised VAE is proposed that only requires a small labelled set of inliers and outliers, thus minimizing end-user intervention. The main idea is to learn separate latent representations for inliers and systematic errors, and only use the inlier representation for data repair. The novel model is shown to be robust to systematic errors, and it registers state-of-the-art repair in image datasets. Compared to the baselines, the novel model does better in challenging scenarios, where corruption level is higher or the labelled set is very small.