Towards analytics over dirty databases

Perini, Massimo

Towards analytics over dirty databases

Files

PeriniM_2025.pdf (2.12 MB)

Item Status

RESTRICTED ACCESS

Embargo End Date

2026-09-12

Date

2025-09-12

Authors

Perini, Massimo

Full item page

Abstract

In today's data-driven world, organizations increasingly rely on large and complex datasets to drive decision-making, build predictive models, and optimize operations. From e-commerce companies leveraging customer behavior data to improve marketing strategies, to financial institutions analyzing transactions for fraud detection, the demand for efficient, scalable, and seamless data processing is critical. However, real-world data is often incomplete, inconsistent, or erroneous, which undermines the accuracy and reliability of the insights drawn from it. Traditional workflows require moving data out of database systems into external analytical tools for machine learning, data cleaning, and query answering tasks, introducing significant inefficiencies and creating bottlenecks. This thesis presents integrated solutions that enable these operations to be performed directly within the database, addressing three key problems in modern database systems: (1) inefficiency in executing machine learning tasks, (2) inability to handle missing data effectively, and (3) limitations in querying inconsistent databases. First, we address the challenge of efficiently training machine learning models over relational data. Most data resides in databases, yet current machine learning systems typically require exporting data into external tools, leading to excessive data transfer and redundant computations. To overcome these inefficiencies, we propose an in-database machine learning library implemented on PostgreSQL and DuckDB. Our approach rewrites popular machine learning algorithms to run directly within the database, leveraging the relational data structure. By training models over aggregate values computed from normalized tables, we eliminate the need for expensive joins and preprocessing, achieving 10 to 100-fold faster model training compared to state-of-the-art solutions like MADLib. Additionally, our library allows multiple models to be constructed using the same set of aggregate computations, further optimizing the learning process. Second, missing data is a pervasive issue in real-world datasets, often necessitating the use of imputation techniques to fill in gaps before analysis or model training can proceed. External tools for imputation, such as those implementing the Multiple Imputation by Chained Equations (MICE) method, typically require data export and preprocessing, adding complexity to the workflow. We introduce an in-database imputation framework that integrates MICE directly into database systems, allowing it to operate over normalized data. By re-engineering the MICE algorithm to share computations across iterations and optimize for fast access to frequently used data, we significantly reduce runtime. Our solution, implemented in both PostgreSQL and DuckDB, outperforms traditional methods, providing a more efficient and scalable way to handle missing data without leaving the database environment. Third, we tackle the problem of query answering over inconsistent databases -- a critical challenge for organizations that rely on accurate query results despite data inconsistencies. Conventional methods often involve data cleaning to restore consistency, but this can be impractical in real-time environments or where altering the original data is not feasible. To address this, we develop a method based on Consistent Query Answering (CQA), which allows queries to be evaluated directly over inconsistent data. We model the concept of ``minimal repairs'' -- smallest changes that restore consistency -- as a logical formula and use model counting techniques to determine the number of possible repairs. Furthermore, by optimizing the size of the logical formula, we achieve up to a 1000-fold reduction in computational complexity. To efficiently compute the number of repairs supporting each query answer, we introduce two Monte Carlo approximation algorithms that leverage the compiled logical formula. These algorithms provide theoretical guarantees for approximation accuracy while maintaining practical efficiency, enabling the execution of CQA over large datasets with multiple functional dependency violations. In conclusion, this thesis presents a comprehensive set of in-database solutions designed to overcome inefficiencies in machine learning, data imputation, and query evaluation processes, particularly in the presence of incomplete or inconsistent data. By integrating these tasks directly into modern relational databases, our approach not only streamlines workflows but also significantly improves performance. Our contributions include a high-performance machine learning library, a scalable imputation technique for handling missing data, and a robust framework for consistent query answering over erroneous databases. Collectively, these innovations represent a significant advancement toward more efficient, reliable, and scalable data management solutions.

URI

https://hdl.handle.net/1842/43965
http://dx.doi.org/10.7488/era/6495

This item appears in the following Collection(s)

Informatics thesis and dissertation collection