Edinburgh Research Archive

Towards analytics over dirty databases

dc.contributor.advisor
Nikolic, Milos
dc.contributor.advisor
Pieris, Andreas
dc.contributor.author
Perini, Massimo
dc.contributor.sponsor
Huawei
en
dc.date.accessioned
2025-09-12T13:43:24Z
dc.date.available
2025-09-12T13:43:24Z
dc.date.issued
2025-09-12
dc.description.abstract
In today's data-driven world, organizations increasingly rely on large and complex datasets to drive decision-making, build predictive models, and optimize operations. From e-commerce companies leveraging customer behavior data to improve marketing strategies, to financial institutions analyzing transactions for fraud detection, the demand for efficient, scalable, and seamless data processing is critical. However, real-world data is often incomplete, inconsistent, or erroneous, which undermines the accuracy and reliability of the insights drawn from it. Traditional workflows require moving data out of database systems into external analytical tools for machine learning, data cleaning, and query answering tasks, introducing significant inefficiencies and creating bottlenecks. This thesis presents integrated solutions that enable these operations to be performed directly within the database, addressing three key problems in modern database systems: (1) inefficiency in executing machine learning tasks, (2) inability to handle missing data effectively, and (3) limitations in querying inconsistent databases. First, we address the challenge of efficiently training machine learning models over relational data. Most data resides in databases, yet current machine learning systems typically require exporting data into external tools, leading to excessive data transfer and redundant computations. To overcome these inefficiencies, we propose an in-database machine learning library implemented on PostgreSQL and DuckDB. Our approach rewrites popular machine learning algorithms to run directly within the database, leveraging the relational data structure. By training models over aggregate values computed from normalized tables, we eliminate the need for expensive joins and preprocessing, achieving 10 to 100-fold faster model training compared to state-of-the-art solutions like MADLib. Additionally, our library allows multiple models to be constructed using the same set of aggregate computations, further optimizing the learning process. Second, missing data is a pervasive issue in real-world datasets, often necessitating the use of imputation techniques to fill in gaps before analysis or model training can proceed. External tools for imputation, such as those implementing the Multiple Imputation by Chained Equations (MICE) method, typically require data export and preprocessing, adding complexity to the workflow. We introduce an in-database imputation framework that integrates MICE directly into database systems, allowing it to operate over normalized data. By re-engineering the MICE algorithm to share computations across iterations and optimize for fast access to frequently used data, we significantly reduce runtime. Our solution, implemented in both PostgreSQL and DuckDB, outperforms traditional methods, providing a more efficient and scalable way to handle missing data without leaving the database environment. Third, we tackle the problem of query answering over inconsistent databases -- a critical challenge for organizations that rely on accurate query results despite data inconsistencies. Conventional methods often involve data cleaning to restore consistency, but this can be impractical in real-time environments or where altering the original data is not feasible. To address this, we develop a method based on Consistent Query Answering (CQA), which allows queries to be evaluated directly over inconsistent data. We model the concept of ``minimal repairs'' -- smallest changes that restore consistency -- as a logical formula and use model counting techniques to determine the number of possible repairs. Furthermore, by optimizing the size of the logical formula, we achieve up to a 1000-fold reduction in computational complexity. To efficiently compute the number of repairs supporting each query answer, we introduce two Monte Carlo approximation algorithms that leverage the compiled logical formula. These algorithms provide theoretical guarantees for approximation accuracy while maintaining practical efficiency, enabling the execution of CQA over large datasets with multiple functional dependency violations. In conclusion, this thesis presents a comprehensive set of in-database solutions designed to overcome inefficiencies in machine learning, data imputation, and query evaluation processes, particularly in the presence of incomplete or inconsistent data. By integrating these tasks directly into modern relational databases, our approach not only streamlines workflows but also significantly improves performance. Our contributions include a high-performance machine learning library, a scalable imputation technique for handling missing data, and a robust framework for consistent query answering over erroneous databases. Collectively, these innovations represent a significant advancement toward more efficient, reliable, and scalable data management solutions.
en
dc.identifier.uri
https://hdl.handle.net/1842/43965
dc.identifier.uri
http://dx.doi.org/10.7488/era/6495
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
In-Database Data Imputation Perini, M. & Nikolic, M., 26 Mar 2024, In: Proceedings of the ACM on Management of Data. 2, 1, p. 1-27 27 p., 70.
en
dc.rights.embargodate
2026-09-12
en
dc.subject
In-database solutions
en
dc.subject
Machine learning
en
dc.subject
Data imputation
en
dc.subject
Query evaluation
en
dc.subject
Dirty databases
en
dc.title
Towards analytics over dirty databases
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
dcterms.accessRights
RESTRICTED ACCESS
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
PeriniM_2025.pdf
Size:
2.12 MB
Format:
Adobe Portable Document Format
Description:
Cover sheet

This item appears in the following Collection(s)