On the volume and veracity of big and shared data
Item statusRestricted Access
Embargo end date22/06/2023
Big data is not new to us. Many efforts are devoted to efficient and parallel query processing of big data. Nevertheless there are still some missing pieces in the technical stack. Existing systems typically assume that there are sufficient resources for query answering, hosted by either on-premise deployed clusters or data warehouses on the cloud. Furthermore, data are always correct and ready to be queried. However none of the above assumptions always holds in practice. This motivates my PhD study to tackle some technical issues with regard to the volume and the veracity of querying big data. The first contribution is a resource bounded approximation scheme (RBAS). We develop a new approximate query processing framework that provides both deterministic and probabilistic accuracy guarantees for generic queries, with the cost of query evaluation in RBAS bounded by a dynamic resource ratio user issued online. Using real-life and synthetic datasets, we show that using only 2% of resources, RBAS improves the accuracy from the state-of-the-art by up to 5 times, and is 3 orders-of-magnitude more efficient than exact query answering. The second contribution is the framework and techniques for querying distributed shared data in a heterogeneous security setting, under which each pair of data owners decide their own protocol to share data with diverse levels of trust. We define query plans by incorporating toll functions determined by data sharing agreements, and formalize query answering as a bi-criteria optimization problem, to minimize both data sharing toll and parallel query evaluation cost. We give both the complexity analysis as well as a set of approximation algorithms to generate efficient distributed query plans. Experimental studies show that our techniques generates efficient query plan under security heterogeneity, outperforms its competitors by 10.27 times in efficiency, and our proposed optimization techniques speed up parallel query evaluation by 3.26 times. Finally the third contribution is coping with inconsistencies in data. We adopt a class of rules for entity enhancing (REE) that embed machine learning predicates, unify entity resolution and conflict resolution, and are collectively defined across multiple relations. We study two related problems using REE: discrepancy detection and entity enhancing (discrepancy fixing). Although both problems are intractable, we develop parallel scalable algorithms and parallel incremental algorithms with parallel scalability. The experimental studies show that REEs improve the accuracy of discrepancy detection and entity enhancing by up to 40% and 57% from its competitors, and our proposed algorithms scale well on a multi-site cluster.