dc.contributor.advisor | Fan, Wenfei | |
dc.contributor.advisor | Libkin, Leonid | |
dc.contributor.author | Wang, Yanghao | |
dc.date.accessioned | 2022-06-22T14:05:36Z | |
dc.date.available | 2022-06-22T14:05:36Z | |
dc.date.issued | 2022-06-22 | |
dc.identifier.uri | https://hdl.handle.net/1842/39176 | |
dc.identifier.uri | http://dx.doi.org/10.7488/era/2427 | |
dc.description.abstract | Big data is not new to us. Many efforts are devoted to efficient and parallel query
processing of big data. Nevertheless there are still some missing pieces in the technical
stack. Existing systems typically assume that there are sufficient resources for query
answering, hosted by either on-premise deployed clusters or data warehouses on the
cloud. Furthermore, data are always correct and ready to be queried. However none
of the above assumptions always holds in practice. This motivates my PhD study to
tackle some technical issues with regard to the volume and the veracity of querying big
data.
The first contribution is a resource bounded approximation scheme (RBAS). We
develop a new approximate query processing framework that provides both deterministic and probabilistic accuracy guarantees for generic queries, with the cost of query
evaluation in RBAS bounded by a dynamic resource ratio user issued online. Using real-life and synthetic datasets, we show that using only 2% of resources, RBAS
improves the accuracy from the state-of-the-art by up to 5 times, and is 3 orders-of-magnitude more efficient than exact query answering.
The second contribution is the framework and techniques for querying distributed
shared data in a heterogeneous security setting, under which each pair of data owners
decide their own protocol to share data with diverse levels of trust. We define query
plans by incorporating toll functions determined by data sharing agreements, and formalize query answering as a bi-criteria optimization problem, to minimize both data
sharing toll and parallel query evaluation cost. We give both the complexity analysis as
well as a set of approximation algorithms to generate efficient distributed query plans.
Experimental studies show that our techniques generates efficient query plan under security heterogeneity, outperforms its competitors by 10.27 times in efficiency, and our
proposed optimization techniques speed up parallel query evaluation by 3.26 times.
Finally the third contribution is coping with inconsistencies in data. We adopt
a class of rules for entity enhancing (REE) that embed machine learning predicates,
unify entity resolution and conflict resolution, and are collectively defined across multiple relations. We study two related problems using REE: discrepancy detection and
entity enhancing (discrepancy fixing). Although both problems are intractable, we
develop parallel scalable algorithms and parallel incremental algorithms with parallel
scalability. The experimental studies show that REEs improve the accuracy of discrepancy detection and entity enhancing by up to 40% and 57% from its competitors, and
our proposed algorithms scale well on a multi-site cluster. | en |
dc.contributor.sponsor | Engineering and Physical Sciences Research Council (EPSRC) | en |
dc.language.iso | en | en |
dc.publisher | The University of Edinburgh | en |
dc.relation.hasversion | [CFWY20] Yang Cao, Wenfei Fan, Yanghao Wang, and Ke Yi. Querying shared data with security heterogeneity. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 575–585, 202 | en |
dc.relation.hasversion | [FTWY21] Wenfei Fan, Chao Tian, Yanghao Wang, and Qiang Yin. Parallel discrepancy detection and incremental detection. Proc. VLDB Endow., 14(8):1351–1364, 2021. | en |
dc.relation.hasversion | Yang Cao, Wenfei Fan, Yanghao Wang, Tengfei Yuan, Yanchao Li, and Laura Yu Chen. Beas: bounded evaluation of sql queries. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1667–1670. ACM, 2017. | en |
dc.subject | big data | en |
dc.subject | querying big data | en |
dc.subject | optimization techniques | en |
dc.subject | resource bounded approximation scheme | en |
dc.subject | toll functions | en |
dc.subject | erroneous data | en |
dc.subject | big data analytics | en |
dc.subject | RBAS | en |
dc.subject | rules for entity enhancing | en |
dc.subject | machine learning | en |
dc.subject | discrepancy detection | en |
dc.subject | entity enhancing | en |
dc.subject | scalable algorithms | en |
dc.title | On the volume and veracity of big and shared data | en |
dc.type | Thesis or Dissertation | en |
dc.type.qualificationlevel | Doctoral | en |
dc.type.qualificationname | PhD Doctor of Philosophy | en |
dc.rights.embargodate | 2023-06-22 | en |
dcterms.accessRights | Restricted Access | en |