Show simple item record

dc.contributor.advisorFan, Wenfei
dc.contributor.advisorLibkin, Leonid
dc.contributor.authorWang, Yanghao
dc.date.accessioned2022-06-22T14:05:36Z
dc.date.available2022-06-22T14:05:36Z
dc.date.issued2022-06-22
dc.identifier.urihttps://hdl.handle.net/1842/39176
dc.identifier.urihttp://dx.doi.org/10.7488/era/2427
dc.description.abstractBig data is not new to us. Many efforts are devoted to efficient and parallel query processing of big data. Nevertheless there are still some missing pieces in the technical stack. Existing systems typically assume that there are sufficient resources for query answering, hosted by either on-premise deployed clusters or data warehouses on the cloud. Furthermore, data are always correct and ready to be queried. However none of the above assumptions always holds in practice. This motivates my PhD study to tackle some technical issues with regard to the volume and the veracity of querying big data. The first contribution is a resource bounded approximation scheme (RBAS). We develop a new approximate query processing framework that provides both deterministic and probabilistic accuracy guarantees for generic queries, with the cost of query evaluation in RBAS bounded by a dynamic resource ratio user issued online. Using real-life and synthetic datasets, we show that using only 2% of resources, RBAS improves the accuracy from the state-of-the-art by up to 5 times, and is 3 orders-of-magnitude more efficient than exact query answering. The second contribution is the framework and techniques for querying distributed shared data in a heterogeneous security setting, under which each pair of data owners decide their own protocol to share data with diverse levels of trust. We define query plans by incorporating toll functions determined by data sharing agreements, and formalize query answering as a bi-criteria optimization problem, to minimize both data sharing toll and parallel query evaluation cost. We give both the complexity analysis as well as a set of approximation algorithms to generate efficient distributed query plans. Experimental studies show that our techniques generates efficient query plan under security heterogeneity, outperforms its competitors by 10.27 times in efficiency, and our proposed optimization techniques speed up parallel query evaluation by 3.26 times. Finally the third contribution is coping with inconsistencies in data. We adopt a class of rules for entity enhancing (REE) that embed machine learning predicates, unify entity resolution and conflict resolution, and are collectively defined across multiple relations. We study two related problems using REE: discrepancy detection and entity enhancing (discrepancy fixing). Although both problems are intractable, we develop parallel scalable algorithms and parallel incremental algorithms with parallel scalability. The experimental studies show that REEs improve the accuracy of discrepancy detection and entity enhancing by up to 40% and 57% from its competitors, and our proposed algorithms scale well on a multi-site cluster.en
dc.contributor.sponsorEngineering and Physical Sciences Research Council (EPSRC)en
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.relation.hasversion[CFWY20] Yang Cao, Wenfei Fan, Yanghao Wang, and Ke Yi. Querying shared data with security heterogeneity. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 575–585, 202en
dc.relation.hasversion[FTWY21] Wenfei Fan, Chao Tian, Yanghao Wang, and Qiang Yin. Parallel discrepancy detection and incremental detection. Proc. VLDB Endow., 14(8):1351–1364, 2021.en
dc.relation.hasversionYang Cao, Wenfei Fan, Yanghao Wang, Tengfei Yuan, Yanchao Li, and Laura Yu Chen. Beas: bounded evaluation of sql queries. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1667–1670. ACM, 2017.en
dc.subjectbig dataen
dc.subjectquerying big dataen
dc.subjectoptimization techniquesen
dc.subjectresource bounded approximation schemeen
dc.subjecttoll functionsen
dc.subjecterroneous dataen
dc.subjectbig data analyticsen
dc.subjectRBASen
dc.subjectrules for entity enhancingen
dc.subjectmachine learningen
dc.subjectdiscrepancy detectionen
dc.subjectentity enhancingen
dc.subjectscalable algorithmsen
dc.titleOn the volume and veracity of big and shared dataen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen
dc.rights.embargodate2023-06-22en
dcterms.accessRightsRestricted Accessen


Files in this item

This item appears in the following Collection(s)

Show simple item record