Edinburgh Research Archive logo

Edinburgh Research Archive

University of Edinburgh homecrest
View Item 
  •   ERA Home
  • Informatics, School of
  • Informatics thesis and dissertation collection
  • View Item
  •   ERA Home
  • Informatics, School of
  • Informatics thesis and dissertation collection
  • View Item
  • Login
JavaScript is disabled for your browser. Some features of this site may not work without it.

On the volume and veracity of big and shared data

View/Open
Wang2022.pdf (2.766Mb)
Date
22/06/2022
Item status
Restricted Access
Embargo end date
22/06/2023
Author
Wang, Yanghao
Metadata
Show full item record
Abstract
Big data is not new to us. Many efforts are devoted to efficient and parallel query processing of big data. Nevertheless there are still some missing pieces in the technical stack. Existing systems typically assume that there are sufficient resources for query answering, hosted by either on-premise deployed clusters or data warehouses on the cloud. Furthermore, data are always correct and ready to be queried. However none of the above assumptions always holds in practice. This motivates my PhD study to tackle some technical issues with regard to the volume and the veracity of querying big data. The first contribution is a resource bounded approximation scheme (RBAS). We develop a new approximate query processing framework that provides both deterministic and probabilistic accuracy guarantees for generic queries, with the cost of query evaluation in RBAS bounded by a dynamic resource ratio user issued online. Using real-life and synthetic datasets, we show that using only 2% of resources, RBAS improves the accuracy from the state-of-the-art by up to 5 times, and is 3 orders-of-magnitude more efficient than exact query answering. The second contribution is the framework and techniques for querying distributed shared data in a heterogeneous security setting, under which each pair of data owners decide their own protocol to share data with diverse levels of trust. We define query plans by incorporating toll functions determined by data sharing agreements, and formalize query answering as a bi-criteria optimization problem, to minimize both data sharing toll and parallel query evaluation cost. We give both the complexity analysis as well as a set of approximation algorithms to generate efficient distributed query plans. Experimental studies show that our techniques generates efficient query plan under security heterogeneity, outperforms its competitors by 10.27 times in efficiency, and our proposed optimization techniques speed up parallel query evaluation by 3.26 times. Finally the third contribution is coping with inconsistencies in data. We adopt a class of rules for entity enhancing (REE) that embed machine learning predicates, unify entity resolution and conflict resolution, and are collectively defined across multiple relations. We study two related problems using REE: discrepancy detection and entity enhancing (discrepancy fixing). Although both problems are intractable, we develop parallel scalable algorithms and parallel incremental algorithms with parallel scalability. The experimental studies show that REEs improve the accuracy of discrepancy detection and entity enhancing by up to 40% and 57% from its competitors, and our proposed algorithms scale well on a multi-site cluster.
URI
https://hdl.handle.net/1842/39176

http://dx.doi.org/10.7488/era/2427
Collections
  • Informatics thesis and dissertation collection

Library & University Collections HomeUniversity of Edinburgh Information Services Home
Privacy & Cookies | Takedown Policy | Accessibility | Contact
Privacy & Cookies
Takedown Policy
Accessibility
Contact
feed RSS Feeds

RSS Feed not available for this page

 

 

All of ERACommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsPublication TypeSponsorSupervisorsThis CollectionBy Issue DateAuthorsTitlesSubjectsPublication TypeSponsorSupervisors
LoginRegister

Library & University Collections HomeUniversity of Edinburgh Information Services Home
Privacy & Cookies | Takedown Policy | Accessibility | Contact
Privacy & Cookies
Takedown Policy
Accessibility
Contact
feed RSS Feeds

RSS Feed not available for this page