Tackling the veracity and variety of big data
dc.contributor.advisor
Fan, Wenfei
dc.contributor.advisor
Libkin, Leonid
dc.contributor.author
Jin, Ruochun
dc.date.accessioned
2022-06-21T15:46:12Z
dc.date.available
2022-06-21T15:46:12Z
dc.date.issued
2022-06-21
dc.description.abstract
This thesis tackles the veracity and variety challenges of big data, especially focusing
on graphs and relational data. We start with proposing a class of graph association
rules (GARs) to specify regularities between entities in graphs, which capture both
missing links and inconsistencies. A GAR is a combination of a graph pattern and a
dependency; it may take as predicates machine learning classifiers for link prediction.
We formalize association deduction with GARs in terms of the chase, and prove its
Church-Rosser property. We show that the satisfiability, implication and association
deduction problems for GARs are coNP-complete, NP-complete and NP-complete, respectively.
The incremental deduction problem is DP-complete for GARs. In addition,
we provide parallel algorithms for association deduction and incremental deduction.
We next develop a parallel algorithm to discover GARs, which applies an applicationdriven
strategy to cut back rules and data that are irrelevant to users’ interest, by training
a machine learning model to identify data pertaining to a given application. Moreover,
we introduce a sampling method to reduce a big graph G to a set H of small
sample graphs. Given expected support and recall bounds, this method is able to deduce
samples in H and mine rules from H to satisfy the bounds in the entire G.
Then we propose a class of temporal association rules (TACOs) for event prediction
in temporal graphs. TACOs are defined on temporal graphs in terms of change patterns
and (temporal) conditions, and may carry machine learning predicates for temporal
event prediction. We settle the complexity of reasoning about TACOs, including their
satisfiability, implication and prediction problems. We develop a system that discovers
TACOs by iteratively training a rule creator based on generative models in a creatorcritic
framework, and predicts events by applying the discovered TACOs in parallel.
Finally, we propose an approach to querying relations D and graphs G taken together
in SQL. The key idea is that if a tuple t in D and a vertex v in G are determined
to refer to the same real-world entity, then we join t and v, correlate their information
and complement tuple t with additional attributes of v from graphs. We show how to
do this in SQL extended with only syntactic sugar, for both static joins when t is a tuple
in D and dynamic joins when t comes from intermediate results of sub-queries on D.
To support the semantic joins, we propose an attribute extraction scheme based on Kmeans
clustering, to identify and fetch graph properties that are linked to v via paths.
Moreover, we develop a scheme to extract a relation schema for entities in graphs, and
a heuristic join method based on the schema to strike a balance between the complexity
and accuracy of dynamic joins.
en
dc.identifier.uri
https://hdl.handle.net/1842/39161
dc.identifier.uri
http://dx.doi.org/10.7488/era/2412
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.title
Tackling the veracity and variety of big data
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- JinR_2022.pdf
- Size:
- 2.1 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

