Tackling the veracity and variety of big data
Abstract
This thesis tackles the veracity and variety challenges of big data, especially focusing
on graphs and relational data. We start with proposing a class of graph association
rules (GARs) to specify regularities between entities in graphs, which capture both
missing links and inconsistencies. A GAR is a combination of a graph pattern and a
dependency; it may take as predicates machine learning classifiers for link prediction.
We formalize association deduction with GARs in terms of the chase, and prove its
Church-Rosser property. We show that the satisfiability, implication and association
deduction problems for GARs are coNP-complete, NP-complete and NP-complete, respectively.
The incremental deduction problem is DP-complete for GARs. In addition,
we provide parallel algorithms for association deduction and incremental deduction.
We next develop a parallel algorithm to discover GARs, which applies an applicationdriven
strategy to cut back rules and data that are irrelevant to users’ interest, by training
a machine learning model to identify data pertaining to a given application. Moreover,
we introduce a sampling method to reduce a big graph G to a set H of small
sample graphs. Given expected support and recall bounds, this method is able to deduce
samples in H and mine rules from H to satisfy the bounds in the entire G.
Then we propose a class of temporal association rules (TACOs) for event prediction
in temporal graphs. TACOs are defined on temporal graphs in terms of change patterns
and (temporal) conditions, and may carry machine learning predicates for temporal
event prediction. We settle the complexity of reasoning about TACOs, including their
satisfiability, implication and prediction problems. We develop a system that discovers
TACOs by iteratively training a rule creator based on generative models in a creatorcritic
framework, and predicts events by applying the discovered TACOs in parallel.
Finally, we propose an approach to querying relations D and graphs G taken together
in SQL. The key idea is that if a tuple t in D and a vertex v in G are determined
to refer to the same real-world entity, then we join t and v, correlate their information
and complement tuple t with additional attributes of v from graphs. We show how to
do this in SQL extended with only syntactic sugar, for both static joins when t is a tuple
in D and dynamic joins when t comes from intermediate results of sub-queries on D.
To support the semantic joins, we propose an attribute extraction scheme based on Kmeans
clustering, to identify and fetch graph properties that are linked to v via paths.
Moreover, we develop a scheme to extract a relation schema for entities in graphs, and
a heuristic join method based on the schema to strike a balance between the complexity
and accuracy of dynamic joins.