Learning, deducing and linking entities
Files
Item Status
Embargo End Date
Date
Authors
Tugay, Resul
Abstract
Improving the quality of data is a critical issue in data management and machine learning, and finding the most representative and concise way to achieve this is a key challenge. Learning how to represent entities accurately is essential for various tasks in data
science, such as generating better recommendations and more accurate question answering. Thus, the amount and quality of information available on an entity can greatly
impact the quality of results of downstream tasks. This thesis focuses on two specific
areas to improve data quality: (i) learning and deducing entities for data currency (i.e.,
how up-to-date information is), and (ii) linking entities across different data sources.
The first technical contribution is GATE (Get the lATEst), a framework that combines deep learning and rule-based methods to find up-to-date information of an entity.
GATE learns and deduces temporal orders on attribute values in a set of tuples that
pertain to the same entity. It is based on creator-critic framework and the creator trains
a neural ranking model to learn temporal orders and rank attribute values based on
correlations among the attributes. The critic then validates the temporal orders learned
and deduces more ranked pairs by chasing the data with currency constraints; it also
provides augmented training data as feedback for the creator to improve the ranking
in the next round. The process proceeds until the temporal order obtained becomes
stable.
The second technical contribution is HER (Heterogeneous Entity Resolution), a
framework that consists of a set of methods to link entities across relations and graphs.
We propose a new notion, parametric simulation, to link entities across a relational
database D and a graph G. Taking functions and thresholds for measuring vertex
closeness, path associations and important properties as parameters, parametric simulation identifies tuplest in D and vertices v in G that refer to the same real-world entity,
based on topological and semantic matching. We develop machine learning methods
to learn the parameter functions and thresholds.
Rather than solely concentrating on rule-based methods and machine learning algorithms separately to enhance data quality, we focused on combining both approaches
to address the challenges of data currency and entity linking. We combined rule-based
methods with state-of-the-art machine learning methods to represent entities, then used
representation of these entities for further tasks. These enhanced models, combination
of machine learning and logic rules helped us to represent entities in a better way (i)
to find the most up-to-date attribute values and (ii) to link them across relations and
graphs.
This item appears in the following Collection(s)

