Towards generic relation extraction
Abstract
A vast amount of usable electronic data is in the form of unstructured text. The relation
extraction task aims to identify useful information in text (e.g., PersonW works
for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational
database that can be more effectively used for querying and automated reasoning.
However, adapting conventional relation extraction systems to new domains
or tasks requires significant effort from annotators and developers. Furthermore, previous
adaptation approaches based on bootstrapping start from example instances of
the target relations, thus requiring that the correct relation type schema be known in
advance. Generic relation extraction (GRE) addresses the adaptation problem by applying
generic techniques that achieve comparable accuracy when transferred, without
modification of model parameters, across domains and tasks.
Previous work on GRE has relied extensively on various lexical and shallow syntactic
indicators. I present new state-of-the-art models for GRE that incorporate governordependency
information. I also introduce a dimensionality reduction step into the GRE
relation characterisation sub-task, which serves to capture latent semantic information
and leads to significant improvements over an unreduced model. Comparison of dimensionality
reduction techniques suggests that latent Dirichlet allocation (LDA) – a
probabilistic generative approach – successfully incorporates a larger and more interdependent
feature set than a model based on singular value decomposition (SVD) and
performs as well as or better than SVD on all experimental settings. Finally, I will
introduce multi-document summarisation as an extrinsic test bed for GRE and present
results which demonstrate that the relative performance of GRE models is consistent
across tasks and that the GRE-based representation leads to significant improvements
over a standard baseline from the literature.
Taken together, the experimental results 1) show that GRE can be improved using
dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE
for the content selection step of extractive summarisation and 3) validate the GRE
claim of modification-free adaptation for the first time with respect to both domain and
task. This thesis also introduces data sets derived from publicly available corpora for
the purpose of rigorous intrinsic evaluation in the news and biomedical domains.