Contextual citation recommendation using scientific discourse annotation schemes
Files
Item Status
Embargo End Date
Date
Authors
Abstract
All researchers have experienced the problem of fishing out the most relevant scientific
papers from an ocean of publications, and some may have wished that their text editor
suggested these papers automatically. This thesis is vertebrated by this task: recommending
contextually relevant citations to the author of a scientific paper, which we
call Contextual Citation Recommendation (CCR). Like others before, we frame CCR
as an Information Retrieval task and we evaluate our approach using existing publications.
That is, an existing in-text citation to one or more documents in a corpus
is replaced with a placeholder and the task is to retrieve the cited documents automatically.
We carry out a cross-domain study and evaluate our approaches using two
separate document collections in two different domains: computational linguistics and
biomedical science.
This thesis is comprised of three parts, which build cumulatively. Part I establishes
a framework for the task using a standard Information Retrieval setup and explores
different parameters for indexing documents and for extracting the evaluation queries
in order to establish solid baselines for our two corpora in our two domains. We experiment
with symmetric windows of words and sentences for both query extraction
and for integrating the anchor text, that is, the text surrounding a citation, which is an
important source of data for building document representations. We show for the first
time that the contribution of anchor text is very domain dependent.
Part II investigates a number of scientific discourse annotation schemes for academic
articles. It has often been suggested that annotating discourse structure could
support Information Retrieval scenarios such as this one, and this is a key hypothesis
of this thesis. We focus on two of these: Argumentative Zoning (AZ, for the domain
of computational linguistics) and Core Scientific Concepts (for the domain of biomedical
sciences); both of these sentence-based, scientific discourse annotation schemes
which define classes such as Hypothesis, Method and Result for CoreSC and Background/
Own/Contrast for AZ. By annotating each sentence in every document with
AZ/CoreSC and indexing them separately by sentence class, we discover that consistent
citing patterns exist in each domain, such as that sentences of type Conclusion
in cited papers are consistently cited by other sentences of type Conclusion or Background
in citing biomedical articles.
Finally, Part III moves away from simple windows over terms or over sentences for
extracting the query from a citation’s context, and investigates methods for supervised
query extraction using linguistic information. As part of this, we first explore how
to automatically generate training data in the form of citation contexts paired with an
optimal query to generate. Second, we train supervised machine learning models for
automatically extracting these queries with limited prior knowledge of the document
collection and show important improvements over our baselines in the domain of computational
linguistics. We also investigate the contribution of stopwords to each corpus
and we explore the performance of human annotators at this task.
This item appears in the following Collection(s)

