Contextual citation recommendation using scientific discourse annotation schemes
Duma, Daniel Cristian
All researchers have experienced the problem of fishing out the most relevant scientific papers from an ocean of publications, and some may have wished that their text editor suggested these papers automatically. This thesis is vertebrated by this task: recommending contextually relevant citations to the author of a scientific paper, which we call Contextual Citation Recommendation (CCR). Like others before, we frame CCR as an Information Retrieval task and we evaluate our approach using existing publications. That is, an existing in-text citation to one or more documents in a corpus is replaced with a placeholder and the task is to retrieve the cited documents automatically. We carry out a cross-domain study and evaluate our approaches using two separate document collections in two different domains: computational linguistics and biomedical science. This thesis is comprised of three parts, which build cumulatively. Part I establishes a framework for the task using a standard Information Retrieval setup and explores different parameters for indexing documents and for extracting the evaluation queries in order to establish solid baselines for our two corpora in our two domains. We experiment with symmetric windows of words and sentences for both query extraction and for integrating the anchor text, that is, the text surrounding a citation, which is an important source of data for building document representations. We show for the first time that the contribution of anchor text is very domain dependent. Part II investigates a number of scientific discourse annotation schemes for academic articles. It has often been suggested that annotating discourse structure could support Information Retrieval scenarios such as this one, and this is a key hypothesis of this thesis. We focus on two of these: Argumentative Zoning (AZ, for the domain of computational linguistics) and Core Scientific Concepts (for the domain of biomedical sciences); both of these sentence-based, scientific discourse annotation schemes which define classes such as Hypothesis, Method and Result for CoreSC and Background/ Own/Contrast for AZ. By annotating each sentence in every document with AZ/CoreSC and indexing them separately by sentence class, we discover that consistent citing patterns exist in each domain, such as that sentences of type Conclusion in cited papers are consistently cited by other sentences of type Conclusion or Background in citing biomedical articles. Finally, Part III moves away from simple windows over terms or over sentences for extracting the query from a citation’s context, and investigates methods for supervised query extraction using linguistic information. As part of this, we first explore how to automatically generate training data in the form of citation contexts paired with an optimal query to generate. Second, we train supervised machine learning models for automatically extracting these queries with limited prior knowledge of the document collection and show important improvements over our baselines in the domain of computational linguistics. We also investigate the contribution of stopwords to each corpus and we explore the performance of human annotators at this task.