Data and Models for Statistical Parsing with Combinatory Categorial Grammar

Hockenmaier, Julia

Data and Models for Statistical Parsing with Combinatory Categorial Grammar

Simple item page

dc.contributor.advisor

Steedman, Mark

en

dc.contributor.advisor

Thompson, Henry

en

dc.contributor.author

Hockenmaier, Julia

en

dc.contributor.sponsor

Engineering and Physical Sciences Research Council (EPSRC)

en

dc.date.accessioned

2004-01-27T12:18:14Z

dc.date.available

2004-01-27T12:18:14Z

dc.date.issued

2003-07

dc.description

Institute for Communicating and Collaborative Systems

en

dc.description.abstract

This dissertation is concerned with the creation of training data and the development of probability models for statistical parsing of English with Combinatory Categorial Grammar (CCG). Parsing, or syntactic analysis, is a prerequisite for semantic interpretation, and forms therefore an integral part of any system which requires natural language understanding. Since almost all naturally occurring sentences are ambiguous, it is not sufficient (and often impossible) to generate all possible syntactic analyses. Instead, the parser needs to rank competing analyses and select only the most likely ones. A statistical parser uses a probability model to perform this task. I propose a number of ways in which such probability models can be defined for CCG. The kinds of models developed in this dissertation, generative models over normal-form derivation trees, are particularly simple, and have the further property of restricting the set of syntactic analyses to those corresponding to a canonical derivation structure. This is important to guarantee that parsing can be done efficiently. In order to achieve high parsing accuracy, a large corpus of annotated data is required to estimate the parameters of the probability models. Most existing wide-coverage statistical parsers use models of phrase-structure trees estimated from the Penn Treebank, a 1-million-word corpus of manually annotated sentences from theWall Street Journal. This dissertation presents an algorithm which translates the phrase-structure analyses of the Penn Treebank to CCG derivations. The resulting corpus, CCGbank, is used to train and test the models proposed in this dissertation. Experimental results indicate that parsing accuracy (when evaluated according to a comparable metric, the recovery of unlabelled word-word dependency relations), is as high as that of standard Penn Treebank parsers which use similar modelling techniques. Most existing wide-coverage statistical parsers use simple phrase-structure grammars whose syntactic analyses fail to capture long-range dependencies, and therefore do not correspond to directly interpretable semantic representations. By contrast, CCG is a grammar formalism in which semantic representations that include long-range dependencies can be built directly during the derivation of syntactic structure. These dependencies define the predicate-argument structure of a sentence, and are used for two purposes in this dissertation: First, the performance of the parser can be evaluated according to how well it recovers these dependencies. In contrast to purely syntactic evaluations, this yields a direct measure of how accurate the semantic interpretations returned by the parser are. Second, I propose a generative model that captures the local and non-local dependencies in the predicate-argument structure, and investigate the impact of modelling non-local in addition to local dependencies.

en

dc.format.extent

998223 bytes

en

dc.format.mimetype

application/pdf

en

dc.identifier.uri

http://hdl.handle.net/1842/320

dc.language.iso

en

dc.publisher

University of Edinburgh. College of Science and Engineering. School of Informatics.

en

dc.title

Data and Models for Statistical Parsing with Combinatory Categorial Grammar

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: IP030013.pdf
Size:: 974.83 KB
Format:: Adobe Portable Document Format

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection