Wide-coverage parsing for Turkish

Çakici, Ruket

Wide-coverage parsing for Turkish

Simple item page

dc.contributor.advisor

Steedman, Mark

en

dc.contributor.advisor

Osborne, Miles

en

dc.contributor.author

Çakici, Ruket

en

dc.date.accessioned

2010-10-04T09:41:04Z

dc.date.available

2010-10-04T09:41:04Z

dc.date.issued

2009

dc.description.abstract

Wide-coverage parsing is an area that attracts much attention in natural language processing research. This is due to the fact that it is the first step tomany other applications in natural language understanding, such as question answering. Supervised learning using human-labelled data is currently the best performing method. Therefore, there is great demand for annotated data. However, human annotation is very expensive and always, the amount of annotated data is much less than is needed to train well-performing parsers. This is the motivation behind making the best use of data available. Turkish presents a challenge both because syntactically annotated Turkish data is relatively small and Turkish is highly agglutinative, hence unusually sparse at the whole word level. METU-Sabancı Treebank is a dependency treebank of 5620 sentences with surface dependency relations and morphological analyses for words. We show that including even the crudest forms of morphological information extracted from the data boosts the performance of both generative and discriminative parsers, contrary to received opinion concerning English. We induce word-based and morpheme-based CCG grammars from Turkish dependency treebank. We use these grammars to train a state-of-the-art CCG parser that predicts long-distance dependencies in addition to the ones that other parsers are capable of predicting. We also use the correct CCG categories as simple features in a graph-based dependency parser and show that this improves the parsing results. We show that a morpheme-based CCG lexicon for Turkish is able to solve many problems such as conflicts of semantic scope, recovering long-range dependencies, and obtaining smoother statistics from the models. CCG handles linguistic phenomena i.e. local and long-range dependencies more naturally and effectively than other linguistic theories while potentially supporting semantic interpretation in parallel. Using morphological information and a morpheme-cluster based lexicon improve the performance both quantitatively and qualitatively for Turkish. We also provide an improved version of the treebank which will be released by kind permission of METU and Sabancı.

en

dc.identifier.uri

http://hdl.handle.net/1842/3807

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.subject

combinatory categorial grammar

en

dc.subject

CCG

en

dc.subject

parsing

en

dc.subject

natural language processing

en

dc.subject

morphology

en

dc.subject

syntax

en

dc.title

Wide-coverage parsing for Turkish

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Cakici2008.pdf
Size:: 2.14 MB
Format:: Adobe Portable Document Format

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection