Scalable semi-supervised grammar induction using cross-linguistically parameterized syntactic prototypes
Item Status
Embargo End Date
Date
Authors
Abstract
This thesis is about the task of unsupervised parser induction: automatically learning
grammars and parsing models from raw text. We endeavor to induce such parsers by
observing sequences of terminal symbols. We focus on overcoming the problem of
frequent collocation that is a major source of error in grammar induction. For example,
since a verb and a determiner tend to co-occur in a verb phrase, the probability
of attaching the determiner to the verb is sometimes higher than that of attaching the
core noun to the verb, resulting in erroneous attachment *((Verb Det) Noun) instead of
(Verb (Det Noun)). Although frequent collocation is the heart of grammar induction, it
is precariously capable of distorting the grammar distribution. Natural language grammars
follow a Zipfian (power law) distribution, where the frequency of any grammar
rule is inversely proportional to its rank in the frequency table. We believe that covering
the most frequent grammar rules in grammar induction will have a strong impact
on accuracy.
We propose an efficient approach to grammar induction guided by cross-linguistic language
parameters. Our language parameters consist of 33 parameters of frequent basic
word orders, which are easy to be elicited from grammar compendiums or short interviews
with naïve language informants. These parameters are designed to capture
frequent word orders in the Zipfian distribution of natural language grammars, while
the rest of the grammar including exceptions can be automatically induced from unlabeled
data. The language parameters shrink the search space of the grammar induction
problem by exploiting both word order information and predefined attachment directions.
The contribution of this thesis is three-fold. (1) We show that the language parameters
are adequately generalizable cross-linguistically, as our grammar induction experiments
will be carried out on 14 languages on top of a simple unsupervised grammar
induction system. (2) Our specification of language parameters improves the accuracy
of unsupervised parsing even when the parser is exposed to much less frequent linguistic
phenomena in longer sentences when the accuracy decreases within 10%. (3)
We investigate the prevalent factors of errors in grammar induction which will provide
room for accuracy improvement. The proposed language parameters efficiently cope with the most frequent grammar
rules in natural languages. With only 10 man-hours for preparing syntactic prototypes,
it improves the accuracy of directed dependency recovery over the state-ofthe-
art Gillenwater et al.’s (2010) completely unsupervised parser in: (1) Chinese by
30.32% (2) Swedish by 28.96% (3) Portuguese by 37.64% (4) Dutch by 15.17% (5)
German by 14.21% (6) Spanish by 13.53% (7) Japanese by 13.13% (8) English by
12.41% (9) Czech by 9.16% (10) Slovene by 7.24% (11) Turkish by 6.72% and (12)
Bulgarian by 5.96%. It is noted that although the directed dependency accuracies of
some languages are below 60%, their TEDEVAL scores are still satisfactory (approximately
80%). This suggests us that our parsed trees are, in fact, closely related to the
gold-standard trees despite the discrepancy of annotation schemes.
We perform an error analysis of over- and under-generation analysis. We found three
prevalent problems that cause errors in the experiments: (1) PP attachment (2) discrepancies
of dependency annotation schemes and (3) rich morphology.
The methods presented in this thesis were originally presented in Boonkwan and Steedman
(2011). The thesis presents a great deal more detail in the design of crosslinguistic
language parameters, the algorithm of lexicon inventory construction, experiment
results, and error analysis.
This item appears in the following Collection(s)

