Scalable semi-supervised grammar induction using cross-linguistically parameterized syntactic prototypes
This thesis is about the task of unsupervised parser induction: automatically learning grammars and parsing models from raw text. We endeavor to induce such parsers by observing sequences of terminal symbols. We focus on overcoming the problem of frequent collocation that is a major source of error in grammar induction. For example, since a verb and a determiner tend to co-occur in a verb phrase, the probability of attaching the determiner to the verb is sometimes higher than that of attaching the core noun to the verb, resulting in erroneous attachment *((Verb Det) Noun) instead of (Verb (Det Noun)). Although frequent collocation is the heart of grammar induction, it is precariously capable of distorting the grammar distribution. Natural language grammars follow a Zipfian (power law) distribution, where the frequency of any grammar rule is inversely proportional to its rank in the frequency table. We believe that covering the most frequent grammar rules in grammar induction will have a strong impact on accuracy. We propose an efficient approach to grammar induction guided by cross-linguistic language parameters. Our language parameters consist of 33 parameters of frequent basic word orders, which are easy to be elicited from grammar compendiums or short interviews with naïve language informants. These parameters are designed to capture frequent word orders in the Zipfian distribution of natural language grammars, while the rest of the grammar including exceptions can be automatically induced from unlabeled data. The language parameters shrink the search space of the grammar induction problem by exploiting both word order information and predefined attachment directions. The contribution of this thesis is three-fold. (1) We show that the language parameters are adequately generalizable cross-linguistically, as our grammar induction experiments will be carried out on 14 languages on top of a simple unsupervised grammar induction system. (2) Our specification of language parameters improves the accuracy of unsupervised parsing even when the parser is exposed to much less frequent linguistic phenomena in longer sentences when the accuracy decreases within 10%. (3) We investigate the prevalent factors of errors in grammar induction which will provide room for accuracy improvement. The proposed language parameters efficiently cope with the most frequent grammar rules in natural languages. With only 10 man-hours for preparing syntactic prototypes, it improves the accuracy of directed dependency recovery over the state-ofthe- art Gillenwater et al.’s (2010) completely unsupervised parser in: (1) Chinese by 30.32% (2) Swedish by 28.96% (3) Portuguese by 37.64% (4) Dutch by 15.17% (5) German by 14.21% (6) Spanish by 13.53% (7) Japanese by 13.13% (8) English by 12.41% (9) Czech by 9.16% (10) Slovene by 7.24% (11) Turkish by 6.72% and (12) Bulgarian by 5.96%. It is noted that although the directed dependency accuracies of some languages are below 60%, their TEDEVAL scores are still satisfactory (approximately 80%). This suggests us that our parsed trees are, in fact, closely related to the gold-standard trees despite the discrepancy of annotation schemes. We perform an error analysis of over- and under-generation analysis. We found three prevalent problems that cause errors in the experiments: (1) PP attachment (2) discrepancies of dependency annotation schemes and (3) rich morphology. The methods presented in this thesis were originally presented in Boonkwan and Steedman (2011). The thesis presents a great deal more detail in the design of crosslinguistic language parameters, the algorithm of lexicon inventory construction, experiment results, and error analysis.