Probabilistic grammar induction from sentences and structured meanings
Kwiatkowski, Thomas Mieczyslaw
The meanings of natural language sentences may be represented as compositional logical-forms. Each word or lexicalised multiword-element has an associated logicalform representing its meaning. Full sentential logical-forms are then composed from these word logical-forms via a syntactic parse of the sentence. This thesis develops two computational systems that learn both the word-meanings and parsing model required to map sentences onto logical-forms from an example corpus of (sentence, logical-form) pairs. One of these systems is designed to provide a general purpose method of inducing semantic parsers for multiple languages and logical meaning representations. Semantic parsers map sentences onto logical representations of their meanings and may form an important part of any computational task that needs to interpret the meanings of sentences. The other system is designed to model the way in which a child learns the semantics and syntax of their first language. Here, logical-forms are used to represent the potentially ambiguous context in which childdirected utterances are spoken and a psycholinguistically plausible training algorithm learns a probabilistic grammar that describes the target language. This computational modelling task is important as it can provide evidence for or against competing theories of how children learn their first language. Both of the systems presented here are based upon two working hypotheses. First, that the correct parse of any sentence in any language is contained in a set of possible parses defined in terms of the sentence itself, the sentence’s logical-form and a small set of combinatory rule schemata. The second working hypothesis is that, given a corpus of (sentence, logical-form) pairs that each support a large number of possible parses according to the schemata mentioned above, it is possible to learn a probabilistic parsing model that accurately describes the target language. The algorithm for semantic parser induction learns Combinatory Categorial Grammar (CCG) lexicons and discriminative probabilistic parsing models from corpora of (sentence, logical-form) pairs. This system is shown to achieve at or near state of the art performance across multiple languages, logical meaning representations and domains. As the approach is not tied to any single natural or logical language, this system represents an important step towards widely applicable black-box methods for semantic parser induction. This thesis also develops an efficient representation of the CCG lexicon that separately stores language specific syntactic regularities and domain specific semantic knowledge. This factorised lexical representation improves the performance of CCG based semantic parsers in sparse domains and also provides a potential basis for lexical expansion and domain adaptation for semantic parsers. The algorithm for modelling child language acquisition learns a generative probabilistic model of CCG parses from sentences paired with a context set of potential logical-forms containing one correct entry and a number of distractors. The online learning algorithm used is intended to be psycholinguistically plausible and to assume as little information specific to the task of language learning as is possible. It is shown that this algorithm learns an accurate parsing model despite making very few initial assumptions. It is also shown that the manner in which both word-meanings and syntactic rules are learnt is in accordance with observations of both of these learning tasks in children, supporting a theory of language acquisition that builds upon the two working hypotheses stated above.