Bayesian models of syntactic category acquisition
Frank, Stella Christina
Discovering a word’s part of speech is an essential step in acquiring the grammar of a language. In this thesis we examine a variety of computational Bayesian models that use linguistic input available to children, in the form of transcribed child directed speech, to learn part of speech categories. Part of speech categories are characterised by contextual (distributional/syntactic) and word-internal (morphological) similarity. In this thesis, we assume language learners will be aware of these types of cues, and investigate exactly how they can make use of them. Firstly, we enrich the context of a standard model (the Bayesian Hidden Markov Model) by adding sentence type to the wider distributional context.We show that children are exposed to a much more diverse set of sentence types than evident in standard corpora used for NLP tasks, and previous work suggests that they are aware of the differences between sentence type as signalled by prosody and pragmatics. Sentence type affects local context distributions, and as such can be informative when relying on local context for categorisation. Adding sentence types to the model improves performance, depending on how it is integrated into our models. We discuss how to incorporate novel features into the model structure we use in a flexible manner, and present a second model type that learns to use sentence type as a distinguishing cue only when it is informative. Secondly, we add a model of morphological segmentation to the part of speech categorisation model, in order to model joint learning of syntactic categories and morphology. These two tasks are closely linked: categorising words into syntactic categories is aided by morphological information, and finding morphological patterns in words is aided by knowing the syntactic categories of those words. In our joint model, we find improved performance vis-a-vis single-task baselines, but the nature of the improvement depends on the morphological typology of the language being modelled. This is the first token-based joint model of unsupervised morphology and part of speech category learning of which we are aware.