Methods for morphology learning in low(er)-resource scenarios
A core issue that hampers development and use of language technology for underresourced and morphologically rich languages is data sparsity. In this work, we consider unsupervised morphological analysis and lemmatization — two linguistically motivated ways to combat problems with sparse data. The morphological analysis aims to represent words in terms of the smallest meaningful units of language — morphemes (e.g., acid +ify +ed), while lemmatization concerns individual relationships among words (e.g., walks, walking and walked all are different forms of the lexeme walk). In this thesis, we focus on morphology learning in low-resource scenarios: we propose algorithms and methods that learn unsupervised morphological analysis and lemmatization with higher accuracy than the previous work while having affordable training data requirements. Our unsupervised morphological analyzers have similar or better underlying morpheme accuracy than three strong baselines while on average, inducing 12.8% more compact representation of the data than the next best system. Our lemmatizers reduce the training data requirements to raw character representations of wordforms in their immediate context, yet yield improvements (especially on unseen and ambiguous words) over systems that learn from complete morphologically annotated sentences.