Edinburgh Research Archive

Understanding and modeling code-switching: metrics, triggers, and applications in multilingual NLP

Item Status

Embargo End Date

Authors

Chi, Jie

Abstract

Code-switching, the phenomenon of alternating between two or more languages within a single conversation or discourse, has been commonly observed in the growing context of multilingual communities. Decades of research across various disciplines have focused on understanding its underlying principles and modeling its patterns. This doctoral thesis contributes to this ongoing research from both theoretical and applied perspectives. Firstly, existing popular metrics for measuring code-switching richness rely on counts of tokens, switching points, or the distribution of language spans without considering differences in morpho-syntax and orthographic conventions across languages. Consequently, metrics calculated for different language pairs are not comparable. This thesis proposes a framework that leverages linguistic findings as makeshift ground truths to assess the quality and sufficiency of existing metrics after normalizing them to factor out token differences. Additionally, it introduces the T-index, which utilizes machine translation systems to capture properties of code-switched words in relation to the participating language pairs. Building on the existing hypothesis that part-of-speech (POS) facilitates code-switching occurrence, this thesis extends prior work by incorporating the impact of word positions and robustly confirms a statistically significant connection between POS and code-switching. The findings suggest that more diverse syntactical structures lead to less flexibility in code-switching. By categorizing code-switched words and investigating neighboring POS, we observe that this relationship is strongest in close proximity to switched instances, gradually diminishing as words move farther from code-switching points. Furthermore, this thesis investigates two approaches to code-switched text generation and their applications in improving automatic speech recognition systems. Using parallel data from two languages, equivalence constraint theory can determine which segments can be replaced to produce code-switched sentences. Alternatively, a multi-lingual machine translation system can achieve similar results by using shared representations between languages to produce lexical replacements. In conclusion, this research enhances the understanding of code-switching phenomena and its applications in natural language processing and speech recognition technologies. By bridging linguistic theory and computational methods, this thesis aims to offer valuable insights and practical solutions for handling code-switching in multilingual environments.

This item appears in the following Collection(s)