Edinburgh Research Archive

Understanding and modeling code-switching: metrics, triggers, and applications in multilingual NLP

dc.contributor.advisor
Bell, Peter
dc.contributor.advisor
Lai, Catherine
dc.contributor.author
Chi, Jie
dc.date.accessioned
2025-07-16T10:56:47Z
dc.date.available
2025-07-16T10:56:47Z
dc.date.issued
2025-07-16
dc.description.abstract
Code-switching, the phenomenon of alternating between two or more languages within a single conversation or discourse, has been commonly observed in the growing context of multilingual communities. Decades of research across various disciplines have focused on understanding its underlying principles and modeling its patterns. This doctoral thesis contributes to this ongoing research from both theoretical and applied perspectives. Firstly, existing popular metrics for measuring code-switching richness rely on counts of tokens, switching points, or the distribution of language spans without considering differences in morpho-syntax and orthographic conventions across languages. Consequently, metrics calculated for different language pairs are not comparable. This thesis proposes a framework that leverages linguistic findings as makeshift ground truths to assess the quality and sufficiency of existing metrics after normalizing them to factor out token differences. Additionally, it introduces the T-index, which utilizes machine translation systems to capture properties of code-switched words in relation to the participating language pairs. Building on the existing hypothesis that part-of-speech (POS) facilitates code-switching occurrence, this thesis extends prior work by incorporating the impact of word positions and robustly confirms a statistically significant connection between POS and code-switching. The findings suggest that more diverse syntactical structures lead to less flexibility in code-switching. By categorizing code-switched words and investigating neighboring POS, we observe that this relationship is strongest in close proximity to switched instances, gradually diminishing as words move farther from code-switching points. Furthermore, this thesis investigates two approaches to code-switched text generation and their applications in improving automatic speech recognition systems. Using parallel data from two languages, equivalence constraint theory can determine which segments can be replaced to produce code-switched sentences. Alternatively, a multi-lingual machine translation system can achieve similar results by using shared representations between languages to produce lexical replacements. In conclusion, this research enhances the understanding of code-switching phenomena and its applications in natural language processing and speech recognition technologies. By bridging linguistic theory and computational methods, this thesis aims to offer valuable insights and practical solutions for handling code-switching in multilingual environments.
en
dc.identifier.uri
https://hdl.handle.net/1842/43683
dc.identifier.uri
http://dx.doi.org/10.7488/era/6215
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Bhattacharya, D., Chi, J., Hirschberg, J., and Bell, P. (2023). Capturing formality in speech across domains and languages. In Interspeech 2023, pages 1030–1034.
en
dc.relation.hasversion
Chi, J. and Bell, P. (2022). Improving code-switched ASR with linguistic information. In Proceedings of COLING, pages 7171–7176
en
dc.relation.hasversion
Chi, J. and Bell, P. (2024). Analyzing the role of part-of-speech in code-switching: A corpus-based study. In Graham, Y. and Purver, M., editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 1712–1721, St. Julian’s, Malta. Association for Computational Linguistics.
en
dc.relation.hasversion
Chi, J., Lu, B., Eisner, J., Bell, P., Jyothi, P., and Ali, A. M. (2023). Unsupervised Code-switched Text Generation from Parallel Text. In Proc. INTERSPEECH 2023, pages 1419–1423
en
dc.relation.hasversion
Chi, J., Wallington, E., and Bell, P. (2024). Characterizing code-switching: Applying Linguistic Principles for Metric Assessment and Development. In Proc. INTER-SPEECH 2024.
en
dc.subject
code-switching
en
dc.subject
ASR
en
dc.subject
mutlilingual
en
dc.subject
Automatic Speech Recognition
en
dc.title
Understanding and modeling code-switching: metrics, triggers, and applications in multilingual NLP
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Chi2025.pdf
Size:
24.25 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)