Edinburgh Research Archive

Discovering and analysing lexical variation in social media text

dc.contributor.advisor
Goldwater, Sharon
en
dc.contributor.advisor
Kirby, James
en
dc.contributor.author
Shoemark, Philippa Jane
en
dc.contributor.sponsor
Engineering and Physical Sciences Research Council (EPSRC)
en
dc.date.accessioned
2020-06-08T12:49:56Z
dc.date.available
2020-06-08T12:49:56Z
dc.date.issued
2020-06-25
dc.description.abstract
For many speakers of non-standard or minority language varieties, social media provides an unprecedented opportunity to write in a way which reflects their everyday speech, without censorship or castigation. Social media also functions as a platform for the construction, communication, and consolidation of personal and group identities, and sociolinguistic variation is an important resource that can be put to work in these processes. The ease and efficiency with which vast social media datasets can be collected make them fertile ground for large-scale quantitative sociolinguistic analyses, and this is a growing research area. However, the limited meta-data associated with social media posts often makes it difficult to control for potential confounding factors and to assess the generalisability of results. The aims of this thesis are to advance methodologies for discovering and analysing patterns of sociolinguistic variation in social media text, and to apply them in order to answer questions about social factors that condition the use of Scots and Scottish English on Twitter. The Anglic language varieties spoken in Scotland are often conceptualised as a continuum extending from Scots at one end to Standard English at the other, with Scottish English in between. There is a large degree of overlap in grammar and vocabulary across the whole continuum, and people fluidly shift up and down it depending on the social context. It can therefore be difficult to classify a short utterance as unequivocally Scots or English. For this reason we focus on the lexical level, using a data-driven method to identify words which are distinctive to tweets from Scotland. These include both centuries-old Scots words attested in dictionaries, and newer forms not yet recorded in dictionaries, including innovative variant spellings, contractions, and acronyms for common Scottish turns of phrase. We first investigate a hypothesised relationship between support for Scottish independence and distinctively Scottish vocabulary use, revealing that Twitter users who favoured hashtags associated with support for Scottish independence in the lead up to the 2014 Scottish Independence Referendum used distinctively Scottish lexical variants at higher rates than those who favoured anti-independence hashtags. We also test the hypothesis that when specifically discussing the referendum, people might increase their Scots usage in order to project a stronger Scottish identity or to emphasise Scottish cultural distinctiveness, but find no evidence to suggest this is a widespread phenomenon on Twitter. In fact, our results indicate that people are significantly more likely to use distinctively Scottish vocabulary in everyday chitchat on Twitter than when discussing Scottish independence. We build on the methodologies of previous large-scale studies of style-shifting and lexical variation on social media, taking greater care to avoid confounding form and meaning, to distinguish effects of audience and topic, and to assess whether our findings generalise across different groups of users. Finally, we develop a system to identify pairs of lexical variants which refer to the same concepts and occur in the same syntactic contexts; but differ in form and signal different things about the speaker or situational context. Our aim is to facilitate the process of curating sociolinguistic variables by providing researchers with a ranked list of candidate variant pairs, which they only have to accept or reject. Data-driven identification of lexical variables is particularly important when studying language varieties which do not have a written standard, and when using social media data where linguistic creativity and innovation is rife, as the most distinctive variables will not necessarily be the same as those that are attested in speech or other written domains. Our proposed system takes as input an unlabelled text corpus containing a mixture of language varieties, and generates pairs of lexical variants which have the same denotation but differential associations with two language varieties of interest. This can considerably speed up the process of identifying pairs of lexical variants with different sociocultural associations, and may reveal pertinent variables that a researcher might not have otherwise considered.
en
dc.identifier.uri
https://hdl.handle.net/1842/37116
dc.identifier.uri
http://dx.doi.org/10.7488/era/417
dc.language.iso
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Shoemark, P., Sur, D., Shrimpton, L., Murray, I., & Goldwater, S. (2017, April). Aye or naw, whit dae ye hink? Scottish independence and linguistic identity on social media. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (pp. 1239-1248).
en
dc.relation.hasversion
Shoemark, P., Kirby, J., & Goldwater, S. (2017, September). Topic and audience effects on distinctively Scottish vocabulary usage in Twitter data. In Proceedings of the Workshop on Stylistic Variation (pp. 59-68)
en
dc.relation.hasversion
Shoemark, P., Kirby, J., & Goldwater, S. (2018, November). Inducing a lexicon of sociolinguistic variables from code-mixed text. In Proceedings of the 2018E MNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text (pp. 1-6)
en
dc.relation.hasversion
Shoemark, P., Ferdousi Liza, F., Nguyen, D., Hale, S. A., and McGillivray, B. (2019). Room to glo: A systematic comparison of semantic change detection approaches with word embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
en
dc.subject
computational sociolinguistics
en
dc.subject
social media
en
dc.subject
computational linguistics
en
dc.subject
sociolinguistics
en
dc.subject
Scots
en
dc.subject
minority languages
en
dc.subject
Natural Language Processing
en
dc.title
Discovering and analysing lexical variation in social media text
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Shoemark2020.pdf
Size:
1.78 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)