Discovering and analysing lexical variation in social media text
Shoemark, Philippa Jane
For many speakers of non-standard or minority language varieties, social media provides an unprecedented opportunity to write in a way which reflects their everyday speech, without censorship or castigation. Social media also functions as a platform for the construction, communication, and consolidation of personal and group identities, and sociolinguistic variation is an important resource that can be put to work in these processes. The ease and efficiency with which vast social media datasets can be collected make them fertile ground for large-scale quantitative sociolinguistic analyses, and this is a growing research area. However, the limited meta-data associated with social media posts often makes it difficult to control for potential confounding factors and to assess the generalisability of results. The aims of this thesis are to advance methodologies for discovering and analysing patterns of sociolinguistic variation in social media text, and to apply them in order to answer questions about social factors that condition the use of Scots and Scottish English on Twitter. The Anglic language varieties spoken in Scotland are often conceptualised as a continuum extending from Scots at one end to Standard English at the other, with Scottish English in between. There is a large degree of overlap in grammar and vocabulary across the whole continuum, and people fluidly shift up and down it depending on the social context. It can therefore be difficult to classify a short utterance as unequivocally Scots or English. For this reason we focus on the lexical level, using a data-driven method to identify words which are distinctive to tweets from Scotland. These include both centuries-old Scots words attested in dictionaries, and newer forms not yet recorded in dictionaries, including innovative variant spellings, contractions, and acronyms for common Scottish turns of phrase. We first investigate a hypothesised relationship between support for Scottish independence and distinctively Scottish vocabulary use, revealing that Twitter users who favoured hashtags associated with support for Scottish independence in the lead up to the 2014 Scottish Independence Referendum used distinctively Scottish lexical variants at higher rates than those who favoured anti-independence hashtags. We also test the hypothesis that when specifically discussing the referendum, people might increase their Scots usage in order to project a stronger Scottish identity or to emphasise Scottish cultural distinctiveness, but find no evidence to suggest this is a widespread phenomenon on Twitter. In fact, our results indicate that people are significantly more likely to use distinctively Scottish vocabulary in everyday chitchat on Twitter than when discussing Scottish independence. We build on the methodologies of previous large-scale studies of style-shifting and lexical variation on social media, taking greater care to avoid confounding form and meaning, to distinguish effects of audience and topic, and to assess whether our findings generalise across different groups of users. Finally, we develop a system to identify pairs of lexical variants which refer to the same concepts and occur in the same syntactic contexts; but differ in form and signal different things about the speaker or situational context. Our aim is to facilitate the process of curating sociolinguistic variables by providing researchers with a ranked list of candidate variant pairs, which they only have to accept or reject. Data-driven identification of lexical variables is particularly important when studying language varieties which do not have a written standard, and when using social media data where linguistic creativity and innovation is rife, as the most distinctive variables will not necessarily be the same as those that are attested in speech or other written domains. Our proposed system takes as input an unlabelled text corpus containing a mixture of language varieties, and generates pairs of lexical variants which have the same denotation but differential associations with two language varieties of interest. This can considerably speed up the process of identifying pairs of lexical variants with different sociocultural associations, and may reveal pertinent variables that a researcher might not have otherwise considered.