Effects of prosody on natural language processing

Nielsen, Elizabeth

Effects of prosody on natural language processing

Files

Nielsen2024.pdf (820.29 KB)

Date

2024-05-15

Authors

Nielsen, Elizabeth

Full item page

Abstract

Prosody -- or the systematic variation in the energy, pitch, timing, and voice quality of speech -- plays an important role in speech communication. For example, pitch is the primary way an English speaker can distinguish between certain kinds of questions and statements (e.g., 'That's today?' vs. 'That's today.'). Despite the fact that prosody can convey a range of linguistic features, it is uncommon for NLP systems that deal with speech inputs to give consideration to prosodic features. Many systems such as dialog agents start with an automatic speech recognition (ASR) step, which converts the audio signal into text, after which all prosodic information is discarded. Previous research has established that prosody can be helpful -- it has been shown to aid in tasks such as syntactic parsing (Tran et al. 2018) -- but the amount of benefit shown for many tasks is modest enough that including prosodic inputs still remains a niche approach in NLP. The goal of this thesis is to revisit the question of how prosodic features can benefit a range of NLP tasks. First, Chapter 3 considers the question of what modeling choices are best for incorporating prosodic inputs to NLP tasks. These experiments show that a wide input context is helpful in detecting prosodic information, but even so, text features alone are able to predict a relatively large portion of prosodic activity. Second, Chapter 4 showcases an example where prosody has no observed effect. Even though there is good linguistic justification for expecting that prosody should help in better conveying information status in speech translation, this effect is not seen because the biases of the speech translation model itself make any effect unmeasureable, underscoring the importance of task and model selection. Third, Chapter 5 shows that prosody does help with syntactic parsing in the more realistic setting where the input is not pre-segmented into sentences. In fact, prosody helps more with segmenting the speech into sentences than with parsing itself, but both tasks benefit. These experiments show that the realistic task of parsing plus segmentation benefits in more ways from including prosody than does parsing alone. Finally, Chapter 6 considers what happens in the sentence segmentation task when an ASR transcript is used as the lexical input, and acoustic noise is introduced to the audio signal. As more sources of noise are added, prosody becomes progressively more important for the model's performance. This suggests that the information in the prosodic and lexical channels is somewhat redundant, with the prosodic channel acting more as a `back-up' for the lexical channel than as a channel for novel information. Together, these results suggest that prosody has the potential to be helpful in many NLP tasks, but that these benefits are more marked in cases that better approximate real-world language usage, where there are obstacles to clear communication. Because the information in the prosodic and lexical channels overlaps so much, adding prosodic information does not boost performance as much when both channels are clear and unobstructed. However, when obstacles to clear perception (such as lacking sentence boundaries, using an ASR transcript, or acoustic noise) are present, prosody becomes more important. This suggests that in future work, it will be important to move towards modelling assumptions that better approximate the non-idealized conditions of real-world language use in order to fully understand the value of prosody for NLP tasks.

URI

https://hdl.handle.net/1842/41785
http://dx.doi.org/10.7488/era/4508

This item appears in the following Collection(s)

Informatics thesis and dissertation collection