Low-resource speech translation

Bansal, Sameer

Low-resource speech translation

Simple item page

dc.contributor.advisor

Goldwater, Sharon

en

dc.contributor.advisor

Lopez, Adam

en

dc.contributor.author

Bansal, Sameer

en

dc.date.accessioned

2020-02-18T15:56:55Z

dc.date.available

2020-02-18T15:56:55Z

dc.date.issued

2019-12-17

dc.description.abstract

We explore the task of speech-to-text translation (ST), where speech in one language (source) is converted to text in a different one (target). Traditional ST systems go through an intermediate step where the source language speech is first converted to source language text using an automatic speech recognition (ASR) system, which is then converted to target language text using a machine translation (MT) system. However, this pipeline based approach is impractical for unwritten languages spoken by millions of people around the world, leaving them without access to free and automated translation services such as Google Translate. The lack of such translation services can have important real-world consequences. For example, in the aftermath of a disaster scenario, easily available translation services can help better co-ordinate relief efforts. How can we expand the coverage of automated ST systems to include scenarios which lack source language text? In this thesis we investigate one possible solution: we build ST systems to directly translate source language speech into target language text, thereby forgoing the dependency on source language text. To build such a system, we use only speech data paired with text translations as training data. We also specifically focus on low-resource settings, where we expect at most tens of hours of training data to be available for unwritten or endangered languages. Our work can be broadly divided into three parts. First we explore how we can leverage prior work to build ST systems. We find that neural sequence-to-sequence models are an effective and convenient method for ST, but produce poor quality translations when trained in low-resource settings. In the second part of this thesis, we explore methods to improve the translation performance of our neural ST systems which do not require labeling additional speech data in the low-resource language, a potentially tedious and expensive process. Instead we exploit labeled speech data for high-resource languages which is widely available and relatively easier to obtain. We show that pretraining a neural model with ASR data from a high-resource language, different from both the source and target ST languages, improves ST performance. In the final part of our thesis, we study whether ST systems can be used to build applications which have traditionally relied on the availability of ASR systems, such as information retrieval, clustering audio documents, or question/answering. We build proof-of-concept systems for two downstream applications: topic prediction for speech and cross-lingual keyword spotting. Our results indicate that low-resource ST systems can still outperform simple baselines for these tasks, leaving the door open for further exploratory work. This thesis provides, for the first time, an in-depth study of neural models for the task of direct ST across a range of training data settings on a realistic multi-speaker speech corpus. Our contributions include a set of open-source tools to encourage further research.

en

dc.identifier.uri

https://hdl.handle.net/1842/36781

dc.identifier.uri

http://dx.doi.org/10.7488/era/86

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Anastasopoulos, A., Bansal, S., Chiang, D., Goldwater, S., and Lopez, A. (2017). Spoken term discovery for language documentation using translations. In Proc. EMNLP Workshop SCNLP.

en

dc.relation.hasversion

Bansal, S., Kamper, H., Livescu, K., Lopez, A., and Goldwater, S. (2018). Low-resource speech-to-text translation. In Proc. Interspeech.

en

dc.relation.hasversion

Bansal, S., Kamper, H., Livescu, K., Lopez, A., and Goldwater, S. (2019). Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proc. NAACL.

en

dc.relation.hasversion

Bansal, S., Kamper, H., Lopez, A., and Goldwater, S. (2017). Towards speech-to-text translation without speech recognition. In Proc. EACL.

en

dc.relation.hasversion

Stoian, M. C., Bansal, S., and Goldwater, S. (2019). Analyzing asr pretraining for low-resource speech-to-text translation. arXiv preprint arXiv:1910.10762.

en

dc.subject

automated translation systems

en

dc.subject

speech translation systems

en

dc.subject

unwritten languages

en

dc.subject

neural models

en

dc.subject

automatic speech recognition system

en

dc.subject

endangered languages

en

dc.subject

sequence-to-sequence models

en

dc.title

Low-resource speech translation

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Bansal2019.pdf
Size:: 3.93 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection