Edinburgh Research Archive

Pixels to pitch: extending few-shot learning to audio tasks

Item Status

Embargo End Date

Authors

Heggan, Calum

Abstract

Few-Shot Learning has gained significant attention in recent years as a possible tool for solving tasks which have too little data for traditional machine learning pipelines, such as user adaptable AI systems or rare event detection. At the start of this study, research on few-shot learning was heavily focused on the imagery domain, with only a small handful of works considering other settings. This bias toward the imagery domain, and moreover toward a few very popular sub-tasks, has the potential to negatively impact the development of well-generalised few-shot learning, capable of performing well across a variety of domains and problem settings. Works available at the time that did try to span out from few-shot imagery were largely around few-shot audio classification and event detection. These works largely suffered from the same lack of reproducibility as one another, and as such did not effectively built from one another. To more effectively branch-out from few-shot imagery, and simultaneously advance both momentum and state-of-the-art in few-shot audio classification, this thesis investigates several important threads of research, from benchmark creation and development of general purpose self-supervision approaches, to transfer learning and performance prediction. The first part of this thesis focuses on the creation of MetaAudio, a large-scale, fully reproducible and extendable few-shot audio classification benchmark containing 10 evaluation datasets and 4 experimental tracks. We provide a detailed description of the benchmark construction and setup, as well as a comprehensive suite of experimental results using popular Meta-Learning approaches. Alongside this, our results in MetaAudio highlight key differences between few-shot in the imagery and audio domains. In the second part, we propose MT-SLVR, a novel general-purpose and domain-agnostic self-supervised algorithm, capable of learning both an augmentation-invariant and augmentation-sensitive feature space. After training and utilising MT-SLVR models, it achieved state-of-the-art performance across almost all benchmarked few-shot audio tasks. The role of transfer learning for few-shot classification was investigated next. In particular, the effectiveness of large-scale self-supervised speech models was investigated while also determining the relatedness of few-shot audio classification to existing audio benchmarking tasks. Additionally, we investigate how effectively more popular and readily available image-based models can be leveraged. The final part of the thesis looks at downstream tasks and investigates how to utilise dataset-dissimilarity measures in order to perform few-shot performance prediction. We found that either transductive or inductive class-separability measures can be used effectively to predict both task-hardness and final performance. By providing both a fully reproducible and stable benchmark as well as transfer learning evaluations and state-of-the-art approaches, this thesis contributes to the advancement of both few-shot and self-supervised learning research. Our work here will aid in the further development of these fields for audio-related tasks, and as an exemplar for sequential data in general.

This item appears in the following Collection(s)