Towards efficient and robust action recognition
dc.contributor.advisor
Fisher, Bob
dc.contributor.advisor
Mac Aodha, Oisin
dc.contributor.author
Kim, Kiyoon
dc.contributor.sponsor
School of Informatics at The University of Edinburgh
en
dc.contributor.sponsor
Engineering and Physical Sciences Research Council (EPSRC)
en
dc.contributor.sponsor
Alan Turing Institute
en
dc.date.accessioned
2023-11-29T10:54:27Z
dc.date.available
2023-11-29T10:54:27Z
dc.date.issued
2023-11-29
dc.description.abstract
Human action recognition is a crucial computer vision task that plays an important role in various video understanding applications. Due to the increasing popularity of video content on the internet, it is vital to research better action recognition to replace excessive human labor to analyze videos. In this thesis, mainly three different problems related to efficiency and robustness that prohibit action recognition technology from being accessible to more people are identified, and solutions to each are proposed.
Firstly, the problem of capturing temporal information for video classification in 2D networks, without increasing their computational cost, is addressed. Existing approaches focus on modifying the architecture of 2D networks (e.g., by including filters in the temporal dimension to turn them into 3D networks, or using optical flow), which increases computation cost. Instead, we propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes. We observe that even without extensions, the proposed sampling strategy improves performance on multiple architectures (e.g., TSN, TRN, TSM, and MVFNet) and datasets (CATER, Something-Something-V1 and V2), up to 24% over the baseline of using the standard video input. In addition, our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing. Given the generality of the results and the flexibility of the approach, we hope this can be widely useful to the video understanding community.
Next, precisely naming the action depicted in a video can be a challenging and oftentimes ambiguous task. In contrast to object instances represented as nouns (e.g., dog, cat, chair), in the case of actions, human annotators typically lack a consensus as to what constitutes a specific action (e.g., jogging versus running). In practice, a given video can contain multiple valid positive annotations for the same action. As a result, video datasets often contain significant levels of label noise and overlap between the atomic action classes. In this work, we address the challenge of training multi-label action recognition models from only single positive training labels. We propose two approaches that are based on generating pseudo training examples sampled from similar instances within the train set. Unlike other approaches that use model-derived pseudo-labels, our pseudo-labels come from human annotations and are selected based on feature similarity. To validate our approaches, we create a new evaluation benchmark by manually annotating a subset of EPIC-Kitchens-100's validation set with multiple verb labels. We present results on this new test set along with additional results on a new version of HMDB-51, called Confusing-HMDB-102, where we outperform existing methods in both cases. Data and code are publicly available.
Finally, despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one for training and validation, and the other for testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for training, and Kinetics-400 for testing, using the subset of the classes that are overlapping in both train and test datasets. The other proposed method extracts the feature mean of each class using the target evaluation dataset's training data (i.e., class prototype), and estimates test video prediction as a cosine similarity score between each sample to the class prototypes of all classes. This procedure does not alter model weights using the target dataset and it does not require aligning overlapping classes of two different datasets, thus is a very efficient method to test the model robustness to distribution shifts, without prior knowledge of the target distribution. We address the robustness problem by adversarial augmentation training - generating augmented views of videos that are "hard" for the classification model by applying gradient ascent on the augmentation parameters - as well as "curriculum" scheduling the strength of the video augmentations. We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models - TSM, Video Swin Transformer, and Uniformer. Curated datasets and code are publicly released. The presented work provides critical insight into model robustness to distribution shifts and presents effective techniques to enhance video action recognition performance in a real-world deployment.
en
dc.identifier.uri
https://hdl.handle.net/1842/41237
dc.identifier.uri
http://dx.doi.org/10.7488/era/3973
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Kiyoon Kim, Shreyank N Gowda, Oisin Mac Aodha, and Laura Sevilla-Lara. Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition. In 33rd British Machine Vision Conference, 2022.
en
dc.relation.hasversion
Kiyoon Kim, Davide Moltisanti, Oisin Mac Aodha, and Laura Sevilla-Lara. An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition. In 33rd British Machine Vision Conference, 2022.
en
dc.relation.hasversion
Kiyoon Kim, Shreyank N Gowda, Panagiotis Eustratiadis, Antreas Antoniou, and Robert B Fisher. Adversarial Augmentation Training Makes Action Recognition Model More Robust to Realistic Video Distribution Shift.
en
dc.subject
action recognition
en
dc.subject
Human action recognition
en
dc.subject
2D networks
en
dc.subject
3D networks
en
dc.subject
robustness
en
dc.subject
natural distribution shifts
en
dc.title
Towards efficient and robust action recognition
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- KimK_2023.pdf
- Size:
- 24.81 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

