Edinburgh Research Archive

Towards efficient and robust action recognition

Item Status

Embargo End Date

Authors

Kim, Kiyoon

Abstract

Human action recognition is a crucial computer vision task that plays an important role in various video understanding applications. Due to the increasing popularity of video content on the internet, it is vital to research better action recognition to replace excessive human labor to analyze videos. In this thesis, mainly three different problems related to efficiency and robustness that prohibit action recognition technology from being accessible to more people are identified, and solutions to each are proposed. Firstly, the problem of capturing temporal information for video classification in 2D networks, without increasing their computational cost, is addressed. Existing approaches focus on modifying the architecture of 2D networks (e.g., by including filters in the temporal dimension to turn them into 3D networks, or using optical flow), which increases computation cost. Instead, we propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes. We observe that even without extensions, the proposed sampling strategy improves performance on multiple architectures (e.g., TSN, TRN, TSM, and MVFNet) and datasets (CATER, Something-Something-V1 and V2), up to 24% over the baseline of using the standard video input. In addition, our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing. Given the generality of the results and the flexibility of the approach, we hope this can be widely useful to the video understanding community. Next, precisely naming the action depicted in a video can be a challenging and oftentimes ambiguous task. In contrast to object instances represented as nouns (e.g., dog, cat, chair), in the case of actions, human annotators typically lack a consensus as to what constitutes a specific action (e.g., jogging versus running). In practice, a given video can contain multiple valid positive annotations for the same action. As a result, video datasets often contain significant levels of label noise and overlap between the atomic action classes. In this work, we address the challenge of training multi-label action recognition models from only single positive training labels. We propose two approaches that are based on generating pseudo training examples sampled from similar instances within the train set. Unlike other approaches that use model-derived pseudo-labels, our pseudo-labels come from human annotations and are selected based on feature similarity. To validate our approaches, we create a new evaluation benchmark by manually annotating a subset of EPIC-Kitchens-100's validation set with multiple verb labels. We present results on this new test set along with additional results on a new version of HMDB-51, called Confusing-HMDB-102, where we outperform existing methods in both cases. Data and code are publicly available. Finally, despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one for training and validation, and the other for testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for training, and Kinetics-400 for testing, using the subset of the classes that are overlapping in both train and test datasets. The other proposed method extracts the feature mean of each class using the target evaluation dataset's training data (i.e., class prototype), and estimates test video prediction as a cosine similarity score between each sample to the class prototypes of all classes. This procedure does not alter model weights using the target dataset and it does not require aligning overlapping classes of two different datasets, thus is a very efficient method to test the model robustness to distribution shifts, without prior knowledge of the target distribution. We address the robustness problem by adversarial augmentation training - generating augmented views of videos that are "hard" for the classification model by applying gradient ascent on the augmentation parameters - as well as "curriculum" scheduling the strength of the video augmentations. We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models - TSM, Video Swin Transformer, and Uniformer. Curated datasets and code are publicly released. The presented work provides critical insight into model robustness to distribution shifts and presents effective techniques to enhance video action recognition performance in a real-world deployment.

This item appears in the following Collection(s)