Learning universal representations across tasks and domains
Abstract
A longstanding goal in computer vision research is to produce broad and general-purpose systems that work well on a broad range of vision problems and are capable of learning concepts only from few labelled samples. In contrast, existing models are limited to work only in specific tasks or domains (datasets), e.g., a semantic segmentation model for indoor images (Silberman et al., 2012). In addition, they are data inefficient and require large labelled dataset for each task or domain. While there has been works proposed for domain/task-agnostic representations by either loss balancing strategies or architecture design, it remains a challenging problem on optimizing such universal representation network. This thesis focuses on addressing the challenges of learning universal representations that generalize well over multiple tasks (e.g. segmentation, depth estimation) or various visual domains (e.g. image object classification, image action classification). In addition, the thesis also shows that these representations can be learned from partial supervision and transferred and adopted to previously unseen tasks/domains in a data-efficient manner.
The first part of the dissertation focuses on learning universal representations, i.e. a single universal network for multi-task learning (e.g., learning a single network jointly for different dense prediction tasks like segmentation and depth estimation) and multi- domain learning (e.g. image classification for various vision datasets, each collected for a different problem like texture, flower or action classification). Learning such universal representations by jointly minimizing the sum of all task-specific losses is challenging because of the interference between tasks and it leads to unbalanced results (i.e. some tasks dominate or interfere other tasks and the universal network performs worse than task/domain-specific networks each of which is trained for a task/domain independently). Hence a new solution is proposed to regularize the optimization of the universal network by encouraging the universal network to produce the same features as the ones of task-specific networks. The experimental results demonstrate that the proposed method learns a single universal network that performs well for multiple tasks or various visual domains.
Despite the recent advances in multi-task learning of dense prediction problems, most methods rely on expensive labelled datasets. Relaxing this assumption gives rise to a new multi-task learning setting, called multi-task partially-supervised learning in this thesis, in which the goal is to jointly learn of multiple dense prediction tasks on partially annotated data (i.e. not all the task labels are available for each training image). In the thesis, a label efficient approach is proposed to successfully leverage task relations to supervise its multi-task learning when data is partially annotated. In particular, the proposed method learns to map each task pair to a joint pairwise task- space which enables sharing information between them in a computationally efficient way through another network conditioned on task pairs, and avoids learning trivial cross-task relations by retaining high-level information about the input image.
The final part of the dissertation studies the problem of adapting a model to pre- viously unseen tasks (from seen or unseen domains) with very few labelled training samples of the new tasks, i.e. cross-domain few-shot learning. Recent methods have focused on using various adaptation strategies for aligning their visual representations to new domains or selecting the relevant ones from multiple domain-specific feature extractors. In this dissertation, new methods are formulated to learn a single task- agnostic network from multiple domains during meta-training and attach light-weight task-specific parameters that are learned from limited training samples and adapt the task-agnostic network to accommodate the previously unseen tasks. Systematic analysis is performed to study various task adaptation strategies for few-shot learning. Extensive experimental evidence demonstrates that the proposed methods that learn a single set of task-agnostic representations and adapt the representations via residual adapters in matrix form attached to the task-agnostic model significantly benefits the cross-domain few-shot learning.