Video object segmentation and applications in temporal alignment and aspect learning
Modern computer vision has seen recently significant progress in learning visual concepts from examples. This progress has been fuelled by recent models of visual appearance as well as recently collected large-scale datasets of manually annotated still images. Video is a promising alternative, as it inherently contains much richer information compared to still images. For instance, in video we can observe an object move which allows us to differentiate it from its surroundings, or we can observe a smooth transition between different viewpoints of the same object instance. This richness in information allows us to effectively tackle tasks that would otherwise be very difficult if we only considered still images, or even adress tasks that are video-specific. Our first contribution is a computationally efficient technique for video object segmentation. Our method relies solely on motion in order to rapidly create a rough initial estimate of the foreground object. This rough initial estimate is then refined through an energy formulation to be spatio-temporally smooth. The method is able to handle rapidly moving backgrounds and objects, as well as non-rigid deformations and articulations without having prior knowledge about the objects appearance, size or location. In addition to this class-agnostic method, we present a class-specific method that incorporates additional class-specific appearance cues when the class of the foreground object is known in advance (e.g. a video of a car). For our second contribution, we propose a novel model for temporal video alignment with regard to the viewpoint of the foreground object (i.e., a pair of aligned frames shows the same object viewpoint) Our work relies on our video object segmentation technique to automatically localise the foreground objects and extract appearance measurements solely from them instead of the background. Our model is able to temporally align realistic videos, where events may occur in a different order, or occur only in one of the videos. This is in contrast to previous works that typically assume that the videos show a scripted sequence of events and can simply be aligned by stretching or compressing one of the videos. As a final contribution, we once again use our video object segmentation technique as a basis for automatic visual aspect discovery from videos of an object class. Compared to previous works, we use a broader definition of an aspect that considers four factors of variation: viewpoint, articulated pose, occlusions and cropping by the image border. We pose the aspect discovery task as a clustering problem and provide an extensive experimental exploration on the benefits of object segmentation for this task.