Video object segmentation and applications in temporal alignment and aspect learning
View/ Open
Date
29/11/2016Author
Papazoglou, Anestis
Metadata
Abstract
Modern computer vision has seen recently significant progress in learning visual concepts
from examples. This progress has been fuelled by recent models of visual appearance
as well as recently collected large-scale datasets of manually annotated still
images. Video is a promising alternative, as it inherently contains much richer information
compared to still images. For instance, in video we can observe an object move
which allows us to differentiate it from its surroundings, or we can observe a smooth
transition between different viewpoints of the same object instance. This richness in
information allows us to effectively tackle tasks that would otherwise be very difficult
if we only considered still images, or even adress tasks that are video-specific.
Our first contribution is a computationally efficient technique for video object segmentation.
Our method relies solely on motion in order to rapidly create a rough initial
estimate of the foreground object. This rough initial estimate is then refined through
an energy formulation to be spatio-temporally smooth. The method is able to handle
rapidly moving backgrounds and objects, as well as non-rigid deformations and articulations
without having prior knowledge about the objects appearance, size or location.
In addition to this class-agnostic method, we present a class-specific method that incorporates
additional class-specific appearance cues when the class of the foreground
object is known in advance (e.g. a video of a car).
For our second contribution, we propose a novel model for temporal video alignment
with regard to the viewpoint of the foreground object (i.e., a pair of aligned
frames shows the same object viewpoint) Our work relies on our video object segmentation
technique to automatically localise the foreground objects and extract appearance
measurements solely from them instead of the background. Our model is able
to temporally align realistic videos, where events may occur in a different order, or
occur only in one of the videos. This is in contrast to previous works that typically
assume that the videos show a scripted sequence of events and can simply be aligned
by stretching or compressing one of the videos.
As a final contribution, we once again use our video object segmentation technique
as a basis for automatic visual aspect discovery from videos of an object class. Compared
to previous works, we use a broader definition of an aspect that considers four
factors of variation: viewpoint, articulated pose, occlusions and cropping by the image
border. We pose the aspect discovery task as a clustering problem and provide an
extensive experimental exploration on the benefits of object segmentation for this task.