Unsupervised category-level viewpoint estimation
View/ Open
Mariotti2023.pdf (8.410Mb)
Date
25/04/2023Item status
Restricted AccessEmbargo end date
25/04/2024Author
Mariotti, Octave
Metadata
Abstract
The recent progress in deep learning techniques transformed the field of computer vision,
with tasks like object classification or segmentation being almost considered solved. This
however requires sufficiently many labeled samples to train the system, hence research
focus has shifted towards tasks where collecting such data is challenging. Recovering
camera poses is one such task, where labels are typically too costly for supervised approaches. This work explores solutions to train camera pose estimation systems without
the need for external supervision.
Preliminary assessments show that it is possible to formulate this problem as a self supervised reconstruction task. By interpreting a network output as 3D rotation, and
using this output to control a differentiable rendering operation, gradient descent can be
used to train the network to predict viewpoint information. However, multiple issues arise
when applying such a method naively on complex data. Confounding factors of particular importance are symmetries, geometry-breaking rendering pipelines and background induced noise. This leads to a regime where purely self-supervised training breaks, al though semi-supervised approaches are still successful.
Specific solutions to the aforementioned problems are therefore studied and evaluated.
For symmetries, multiple viewpoint predictions are made, and their distribution is further regulated. Two main rendering pipelines are also compared to improve over naive
convolution-based reconstruction: a voxel-based one, and a more recent implicit neural
representation. Experimental evidence shows that carefully crafting a system with these
improvements allows recovery of poses on many everyday objects, such as cars and chairs,
with performances reaching the level of supervised approaches on some categories.
In addition, this thesis underlines two potential problems in related approaches. First, an
unstable pose retrieval method used in recent implicit representations, that is prohibitively
expensive. Second, an insidious issue in unsupervised methods, arising from a combination of dataset biases and naive calibration. As this potentially leads to overestimated
performances, it calls for a more robust evaluation standard, as well as more careful data
gathering.