Vision as inverse graphics for detailed scene understanding
Moreno Comellas, Pol
An image of a scene can be described by the shape, pose and appearance of the objects within it, as well as the illumination, and the camera that captured it. A fundamental goal in computer vision is to recover such descriptions from an image. Such representations can be useful for tasks such as autonomous robotic interaction with an environment, but obtaining them can be very challenging due the large variability of objects present in natural scenes. A long-standing approach in computer vision is to use generative models of images in order to infer the descriptions that generated the image. These methods are referred to as “vision as inverse graphics” or “inverse graphics”. We propose using this approach to scene understanding by making use of a generative model (GM) in the form of a graphics renderer. Since searching over scene factors to obtain the best match for an image is very inefficient, we make use of convolutional neural networks, which we refer to as the recognition models (RM), trained on synthetic data to initialize the search. First we address the effect that occlusions on objects have on the performance of predictive models of images. We propose an inverse graphics approach to predicting the shape, pose, appearance and illumination with a GM which includes an outlier model to account for occlusions. We study how the inferences are affected by the degree of occlusion of the foreground object, and show that a robust GM which includes an outlier model to account for occlusions works significantly better than a non-robust model. We then characterize the performance of the RM and the gains that can be made by refining the search using the robust GM, using a new synthetic dataset that includes background clutter and occlusions. We find that pose and shape are predicted very well by the RM, but appearance and especially illumination less so. However, accuracy on these latter two factors can be clearly improved with the generative model. Next we apply our inverse graphics approach to scenes with multiple objects. We propose using a method to efficiently and differentiably model self shadowing which improves the realism of the GM renders. We also propose a way to render object occlusion boundaries which results in more accurate gradients of the rendering function. We evaluate these improvements using a dataset with multiple objects and show that the refinement step of the GM clearly improves on the predictions of the RM for the latent variables of shape, pose, appearance and illumination. Finally we tackle the task of learning generative models of 3D objects from a collection of meshes. We present a latent variable architecture that learns to separately capture the underlying factors of shape and appearance from the meshes. To do so we first transform the meshes of a given class to a data representation that sidesteps the need for landmark correspondences across meshes when learning the GM. The ability and usefulness of learning a disentangled latent representation of objects is demonstrated via an experiment where the appearance of one object is transferred onto the shape of another.