Edinburgh Research Archive

3D scene graph inference and refinement for vision-as-inverse-graphics

dc.contributor.advisor
Williams, Chris
en
dc.contributor.advisor
Winn, John
en
dc.contributor.advisor
Komura, Taku
en
dc.contributor.author
Romaszko, Lukasz
en
dc.contributor.sponsor
other
en
dc.date.accessioned
2020-04-29T13:56:23Z
dc.date.available
2020-04-29T13:56:23Z
dc.date.issued
2020-06-25
dc.description.abstract
The goal of scene understanding is to interpret images, so as to infer the objects present in a scene, their poses and fine-grained details. This thesis focuses on methods that can provide a much more detailed explanation of the scene than standard bounding-boxes or pixel-level segmentation - we infer the underlying 3D scene given only its projection in the form of a single image. We employ the Vision-as-Inverse-Graphics (VIG) paradigm, which (a) infers the latent variables of a scene such as the objects present and their properties as well as the lighting and the camera, and (b) renders these latent variables to reconstruct the input image. One highly attractive aspect of the VIG approach is that it produces a compact and interpretable representation of the 3D scene in terms of an arbitrary number of objects, called a 'scene graph'. This representation is of a key importance, as it can be useful e.g. if we wish to edit, refine, interpret the scene or interact with it. First, we investigate how the recognition models can be used to infer the scene graph given only a single RGB image. These models are trained using realistic synthetic images and corresponding ground truth scene graphs, obtained from a rich stochastic scene generator. Once the objects have been detected, each object detection is further processed using neural networks to predict the object and global latent variables. This allows computing of object poses and sizes in 3D scene coordinates, given the camera parameters. This inference of the latent variables in the form of a 3D scene graph acts like the encoder of an autoencoder, with graphics rendering as the decoder. One of the major challenges is the problem of placing the detected objects in 3D at a reasonable size and distance with respect to the single camera, the parameters of which are unknown. Previous VIG approaches for multiple objects usually only considered a fixed camera, while we allow for variable camera pose. To infer the camera parameters given the votes cast by the detected objects, we introduce a Probabilistic HoughNets framework for combining probabilistic votes, robustified with an outlier model. Each detection provides one noisy low-dimensional manifold in the Hough space, and by intersecting them probabilistically we reduce the uncertainty on the camera parameters. Given an initialization of a scene graph, its refinement typically involves computationally expensive and inefficient search through the latent space. Since optimization of the 3D scene corresponding to an image is a challenging task even for a few LVs, previous work for multi-object scenes considered only refinement of the geometry, but not the appearance or illumination. To overcome this issue, we develop a framework called 'Learning Direct Optimization' (LiDO) for optimization of the latent variables of a multi-object scene. Instead of minimizing an error metric that compares observed image and the render, this optimization is driven by neural networks that make use of the auto-context in the form of a current scene graph and its render to predict the LV update. Our experiments show that the LiDO method converges rapidly as it does not need to perform a search on the error landscape, produces better solutions than error-based competitors, and is able to handle the mismatch between the data and the fitted scene model. We apply LiDO to a realistic synthetic dataset, and show that the method transfers to work well with real images. The advantages of LiDO mean that it could be a critical component in the development of future vision-as-inverse-graphics systems.
en
dc.identifier.uri
https://hdl.handle.net/1842/37006
dc.identifier.uri
http://dx.doi.org/10.7488/era/307
dc.language.iso
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Romaszko, L., Williams, C. K. I., Moreno, P., and Kohli, P. (2017). Vision-as-Inverse-Graphics: Obtaining a Rich 3D Explanation of a Scene from a Single Image. In ICCV 2017 Geometry Meets Deep Learning Workshop, pages 851–859, 2017.
en
dc.relation.hasversion
Romaszko, L., Williams, C. K. I., and Winn, J. (2020). Learning Direct Optimization for Scene Understanding. Pattern Recognition
en
dc.subject
computer vision
en
dc.subject
scene understanding
en
dc.subject
3D reconstruction
en
dc.subject
inverse graphics
en
dc.subject
object recognition
en
dc.subject
scene graph
en
dc.subject
analysis-by-synthesis
en
dc.subject
graphics
en
dc.title
3D scene graph inference and refinement for vision-as-inverse-graphics
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 2 of 2
Name:
Romaszko2020_Redacted.pdf
Size:
21.31 MB
Format:
Adobe Portable Document Format
Description:
Name:
Romaszko2020.pdf
Size:
21.26 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)