3D scene graph inference and refinement for vision-as-inverse-graphics

Romaszko, Lukasz

3D scene graph inference and refinement for vision-as-inverse-graphics

Simple item page

dc.contributor.advisor

Williams, Chris

en

dc.contributor.advisor

Winn, John

en

dc.contributor.advisor

Komura, Taku

en

dc.contributor.author

Romaszko, Lukasz

en

dc.contributor.sponsor

other

en

dc.date.accessioned

2020-04-29T13:56:23Z

dc.date.available

2020-04-29T13:56:23Z

dc.date.issued

2020-06-25

dc.description.abstract

The goal of scene understanding is to interpret images, so as to infer the objects present in a scene, their poses and fine-grained details. This thesis focuses on methods that can provide a much more detailed explanation of the scene than standard bounding-boxes or pixel-level segmentation - we infer the underlying 3D scene given only its projection in the form of a single image. We employ the Vision-as-Inverse-Graphics (VIG) paradigm, which (a) infers the latent variables of a scene such as the objects present and their properties as well as the lighting and the camera, and (b) renders these latent variables to reconstruct the input image. One highly attractive aspect of the VIG approach is that it produces a compact and interpretable representation of the 3D scene in terms of an arbitrary number of objects, called a 'scene graph'. This representation is of a key importance, as it can be useful e.g. if we wish to edit, refine, interpret the scene or interact with it. First, we investigate how the recognition models can be used to infer the scene graph given only a single RGB image. These models are trained using realistic synthetic images and corresponding ground truth scene graphs, obtained from a rich stochastic scene generator. Once the objects have been detected, each object detection is further processed using neural networks to predict the object and global latent variables. This allows computing of object poses and sizes in 3D scene coordinates, given the camera parameters. This inference of the latent variables in the form of a 3D scene graph acts like the encoder of an autoencoder, with graphics rendering as the decoder. One of the major challenges is the problem of placing the detected objects in 3D at a reasonable size and distance with respect to the single camera, the parameters of which are unknown. Previous VIG approaches for multiple objects usually only considered a fixed camera, while we allow for variable camera pose. To infer the camera parameters given the votes cast by the detected objects, we introduce a Probabilistic HoughNets framework for combining probabilistic votes, robustified with an outlier model. Each detection provides one noisy low-dimensional manifold in the Hough space, and by intersecting them probabilistically we reduce the uncertainty on the camera parameters. Given an initialization of a scene graph, its refinement typically involves computationally expensive and inefficient search through the latent space. Since optimization of the 3D scene corresponding to an image is a challenging task even for a few LVs, previous work for multi-object scenes considered only refinement of the geometry, but not the appearance or illumination. To overcome this issue, we develop a framework called 'Learning Direct Optimization' (LiDO) for optimization of the latent variables of a multi-object scene. Instead of minimizing an error metric that compares observed image and the render, this optimization is driven by neural networks that make use of the auto-context in the form of a current scene graph and its render to predict the LV update. Our experiments show that the LiDO method converges rapidly as it does not need to perform a search on the error landscape, produces better solutions than error-based competitors, and is able to handle the mismatch between the data and the fitted scene model. We apply LiDO to a realistic synthetic dataset, and show that the method transfers to work well with real images. The advantages of LiDO mean that it could be a critical component in the development of future vision-as-inverse-graphics systems.

en

dc.identifier.uri

https://hdl.handle.net/1842/37006

dc.identifier.uri

http://dx.doi.org/10.7488/era/307

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Romaszko, L., Williams, C. K. I., Moreno, P., and Kohli, P. (2017). Vision-as-Inverse-Graphics: Obtaining a Rich 3D Explanation of a Scene from a Single Image. In ICCV 2017 Geometry Meets Deep Learning Workshop, pages 851–859, 2017.

en

dc.relation.hasversion

Romaszko, L., Williams, C. K. I., and Winn, J. (2020). Learning Direct Optimization for Scene Understanding. Pattern Recognition

en

dc.subject

computer vision

en

dc.subject

scene understanding

en

dc.subject

3D reconstruction

en

dc.subject

inverse graphics

en

dc.subject

object recognition

en

dc.subject

scene graph

en

dc.subject

analysis-by-synthesis

en

dc.subject

graphics

en

dc.title

3D scene graph inference and refinement for vision-as-inverse-graphics

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Romaszko2020_Redacted.pdf
Size:: 21.31 MB
Format:: Adobe Portable Document Format
Description:

Download

Name:: Romaszko2020.pdf
Size:: 21.26 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection