Learning to represent, model and generate the world
Item Status
Embargo End Date
Date
Authors
Anciukevičius, Titas
Abstract
Every agent acting in the world receives sensory signals from which it must extract information necessary for intelligent actions. While neural networks can be trained to interpret these signals using labelled supervision, this thesis presents unsupervised algorithms for learning to represent, infer, and generate the physical world around us. Our core idea is to develop methods for learning latent variable generative models of images and videos, which support the generation and inference of useful latent representations, even from observations that are substantially different from those encountered during learning. In the first part of this thesis, we introduce a novel generative model based on denoising diffusion probabilistic models, which can learn a prior over underlying 3D representations despite being trained on 2D images. The core insight is that the model can be trained to denoise images by rendering a 3D scene. We demonstrate that the model learns a prior, which allows us to sample 3D representations from the true underlying distribution. In the second part of this thesis, we introduce a new neural representation for unbounded scenes and extend the denoising-by-rendering framework to support reconstruction of 3D representations given one or a few input images. We demonstrate that our model can sample a diverse set of 3D representations that explain a sparse visual conditioning signal. In the final part of the thesis, we explore how latent generative models can infer useful representations of the world from unfamiliar observations, i.e.~generalise outside their training distribution. The core insight is to learn the underlying causal, compositional and generative mechanisms, including how images are formed, and then exploit them in out-of-distribution scenarios, such as understanding images taken from unfamiliar viewpoints. We demonstrate that, while classical machine learning models fail, our framework can infer useful representations even when the input image is generated with a different data generation mechanism than the one used during training. Overall, our contributions advance the field by providing approaches to learning generative models that can reason about the complex physical world using raw sensory inputs---even in unfamiliar situations.
This item appears in the following Collection(s)

