Neural distribution estimation as a two-part problem
Given a dataset of examples, distribution estimation is the task of approximating the assumed underlying probability distribution from which those samples were drawn. Neural distribution estimation relies on the powerful function approximation capabilities of deep neural networks to build models for this purpose, and excels when data is high-dimensional and exhibits complex, nonlinear dependencies. In this thesis, we explore several approaches to neural distribution estimation, and present a unified perspective for these methods based on a two-part design principle. In particular, we examine how many models iteratively break down the task of distribution estimation into a series of tractable sub-tasks, before fitting a multi-step generative process which combines solutions to these sub-tasks in order to approximate the data distribution of interest. Framing distribution estimation as a two-part problem provides a shared language in which to compare and contrast prevalent models in the literature, and also allows for discussion of alternative approaches which do not follow this structure. We first present the Autoregressive Energy Machine, an energy-based model which is trained by approximate maximum likelihood through an autoregressive decomposition. The method demonstrates the flexibility of an energy-based model over an explicitly normalized model, and the novel application of autoregressive importance sampling highlights the benefit of an autoregressive approach to distribution estimation which recursively transforms the problem into a series of univariate tasks. Next, we present Neural Spline Flows, a class of normalizing flow models based on monotonic spline transformations which admit both an explicit inverse and a tractable Jacobian determinant. Normalizing flows tackle distribution estimation by searching for an invertible map between the data distribution and a more tractable base distribution, and this map is typically constructed as the composition of a series of invertible building blocks. We demonstrate that spline flows can be used to enhance density estimation of tabular data, variational inference in latent variable models, and generative modeling of natural images. The third chapter presents Maximum Likelihood Training of Score-Based Diffusion Models. Generative models based on estimation of the gradient of the logarithm of the probability density---or score function---have recently gained traction as a powerful modeling paradigm, in which the data distribution is gradually transformed toward a tractable base distribution by means of a stochastic process. The paper illustrates how this class of models can be trained by maximum likelihood, resulting in a model which is functionally equivalent to a continuous normalizing flow, and which bridges the gap between two branches of the literature. We also discuss latent-variable generative models more broadly, of which diffusion models are a structured special case. Finally, we present On Contrastive Learning for Likelihood-Free Inference, a unifying perspective for likelihood-free inference methods which perform Bayesian inference using either density estimation or density-ratio estimation. Likelihood-free inference focuses on inference in stochastic simulator models where the likelihood of parameters given observations is computationally intractable, and traditional inference methods fall short. In addition to illustrating the power of normalizing flows as generic tools for density estimation, this chapter also gives us the opportunity to discuss likelihood-free models more broadly. These so-called implicit generative models form a large part of the distribution estimation literature under the umbrella of generative adversarial networks, and are distinct in how they treat distribution estimation as a one-part problem.