Programming language semantics as a foundation for Bayesian inference
Item Status
Embargo End Date
Date
Authors
Abstract
Bayesian modelling, in which our prior belief about the distribution on model parameters
is updated by observed data, is a popular approach to statistical data analysis.
However, writing specific inference algorithms for Bayesian models by hand is time-consuming
and requires significant machine learning expertise.
Probabilistic programming promises to make Bayesian modelling easier and more
accessible by letting the user express a generative model as a short computer program
(with random variables), leaving inference to the generic algorithm provided by the
compiler of the given language. However, it is not easy to design a probabilistic programming
language correctly and define the meaning of programs expressible in it.
Moreover, the inference algorithms used by probabilistic programming systems usually
lack formal correctness proofs and bugs have been found in some of them, which
limits the confidence one can have in the results they return.
In this work, we apply ideas from the areas of programming language theory and
statistics to show that probabilistic programming can be a reliable tool for Bayesian
inference. The first part of this dissertation concerns the design, semantics and type
system of a new, substantially enhanced version of the Tabular language. Tabular is a
schema-based probabilistic language, which means that instead of writing a full program,
the user only has to annotate the columns of a schema with expressions generating
corresponding values. By adopting this paradigm, Tabular aims to be user-friendly,
but this unusual design also makes it harder to define the syntax and semantics correctly
and reason about the language. We define the syntax of a version of Tabular extended
with user-defined functions and pseudo-deterministic queries, design a dependent type
system for this language and endow it with a precise semantics. We also extend Tabular
with a concise formula notation for hierarchical linear regressions, define the type
system of this extended language and show how to reduce it to pure Tabular.
In the second part of this dissertation, we present the first correctness proof for a
Metropolis-Hastings sampling algorithm for a higher-order probabilistic language. We
define a measure-theoretic semantics of the language by means of an operationally-defined
density function on program traces (sequences of random variables) and a map
from traces to program outputs. We then show that the distribution of samples returned
by our algorithm (a variant of “Trace MCMC” used by the Church language) matches
the program semantics in the limit.
This item appears in the following Collection(s)

