Abstract
Information maximization is a common framework of unsupervised learning,
which may be used for extracting informative representations y of the observed
patterns x. The key idea there is to maximize mutual information (MI), which is
a formal measure of coding efficiency. Unfortunately, exact maximization of MI is
computationally tractable only in a few special cases; more generally, approxima¬
tions need to be considered. Here we describe a family of variational lower bounds
on mutual information which gives rise to a formal and theoretically rigorous ap¬
proach to information maximization in large-scale stochastic channels. We hope
that the results presented in this work are potentially interesting for maximizing
mutual information from several perspectives. First of all, our method optimizes a
proper lower bound, rather than a surrogate objective criterion or an approxima¬
tion of MI (which may only be accurate under specific asymptotic assumptions,
and weak or even undefined when the assumptions are violated). Secondly, the
flexibility of the choice of the variational distribution makes it possible to gener¬
alize and improve simple bounds on MI. For example, we may introduce tractable
auxiliary variational bounds on MI, which may be used to improve on any simple
generic approach without altering properties of the original channel. Thirdly, the
suggested variational framework is typically simpler than standard variational
approaches to maximizing the conditional likelihood in stochastic autoencoder
models, while it leads to the same fixed points in its simplest formulation; this
gives rise to more efficient optimization procedures. Finally, in some cases the
variational framework results in optimization procedures which only require lo¬
cal computations, which may be particularly attractive from the neuro-biological
perspective. Possibly the most important contribution of this work is a rigorous
and general framework for maximizing the mutual information in intrinsically
intractable channels. We show that it gives rise to simple, stable, and easily generalizable optimization procedures, which outperform and supersede many of the
common approximate information-maximizing techniques. We demonstrate our
results by considering clustering, dimensionality reduction, and binary stochastic
coding problems, and discuss a link to approximate statistical inference.