Stochastic dynamics and partitioned algorithms for model parameterization in deep learning
Item statusRestricted Access
Embargo end date16/06/2023
Vlaar, Tiffany Joyce
In this thesis, we study model parameterization for deep learning applications. Part of the mathematical foundation for this work lies in stochastic differential equations and their constrained counterparts. We will study their role in deep learning, their properties, and their discretization. On the deep learning theory side we discuss questions around generalization error, optimization, the structure of neural network loss landscapes, and existing metrics of neural network training. Rather than aiming to exceed state-of-the-art results on benchmark datasets, our work in this area is aimed at studying and teasing out underlying properties of neural network optimization, and using those findings to obtain enhanced generalization performance. Our optimization schemes often draw inspiration from molecular dynamics and statistical physics and pave the way towards training robust and generalizable neural networks on datasets that arise in the physical sciences. The contributions of this thesis are as follows: (1) We illustrate that embedding the loss gradient in a second order Langevin dynamics framework and using low temperatures leads to more exploration, increased robustness, and —in combination with partitioned integrators— can lead to enhanced generalization performance of neural networks on certain classification tasks. (2) We provide a general framework for using constrained stochastic differential equations to train deep neural networks. Constraints provide direct control of the parameter space, which allows us to directly study their effect on generalization. A statistical guarantee on the convergence of the training is provided, along with detailed implementation schemes for specific constraints –magnitude-based and orthogonality of the weight matrix– and extensive testing. (3) We illustrate the presence of latent multiple time scales in deep learning applications and introduce the use of multirate techniques for neural network training. We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla stochastic gradient descent. As main application we show that using a multirate approach we can train deep neural networks for transfer learning applications in half the time, without losing generalization performance. (4) We re-evaluate existing deep learning metrics. In particular, we study the use of the loss along the linear path between the initial and final parameters of a network as a measure of the loss landscape. We show that caution is needed when using linear interpolation to make broader claims on the shape of the landscape and success of optimization. We also find that certain neural network layers are more sensitive to the choice of initialization and optimizer hyperparameter settings, and use these observations to design custom optimization schemes.