Higher-order damping mechanisms with applications in optimisation and machine Learning
Item Status
Embargo End Date
Date
Authors
Karoni, Aikaterini
Abstract
The main focus of this thesis is the development of new optimisation methods. The design of our methods is guided by a dynamical systems perspective and utilises concepts commonly used in molecular dynamics such as thermal controls, which regulate the temperature (i.e. the mean kinetic energy) of the system. This idea, of introducing a feedback loop based on the integral of the squared speed, is rooted in the work of James Clerk Maxwell [84] on governors and control theory. In our work, the mechanism by which we influence the physical system is dissipation, controlled by an adaptive friction variable; this is tantamount to varying the so-called “momentum” hyperparameter in optimisation algorithms. Increasing the friction has the effect of cooling the system more rapidly.
Theoretical convergence guarantees are provided for both the continuous and discrete optimiser dynamics. We test our family of optimisers on a two-dimensional Rosenbrock function, which is a “stiff” landscape consisting of a high-gradient region and a narrow valley, making convergence of standard methods such as momentum gradient descent challenging. We then turn our attention to particle cluster optimisation problems arising in molecular dynamics where we are able to show that our methods outperform momentum gradient descent for most parameters. We show that adapting the friction parameter based on the integral of the kinetic energy is essentially equivalent to adding a cubic damping term to the standard equations of motion of linearly damped Hamiltonian dynamics.
This last observation further motivates the introduction of two new optimisers, presented in Chapter 4. We take the continuous dynamics of two popular optimisation methods, Adam [17] and mSGD [122] and augment them by adding a cubic damping term to the momentum equation for each optimiser. Momemtum Stochastic Gradient Descent (mSGD) is an alternative to Stochastic Gradient Descent which is adapted to a kinetic framework (based on a Hamiltonian system formulation with positions and momenta), and includes a linear dissipation mechanism in the momenta which forces convergence to a minimizer. Adam [61] uses the same dissipation mechanism as mSGD but further includes an adaptive step size mechanism designed to accelerate convergence in low curvature regions by adapting the step depending on a recent history of the squares of the gradient components. We provide theoretical convergence guarantees for our methods and show that introducing a cubic damping mechanism into the dynamics of Adam and mSGD can lead to improved performance on several standard machine learning benchmarks.
The final part of this thesis concerns itself with continual learning mechanisms. The ability of a model to learn in an incremental fashion is key for many real-world applications. Continual learning refers to a setting where a model is updated as new data becomes available, while preserving knowledge acquired from previous tasks. The challenge one faces in such a scenario is to maintain old knowledge, even if old data is no longer available, or retraining on it is not computationally viable. Elastic weight consolidation (EWC) [62] is a popular continual learning method that uses a regularisation approach in order to constrain the change of model parameters that are deemed important for preserving previous knowledge. EWC uses the Fisher information matrix (FIM) to quantify the importance of each of the model parameters and then uses the corresponding Fisher information matrix entry as a regularisation coefficient for each model parameter. The work presented here draws inspiration from EWC and the use of the Fisher information matrix to quantify parameter importance. However, instead of incrementing the loss function by a FIM-based regularisation penalty, the idea behind the method presented is to update each of the model parameters using a different learning rate, which is inversely proportional to the respective FIM value. Thus, parameters that correspond to a higher FIM value and are therefore more important, are updated at a slower rate than parameters with a lower Fisher value.
This item appears in the following Collection(s)

