Efficient neural networks
View/ Open
Date
23/08/2022Author
Turner, Jack
Metadata
Abstract
Improving the e ciency of neural networks has great potential impact due to
their wide range of possible use cases and their high levels of arithmetic intensity.
As neural network designs evolve and hardware grows more complex, the goal
of modern deep learning compilers will be to exploit opportunities for optimisation
at all levels of the deployment stack; from high level choices about neural
architectures all the way down to low level decisions on code generation.
This thesis decomposes neural network designs into three core components:
skeletons, blocks, and operations. Each component is addressed individually, and
the interactions between optimisations applied at di erent layers of the deployment
stack are examined.
First considered are the optimisation schemes for neural network skeletons,
and it is shown that the commonplace prune-and- netune pattern has a negative
impact on throughput on both CPUs and GPUs. New schemes are developed for
downscaling skeletons that preserve hardware performance, yield better accuracy,
and avoid the expensive netuning stage.
Secondly, this thesis considers optimisation for neural network blocks. A
wealth of research has been dedicated to designing drop-in replacements for neural
network blocks that attempt to improve their e ciency. Based on a set of simple
drop-ins, this thesis develops new method for quickly deciding which replacements
to put where in a network. It is shown that the algorithm developed can be used
more generally to design such blocks from scratch. A core facet of the algorithm is
a rejection lter which can be used to guide the kinds of networks proposed. This
rejection lter can take the form of simple parameter counts, or more complex
compilation metrics such as optimised inference time or levels of data reuse. This
provides a potential handle for interaction between the network designer and the
optimising compiler.
Finally, the thesis considers network operations. Ideas are uni ed from optimising
compilers and network architecture search into a single framework that
allows for the generation new operations, and mutations of network architectures
into highly optimised forms.