Efficient neural networks
Improving the e ciency of neural networks has great potential impact due to their wide range of possible use cases and their high levels of arithmetic intensity. As neural network designs evolve and hardware grows more complex, the goal of modern deep learning compilers will be to exploit opportunities for optimisation at all levels of the deployment stack; from high level choices about neural architectures all the way down to low level decisions on code generation. This thesis decomposes neural network designs into three core components: skeletons, blocks, and operations. Each component is addressed individually, and the interactions between optimisations applied at di erent layers of the deployment stack are examined. First considered are the optimisation schemes for neural network skeletons, and it is shown that the commonplace prune-and- netune pattern has a negative impact on throughput on both CPUs and GPUs. New schemes are developed for downscaling skeletons that preserve hardware performance, yield better accuracy, and avoid the expensive netuning stage. Secondly, this thesis considers optimisation for neural network blocks. A wealth of research has been dedicated to designing drop-in replacements for neural network blocks that attempt to improve their e ciency. Based on a set of simple drop-ins, this thesis develops new method for quickly deciding which replacements to put where in a network. It is shown that the algorithm developed can be used more generally to design such blocks from scratch. A core facet of the algorithm is a rejection lter which can be used to guide the kinds of networks proposed. This rejection lter can take the form of simple parameter counts, or more complex compilation metrics such as optimised inference time or levels of data reuse. This provides a potential handle for interaction between the network designer and the optimising compiler. Finally, the thesis considers network operations. Ideas are uni ed from optimising compilers and network architecture search into a single framework that allows for the generation new operations, and mutations of network architectures into highly optimised forms.