Efficient methods and architectures for deep neural network sequence models
Mbabazi, Emmanuel Kahembwe
The recent resurgence of neural networks, termed "Deep Learning", has led to a reinvigoration of the artificial intelligence research field and all related sub-fields; from robotics and vision to natural language processing and understanding. In the last decade, this field has seen incredible breakthroughs, primarily driven by improvements to computing capability that have allowed for ever larger neural network architectures. The key driving force behind this resurgence has been the graphics processing unit (GPU) and as deep neural networks (DNNs) get ever larger, efficiency has become a bottleneck issue. Even with ample amounts of GPUs and significant financial resources, the state-of-the-art neural network models and methods are out of reach for most scientists. The significance of this challenge is brought to bare when attempting to use DNNs on video, the most consumed form of data and media. Modelling high dimensional data such as video is already computationally expensive and challenging even with small neural networks. With the 2020 Coronavirus pandemic, production and consumption of video has greatly increased as the global business population moves to working and interacting online. The low cost of video production and transmission is quickly making it the most common medium of digital communication for socially distanced humans. Video is also often the cheapest and most detailed source of information relied upon in fields such as robotics; for driverless cars, drones and teleoperated machines. As such, being able to efficiently model such data is of paramount importance to the field of AI. In this thesis, we tackle the issue of efficient modelling of complex high dimensional sequential data such as video and language. We address this problem on two fronts, computational efficiency and algorithmic efficiency. On the computational front, we propose a design methodology that significantly lowers the cost of video modelling tasks while improving performance. To enable this, we bring to bare the tools of hessian analysis in the most comprehensive analysis of generative video models to date. We then go on to tackle sequential modelling from an algorithmic efficiency perspective. We propose methods that use the temporal dynamics of sequential data to improve modelling performance post-training. We highlight the new capabilities enabled when optimization is not restricted to training scenarios and conjecture that intelligent systems should never stop training. In a collaborative effort, we propose similar approaches for natural language modelling. To conclude, we demonstrate with a single commodity GPU, that our proposed methods and architectures realise state-of-the-art results often surpassing the performance of models trained on hundreds of GPUs at significant financial cost.