Enabling high performance dynamic language programming for micro-core architectures
Micro-core architectures are intended to deliver high performance at a low overall power consumption by combining many simple central processing unit (CPU) cores, with an associated small amount of memory, onto a single chip. This technology is not only of great interest for embedded, Edge and IoT applications but also for High-Performance Computing (HPC) accelerators. However, micro-core architectures are difficult to program and exploit, not only because each technology is different, with its own idiosyncrasies, but also because they each present a different low-level interface to the programmer. Furthermore, micro-cores have very constrained amounts of on-chip, scratchpad memory (often around 32KB), further hampering programmer productivity by requiring the programmer to manually manage the regular loading and unloading of data from the host to the device during program execution. To help address these issues, dynamic languages such as Python have been ported to several micro-core architectures but these are often delivered as interpreters with the associated performance penalty over natively compiled languages, such as C. The research questions for this thesis target four areas of concern for dynamic programming languages on micro-core architectures: (RQ1) how to manage the limited on-chip memory for data, (RQ2) how to manage the limited on-chip memory for code, (RQ3) how to address the low runtime performance of virtual machines and (RQ4) how to manage the idiosyncratic architectures of micro-core architectures. The focus of this work is to address these concerns whilst maintaining the programmer productivity benefits of dynamic programming languages, using ePython as the research vehicle. Therefore, key areas of design (such as abstractions for offload) and implementation (novel compiler and runtime techniques for these technologies) are considered, resulting in a number of approaches that are not only applicable to the compilation of Python codes but also more generally to other dynamic languages on micro-cores architectures. RQ1 was addressed by providing support for kernels with arbitrary data size through high-level programming abstractions that enable access to the memory hierarchies of micro-core devices, allowing the deployment of real-world applications, such as a machine learning code to detect cancer cells in full-sized scan images. A new abstract machine, Olympus, addressed RQ2 by supporting the compilation of dynamic languages, such as Python, to micro-core native code. Olympus enables ePython to close the kernel runtime performance gap with native C, matching C for the LINPACK and an iterative Fibonacci benchmark, and to provide, on average, around 75\% of native C runtime performance across four benchmarks running on a set of eight CPU architectures. Olympus also addresses RQ3 by providing dynamic function loading, supporting kernel codes larger than the on-chip memory, whilst still retaining the runtime performance benefits of native code generation. Finally, RQ4 was addressed by the Eithne benchmarking framework which not only enabled a single benchmarking code to be deployed, unchanged, across different CPU architectures, but also provided the underlying communications framework for Olympus. The portability of end-user ePython codes and the underlying Olympus abstract machine were validated by running a set of four benchmarks on eight different CPU architectures, from a single codebase.