Automatic performance optimisation of parallel programs for GPUs via rewrite rules
Graphics Processing Units (GPUs) are now commonplace in computing systems and are the most successful parallel accelerators. Their performance is orders of magnitude higher than traditional Central Processing Units (CPUs) making them attractive for many application domains with high computational demands. However, achieving their full performance potential is extremely hard, even for experienced programmers, as it requires specialised software tailored for specific devices written in low-level languages such as OpenCL. Differences in device characteristics between manufacturers and even hardware generations often lead to large performance variations when different optimisations are applied. This inevitably leads to code that is not performance portable across different hardware. This thesis demonstrates that achieving performance portability is possible using LIFT, a functional data-parallel language which allows programs to be expressed at a high-level in a hardware-agnostic way. The LIFT compiler is empowered to automatically explore the optimisation space using a set of well-defined rewrite rules to transform programs seamlessly between different high-level algorithmic forms before translating them to a low-level OpenCL-specific form. The first contribution of this thesis is the development of techniques to compile functional LIFT programs that have optimisations explicitly encoded into efficient imperative OpenCL code. Producing efficient code is non-trivial as many performance sensitive details such as memory allocation, array accesses or synchronisation are not explicitly represented in the functional LIFT language. The thesis shows that the newly developed techniques are essential for achieving performance on par with manually optimised code for GPU programs with the exact same complex optimisations applied. The second contribution of this thesis is the presentation of techniques that enable the LIFT compiler to perform complex optimisations that usually require from tens to hundreds of individual rule applications by grouping them as macro-rules that cut through the optimisation space. Using matrix multiplication as an example, starting from a single high-level program the compiler automatically generates highly optimised and specialised implementations for desktop and mobile GPUs with very different architectures achieving performance portability. The final contribution of this thesis is the demonstration of how low-level and GPU-specific features are extracted directly from the high-level functional LIFT program, enabling building a statistical performance model that makes accurate predictions about the performance of differently optimised program variants. This performance model is then used to drastically speed up the time taken by the optimisation space exploration by ranking the different variants based on their predicted performance. Overall, this thesis demonstrates that performance portability is achievable using LIFT.