Zero-cost abstractions for irregular data shapes in a high-performance parallel language
Item statusRestricted Access
Embargo end date15/02/2024
Modern parallel accelerators offer an unprecedented degree of performance, and are used pervasively in important application domains, such as High Performance Computing and Machine Learning. However, these accelerators are challenging to program. The primary option is using platform-specific low-level languages (eg. OpenCL, CUDA), but this requires a significant level of expertise to be used effectively. The other option is to rely on domain-specific libraries such as OpenBLAS and cuSparse. However, they cater to common use cases, and are sub-optimal for less popular architectures or applications. Data-Parallel Functional Languages such as Lift, Accelerate and Futhark facilitate the programming of parallel accelerators, offering a solution to these issues. They have extensive support for multi-dimensional array programming and excel in expressing parallel programs over regularly-shaped data structures. Raising the level of abstraction allows users to express the application more naturally, while the constrained nature of array operators enables the generation of high-performance code. However, many applications require irregularly shaped data, such as sparse or otherwise non-rectangular matrices. Expressing these structures in terms of dense multidimensional array is challenging. This thesis introduces the minimal number of constructs necessary to express such structures. Rather than add a completely new programming paradigm, it extends the array programming model, allowing for maximal composability and uniformity between the new and existing abstractions. The central language extension is the position-dependent arrays: heterogeneous multi-dimensional arrays whose element’s type is statically dependent on its position within the data structure. The construct is then progressively generalised, allowing the dependence to be not only static but also dynamic, enabling programs using sparse matrices. These abstractions are zero-cost, which is key to producing fast code. This is achieved by using dependent types to enforce complex invariants at compile time. As programs using sparse matrices often need a fast prefix-sum, the thesis also includes a novel implementation of work-efficient parallel scan for data-parallel languages. A comparison with state-of-the-art alternatives shows the effectiveness of these contributions. The compiler generates programs that are competitive or outright outperform those found in libraries such as OpenBlas, cuBlas and cuSparse. Examples include an average 1.2x speedup for sparse matrix-vector multiplication vs cuSparse’s implementation on a NVidia GTX 1070 GPU, and a 1.5x speedup for prefix sum vs a competitive handwritten implementation.