Edinburgh Research Archive

Evaluating and optimising the AArch64 ecosystem for HPC

Item Status

Embargo End Date

Authors

Bastos Cordeiro de Jesus, Ricardo Jorge

Abstract

In recent years, Arm-based processors have become a viable alternative to Intel or AMD CPUs for HPC systems. To make this possible, hardware and software evolved greatly to meet the complex performance needs of HPC applications. In this thesis, we assess the readiness of the AArch64 ecosystem for mainstream HPC usage, and provide improvements to the areas we found to be lacking. We start by performing a preliminary evaluation of the maturity of the support and performance of the AArch64 ecosystem as of 2020, when production-grade AArch64 HPC systems were emerging. This led us to find: new optimisation opportunities enabled by AArch64’s features, such as its novel scalable vector extension (SVE); generally competitive hardware performance, although with some classes of code patterns showing reduced performance on certain microarchitectures; and generally mature support from the surrounding software ecosystem, in particular compilers, despite a few instances of suboptimal code generation. These observations motivated the majority of the work of this thesis, as summarised below, with the goal of improving the maturity of the AArch64 ecosystem further. In particular, we explore the usage of AArch64’s scalable vector extension (SVE) to vectorise number-theoretic transforms (NTTs). We show that SVE enables the efficient implementation of 64-bit modular arithmetic operations, including modular multiplication. This enables the efficient vectorisation of NTT loops and other large integer arithmetic codes, which was previously not possible with traditional single instruction multiple data (SIMD) architectures as these architectures lack crucial instructions to efficiently implement 64-bit modular multiplication. We test and evaluate our SVE implementation on the Fujitsu A64FX processor in an HPE Apollo 80 system. Furthermore, we implement a distributed NTT for the computation of large-scale exact integer convolutions. We evaluate this transform on Arm-based HPE Apollo 70, Cray XC50, and HPE Apollo 80 systems, where we demonstrate good scalability to thousands of cores. Further, we demonstrate how these methods can be utilised to count the number of Goldbach partitions of all even numbers to large limits. Employing a total of 2048 Marvell ThunderX2 cores, we carry out the computation to the world-record limit of 2⁴⁰. Furthermore, we evaluate the performance of compare-and-swap (CAS) operations, implemented via traditional load/store-exclusive (LL-SC) instruction pairs or the newer CAS instructions introduced with the Armv8.1-A large system extension (LSE), on several high-performance Arm-based CPUs such as the A64FX, ThunderX2, and Graviton3. We observe that CAS and LL-SC instructions can lead to fundamentally different performance profiles, revealing shortcomings in some implementations. On the A64FX, for example, the newer CAS instructions, preferred by compilers and libraries over the older LL-SC pairs, can in some instances lead to a quadratic increase in average time per successful CAS operation as the number of threads contending for the operation increases, whilst the traditional LL-SC approach shows the expected constant behaviour. For high thread counts, this difference translates into a speedup of more than 20× when using LL-SC instructions. We characterise the conditions under which the LL-SC or CAS approaches are preferable on each CPU, and the speedup that can be realised by favouring one strategy over the other. Finally, we explore the performance of the main optimising compiler toolchains currently available for AArch64 processors on the recently released NVIDIA Grace CPU. We consider the Arm Compiler for Linux (ACFL), GCC, LLVM and the NVIDIA HPC (NVHPC) compilers. We evaluate the quality of the code generated by these compilers using the RAJA Performance Suite (RAJAPerf) to understand the cases where each compiler does best, and why. We find that compilers mostly generate well optimised code on baseline sequential runs, with the gap between the fastest and slowest being 8% on average. Threaded parallel runs show a larger variation, with this gap increasing to approximately 33% on average. We investigate those kernels where LLVM performs worst relative to the remaining compilers in detail and propose optimisations to improve code generation in those cases. We show scenarios where the default compiler behaviour produces suboptimal code and where adjusting compiler flags, such as those controlling loop unrolling or vectorisation decisions, can improve performance significantly. In cases where this is insufficient, we propose changes at the compiler level necessary to enable improved code generation and unlock further optimisations. These improvements account for speedups of over 70% in some kernels. Overall, we conclude that the AArch64 ecosystem has reached maturity and the few corner cases that remain pose no real challenge to its widespread adoption in HPC.

This item appears in the following Collection(s)