Evaluating and optimising the AArch64 ecosystem for HPC
Item Status
Embargo End Date
Date
Authors
Bastos Cordeiro de Jesus, Ricardo Jorge
Abstract
In recent years, Arm-based processors have become a viable alternative to Intel or
AMD CPUs for HPC systems. To make this possible, hardware and software evolved
greatly to meet the complex performance needs of HPC applications. In this thesis,
we assess the readiness of the AArch64 ecosystem for mainstream HPC usage, and
provide improvements to the areas we found to be lacking. We start by performing a
preliminary evaluation of the maturity of the support and performance of the AArch64
ecosystem as of 2020, when production-grade AArch64 HPC systems were emerging.
This led us to find: new optimisation opportunities enabled by AArch64’s features,
such as its novel scalable vector extension (SVE); generally competitive hardware
performance, although with some classes of code patterns showing reduced performance
on certain microarchitectures; and generally mature support from the surrounding
software ecosystem, in particular compilers, despite a few instances of suboptimal code
generation. These observations motivated the majority of the work of this thesis, as
summarised below, with the goal of improving the maturity of the AArch64 ecosystem
further.
In particular, we explore the usage of AArch64’s scalable vector extension (SVE) to
vectorise number-theoretic transforms (NTTs). We show that SVE enables the efficient
implementation of 64-bit modular arithmetic operations, including modular multiplication.
This enables the efficient vectorisation of NTT loops and other large integer
arithmetic codes, which was previously not possible with traditional single instruction
multiple data (SIMD) architectures as these architectures lack crucial instructions to
efficiently implement 64-bit modular multiplication. We test and evaluate our SVE
implementation on the Fujitsu A64FX processor in an HPE Apollo 80 system. Furthermore,
we implement a distributed NTT for the computation of large-scale exact integer
convolutions. We evaluate this transform on Arm-based HPE Apollo 70, Cray XC50,
and HPE Apollo 80 systems, where we demonstrate good scalability to thousands of
cores. Further, we demonstrate how these methods can be utilised to count the number
of Goldbach partitions of all even numbers to large limits. Employing a total of 2048
Marvell ThunderX2 cores, we carry out the computation to the world-record limit of
2⁴⁰.
Furthermore, we evaluate the performance of compare-and-swap (CAS) operations,
implemented via traditional load/store-exclusive (LL-SC) instruction pairs or the newer
CAS instructions introduced with the Armv8.1-A large system extension (LSE), on
several high-performance Arm-based CPUs such as the A64FX, ThunderX2, and
Graviton3. We observe that CAS and LL-SC instructions can lead to fundamentally
different performance profiles, revealing shortcomings in some implementations. On
the A64FX, for example, the newer CAS instructions, preferred by compilers and
libraries over the older LL-SC pairs, can in some instances lead to a quadratic increase
in average time per successful CAS operation as the number of threads contending
for the operation increases, whilst the traditional LL-SC approach shows the expected
constant behaviour. For high thread counts, this difference translates into a speedup of
more than 20× when using LL-SC instructions. We characterise the conditions under
which the LL-SC or CAS approaches are preferable on each CPU, and the speedup that
can be realised by favouring one strategy over the other.
Finally, we explore the performance of the main optimising compiler toolchains
currently available for AArch64 processors on the recently released NVIDIA Grace
CPU. We consider the Arm Compiler for Linux (ACFL), GCC, LLVM and the NVIDIA
HPC (NVHPC) compilers. We evaluate the quality of the code generated by these
compilers using the RAJA Performance Suite (RAJAPerf) to understand the cases
where each compiler does best, and why. We find that compilers mostly generate
well optimised code on baseline sequential runs, with the gap between the fastest and
slowest being 8% on average. Threaded parallel runs show a larger variation, with
this gap increasing to approximately 33% on average. We investigate those kernels
where LLVM performs worst relative to the remaining compilers in detail and propose
optimisations to improve code generation in those cases. We show scenarios where the
default compiler behaviour produces suboptimal code and where adjusting compiler
flags, such as those controlling loop unrolling or vectorisation decisions, can improve
performance significantly. In cases where this is insufficient, we propose changes at
the compiler level necessary to enable improved code generation and unlock further
optimisations. These improvements account for speedups of over 70% in some kernels.
Overall, we conclude that the AArch64 ecosystem has reached maturity and the few
corner cases that remain pose no real challenge to its widespread adoption in HPC.
This item appears in the following Collection(s)

