Optimising a fluid plasma turbulence simulation on modern high performance computers
Edwards, Thomas David
Nuclear fusion offers the potential of almost limitless energy from sea water and lithium without the dangers of carbon emissions or long term radioactive waste. At the forefront of fusion technology are the tokamaks, toroidal magnetic confinement devices that contain miniature stars on Earth. Nuclei can only fuse by overcoming the strong electrostatic forces between them which requires high temperatures and pressures. The temperatures in a tokamak are so great that the Deuterium-Tritium fusion fuel forms a plasma which must be kept hot and under pressure to maintain the fusion reaction. Turbulence in the plasma causes disruption by transporting mass and energy away from this core, reducing the efficiency of the reaction. Understanding and controlling the mechanisms of plasma turbulence is key to building a fusion reactor capable of producing sustained output. The extreme temperatures make detailed empirical observations difficult to acquire, so numerical simulations are used as an additional method of investigation. One numerical model used to study turbulence and diffusion is CENTORI, a direct two-fluid magneto-hydrodynamic simulation of a tokamak plasma developed by the Culham Centre for Fusion Energy (CCFE formerly UKAEA:Fusion). It simulates the entire tokamak plasma with realistic geometry, evolving bulk plasma quantities like pressure, density and temperature through millions of timesteps. This requires CENTORI to run in parallel on a Massively Parallel Processing (MPP) supercomputer to produce results in an acceptable time. Any improvements in CENTORI’s performance increases the rate and/or total number of results that can be obtained from access to supercomputer resources. This thesis presents the substantial effort to optimise CENTORI on the current generation of academic supercomputers. It investigates and reviews the properties of contemporary computer architectures then proposes, implements and executes a benchmark suite of CENTORI’s fundamental kernels. The suite is used to compare the performance of three competing memory layouts of the primary vector data structure using a selection of compilers on a variety of computer architectures. The results show there is no optimal memory layout on all platforms so a flexible optimisation strategy was adopted to pursue “portable” optimisation i.e optimisations that can easily be added, adapted or removed from future platforms depending on their performance. This required designing an interface to functions and datatypes that separate CENTORI’s fundamental algorithms from repetitive, low-level implementation details. This approach offered multiple benefits including: the clearer representation of CENTORI’s core equations as mathematical expressions in Fortran source code allows rapid prototyping and development of new features; the reduction in the total data volume by a factor of three reduces the amount of data transferred over the memory bus to almost a third; and the reduction in the number of intense floating point kernels reduces the effort of optimising the application on new platforms. The project proceeds to rewrite CENTORI using the new Application Programming Interface (API) and evaluates two optimised implementations. The first is a traditional library implementation that uses hand optimised subroutines to implement the library functions. The second uses a dynamic optimisation engine to perform automatic stripmining to improve the performance of the memory hierarchy. The automatic stripmining implementation uses lazy evaluation to delay calculations until absolutely necessary, allowing it to identify temporary data structures and minimise them for optimal cache use. This novel technique is combined with highly optimised implementations of the kernel operations and optimised parallel communication routines to produce a significant improvement in CENTORI’s performance. The maximum measured speed up of the optimised versions over the original code was 3.4 times on 128 processors on HPCx, 2.8 times on 1024 processors on HECToR and 2.3 times on 256 processors on HPC-FF.