Towards larger scale collective operations in the Message Passing Interface
Rüfenacht, Martin Peter Albert
Supercomputers continue to expand both in size and complexity as we reach the beginning of the exascale era. Networks have evolved, from simple mechanisms which transport data to subsystems of computers which fulfil a significant fraction of the workload that computers are tasked with. Inevitably with this change, assumptions which were made at the beginning of the last major shift in computing are becoming outdated. We introduce a new latency-bandwidth model which captures the characteristics of sending multiple small messages in quick succession on modern networks. Contrary to other models representing the same effects, the pipelining latency-bandwidth model is simple and physically based. In addition, we develop a discrete-event simulation, Fennel, to capture non-analytical effects of communication within models. AllReduce operations with small messages are common throughout supercomputing, particularly for iterative methods. The performance of network operations are crucial to the overall time-to-solution of an application as a whole. The Message Passing Interface standard was introduced to abstract complex communications from application level development. The underlying algorithms used for the implementation to achieve the specified behaviour, such as the recursive doubling algorithm for AllReduce, have to evolve with the computers on which they are used. We introduce the recursive multiplying algorithm as a generalisation of recursive doubling. By utilising the pipelining nature of modern networks, we lower the latency of AllReduce operations and enable greater choice of schedule. A heuristic is used to quickly generate a near-optimal schedule, by using the pipelining latency-bandwidth model. Alongside recursive multiplying, the endpoints of collective operations must be able to handle larger numbers of incoming messages. Typically this is done by duplicating receive queues for remote peers, but this requires a linear amount of memory space for the size of the application. We introduce a single-consumer multipleproducer queue which is designed to be used with MPI as a protocol to insert messages remotely, with minimal contention for shared receive queues.