Efficient, scalable, and fair read-modify-writes
Read-Modify-Write (RMW) operations, or atomics, have widespread application in (a) synchronization, where they are used as building blocks of various synchronization constructs like locks, barriers, and lock-free data structures (b) supervised memory systems, where every memory operation is effectively an RMW that reads and modifies metadata associated with memory addresses and (c) profiling, where RMW instructions are used to increment shared counters to convey meaningful statistics about a program. In each of these scenarios, the RMWs pose a bottleneck to performance and scalability. We observed that the cost of RMWs is dependent on two major factors – the memory ordering enforced by the RMW, and contention amongst processors performing RMWs to the same memory address. In the case of both synchronization and supervised memory systems, the RMWs are expensive due to the memory ordering enforced due to the atomic RMW operation. Performance overhead due to contention is more prevalent in parallel programs which frequently make use of RMWs to update concurrent data structures in a non-blocking manner. Such programs also suffer from a degradation in fairness amongst concurrent processors. In this thesis, we study the cost of RMWs in the above applications, and present solutions to obtain better performance and scalability from RMW operations. Firstly, this thesis tackles the large overhead of RMW instructions when used for synchronization in the widely used x86 processor architectures, like in Intel, AMD, and Sun processors. The x86 processor architecture implements a variation of the Total-Store-Order (TSO) memory consistency model. RMW instructions in existing TSO architectures (we call them type-1 RMW) are ordered like memory fences, which makes them expensive. The strong fence-like ordering of type-1 RMWs is unnecessary for the memory ordering required by synchronization. We propose weaker RMW instructions for TSO consistency; we consider two weaker definitions: type-2 and type-3, each causing subtle ordering differences. Type-2 and type-3 RMWs avoid the fence-like ordering of type-1 RMWs, thereby reducing their overhead. Recent work has shown that the new C/C++11 memory consistency model can be realized by generating type-1 RMWs for SC-atomic-writes and/or SC-atomic-reads. We formally prove that this is equally valid for the proposed type-2 RMWs, and partially for type-3 RMWs. We also propose efficient implementations for type-2 (type-3) RMWs. Simulation results show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%), which translates into an overall performance improvement of up to 9.0% (9.2%) for the programs considered. Next, we argue the case for an efficient and correct supervised memory system for the TSO memory consistency model. Supervised memory systems make use of RMW-like supervised memory instructions (SMIs) to atomically update metadata associated with every memory address used by an application program. Such a system is used to help increase reliability, security and accuracy of parallel programs by offering debugging/monitoring features. Most existing supervised memory systems assume a sequentially consistent memory. For weaker consistency models, like TSO, correctness issues (like imprecise exceptions) arise if the ordering requirement of SMIs is neglected. In this thesis, we show that it is sufficient for supervised instructions to only read and process their metadata in order to ensure correctness. We propose SuperCoP, a supervised memory system for relaxed memory models in which SMIs read and process metadata before retirement, while allowing data and metadata writes to retire into the write-buffer. Our experimental results show that SuperCoP performs better than the existing state-of-the-art correct supervision system by 16.8%. Finally, we address the issue of contention and contention-based failure of RMWs in non-blocking synchronization mechanisms. We leverage the fact that most existing lock-free programs make use of compare-and-swap (CAS) loops to access the concurrent data structure. We propose DyFCoM (Dynamic Fairness and Contention Management), a holistic scheme which addresses both throughput and fairness under increased contention. DyFCoM monitors the number of successful and failed RMWs in each thread, and uses this information to implement a dynamic backoff scheme to optimize throughput. We also use this information to throttle faster threads and give slower threads a higher chance of performing their lock-free operations, to increase fairness among threads. Our experimental results show that our contention management scheme alone performs better than the existing state-of-the-art CAS contention management scheme by an average of 7.9%. When fairness management is included, our scheme provides an average of 3.4% performance improvement over the constant backoff scheme, while showing increased fairness values in all cases (up to 43.6%).
The following license files are associated with this item: