dc.contributor.advisor | Nagarajan, Vijayanand | |
dc.contributor.advisor | Fensch, Christian | |
dc.contributor.author | Rajaram, Bharghava | |
dc.date.accessioned | 2015-09-02T14:23:34Z | |
dc.date.available | 2015-09-02T14:23:34Z | |
dc.date.issued | 2015-06-29 | |
dc.identifier.uri | http://hdl.handle.net/1842/10530 | |
dc.description.abstract | Read-Modify-Write (RMW) operations, or atomics, have widespread application in
(a) synchronization, where they are used as building blocks of various synchronization
constructs like locks, barriers, and lock-free data structures (b) supervised memory systems,
where every memory operation is effectively an RMW that reads and modifies
metadata associated with memory addresses and (c) profiling, where RMW instructions
are used to increment shared counters to convey meaningful statistics about a
program. In each of these scenarios, the RMWs pose a bottleneck to performance and
scalability. We observed that the cost of RMWs is dependent on two major factors –
the memory ordering enforced by the RMW, and contention amongst processors performing
RMWs to the same memory address. In the case of both synchronization and
supervised memory systems, the RMWs are expensive due to the memory ordering
enforced due to the atomic RMW operation. Performance overhead due to contention
is more prevalent in parallel programs which frequently make use of RMWs to update
concurrent data structures in a non-blocking manner. Such programs also suffer from a
degradation in fairness amongst concurrent processors. In this thesis, we study the cost
of RMWs in the above applications, and present solutions to obtain better performance
and scalability from RMW operations.
Firstly, this thesis tackles the large overhead of RMW instructions when used for
synchronization in the widely used x86 processor architectures, like in Intel, AMD, and
Sun processors. The x86 processor architecture implements a variation of the Total-Store-Order (TSO) memory consistency model. RMW instructions in existing TSO architectures
(we call them type-1 RMW) are ordered like memory fences, which makes
them expensive. The strong fence-like ordering of type-1 RMWs is unnecessary for the
memory ordering required by synchronization. We propose weaker RMW instructions
for TSO consistency; we consider two weaker definitions: type-2 and type-3, each
causing subtle ordering differences. Type-2 and type-3 RMWs avoid the fence-like
ordering of type-1 RMWs, thereby reducing their overhead. Recent work has shown
that the new C/C++11 memory consistency model can be realized by generating type-1 RMWs for SC-atomic-writes and/or SC-atomic-reads. We formally prove that this
is equally valid for the proposed type-2 RMWs, and partially for type-3 RMWs. We
also propose efficient implementations for type-2 (type-3) RMWs. Simulation results
show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%),
which translates into an overall performance improvement of up to 9.0% (9.2%) for
the programs considered.
Next, we argue the case for an efficient and correct supervised memory system
for the TSO memory consistency model. Supervised memory systems make use of
RMW-like supervised memory instructions (SMIs) to atomically update metadata associated
with every memory address used by an application program. Such a system is
used to help increase reliability, security and accuracy of parallel programs by offering
debugging/monitoring features. Most existing supervised memory systems assume a
sequentially consistent memory. For weaker consistency models, like TSO, correctness
issues (like imprecise exceptions) arise if the ordering requirement of SMIs is
neglected. In this thesis, we show that it is sufficient for supervised instructions to only
read and process their metadata in order to ensure correctness. We propose SuperCoP,
a supervised memory system for relaxed memory models in which SMIs read and process
metadata before retirement, while allowing data and metadata writes to retire into
the write-buffer. Our experimental results show that SuperCoP performs better than
the existing state-of-the-art correct supervision system by 16.8%.
Finally, we address the issue of contention and contention-based failure of RMWs
in non-blocking synchronization mechanisms. We leverage the fact that most existing
lock-free programs make use of compare-and-swap (CAS) loops to access the
concurrent data structure. We propose DyFCoM (Dynamic Fairness and Contention
Management), a holistic scheme which addresses both throughput and fairness under
increased contention. DyFCoM monitors the number of successful and failed RMWs
in each thread, and uses this information to implement a dynamic backoff scheme to
optimize throughput. We also use this information to throttle faster threads and give
slower threads a higher chance of performing their lock-free operations, to increase
fairness among threads. Our experimental results show that our contention management
scheme alone performs better than the existing state-of-the-art CAS contention
management scheme by an average of 7.9%. When fairness management is included,
our scheme provides an average of 3.4% performance improvement over the constant
backoff scheme, while showing increased fairness values in all cases (up to 43.6%). | en |
dc.language.iso | en | en |
dc.publisher | The University of Edinburgh | en |
dc.relation.hasversion | Bharghava Rajaram, Vijay Nagarajan, Andrew J. McPherson, and Marcelo Cintra. 2012. SuperCoP: a general, correct, and performance-efficient supervised memory system. In Proceedings of the 9th conference on Computing Frontiers (CF ’12). ACM, New York, NY, USA, 85-94. | en |
dc.relation.hasversion | Bharghava Rajaram, Vijay Nagarajan, Susmit Sarkar, and Marco Elver. 2013. Fast RMWs for TSO: semantics and implementation. In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation (PLDI ’13). ACM, New York, NY, USA, 61-72. | en |
dc.relation.hasversion | Lin, C., Nagarajan, V., Gupta, R., and Rajaram, B. (2012). Efficient sequential consistency via conflict ordering. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 273–286, New York, NY, USA. ACM. | en |
dc.rights | Attribution-NonCommercial-ShareAlike 4.0 International | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ | * |
dc.subject | parallel computer architecture | en |
dc.subject | atomic instructions | en |
dc.subject | efficient synchronization | en |
dc.title | Efficient, scalable, and fair read-modify-writes | en |
dc.type | Thesis or Dissertation | en |
dc.type.qualificationlevel | Doctoral | en |
dc.type.qualificationname | PhD Doctor of Philosophy | en |