Edinburgh Research Archive

Co-designing reliability and performance for datacenter memory

dc.contributor.advisor
Nagarajan, Vijay
dc.contributor.advisor
Cole, Murray
dc.contributor.author
Patil, Adarsh
dc.contributor.sponsor
ARM Center of Excellence
en
dc.contributor.sponsor
ACM travel grant
en
dc.date.accessioned
2023-09-25T14:16:30Z
dc.date.available
2023-09-25T14:16:30Z
dc.date.issued
2023-09-25
dc.description.abstract
Memory is one of the key components that affects reliability and performance of datacenter servers. Memory in today’s servers is organized and shared in several ways to provide the most performant and efficient access to data. For example, cache hierarchy in multi-core chips to reduce access latency, non-uniform memory access (NUMA) in multi-socket servers to improve scalability, disaggregation to increase memory capacity. In all these organizations, hardware coherence protocols are used to maintain memory consistency of this shared memory and implicitly move data to the requesting cores. This thesis aims to provide fault-tolerance against newer models of failure in the organization of memory in datacenter servers. While designing for improved reliability, this thesis explores solutions that can also enhance performance of applications. The solutions build over modern coherence protocols to achieve these properties. First, we observe that DRAM memory system failure rates have increased, demanding stronger forms of memory reliability. To combat this, the thesis proposes Dvé, a hardware driven replication mechanism where data blocks are replicated across two different memory controllers in a cache-coherent NUMA system. Data blocks are accompanied by a code with strong error detection capabilities so that when an error is detected, correction is performed using the replica. Dvé’s organization offers two independent points of access to data which enables: (a) strong error correction that can recover from a range of faults affecting any of the components in the memory and (b) higher performance by providing another nearer point of memory access. Dvé’s coherent replication keeps the replicas in sync for reliability and also provides coherent access to read replicas during fault-free operation for improved performance. Dvé can flexibly provide these benefits on-demand at runtime. Next, we observe that the coherence protocol itself requires to be hardened against failures. Memory in datacenter servers is being disaggregated from the compute servers into dedicated memory servers, driven by standards like CXL. CXL specifies the coherence protocol semantics for compute servers to access and cache data from a shared region in the disaggregated memory. However, the CXL specification lacks the requisite level of fault-tolerance necessary to operate at an inter-server scale within the datacenter. Compute servers can fail or be unresponsive in the datacenter and therefore, it is important that the coherence protocol remain available in the presence of such failures. The thesis proposes Āpta, a CXL-based, shared disaggregated memory system for keeping the cached data consistent without compromising availability in the face of compute server failures. Āpta architects a high-performance fault-tolerant object-granular memory server that significantly improves performance for stateless function-as-a-service (FaaS) datacenter applications.
en
dc.identifier.uri
https://hdl.handle.net/1842/40951
dc.identifier.uri
http://dx.doi.org/10.7488/era/3703
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Dvé: Improving DRAM reliability and performance on-demand via coherent replication Patil, A., Nagarajan, V., Balasubramonian, R. & Oswald, N., 4 Aug 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture ISCA 2021. IEEE, p. 526-539 14 p. (Proceedings - International Symposium on Computer Architecture; vol. 2021-June).
en
dc.relation.hasversion
Apta: Fault-tolerant object-granular CXL disaggregated memory for accelerating FaaS Patil, A., Nagarajan, V., Nikoleris, N. & Oswald, N., 9 Aug 2023, 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2023). IEEE, p. 201-215 15 p. (IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)).
en
dc.rights
Attribution 4.0 International (CC BY 4.0)
en
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
en
dc.subject
Memory
en
dc.subject
DRAM
en
dc.subject
Reliability
en
dc.subject
performance
en
dc.subject
Coherence Protocols
en
dc.subject
function-as-a-service
en
dc.subject
Disaggregation
en
dc.subject
Fault Tolerance
en
dc.title
Co-designing reliability and performance for datacenter memory
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
PatilA_2023.pdf
Size:
3.28 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)