Co-designing reliability and performance for datacenter memory

Patil, Adarsh

Co-designing reliability and performance for datacenter memory

Simple item page

dc.contributor.advisor

Nagarajan, Vijay

dc.contributor.advisor

Cole, Murray

dc.contributor.author

Patil, Adarsh

dc.contributor.sponsor

ARM Center of Excellence

en

dc.contributor.sponsor

ACM travel grant

en

dc.date.accessioned

2023-09-25T14:16:30Z

dc.date.available

2023-09-25T14:16:30Z

dc.date.issued

2023-09-25

dc.description.abstract

Memory is one of the key components that affects reliability and performance of datacenter servers. Memory in today’s servers is organized and shared in several ways to provide the most performant and efficient access to data. For example, cache hierarchy in multi-core chips to reduce access latency, non-uniform memory access (NUMA) in multi-socket servers to improve scalability, disaggregation to increase memory capacity. In all these organizations, hardware coherence protocols are used to maintain memory consistency of this shared memory and implicitly move data to the requesting cores. This thesis aims to provide fault-tolerance against newer models of failure in the organization of memory in datacenter servers. While designing for improved reliability, this thesis explores solutions that can also enhance performance of applications. The solutions build over modern coherence protocols to achieve these properties. First, we observe that DRAM memory system failure rates have increased, demanding stronger forms of memory reliability. To combat this, the thesis proposes Dvé, a hardware driven replication mechanism where data blocks are replicated across two different memory controllers in a cache-coherent NUMA system. Data blocks are accompanied by a code with strong error detection capabilities so that when an error is detected, correction is performed using the replica. Dvé’s organization offers two independent points of access to data which enables: (a) strong error correction that can recover from a range of faults affecting any of the components in the memory and (b) higher performance by providing another nearer point of memory access. Dvé’s coherent replication keeps the replicas in sync for reliability and also provides coherent access to read replicas during fault-free operation for improved performance. Dvé can flexibly provide these benefits on-demand at runtime. Next, we observe that the coherence protocol itself requires to be hardened against failures. Memory in datacenter servers is being disaggregated from the compute servers into dedicated memory servers, driven by standards like CXL. CXL specifies the coherence protocol semantics for compute servers to access and cache data from a shared region in the disaggregated memory. However, the CXL specification lacks the requisite level of fault-tolerance necessary to operate at an inter-server scale within the datacenter. Compute servers can fail or be unresponsive in the datacenter and therefore, it is important that the coherence protocol remain available in the presence of such failures. The thesis proposes Āpta, a CXL-based, shared disaggregated memory system for keeping the cached data consistent without compromising availability in the face of compute server failures. Āpta architects a high-performance fault-tolerant object-granular memory server that significantly improves performance for stateless function-as-a-service (FaaS) datacenter applications.

en

dc.identifier.uri

https://hdl.handle.net/1842/40951

dc.identifier.uri

http://dx.doi.org/10.7488/era/3703

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Dvé: Improving DRAM reliability and performance on-demand via coherent replication Patil, A., Nagarajan, V., Balasubramonian, R. & Oswald, N., 4 Aug 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture ISCA 2021. IEEE, p. 526-539 14 p. (Proceedings - International Symposium on Computer Architecture; vol. 2021-June).

en

dc.relation.hasversion

Apta: Fault-tolerant object-granular CXL disaggregated memory for accelerating FaaS Patil, A., Nagarajan, V., Nikoleris, N. & Oswald, N., 9 Aug 2023, 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2023). IEEE, p. 201-215 15 p. (IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)).

en

dc.rights

Attribution 4.0 International (CC BY 4.0)

en

dc.rights.uri

https://creativecommons.org/licenses/by/4.0/

en

dc.subject

Memory

en

dc.subject

DRAM

en

dc.subject

Reliability

en

dc.subject

performance

en

dc.subject

Coherence Protocols

en

dc.subject

function-as-a-service

en

dc.subject

Disaggregation

en

dc.subject

Fault Tolerance

en

dc.title

Co-designing reliability and performance for datacenter memory

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: PatilA_2023.pdf
Size:: 3.28 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection