Addressing variability in reuse prediction for last-level caches
Faldu, Priyank Popatlal
Last-Level Cache (LLC) represents the bulk of a modern CPU processor's transistor budget and is essential for application performance as LLC enables fast access to data in contrast to much slower main memory. Problematically, technology constraints make it infeasible to scale LLC capacity to meet the ever-increasing working set size of the applications. Thus, future processors will rely on effective cache management mechanisms and policies to get more performance out of the scarce LLC capacity. Applications with large working set size often exhibit streaming and/or thrashing access patterns at LLC. As a result, a large fraction of the LLC capacity is occupied by dead blocks that will not be referenced again, leading to inefficient utilization of the LLC capacity. To improve cache efficiency, the state-of-the-art cache management techniques employ prediction mechanisms that learn from the past access patterns with an aim to accurately identify as many dead blocks as possible. Once identified, dead blocks are evicted from LLC to make space for potentially high reuse cache blocks. In this thesis, we identify variability in the reuse behavior of cache blocks as the key limiting factor in maximizing cache efficiency for state-of-the-art predictive techniques. Variability in reuse prediction is inevitable due to numerous factors that are outside the control of LLC. The sources of variability include control-flow variation, speculative execution and contention from cores sharing the cache, among others. Variability in reuse prediction challenges existing techniques in reliably identifying the end of a block's useful lifetime, thus causing lower prediction accuracy, coverage, or both. To address this challenge, this thesis aims to design robust cache management mechanisms and policies for LLC in the face of variability in reuse prediction to minimize cache misses, while keeping the cost and complexity of the hardware implementation low. To that end, we propose two cache management techniques, one domain-agnostic and one domain-specialized, to improve cache efficiency by addressing variability in reuse prediction. In the first part of the thesis, we consider domain-agnostic cache management, a conventional approach to cache management, in which the LLC is managed fully in hardware, and thus the cache management is transparent to the software. In this context, we propose Leeway, a novel domain-agnostic cache management technique. Leeway introduces a new metric, Live Distance, that captures the largest interval of temporal reuse for a cache block, providing a conservative estimate of a cache block's useful lifetime. Leeway implements a robust prediction mechanism that identifies dead blocks based on their past Live Distance values. Leeway monitors the change in Live Distance values at runtime and dynamically adapts its reuse-aware policies to maximize cache efficiency in the face of variability. In the second part of the thesis, we identify applications, for which existing domain-agnostic cache management techniques struggle in exploiting the high reuse due to variability arising from certain fundamental application characteristics. Specifically, applications from the domain of graph analytics inherently exhibit high reuse when processing natural graphs. However, the reuse pattern is highly irregular and dependent on graph topology; a small fraction of vertices, hot vertices, exhibit high reuse whereas a large fraction of vertices exhibit low- or no-reuse. Moreover, the hot vertices are sparsely distributed in the memory space. Data-dependent irregular access patterns, combined with the sparse distribution of hot vertices, make it difficult for existing domain-agnostic predictive techniques in reliably identifying, and, in turn, retaining hot vertices in cache, causing severe underutilization of the LLC capacity. In this thesis, we observe that the software is aware of the application reuse characteristics, which, if passed on to the hardware efficiently, can help hardware in reliably identifying the most useful working set even amidst irregular access patterns. To that end, we propose a holistic approach of software-hardware co-design to effectively manage LLC for the domain of graph analytics. Our software component implements a novel lightweight software technique, called Degree-Based Grouping (DBG), that applies a coarse-grain graph reordering to segregate hot vertices in a contiguous memory region to improve spatial locality. Meanwhile, our hardware component implements a novel domain-specialized cache management technique, called Graph Specialized Cache Management (GRASP). GRASP augments existing cache policies to maximize reuse of hot vertices by protecting them against cache thrashing, while maintaining sufficient flexibility to capture the reuse of other vertices as needed. To reliably identify hot vertices amidst irregular access patterns, GRASP leverages the DBG-enabled contiguity of hot vertices. Our domain-specialized cache management not only outperforms the state-of-the-art domain-agnostic predictive techniques, but also eliminates the need for any storage-intensive prediction mechanisms.