Improving address translation performance in virtualized multi-tenant systems
Item Status
Embargo End Date
Date
Authors
Margaritov, Artemiy
Abstract
With the explosive growth in dataset sizes, application memory footprints are commonly reaching hundreds of GBs. Such huge datasets pressure the TLBs, resulting
in frequent misses that must be resolved through a page walk – a long-latency pointer
chase through multiple levels of the in-memory radix-tree-based page table. Page walk
latency is particularly high under virtualization where address translation mandates traversing two radix-tree page tables in a process called a nested page walk, performing
up to 24 memory accesses. Page walk latency can be also amplified by the effects
caused by the colocation of applications on the same server used in an attempt to increase utilization. Under colocation, cache contention makes cache misses during a
nested page walk more frequent, piling up page walk latency. Both virtualization and
colocation are widely adopted in cloud platforms, such as Amazon Web Services and
Google Cloud Engine. As a result, in cloud environments, page walk latency can
reach hundreds of cycles, significantly reducing the overall application’s performance.
This thesis addresses the problem of the high page walk latency by 1 identifying
the sources of the high page walk latency under virtualization and/or colocation, and
2 proposing hardware and software techniques that accelerate page walks by means
of new memory allocation strategies for the page table and data which can be easily
adopted by existing systems.
Firstly, we quantify how the dataset size growth, virtualization, and colocation affect page walk latency. We also study how a high page walk latency affects perform ance. Due to the lack of dedicated tools for evaluating address translation overhead
on modern processors, we design a methodology to vary the page walk latency experienced by an application running on real hardware. To quantify the performance impact
of address translation, we measure the application’s execution time while varying the
page walk latency. We find that under virtualization, address translation considerably
limits performance: an application can waste up to 68% of execution time due to stalls
originating from page walks. In addition, we investigate which accesses from a nested
page walk are most significant for the overall page walk latency by examining from
where in the memory hierarchy these accesses are served. We find that accesses to the
deeper levels of the page table radix tree are responsible for most of the overall page
walk latency.
Based on these observations, we introduce two address translation acceleration
techniques that can be applied to any ISA that employs radix-tree page tables and
nested page walks. The first of these techniques is Prefetched Address Translation
(ASAP), a new software-hardware approach for mitigating the high page walk latency
caused by virtualization and/or application colocation. At the heart of ASAP is a
lightweight technique for directly indexing individual levels of the page table radix
tree. Direct indexing enables ASAP to fetch nodes from deeper levels of the page
table without first accessing the preceding levels, thus lowering the page walk latency.
ASAP is fully compatible with the existing radix-tree-based page table and requires
only incremental and isolated changes to the memory subsystem.
The second technique is PTEMagnet, a new software-only approach for reducing
address translation latency under virtualization and application colocation. Initially,
we identify a new address translation bottleneck caused by memory fragmentation
stemming from the interaction of virtualization, application colocation, and the Linux
memory allocator. The fragmentation results in the effective cache footprint of the
host page table being larger than that of the guest page table. The bloated footprint
of the host page table leads to frequent cache misses during nested page walks, increasing page walk latency. In response to these observations, we propose PTEMag net. PTEMagnet prevents memory fragmentation by fine-grained reservation-based
memory allocation in the guest OS. PTEMagnet is fully legacy-preserving, requiring
no modifications to either user code or mechanisms for address translation and virtualization.
In summary, this thesis proposes non-disruptive upgrades to the virtual memory
subsystem for reducing page walk latency in virtualized deployments. In doing so,
this thesis evaluates the impact of page walk latency on the application’s performance, identifies the bottlenecks of the existing address translation mechanism caused
by virtualization, application colocation, and the Linux memory allocator, and proposes software-hardware and software-only solutions for eliminating the bottlenecks.
This item appears in the following Collection(s)

