Efficient cross-architecture simulation of multicore systems
Item Status
Embargo End Date
Date
Authors
Kristien, Martin
Abstract
Computer systems are continually becoming more complex and powerful in all areas of computing, from high-end servers to embedded devices. Machine virtualisation has become an instrumental technology to manage these vast hardware resources by separating applications running within virtual machines (guests) from the real hardware (hosts). This form of virtualisation is well supported by modern hardware as long as the guest and the host machines’ Instruction Set Architectures (ISA) are matching. A mismatching (cross-ISA) virtualisation is more challenging, while remaining an important technology for hardware prototyping and software development.
In this context, Instruction Set Simulators (ISSs) are developed to provide functional cross-architecture simulation. A wide range of techniques can be utilised to achieve high simulation speeds for single-core guest applications. However, multicore support is limited. The state of the art tools are often making trade-offs between accuracy and speed. The lack of multicore support is exacerbated if the guest
application requires a full-system simulation and/or exhibits dynamically generated code.
This thesis presents three contributions to accuracy, memory efficiency, and simulation speed of multicore cross-ISA simulation. Firstly, it presents a scalable and provably correct scheme for emulating atomic instructions. Most commonly in cross-ISA simulation, Reduced Instruction Set Computer (RISC) type guest atomics, LoadLinked/Store-Conditional (LL/SC), are emulated on Complex Instruction Set Computer (CISC) type host hardware providing a complex Compare-And-Swap (CAS) atomic instruction. Although the semantics of the RISC and CISC atomics are different, ISSs often emulate LL/SC using CAS instructions for improved performance. However, this results in a divergent execution inside a simulator relative to the real hardware. The scheme presented in this thesis faithfully emulates the LL/SC semantics while maintaining scalability to multicore systems.
Efficient use of simulator memory is especially important for interpreter-based ISSs, which enable quick prototyping without extensive engineering efforts and easy integration with instrumentation, profiling, and debugging tools. However, the computational overheads of the Fetch and Decode stages in instruction interpretation significantly increase the overall simulation time. This thesis proposes a number of memory efficient caching strategies with focus on memory sharing among multiple simulated cores. The novel schemes exhibit up to 1.57× speedup relative to state-of-the-art baseline scheme, while requiring only 27% of cache memory.
Dynamic Binary Translation typically translates and caches multiple guest instructions as a unit, resulting in faster simulation speed. However, code cache maintenance has to account for guest applications modifying its own instructions, necessitating invalidation of cached code. This maintenance mechanism in most ISSs is falsely triggered by applications dynamically generating code in proximity of previously cached code, resulting in needless code invalidation and poor performance. This thesis proposes an improved code tracking scheme that allows optimised guest code execution even when data and code are interleaved by the guest, achieving up to 1.42× speedup relative to state-of-the-art page-granular code protection mechanism.
This item appears in the following Collection(s)

