Lightweight speculative support for aggressive auto-parallelisation tools
View/ Open
Date
29/06/2015Author
Powell, Daniel Christopher
Metadata
Abstract
With the recent move to multi-core architectures it has become important to create the
means to exploit the performance made available to us by these architectures. Unfortunately
parallel programming is often a difficult and time-intensive process, even
to expert programmers. Auto-parallelisation tools have aimed to fill the performance
gap this has created, but static analysis commonly employed by such tools are unable
to provide the performance improvements required due to lack of information at
compile-time. More recent aggressive parallelisation tools use profiled-execution to
discover new parallel opportunities, but these tools are inherently unsafe. They require
either manual confirmation that their changes are safe, completely ruling out auto-parallelisation,
or they rely upon speculative execution such as software thread-level
speculation (SW-TLS) to confirm safe execution at runtime.
SW-TLS schemes are currently very heavyweight and often fail to provide speedups for
a program. Performance gains are dependent upon suitable parallel opportunities, correct
selection and configuration, and appropriate execution platforms. Little research
has been completed into the automated implemention of SW-TLS programs.
This thesis presents an automated, machine-learning based technique to select and configure
suitable speculation schemes when appropriate. This is performed by extracting
metrics from potential parallel opportunities and using them to determine if a loop is
suitable for speculative execution and if so, which speculation policy should be used.
An extensive evaluation of this technique is presented, verifying that SW-TLS configuration
can indeed be automated and provide reliable performance gains. This work
has shown that on an 8-core machine, up to 7.75X and a geometric mean of 1.64X
speedups can be obtained through automatic configuration, providing on average 74%
of the speedup obtainable through manual configuration.
Beyond automated configuration, this thesis explores the idea that many SW-TLS
schemes focus too heavily on recovery from detecting a dependence violation. Doing
so often results in worse than sequential performance for many real-world applications,
therefore this work hypothesises that for many highly-likely parallel candidates,
discovered through aggressive parallelisation techniques, would benefit from a simple
dependence check without the ability to roll back. Dependence violations become
extremely expensive in this scenario, however this would be incredibly rare. With a
thorough evaluation of the technique this thesis confirms the hypothesis whilst achieving speedups of up to 22.53X, and a geometric mean of 2.16X on a 32-core machine.
In a competitive scheduling scenario performance loss can be restricted to at least sequential
speeds, even when a dependence has been detected.
As a means to lower costs further this thesis explores other platforms to aid in the execution
of speculative error checking. Introduced is the use of a GPU to offload some of
the costs to during execution that confirms that using an auxiliary device is a legitimate
means to obtain further speedup. Evaluation demonstrates that doing so can achieve
up to 14.74X and a geometric mean of 1.99X speedup on a 12-core hyperthreaded machine.
Compared to standard CPU-only techniques this performs slightly slower with
a geometric mean of 0.96X speedup, however this is likely to improve with upcoming
GPU designs.
With the knowledge that GPU’s can be used to reduce speculation costs, this thesis
also investigates their use to speculatively improve execution times also. Presented
is a novel SW-TLS scheme that targets GPU-based execution for use with aggressive
auto-parallelisers. This scheme is executed using a competitive scheduling model, ensuring
performance is no lower than sequential execution, whilst being able to provide
speedups of up to 99X and on average 3.2X over sequential. On average this technique
outperformed static analysis alone by a factor of 7X and achieved approximately 99%
of the speedup obtained from manual parallel implementations and outperformed the
state-of-the-art in GPU SW-TLS by a factor of 1.45.
Collections
The following license files are associated with this item: