Programmer-transparent efficient parallelism with skeletons
Item Status
Embargo End Date
Date
Authors
Metzger, Paul
Abstract
Parallel and heterogeneous systems are ubiquitous. Unfortunately, both require significant complexity at the software level to the detriment of programmer productivity. To
produce correct and efficient code programmers not only have to manage synchronisation and communication but also be aware of low-level hardware details. It is foresee able that the problem is becoming worse because systems are increasingly parallel and
heterogeneous.
Building on earlier work, this thesis further investigates the contribution which
algorithmic skeletons can make towards solving this problem. Skeletons are high-level
abstractions for typical parallel computations. They hide low-level hardware details
from programmers and, in addition, encode information about the computations that
they implement, which runtime systems and library developers can use for automatic
optimisations. We present two novel case studies in this respect.
First, we provide scheduling flexibility on heterogeneous CPU + GPU systems in
a programmer transparent way similar to the freedom OS schedulers have on CPUs.
Thanks to the high-level nature of skeletons we automatically switch between CPU and
GPU implementations of kernels and use semantic information encoded in skeletons to
find execution time points at which switches can occur. In more detail, kernel iteration
spaces are processed in slices and migration is considered on a slice-by-slice basis. We
show that slice sizes choices that introduce negligible overheads can be learned by predictive models. We show that in a simple deployment scenario mid-kernel migration
achieves speedups of up to 1.30x and 1.08x on average. Our mechanism introduces
negligible overheads of 2.34% if a kernel does not actually migrate.
Second, we propose skeletons to simplify the programming of parallel hard real-time systems. We combine information encoded in task farms with real-time systems
user code analysis to automatically choose thread counts and an optimisation parameter
related to farm internal communication. Both parameters are chosen so that real-time
deadlines are met with minimum resource usage. We show that our approach achieves
1.22x speedup over unoptimised code, selects the best parameter settings in 83% of
cases, and never chooses parameters that cause deadline misses.
This item appears in the following Collection(s)

