Analysis and parameter prediction of compiler transformation for graphics processors
View/ Open
Date
27/06/2016Author
Magni, Alberto
Metadata
Abstract
In the last decade graphics processors (GPUs) have been extensively used to solve computationally
intensive problems. A variety of GPU architectures by different hardware manufacturers
have been shipped in a few years. OpenCL has been introduced as the standard cross-vendor
programming framework for GPU computing. Writing and optimising OpenCL applications is
a challenging task, the programmer has to take care of several low level details. This is even
harder when the goal is to improve performance on a wide range of devices: OpenCL does not
guarantee performance portability.
In this thesis we focus on the analysis and the portability of compiler optimisations. We
describe the implementation of a portable compiler transformation: thread-coarsening. The
transformation increases the amount of work carried out by a single thread running on the GPU.
The goal is to reduce the amount of redundant instructions executed by the parallel application.
The first contribution is a technique to analyse performance improvements and degradations
given by the compiler transformation, we study the changes of hardware performance
counters when applying coarsening. In this way we identify the root causes of execution time
variations due to coarsening.
As second contribution, we study the relative performance of coarsening over multiple
input sizes. We show that the speedups given by coarsening are stable for problem sizes larger
than a threshold that we call saturation point. We exploit the existence of the saturation point
to speedup iterative compilation.
The last contribution of the work is the development of a machine learning technique that
automatically selects a coarsening configuration that improves performance. The technique is
based on an iterative model built using a neural network. The network is trained once for a
GPU model and used for several programs. To prove the flexibility of our techniques, all our
experiments have been deployed on multiple GPU models by different vendors.