Reactive Scheduling of DAG Applications on Heterogeneous and Dynamic Distributed Computing Systems
Hernandez, Jesus Israel
Emerging technologies enable a set of distributed resources across a network to be linked together and used in a coordinated fashion to solve a particular parallel application at the same time. Such applications are often abstracted as directed acyclic graphs (DAGs), in which vertices represent application tasks and edges represent data dependencies between tasks. Effective scheduling mechanisms for DAG applications are essential to exploit the tremendous potential of computational resources. The core issues are that the availability and performance of resources, which are already by their nature heterogeneous, can be expected to vary dynamically, even during the course of an execution. In this thesis, we first consider the problem of scheduling DAG task graphs onto heterogeneous resources with changeable capabilities. We propose a list-scheduling heuristic approach, the Global Task Positioning (GTP) scheduling method, which addresses the problem by allowing rescheduling and migration of tasks in response to significant variations in resource characteristics. We observed from experiments with GTP that in an execution with relatively frequent migration, it may be that, over time, the results of some task have been copied to several other sites, and so a subsequent migrated task may have several possible sources for each of its inputs. Some of these copies may now be more quickly accessible than the original, due to dynamic variations in communication capabilities. To exploit this observation, we extended our model with a Copying Management(CM) function, resulting in a new version, the Global Task Positioning with copying facilities (GTP/c) system. The idea is to reuse such copies, in subsequent migration of placed tasks, in order to reduce the impact of migration cost on makespan. Finally, we believe that fault tolerance is an important issue in heterogeneous and dynamic computational environments as the availability of resources cannot be guaranteed. To address the problem of processor failure, we propose a rewinding mechanism which rewinds the progress of the application to a previous state, thereby preserving the execution in spite of the failed processor(s). We evaluate our mechanisms through simulation, since this allow us to generate repeatable patterns of resource performance variation. We use a standard benchmark set of DAGs, comparing performance against that of competing algorithms from the scheduling literature.