Characterizing and exploiting application behavior under data corruption
View/ Open
thesis files.zip (8.476Mb)
Date
26/11/2015Author
Stefanakis, Georgios
Metadata
Abstract
Shrinking semiconductor technologies come at the cost of higher susceptibility to hardware
faults that render the systems unreliable. Traditionally, reliability solutions are
aimed to protect equally and exhaustively all hardware parts of a system. This is in
order to maintain the illusion of a correctly operating hardware. Due to the increasing
error rates that induce higher reliability costs, this approach can no longer be sustainable.
It is a fact that hardware faults can be masked by various levels of fault-masking
effects. Therefore, not all hardware faults manifest as the same outcome on an application’s
execution. Motivated by this fact, we propose a shift to vulnerability driven
unequal protection of a given structure (or same-level structures), where the
less-vulnerable parts of a structure are protected less than their more-vulnerable counterparts.
For that purpose, in this thesis, we quantitatively investigate how the effect of
hardware-induced data corruptions on application behavior varies. We develop a portable
software-implemented fault-injection (SWIFI) tool. On top of performing single-bit
fault injections to capture their effects on application behavior, our tool is also data-level
aware and tracks the corrupted data to obtain more of their characteristics. This
enables to analyze the effects of single-bit data corruptions in relation to the corrupted
data characteristics and the executing workload. After a set of extensive fault-injection
experiments on programs from the NAS Parallel Benchmarks suite, we obtain detailed
insight on how the vulnerability varies; among others, for different application data
types and for different bit locations within the data.
The results show that we can characterize the vulnerability of data based on their
high-level characteristics (e.g. usage type, size, user and memory space location).
Moreover, we conclude that application data are vulnerable in parts. All these show
that there is potential in exploiting the application behavior under data corruption. The
exhaustive equal protection can be avoided by safely shifting to vulnerability-driven
unequal protection within given structures. This can reduce the reliability overheads,
without a significant impact on the fault coverage. For that purpose, we demonstrate
the potential benefits of exploiting the varying vulnerability characteristics of application
data in the case of a data cache.