Characterizing and exploiting application behavior under data corruption
Shrinking semiconductor technologies come at the cost of higher susceptibility to hardware faults that render the systems unreliable. Traditionally, reliability solutions are aimed to protect equally and exhaustively all hardware parts of a system. This is in order to maintain the illusion of a correctly operating hardware. Due to the increasing error rates that induce higher reliability costs, this approach can no longer be sustainable. It is a fact that hardware faults can be masked by various levels of fault-masking effects. Therefore, not all hardware faults manifest as the same outcome on an application’s execution. Motivated by this fact, we propose a shift to vulnerability driven unequal protection of a given structure (or same-level structures), where the less-vulnerable parts of a structure are protected less than their more-vulnerable counterparts. For that purpose, in this thesis, we quantitatively investigate how the effect of hardware-induced data corruptions on application behavior varies. We develop a portable software-implemented fault-injection (SWIFI) tool. On top of performing single-bit fault injections to capture their effects on application behavior, our tool is also data-level aware and tracks the corrupted data to obtain more of their characteristics. This enables to analyze the effects of single-bit data corruptions in relation to the corrupted data characteristics and the executing workload. After a set of extensive fault-injection experiments on programs from the NAS Parallel Benchmarks suite, we obtain detailed insight on how the vulnerability varies; among others, for different application data types and for different bit locations within the data. The results show that we can characterize the vulnerability of data based on their high-level characteristics (e.g. usage type, size, user and memory space location). Moreover, we conclude that application data are vulnerable in parts. All these show that there is potential in exploiting the application behavior under data corruption. The exhaustive equal protection can be avoided by safely shifting to vulnerability-driven unequal protection within given structures. This can reduce the reliability overheads, without a significant impact on the fault coverage. For that purpose, we demonstrate the potential benefits of exploiting the varying vulnerability characteristics of application data in the case of a data cache.