Design and implementation of a telemetry platform for high-performance computing environments
A new generation of high-performance and distributed computing applications and services rely on adaptive and dynamic architectures and execution strategies to run efficiently, resiliently, and at scale in today’s HPC environments. These architectures require insights into their execution behaviour and the state of their execution environment at various levels of detail, in order to make context-aware decisions. HPC telemetry provides this information. It describes the continuous stream of time series and event data that is generated on HPC systems by the hardware, operating systems, services, runtime systems, and applications. Current HPC ecosystems do not provide the conceptual models, infrastructure, and interfaces to collect, store, analyse, and integrate telemetry in a structured and efficient way. Consequently, applications and services largely depend on one-off solutions and custom-built technologies to achieve these goals; introducing significant development overheads that inhibit portability and mobility. To facilitate a broader mix of applications, more efficient application development, and swift adoption of adaptive architectures in production, a comprehensive framework for telemetry management and analysis must be provided as part of future HPC ecosystem designs. This thesis provides the blueprint for such a framework: it proposes a new approach to telemetry management in HPC: the Telemetry Platform concept. Departing from the observation that telemetry data and the corresponding analysis, and integration pat- terns on modern multi-tenant HPC systems have a lot of similarities to the patterns observed in large-scale data analytics or “Big Data” platforms, the telemetry platform concept takes the data platform paradigm and architectural approach and applies them to HPC telemetry. The result is the blueprint for a system that provides services for storing, searching, analysing, and integrating telemetry data in HPC applications and other HPC system services. It allows users to create and share telemetry data-driven insights using everything from simple time-series analysis to complex statistical and machine learning models while at the same time hiding many of the inherent complexities of data management such as data transport, clean-up, storage, cataloguing, access management, and providing appropriate and scalable analytics and integration capabilities. The main contributions of this research are (1) the application of the data platform concept to HPC telemetry data management and usage; (2) a graph-based, time-variant telemetry data model that captures structures and properties of platform and applications and in which telemetry data can be organized; (3) an architecture blueprint and prototype of a concrete implementation and integration architecture of the telemetry platform; and (4) a proposal for decoupled HPC application architectures, separating telemetry data management, and feedback-control-loop logic from the core application code. First experimental results with the prototype implementation suggest that the telemetry platform paradigm can reduce overhead and redundancy in the development of telemetry-based application architectures, and lower the barrier for HPC systems research and the provisioning of new, innovative HPC system services.