Introduction

Log data is essential for understanding the behavior of high-performance computing (HPC) systems by recording their usage and troubleshooting system faults. Today’s HPC systems are heavily instrumented at every layer for health monitoring by collecting with performance counters and resource usage data. Most components also report information about abnormal events, such as critical conditions, faults, errors and failures. This system activity and event information is logged for monitoring and analysis. Large-scale HPC installations produce various types of log data. For example, job logs maintain a history of application runs, the allocated resources, their sizes, user information, and exit statuses, i.e., successful vs. failed. Reliability, availability and serviceability (RAS) system logs derive data from various hardware and software sensors, such as temperature sensors, memory errors and processor utilization. Network systems collect data about network link bandwidth, congestion and routing and link faults. Input/output (I/O) and storage systems produce logs that record performance characteristics as well as data about degradations and errors detected.

HPC log data, when thoroughly investigated both in spatial and temporal dimensions, can be used to detect occurrences of failures and understand their root causes, identify persistent temporal and spatial patterns of failures, track error propagation, evaluate system reliability characteristics, and even analyze contention for shared resources in the system. However, HPC log data is derived from multiple monitoring frameworks and sensors and is inherently unstructured. Most log entries are not set up to be understood easily by humans, with some entries consisting of numeric values while others include cryptic text, hexadecimal codes, or error codes. The analysis of this data and finding correlations faces two main difficulties: first, the volume of RAS logs makes the manual inspection difficult; and second, the unstructured nature and idiosyncratic properties of log data produced by each subsystem log adds another dimension of difficulty in identifying implicit correlation among the events recorded. Consequently, the usage of log data is, in practice, largely limited to detection of mere occurrences of known text patterns that are already known to be associated with certain types of events.

As the scale and complexity of HPC systems continues to grow, the storage, retrieval, and comprehensive analysis of the log data is a significant challenge. In future extreme scale HPC systems the massive volume of monitoring and log data makes manual inspection and analysis impractical, and therefore poses a data analytics challenge. To address this challenge, scalable methods for processing log and monitoring data are needed. This will require storing the enormous data sets in flexible schemas based on scalable and highly available database technologies, which can support large-scale analytics with low latencies, and high performance distributed data processing frameworks to support batch, real-time, and advanced analytics on the system data.

In this paper, we introduce a scalable HPC system data analytics framework, which is designed to provide system log data analysis capabilities to a wide range of researchers and engineers including system administrators, system researchers, and end users. The framework leverages Cassandra, a NoSQL distributed database to realize a scalable and fast-response backend for high throughput read/write operations, the Apache Spark for supporting rapid analysis on the voluminous system data. The framework provides a web-based graphical, interactive frontend interface that enables users to track system activity and performance and visualize the data. Using the framework, users can navigate spatio-temporal event space that overlaps with particular system events, faults, application runs, and resource usage to monitor the system, extract statistical features, and identify persistent behavioral patterns. End users can also visually inspect trends among the system events and contention on shared resources that occur during the run of their applications. Through such analysis, the users may find sources of performance anomalies and gain deeper insights into the impact of various system behaviors on application performance.

The rest of the document is organized as follows: Section 5 presents the data model and the design considerations that influenced the architecture of our framework. Section 10 details the architecture of our framework and how it has been adapted to analyze data from the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). Section 9 surveys related works in HPC monitoring frameworks and the analysis of log data. Finally, Section 8 concludes the paper with a discussion on potential future directions.

Schemas for event occurrences: event schema ordered by time of occurrence (Top) and by location of occurrence (Bottom)

Schemas for application runs: Application schema ordered by time of occurrence (Top), by name of application (Middle), and by users (Bottom)

Conclusion

With the ever-growing scale and complexity of high performance computing (HPC) systems, characterizing system behavior has become a significant challenge. The systems produce and log vast amounts of unstructured multi-dimensional data collected using a variety of monitoring tools. The tools and methods available today for processing this log data lack advanced data analytics capabilities, which makes it difficult to diagnose and completely understand the impact of system performance variations, fault and error events in the system on application performance. To handle the massive amounts of system log data from a diverse set of monitoring frameworks and rapidly identify problems and variations in system behavior, it is essential to have scalable tools to store and analyze the data.

In this paper, we introduced a scalable HPC log data analytics framework based on a distributed data and computation model. The framework defines a time-series oriented data model for HPC log data. We leverage big data frameworks, including Cassandra, a highly scalable, high-performance column-oriented NoSQL distributed database, and Apache Spark, a real-time distributed in-memory analytics engine. We presented a data model designed to facilitate log data analytics for system administrators and researchers as well as end users who are often oblivious to the impact of variations and fault events on their application jobs.

Our log analytic framework has been tested with Titan supercomputer at the Oak Ridge Leadership Computing Facility’s (OLCF). Although the framework is still evolving, with new analytics modules being currently developed, the preliminary assessment shows that the framework can provide deeper insights about the root causes of system faults, and abnormal behaviors of user applications. It also enables statistical analysis of event occurrences and their correlations on a spatial and temporal basis. These capabilities will be valuable when deploying a new HPC system in the pre-production phase, as well as during operational lifetime for fine tuning the system.

While our existing framework improves upon the state-of-the-art in HPC log data processing, there is much room to improve. As future work, we are planning several enhancements and improvements to the framework. First, new and composite event types will need to be defined for capturing the complete status of the system. This will involve event mining techniques rather than text pattern matching. Second, the framework will need to develop application profiles in terms of event occurred during its runs. This will help understand correlations between application runtime characteristics and variations observed in the system on account of faults and errors. Finally, the framework will need to support advanced statistical techniques, incorporate machine learning algorithms, and graph analytics for more comprehensive investigation of log and monitoring data.

Data Model

The monitoring infrastructure in supercomputing systems produces data streams from various sensors, which captures the resource and capacity utilization, power consumption, cooling systems, application performance, as well as various types of faults, errors and failures in the system. With the rapid increase in the complexity of supercomputing systems due to the use of millions of cores, complex memory hierarchies, and communication and file systems, massive amounts of monitoring data must be handled and analyzed in order to understand the characteristics of these systems and correlations between the various system measurements. The analysis of system monitoring data requires capturing relevant sensor data and system events, storing them in databases, developing analytic models to understand the spatial and temporal features of the data, or correlation between the various data streams, and providing tools capable of visualizing these system characteristics, or even building predictive models to improve the system operation. With the explosion in the monitoring data, this rapidly becomes a big data challenge. Therefore, to handle the massive amounts of system monitoring data and to support capabilities for more rigorous forms of user defined analytics, we adopt storage solutions designed to handle large amounts of data, and an in-memory data processing framework.

Design Considerations

An implementation of an HPC system log analytics framework should start with extracting, transforming, and loading (ETL) of log data into a manageable database that can serve a flexible and rich set of queries over large amount of data. Due to the variety and volume of the data, we considered flexibility and fast performance to be two key design objectives of the framework. For an analytics framework to be successfully used for current and emerging system architectures, we placed emphasis on the following design considerations for the backend data model:

Scalability: The framework needs to store historical log data as well as future events from the monitoring frameworks. The data model should be scalable to accommodate an ever increasing volume of data.
Low latency: The framework work also needs to serve interactive analytics that require near-real time query responses for timely visual updates. The backend data model should operate with minimal latency.
Flexibility: A single data representation, or schema, for the various types of events from different system components is not feasible. The data model should offer flexible mechanism to add new event types and modify existing schemas to accommodate changes in system configuration, software updates, etc.
Time series friendly: The most common type of log analytics that are of interest to HPC practitioners are expected to be based on time series data, which provide insights about the system’s behavior over a user specified window of time.

We believe that these features will enable users to identify patterns among the event occurrences over time and explain the abnormal behavior of systems and the impact on applications. The foundation of the analytics framework on such a data model will support a variety of statistical or data mining techniques, such as association rules , decision trees , cross correlation , Bayesian network , etc., to be applied to the system log data.

For supporting a broad range of analytics, the retention of the raw data in a semi-structured format will be greatly beneficial. However, we found the conventional relational databases (RDBMS) do not satisfy our requirements. First, a schema of a relational database, once created, is very difficult to modify, whereas the format of HPC logs tend to change periodically. Second, due to its support for the atomicity, consistency, isolation, and durability (ACID) properties and two-phase commit protocols, it does not scale. After investigating various database technologies, we found the Apache Cassandra to be most suitable for building the backend data model for our design of log analytics framework. Cassandra, based on Amazon’s Dynamo and Google’s BigTable, is a column-oriented distributed database offering highly available (HA) services with no single point of failure. Cassandra, a hashing-based distributed database system, stores data in tables. A data unit of a table, also known as partition, is associated with a hash key and mapped to one or more nodes of a Cassandra cluster. A partition can be considered as a data row that can contain multiple column families, where each family can be of a different format. Cassandra’s performance, resiliency, and scalability come from its master-less ring design, which unlike a legacy master-slave architecture gives an identical role to each node. With a replication option that is implemented on commodity hardware, Cassandra offers a fault tolerant data service. Also with its column oriented features, Cassandra is naturally suitable for handling data in sequence, regardless of data sizes. When data is written to Cassandra, each data record is sorted and written sequentially to disk. When a database is queried, data is retrieved by row key and range within a row, which guarantees a fast and efficient search.

Data Model Design

Our data model is designed to initially study the operational behavior of the Titan supercomputer hosted by the Oak Ridge National Laboratory. The framework is designed to study Titan’s system logs collected from console, application and network logs, which contain timestamped entries of critical system events. The data model is designed to capture various system events including, machine check exceptions, memory errors, GPU failures, GPU memory errors, Lustre file system errors, data virtualization service errors, network errors, application aborts, kernel panics, etc.

We have created a total of eight tables to model system information, the types of event we monitor, occurrences of events, and application runs. The partitions for events are designed to disperse overheads in both reading and writing data evenly over to the cluster nodes. Fig 1 shows how a partition is mapped to one of the four nodes by its hash key of hour and type combination.

nodeinfos
eventtypes
eventsynopsis
event_by_time
event_by_location
application_by_time
application_by_user
application_by_location

The nodeinfos contains information about the system including the position of a rack (or cabinet) in terms of row and column number, the position of a compute node in terms of rack, chassis, blade, and module number, network and routing information, etc. Each node in the Titan system consists of a AMD CPU and a NVIDIA GPU. Each CPU is a 16-core AMD Opteron 6274 processor with 32 GB of DDR3 memory and each GPU is a NVIDIA K20X Kepler architecture-based GPU with 6 GB of GDDR5 memory. The system uses Cray Gemini routers, which are shared between a pair of nodes. Each blade/slot Titan supercomputer consists of four nodes. Each cage has eight such blades and a cabinet contains three such cages. The complete system consists of 200 cabinets that are organized in a grid of 25 rows and 8 columns. The nodeinfo enables spatial correlation and analysis of events in the system.

The two tables event_by_time and event_by_location store system event information from two perspectives, time and location to facilitate spatio-temporal analysis. An event in our data model is defined as occurrence(s) of a certain type reported at a particular timestamp. An event is also associated with the location (or the source component) where it is reported. The two tables illustrate these dual representations of an event as illustrated in Fig 1. The first table structure associates an event with its type and the hour of its occurrence; all events of a certain type generated at a certain hour are stored in the same partition. In contrast, the second table structure associates an event with hour and location; all events, regardless of their type, generated at a certain hour for the same component, are stored in the same partition. Note that each partition stores events sorted by their timestamps, which is a time series representation of events that is one hour long. This facilitates to support a spatio-temporal query.

For data about user application runs, we added another dimension: users. More specifically, three tables to represent user application runs from perspectives of time, application, and user (see Fig 2). Readers can find this as a set of denormalized views on application runs. Note however, although all application runs in each partition type are depicted the same, in fact, each application run may include columns unique to it. For example, a column named as Other Info may include multiple sub-columns to represent different information.