Hardware-Conscious Stream Processing: A Survey
Background
In this section, we introduce the common APIs and runtime architectures of modern DSPSs.
Common APIs
A DSPS needs to provide a set of APIs for users to express their stream applications. Most modern DSPSs such as Storm and Flink express a streaming application as a directed acyclic graph (DAG), where nodes in the graph represent operators, and edges represent the data dependency between operators. Figure 1 (a) illustrates the word count (WC) as an example application containing five operators. A detailed description of a few more stream applications can be found in .
Some earlier DSPSs (e.g., Storm ) require users to implement each operator manually. Recent efforts from Saber , Flink , Spark-Streaming , and Trident aim to provide declarative APIs (e.g., SQL) with rich built-in operations such as aggregation and join. Subsequently, many efforts have been devoted to improving the execution efficiency of the operations, especially by utilizing modern hardware (Section 16).
Common Runtime Architectures
Modern stream processing systems can be generally categorized based on their processing models including the Continuous Operator (CO) model and the Bulk Synchronous Parallel (BSP) model .
Continuous Operator Model: Under the CO model, the execution runtime treats each operator (a vertex of a DAG) as a single execution unit (e.g., a Java thread), and multiple operators communicate through message passing (an edge in a DAG). For scalability, each operator can be executed independently in multiple threads, where each thread handles a substream of input events with stream partitioning . This execution model allows users to control the parallelism of each operator in a fine-grained manner . This kind of design was adopted by many DSPSs such as Storm , Heron , Seep , and Flink due to its advantage of low processing latency. Other recent hardware-conscious DSPSs adopt the CO model including Trill , BriskStream , and TerseCades .
Bulk-Synchronous Parallel Model: Under the BSP model, input stream is
explicitly grouped into micro batches by a central coordinator and then
distributed to multiple workers (e.g., a thread/machine). Subsequently,
each data item in a micro batch is independently processed by going
through the entire DAG (ideally by the same thread without any
cross-operator communication). However, the DAG may contain
synchronization barrier, where threads have to exchange their
intermediate results (i.e., data shuffling). Taking WC as an example,
the Splitter needs to ensure that the same word is always passed to
the same thread of the Counter. Hence, a data shuffling operation is
required before the Counter. As a result, such synchronization
barriers break the DAG into multiple stages under the BSP model, and the
communication between stages is managed by the central coordinator. This
kind of design was adopted by Spark-streaming , Drizzle , and
FlumeJava . Other recent hardware-conscious DSPSs adopt the BSP model
including Saber and StreamBox .
Although there have been prior efforts to compare different models , it is still inconclusive that which model is more suitable for utilizing modern hardware – each model comes with its own advantages and disadvantages. For example, the BSP model naturally minimizes communication among operators inside the same DAG, but its single centralized scheduler has been identified with scalability limitation . Moreover, its unavoidable data shuffling also brings significant communication overhead, as observed in recent research . In contrast, CO model allows fine-grained optimization (i.e., each operator can be configured with different parallelisms and placements) but potentially incurs higher communication costs among operators. Moreover, the limitations of both models can potentially be addressed with more advanced techniques. For example, cross operator communication overhead (under both CO and BSP models) can be overcome by exploiting tuple batching , high bandwidth memory , data compression , InfiniBand (Section 13), and architecture-aware query deployment (Section 15).
Survey Outline
The hardware architecture is evolving fast and provides a much higher processing capability than that traditional DSPSs were originally designed for. For example, recent scale-up servers can accommodate hundreds of CPU cores and terabytes of memory , providing abundant computing resources. Emerging network technologies such as Remote Direct Memory Access (RDMA) and 10Gb Ethernet significantly improve system ingress rate, making I/O no longer a bottleneck in many practical scenarios . However, prior studies have shown that existing data stream processing systems (DSPSs) severely underutilize hardware resources due to the unawareness of the underlying complex hardware architectures.
As summarized in Table [tbl:summary], we are witnessing a revolution in the design of DSPSs that exploit emerging hardware capability, particularly along with the following three dimensions:
1) Computation Optimization: Contrary to conventional DBMSs, there are two key features in DSPSs that are fundamental to many stream applications and computationally expensive: Windowing operation (e.g., windowing stream join) and Out-of-order handling . The former deals with infinite stream, and the latter handles stream imperfection. The support for those expensive operations is becoming one of the major requirements for modern DSPSs and is treated as one of the key dimensions in differentiating modern DSPSs. Prior approaches use multicores , heterogeneous architectures (e.g., GPUs and Cell processors) , and Field Programmable Gate Arrays (FPGAs) for accelerating those operations.
2) Stream I/O Optimization: Cross-operator communication is often a major source of overhead in stream processing. Recent work has revealed that the overhead due to cross-operator communication is significant, even without the TCP/IP network stack . Subsequently, research has been conducted on improving the efficiency of data grouping (i.e., output stream shuffling among operators) using High Bandwidth Memory (HBM) , compressing data in transmission with hardware accelerators and applying computation directly over compressed data , and leveraging InfiniBand for faster data flow . Having said that, there are also cases where the application needs to temporarily store data for future usage (i.e., state management ). Examples include stream processing with large window operation (i.e., workload footprint larger than memory capacity) and stateful stream processing with high availability (i.e., application states are kept persistently). To relieve the disk I/O overhead, recent work has investigated how to achieve more efficient state management, leveraging SSD and non-volatile memory (NVM) .
3) Query Deployment: At an even higher point of view, researchers have studied launching a whole stream application (i.e., a query) into various hardware architectures. Similar to traditional database systems, the goal of query deployment in DSPS is to minimize operator interference/ cross-operator communication, balance hardware resource utilization, and so on. The major difference compared to traditional database systems lies in their different problem assumptions, and hence in their system architectures (e.g., infinite input stream , processing latency constraints , and unique cost function of streaming operators ). To take advantage of modern hardware, prior works have exploited various hardware characteristics such as cache-conscious strategies , FPGA , and GPUs . Recent works have also looked into supporting hybrid architectures and NUMA .
Conclusion
In this paper, we have discussed relevant literature from the field of hardware-conscious DSPSs, which aim to utilize modern hardware capabilities for accelerating stream processing. Those works have significantly improved DSPSs to better satisfy the design requirements raised by Stonebraker et al. . In the following, we list some additional advice on future research directions.
Scale-up and -out Stream Processing. As emphasized by Gibbons , scaling both out and up is crucial to effectively improving the system performance. In situ analytics enable data processing at the point of data origin, thus reducing the data movements across networks; Powerful hardware infrastructure provides an opportunity to improve processing performance within a single node. To this end, many recent works have exploited the potential of high-performance stream processing on a single node . However, the important question of how best to use powerful local nodes in the context of large distributed computation setting still remains unclear.
Stream Processing Processor. With the wide adoption of stream processing today, it may be a good time to revisit the design of a specific processor for DSPSs. GPUs provide much higher bandwidth than CPUs, but it comes with larger latency as tuples must be first accumulated in order to fully utilize thousands of cores on GPU; FPGA has its advantage in providing low latency, low power consumption computation but its throughput is still much lower compared to GPUs. The requirement for an ideal processor for stream processing includes low latency, low power consumption, and high bandwidth. On the other hand, components like complex control logic may be sacrificed as stream processing logic is usually predefined and fixed. Further, due to the nature of continuous query processing, it is ideal to keep the entire instruction set close to processor .
Acknowledgments.
The authors would like to thank the anonymous reviewer and the associate editor, Pınar Tözün, for their insightful comments on improving this manuscript. This work is supported by a MoE Tier 1 grant (T1 251RES1824) and a MoE Tier 2 grant (MOE2017-T2-1-122) in Singapore. Feng Zhang’s work was partially supported by the National Natural Science Foundation of China (Grant No. 61802412, 61732014).