Several high-throughput distributed data-processing applications require multi-hop processing of streams of data. These applications include continual processing on data streams originating from a network of sensors, composing a multimedia stream through embedding several component streams originating from different locations, etc. These data-flow computing applications require multiple processing nodes interconnected according to the data-flow topology of the application, for on-stream processing of the data. Since the applications usually sustain for a long period, it is important to optimally map the component computations and communications on the nodes and links in the network, fulfilling the capacity constraints and optimizing some quality metric such as end-to-end latency. The mapping problem is unfortunately NP-complete and heuristics have been previously proposed to compute the approximate solution in a centralized way. However, because of the dynamicity of the network, it is practically impossible to aggregate the correct state of the whole network in a single node. In this paper, we present a distributed algorithm for optimal mapping of the components of the data flow applications. We propose several heuristics to minimize the message complexity of the algorithm while maintaining the quality of the solution.
Deep Dive into Towards a decentralized algorithm for mapping network and computational resources for distributed data-flow computations.
Several high-throughput distributed data-processing applications require multi-hop processing of streams of data. These applications include continual processing on data streams originating from a network of sensors, composing a multimedia stream through embedding several component streams originating from different locations, etc. These data-flow computing applications require multiple processing nodes interconnected according to the data-flow topology of the application, for on-stream processing of the data. Since the applications usually sustain for a long period, it is important to optimally map the component computations and communications on the nodes and links in the network, fulfilling the capacity constraints and optimizing some quality metric such as end-to-end latency. The mapping problem is unfortunately NP-complete and heuristics have been previously proposed to compute the approximate solution in a centralized way. However, because of the dynamicity of the network, it is
Real-time processing of continuous data streams are becoming an important component of data-flow intensive distributed applications. In general these applications consist of a few cascades of computational operations on several streams of data originating from one or more sources and presenting a view of the processed data at one or more sink nodes. Applications such as continual query [4] on the stream of information sent by a network of sensors, composing a multimedia stream through several stages of encoding, decoding and embedding [3,9], scientific workflow [6], etc. belong to this category. These applications require several computational resources along the path the data streams travel from the source to destination. In addition, as each of these computations generate new data streams that are to processed by other computations or to be delivered to the destination. Sufficient network link bandwidth must be provided to carry these data streams among source, destination and computational nodes, so that the computations can proceed seamlessly. In this paper, we deal with the problem of optimally allocating computational and network resources for these distributed applications.
Usually the distributed computation operates for a long time after being set up with all the necessary resources. So, it is important to optimally acquire the resources before the operation starts. When resources are requested for a distributed job, the topology that interconnect the component nodes of the flow, i.e. the data sources, the processing nodes and the destination, is known. In very general terms, the interconnection topology can be an acyclic graph. However, in most common cases the flow is a linear path or tree or a series-parallel graph. We show in Section 2.3 that even for a linear path-like flow, finding a mapping that computations on processing nodes and data transmissions on network paths, satisfying the processing capacity and bandwidth constraint, is an NP-complete problem. In this paper, we develop a scheme to solve the problem of mapping linear path-like computation on an arbitrary resource network.
The problem of establishing a path between a source and a destination node in an arbitrary network, subject to some end-to-end quality constraints, has been a topic for active research for a long time. If such path is to be established to satisfy one additive quality requirement such as delay or hop-count, the problem can easily be solved by Dijkstra’s shortest path algorithm. Even if some end-to-end min-max constraint such as bandwidth need to be satisfied, still the problem can be solved easily using Wang and Crowcroft’s shortest-widest path algorithm [10]. However, it is well known that establishing a path satisfying more than one additive quality constraints is an NP-hard problem [1,8]. It is important to note that the problem of finding a mapping for a data-flow computation requires more than end-to-end constraints, because computational capacity of each of the nodes need to be individually satisfied.
Due to the inherent complexity of the optimization problem, several workable heuristic solutions have been proposed in different contexts. A recursive mapping on a hierarchy of node-groups in the resource networks is applied in [4]. In [9] and [3], mapping is performed after pruning the whole resource network into a subset of compatible resources. The solution by Liang and Nahrstedt [5] is closest to ours. One of the assumptions made by Liang and Nahrstedt was that the optimization algorithm was executed in a single node and complete state of the resource network is available to that node before execution. In a large scale dynamic network this assumption is hard to realize. If we assume that each node in the resource network is aware of the state of its immediate neighborhood only, we need to compute the solution using a distributed algorithm. In this paper we present a distributed algorithm to solve the problem, which is a dynamic programming based extension of the distributed Bellman-Ford algorithm.
The rest of the paper is organized as follows. In Section 2 of this paper we formally define the resource allocation problem as a constrained graph mapping problem. The Bandwidth Constrained Path Mapping (BCPM) problem that covers most of the practical applications, is then defined as a special case of the general graph mapping problem. We provide a formal proof of NP-completeness of the BCPM problem in the same section. In Section 3, centralized and decentralized algorithms to solve the BCPM problem are developed. A guideline for designing cost-effective heuristics to obtain approximate solutions to the problem is provided at the end of the same section. The discussion is then summarized with directions for possible future extensions in Section 4.
In this section we formally define the problem of capacity constrained mapping of dataflow computations on arbitrary networks. Any distributed dataflow computation
…(Full text truncated)…
This content is AI-processed based on ArXiv data.