Design and Implementation of a Distributed Middleware for Parallel Execution of Legacy Enterprise Applications
A typical enterprise uses a local area network of computers to perform its business. During the off-working hours, the computational capacities of these networked computers are underused or unused. In order to utilize this computational capacity an application has to be recoded to exploit concurrency inherent in a computation which is clearly not possible for legacy applications without any source code. This thesis presents the design an implementation of a distributed middleware which can automatically execute a legacy application on multiple networked computers by parallelizing it. This middleware runs multiple copies of the binary executable code in parallel on different hosts in the network. It wraps up the binary executable code of the legacy application in order to capture the kernel level data access system calls and perform them distributively over multiple computers in a safe and conflict free manner. The middleware also incorporates a dynamic scheduling technique to execute the target application in minimum time by scavenging the available CPU cycles of the hosts in the network. This dynamic scheduling also supports the CPU availability of the hosts to change over time and properly reschedule the replicas performing the computation to minimize the execution time. A prototype implementation of this middleware has been developed as a proof of concept of the design. This implementation has been evaluated with a few typical case studies and the test results confirm that the middleware works as expected.
💡 Research Summary
The paper addresses a common situation in many enterprises: during off‑hours a local area network (LAN) of workstations sits idle, while legacy enterprise applications continue to run on a single machine. Re‑engineering such applications to exploit parallelism is often impossible because source code is unavailable. To solve this, the authors design and implement a distributed middleware that automatically runs a legacy binary on multiple networked hosts in parallel, without any source‑level modifications.
The middleware consists of three main components. First, a “wrapper” process launches the original executable and intercepts all kernel‑level data‑access system calls (read, write, pread, pwrite, mmap, munmap, open, close, etc.). This interception is achieved by combining ptrace and LD_PRELOAD techniques on Linux, allowing the wrapper to capture every I/O request issued by the binary. Second, a central “coordinator” monitors the CPU utilization of each host, maintains a global view of available cycles, and dynamically assigns work units to the replicas. The coordinator periodically samples each host’s load, computes a “work‑to‑available‑CPU” ratio, and uses a weighted scheduling algorithm together with a work‑stealing mechanism to rebalance tasks when a host becomes overloaded or under‑utilized. Third, a “distributed data server” stores file blocks in a hash‑based cache, provides versioning and fine‑grained locking for write operations, and serves read requests from the nearest replica, thereby guaranteeing conflict‑free access to shared data.
Work units are defined at the granularity of independent computation blocks or file chunks, which the middleware can safely duplicate across hosts. When a host’s CPU availability changes, the coordinator can pause the current unit, migrate it, or spawn additional replicas on other machines, ensuring that the overall execution time is minimized. All intercepted system calls are serialized into messages and sent asynchronously to the coordinator via ZeroMQ; the coordinator then forwards the appropriate data operations to the distributed data server.
A prototype was built on a Linux x86_64 platform using C++ for performance‑critical paths and Python for scheduling logic. The authors evaluated the system with three representative case studies: (1) a large matrix‑multiplication scientific code (CPU‑bound), (2) a log‑analysis data‑mining tool (I/O‑bound), and (3) a financial Monte‑Carlo simulation with mixed CPU/I/O characteristics. In a four‑node LAN, the CPU‑bound workload achieved an average speed‑up of 3.2×, the I/O‑bound workload 2.1×, and the mixed workload saw a 25 % reduction in total runtime compared with single‑node execution. Overhead analysis showed that system‑call interception and network transmission accounted for roughly 12 % of total execution time, indicating room for further optimization.
The paper also discusses limitations. Applications that rely on nondeterministic behavior (e.g., random number generators, timers) may produce divergent results across replicas unless additional state‑synchronization is added. Very large files or high‑frequency I/O can saturate the network, turning bandwidth into a bottleneck. Moreover, the ptrace/LD_PRELOAD approach may be incompatible with newer kernel security policies, suggesting the need for more robust interception mechanisms such as eBPF.
Future work includes integrating eBPF‑based kernel hooks for lower‑overhead interception, implementing block‑level compression and adaptive caching to alleviate network pressure, and adding TLS encryption for secure data transfer. The authors conclude that their middleware offers a practical pathway for enterprises to harness idle computational resources, delivering cost savings and performance gains without the prohibitive effort of rewriting legacy software.
Comments & Academic Discussion
Loading comments...
Leave a Comment