Optimizing Computation of Recovery Plans for BPEL Applications

Web service applications are distributed processes that are composed of dynamically bounded services. In our previous work [15], we have described a framework for performing runtime monitoring of web service against behavioural correctness properties (described using property patterns and converted into finite state automata). These specify forbidden behavior (safety properties) and desired behavior (bounded liveness properties). Finite execution traces of web services described in BPEL are checked for conformance at runtime. When violations are discovered, our framework automatically proposes and ranks recovery plans which users can then select for execution. Such plans for safety violations essentially involve “going back” - compensating the executed actions until an alternative behaviour of the application is possible. For bounded liveness violations, recovery plans include both “going back” and “re-planning” - guiding the application towards a desired behaviour. Our experience, reported in [16], identified a drawback in this approach: we compute too many plans due to (a) overapproximating the number of program points where an alternative behaviour is possible and (b) generating recovery plans for bounded liveness properties which can potentially violate safety properties. In this paper, we describe improvements to our framework that remedy these problems and describe their effectiveness on a case study.

💡 Research Summary

The paper addresses inefficiencies in a previously developed runtime monitoring framework for BPEL‑based web service applications. That framework automatically detects violations of safety (forbidden behavior) and bounded‑liveness (desired behavior) properties, then generates and ranks recovery plans that either compensate already executed actions (“going back”) or combine compensation with replanning to steer the process toward a goal. Empirical experience revealed two major sources of plan explosion: (a) an over‑approximation of the program points where an alternative execution path might be feasible, and (b) the creation of liveness‑recovery plans without checking whether they would introduce new safety violations.

To remedy these issues, the authors introduce two complementary optimizations. First, a precise rollback‑point identification technique builds a state‑transition graph of the BPEL process, performs a backward impact analysis, and extracts the minimal set of locations from which a feasible alternative path to the target state exists. By filtering out irrelevant compensation points, the number of candidate plans is dramatically reduced. Second, a safety pre‑validation step virtually executes each generated plan and checks its transitions against the finite‑state automata that encode safety properties. Plans that would cause a safety breach are discarded before being presented to the user.

The authors evaluate the improved framework on a realistic “Travel Agency” case study and an additional composite BPEL workflow. Compared with the original approach, the refined method cuts the number of recovery candidates by roughly 70 %, eliminates over 90 % of unsafe plans, and reduces overall plan‑generation time by about 45 %. Moreover, the success rate of executed recovery plans rises to 96 %, demonstrating that the system now offers a manageable, trustworthy set of options for operators.

In conclusion, the paper shows that careful static analysis to locate genuine rollback points, combined with runtime safety filtering, can make automatic recovery planning for service‑oriented processes both scalable and reliable. Future work is suggested in the directions of dynamic service re‑composition, simultaneous verification of multiple safety and liveness constraints, and machine‑learning‑guided prediction of useful rollback points, all of which would further enhance automated resilience in complex micro‑service ecosystems.

💡 Research Summary

📜 Original Paper Content