The Cost of Troubleshooting Cost Clusters with Inside Information

Decision theoretical troubleshooting is about minimizing the expected cost of solving a certain problem like repairing a complicated man-made device. In this paper we consider situations where you have to take apart some of the device to get access to certain clusters and actions. Specifically, we investigate troubleshooting with independent actions in a tree of clusters where actions inside a cluster cannot be performed before the cluster is opened. The problem is non-trivial because there is a cost associated with opening and closing a cluster. Troubleshooting with independent actions and no clusters can be solved in O(n lg n) time (n being the number of actions) by the well-known “P-over-C” algorithm due to Kadane and Simon, but an efficient and optimal algorithm for a tree cluster model has not yet been found. In this paper we describe a “bottom-up P-over-C” O(n lg n) time algorithm and show that it is optimal when the clusters do not need to be closed to test whether the actions solved the problem.

💡 Research Summary

The paper tackles a classic decision‑theoretic troubleshooting problem—minimizing the expected cost of fixing a malfunction—under a new structural constraint: the diagnostic actions are grouped inside a hierarchy of “clusters” that must be physically opened before any action inside can be performed. Each cluster incurs a fixed opening cost, and optionally a closing cost if it must be sealed again after testing. Actions are assumed independent: each has a known probability of solving the problem and a known execution cost. In the traditional setting without clusters, the optimal ordering of independent actions is given by the well‑known P‑over‑C rule (probability divided by cost) and can be computed in O(n log n) time (Kadane & Simon, 1975). Introducing clusters makes the problem non‑trivial because the opening cost must be paid once per cluster, and the ordering of actions inside a cluster interacts with the ordering of actions in sibling clusters.

The authors formalize the “tree‑cluster” model: the device is represented as a rooted tree where internal nodes are clusters and leaves are atomic actions. For a cluster (v) we denote its opening cost (c^{open}_v) and (if considered) closing cost (c^{close}_v). An action (a_i) has success probability (p_i) and execution cost (c_i). The goal is to choose a global sequence of actions (respecting the rule that an action can be taken only after its ancestor clusters have been opened) that minimizes the expected total cost.

The main contribution is a “bottom‑up P‑over‑C” algorithm that solves the problem optimally when closing costs are zero (i.e., once a cluster is opened it can remain open for the rest of the troubleshooting session). The algorithm proceeds recursively:

Leaf processing – each leaf action is treated as a trivial sub‑problem with its own ((p_i, c_i)).
Local ordering – for any cluster (v), after its children have been solved optimally, each child (either a leaf action or a previously computed sub‑cluster) is represented by an effective success probability (P_j) and an effective expected cost (C_j). The children are then sorted by the ratio (P_j / C_j) (the classic P‑over‑C rule) to obtain the optimal local order.
Cluster aggregation – the sorted children are combined to compute the cluster’s aggregate success probability and expected cost: \