Topological Value Iteration Algorithms
Value iteration is a powerful yet inefficient algorithm for Markov decision processes (MDPs) because it puts the majority of its effort into backing up the entire state space, which turns out to be unnecessary in many cases. In order to overcome this problem, many approaches have been proposed. Among them, ILAO* and variants of RTDP are state-of-the-art ones. These methods use reachability analysis and heuristic search to avoid some unnecessary backups. However, none of these approaches build the graphical structure of the state transitions in a pre-processing step or use the structural information to systematically decompose a problem, whereby generating an intelligent backup sequence of the state space. In this paper, we present two optimal MDP algorithms. The first algorithm, topological value iteration (TVI), detects the structure of MDPs and backs up states based on topological sequences. It (1) divides an MDP into strongly-connected components (SCCs), and (2) solves these components sequentially. TVI outperforms VI and other state-of-the-art algorithms vastly when an MDP has multiple, close-to-equal-sized SCCs. The second algorithm, focused topological value iteration (FTVI), is an extension of TVI. FTVI restricts its attention to connected components that are relevant for solving the MDP. Specifically, it uses a small amount of heuristic search to eliminate provably sub-optimal actions; this pruning allows FTVI to find smaller connected components, thus running faster. We demonstrate that FTVI outperforms TVI by an order of magnitude, averaged across several domains. Surprisingly, FTVI also significantly outperforms popular heuristically-informed MDP algorithms such as ILAO*, LRTDP, BRTDP and Bayesian-RTDP in many domains, sometimes by as much as two orders of magnitude. Finally, we characterize the type of domains where FTVI excels — suggesting a way to an informed choice of solver.
💡 Research Summary
The paper tackles the well‑known inefficiency of classical Value Iteration (VI) for solving Markov Decision Processes (MDPs). VI repeatedly backs up the entire state space at every iteration, which is wasteful when many states are irrelevant to the optimal solution. The authors propose two novel algorithms that exploit the graphical structure of the MDP: Topological Value Iteration (TVI) and its focused variant, Focused Topological Value Iteration (FTVI).
TVI begins by representing the MDP as a directed transition graph and decomposing this graph into strongly‑connected components (SCCs) using a linear‑time algorithm such as Tarjan’s. An SCC is a maximal set of states that are mutually reachable; edges between different SCCs are necessarily one‑way. By performing a topological sort on the SCC‑dependency DAG, TVI obtains an ordering in which each component can be solved after all its predecessors have converged. Within a component, standard VI is applied until convergence, but because the values of predecessor components are already fixed, no further backups of those states are required. Consequently, the overall number of backups is dramatically reduced, especially when the MDP contains many SCCs of comparable size. The algorithm’s theoretical complexity is O(|S|+|E|) for the decomposition plus Σ_i O(|S_i|·|E_i|·k_i) for the internal VI loops, where k_i denotes the number of iterations needed for component i.
FTVI builds on TVI by adding a lightweight heuristic search phase that prunes provably sub‑optimal actions before the SCC decomposition is performed. The search (e.g., a bounded A* or a few RTDP rollouts) identifies actions that cannot belong to any optimal policy from the current start state. Removing these actions can split the original SCCs into smaller, more relevant sub‑components or eliminate entire components that are never visited by an optimal policy. The authors prove that this pruning does not affect optimality. The resulting focused SCCs are solved with the same topological ordering as TVI, but because the graph is now smaller, FTVI typically runs an order of magnitude faster than TVI.
Empirical evaluation spans several benchmark domains (grid worlds, RiverSwim, stochastic shortest‑path problems) and synthetic MDPs specifically engineered to contain many near‑equal‑sized SCCs. Performance metrics include total runtime and number of Bellman backups. Both TVI and FTVI dramatically outperform vanilla VI and also dominate state‑of‑the‑art heuristic‑informed solvers such as ILAO*, LRTDP, BRTDP, and Bayesian‑RTDP. In domains with a rich SCC structure, FTVI achieves speed‑ups of 10× to 100× over the best existing methods. In contrast, when the transition graph is almost fully connected (i.e., a single giant SCC), the advantage diminishes, and traditional VI or heuristic search methods remain competitive.
The paper also provides practical guidance for algorithm selection. If a problem exhibits a clear hierarchical or modular transition structure—evidenced by multiple SCCs, sparse inter‑component connections, or large regions that are unreachable from the optimal policy—TVI or, preferably, FTVI should be the first choice. Conversely, for dense graphs lacking meaningful SCC decomposition, the overhead of graph analysis may outweigh the benefits.
Finally, the authors outline future research directions: (1) developing online or incremental SCC detection to handle dynamic or partially known MDPs, (2) parallelizing the independent SCC solves to exploit multi‑core architectures, and (3) enhancing the pruning phase with more sophisticated sampling or learning‑based heuristics to further shrink the focused subgraph. By marrying graph‑theoretic decomposition with classic dynamic programming, the work offers a compelling new paradigm for scalable MDP solution methods.