Metrics for Finite Markov Decision Processes

We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.

💡 Research Summary

The paper introduces a principled way to quantify similarity between states in a finite Markov decision process (MDP) by constructing a bisimulation‑based metric. Traditional bisimulation treats two states as equivalent only when they produce identical reward and transition distributions for every action. While this binary notion is useful for exact model reduction, it is too strict for reinforcement learning (RL) where approximate similarity is often sufficient and desirable. The authors therefore lift bisimulation to a quantitative setting: they define a distance function d(s, s′) over the state space that simultaneously bounds differences in immediate rewards and in the probability distributions over next states.

Formally, for each action a the distance must satisfy
|r(s,a) − r(s′,a)| ≤ d(s,s′) and
W₁(P(·|s,a), P(·|s′,a)) ≤ d(s,s′),
where W₁ denotes the 1‑Wasserstein distance with the ground metric given by d itself. This self‑referential definition yields a contraction mapping F on the space of pseudo‑metrics:

F(d)(s,s′) = maxₐ{ |r(s,a) − r(s′,a)| + γ·W₁^{d}(P(·|s,a), P(·|s′,a)) }.

Because the discount factor γ ∈