Axioms for Rational Reinforcement Learning
We provide a formal, simple and intuitive theory of rational decision making including sequential decisions that affect the environment. The theory has a geometric flavor, which makes the arguments easy to visualize and understand. Our theory is for complete decision makers, which means that they have a complete set of preferences. Our main result shows that a complete rational decision maker implicitly has a probabilistic model of the environment. We have a countable version of this result that brings light on the issue of countable vs finite additivity by showing how it depends on the geometry of the space which we have preferences over. This is achieved through fruitfully connecting rationality with the Hahn-Banach Theorem. The theory presented here can be viewed as a formalization and extension of the betting odds approach to probability of Ramsey and De Finetti.
💡 Research Summary
The paper “Axioms for Rational Reinforcement Learning” develops a concise, geometric theory that links rational preferences over contracts (i.e., bets on outcomes) to the existence of an implicit probabilistic model of the environment. The authors work with complete decision makers, meaning that for every possible contract the agent can state whether it is acceptable, rejectable, or both.
Four basic rationality axioms are introduced (completeness, symmetry between acceptance and rejection, linear closure under non‑negative combinations, and a directionality condition that contracts with all non‑negative (or all negative) payoffs are always accepted (or rejected)). These axioms are shown to be equivalent to the classic axioms of completeness, transitivity, independence, and continuity for preference relations.
From these axioms the authors prove Theorem 5: the set of acceptable contracts must be a closed half‑space in ℝ^m. Consequently there exists a non‑negative weight vector p with ∑p_i = 1 such that a contract x is accepted iff its weighted sum ∑p_i x_i (the expectation) is non‑negative. In other words, the decision maker’s preferences are exactly those induced by a linear functional, which can be interpreted as a probability distribution over the m possible outcomes. The separating hyperplane (the “accept/reject” boundary) is precisely the hyperplane defined by the expectation being zero.
The authors then extend the framework to multi‑dimensional events (pairs (i, j) etc.), showing that marginal and conditional probabilities arise naturally from the same linear functional. Conditioning on observed symbols corresponds to restricting contracts to those that give zero payoff for incompatible outcomes, leading to the familiar Bayes rule expressed as P(i|j) = P(i,j)/P(j).
Learning is modeled as repeated conditioning: after observing a prefix of symbols, the decision maker updates the induced probability distribution by marginalising over the unseen suffixes. This yields a formal definition of an “informed” decision maker that uses the same expectation functional but with updated probabilities.
Choosing between actions is reduced to comparing the difference of their associated contracts. The optimal action is therefore a* = arg max_k ∑i p_i x{k,i}, i.e., the action with the highest expected payoff under the implicit probabilities. This recovers the standard reinforcement‑learning principle of maximizing expected return, but now grounded in a set of elementary rationality axioms rather than an assumed probabilistic model.
A crucial contribution is the treatment of countably infinite outcome spaces. By viewing the space of contracts as a Banach space (e.g., ℓ¹ or ℓ^∞) and invoking the Hahn‑Banach theorem, the authors demonstrate that a continuous linear functional (hence a probability measure) exists without needing extra continuity or monotonicity assumptions that previous works required. The geometry of the contract space determines whether the induced probability is finitely additive or countably additive.
Overall, the paper provides a clean, geometric justification for why any rational reinforcement‑learning agent must implicitly possess a probability distribution over its environment. The approach unifies classic expected‑utility theory, betting‑odds interpretations (Ramsey, de Finetti), and modern reinforcement‑learning, while also shedding light on subtle issues such as finite versus countable additivity through functional‑analytic tools. This work offers both a theoretical foundation for designing rational agents and a bridge between decision‑theoretic axioms and the probabilistic machinery commonly used in AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment