Asymptotically Optimal Agents
Artificial general intelligence aims to create agents capable of learning to solve arbitrary interesting problems. We define two versions of asymptotic optimality and prove that no agent can satisfy the strong version while in some cases, depending on discounting, there does exist a non-computable weak asymptotically optimal agent.
š” Research Summary
The paper investigates the longāterm performance guarantees that can be demanded of a general reinforcementālearning agent operating in an unknown environment. Two notions of asymptotic optimality are introduced. Strong asymptotic optimality requires that, for every environment µ in a given class M, the value of the agentās policy Ļ converges to the value of the optimal policy Ļ*µ as time n ā ā. In other words, the agent must eventually stop exploring and act optimally at every step. Weak asymptotic optimality relaxes this requirement: the average difference between the agentās value and the optimal value must tend to zero, allowing the agent to keep exploring forever as long as the fraction of āserious mistakesāā vanishes.
The authors focus on the class M of all deterministic computable environments and on arbitrary (but regular) discount functions γ. Their main negative result (TheoremāÆ8) has three parts:
-
No strong asymptotically optimal policy exists when γ is computable. The proof constructs a simple twoāaction environment where the optimal policy always chooses āupāā and yields a constant reward of ½. Any policy that were strongly optimal would eventually have to avoid long runs of the subāoptimal ādownāā action. By embedding a hidden ātrapāā that can be unlocked only after a sufficiently long contiguous block of ādownāā actions, the authors create a second environment ν that is indistinguishable from the first under the candidate policy, yet in ν the policyās value stays below optimal, yielding a contradiction.
-
No computable weakly asymptotically optimal policy exists for the same class of environments and discount functions. If a computable policy Ļ were weakly optimal, one can again construct an environment that mimics the agentās behaviour but rewards the agent only when it follows a specific longārun pattern that Ļ never produces, thereby violating the weakāoptimality condition. Consequently any weakly optimal policy must be nonācomputable.
-
For certain discount functions, even nonācomputable weakly optimal policies cannot exist. The authors exhibit γkāÆ=āÆ1/(k(k+1)), whose effective horizon grows linearly with time. With such a slowly decaying discount, an agent cannot afford the deep, contiguous exploration needed to learn the hidden trap without sacrificing too much discounted reward, so no weakly optimal policy (computable or not) can satisfy the definition.
On the positive side, the paper shows that for āreasonableāā discount functionsāmost notably geometric discounting γkāÆ=āÆĪ³^k with 0āÆ<āÆĪ³āÆ<āÆ1āthere does exist a nonācomputable weakly asymptotically optimal agent. The construction is reminiscent of AIXI: the agent maintains a Bayesian mixture over all computable environments, but it augments the mixture with an explicit exploration component (similar to εāgreedy or UCB). The exploration schedule is designed so that (i) the agent explores sufficiently often and deeply to eventually identify the true environment, and (ii) the exploration probability decays fast enough (as dictated by the discount) that the average discounted loss caused by exploration vanishes. This yields a policy whose average value converges to the optimal value in every computable deterministic environment.
The authors argue that strong asymptotic optimality is too demanding for realistic agents; even weak asymptotic optimality can be impossible depending on how far the agent looks ahead (the shape of γ). This highlights the discount function as a fundamental design parameter for AGI: a rapidly decaying discount (short effective horizon) forces early exploitation and precludes the deep exploration needed for learning, while a slowly decaying discount (long horizon) may make learning possible but can also render weak optimality unattainable if the horizon grows too fast.
Finally, the paper situates its results within the broader literature. Existing PACāMDP guarantees assume finite state and action spaces and often rely on ergodicity or mixing conditions; the present work removes those assumptions entirely, dealing with the most general class of deterministic computable environments. The impossibility results therefore expose a fundamental limitation of any learning theory that seeks universal guarantees without imposing structural constraints on the environment. The constructive weakāoptimal policy suggests a new direction for practical algorithm design: incorporate principled, discountāaware exploration bonuses that balance deep learning against longāterm reward, even if the resulting algorithm cannot be fully realized computationally. In sum, the paper delineates the precise boundaries between what is provably achievable and what is fundamentally impossible for universal reinforcement learners, emphasizing the central role of discounting in shaping those boundaries.
Comments & Academic Discussion
Loading comments...
Leave a Comment