Asymptotically Optimal Agents

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Artificial general intelligence aims to create agents capable of learning to solve arbitrary interesting problems. We define two versions of asymptotic optimality and prove that no agent can satisfy the strong version while in some cases, depending on discounting, there does exist a non-computable weak asymptotically optimal agent.

💡 Research Summary

The paper investigates the long‑term performance guarantees that can be demanded of a general reinforcement‑learning agent operating in an unknown environment. Two notions of asymptotic optimality are introduced. Strong asymptotic optimality requires that, for every environment µ in a given class M, the value of the agent’s policy π converges to the value of the optimal policy π*µ as time n → ∞. In other words, the agent must eventually stop exploring and act optimally at every step. Weak asymptotic optimality relaxes this requirement: the average difference between the agent’s value and the optimal value must tend to zero, allowing the agent to keep exploring forever as long as the fraction of “serious mistakes’’ vanishes.

The authors focus on the class M of all deterministic computable environments and on arbitrary (but regular) discount functions γ. Their main negative result (Theorem 8) has three parts:

No strong asymptotically optimal policy exists when γ is computable. The proof constructs a simple two‑action environment where the optimal policy always chooses “up’’ and yields a constant reward of ½. Any policy that were strongly optimal would eventually have to avoid long runs of the sub‑optimal “down’’ action. By embedding a hidden “trap’’ that can be unlocked only after a sufficiently long contiguous block of “down’’ actions, the authors create a second environment ν that is indistinguishable from the first under the candidate policy, yet in ν the policy’s value stays below optimal, yielding a contradiction.
No computable weakly asymptotically optimal policy exists for the same class of environments and discount functions. If a computable policy π were weakly optimal, one can again construct an environment that mimics the agent’s behaviour but rewards the agent only when it follows a specific long‑run pattern that π never produces, thereby violating the weak‑optimality condition. Consequently any weakly optimal policy must be non‑computable.
For certain discount functions, even non‑computable weakly optimal policies cannot exist. The authors exhibit γk = 1/(k(k+1)), whose effective horizon grows linearly with time. With such a slowly decaying discount, an agent cannot afford the deep, contiguous exploration needed to learn the hidden trap without sacrificing too much discounted reward, so no weakly optimal policy (computable or not) can satisfy the definition.

On the positive side, the paper shows that for “reasonable’’ discount functions—most notably geometric discounting γk = γ^k with 0 < γ < 1—there does exist a non‑computable weakly asymptotically optimal agent. The construction is reminiscent of AIXI: the agent maintains a Bayesian mixture over all computable environments, but it augments the mixture with an explicit exploration component (similar to ε‑greedy or UCB). The exploration schedule is designed so that (i) the agent explores sufficiently often and deeply to eventually identify the true environment, and (ii) the exploration probability decays fast enough (as dictated by the discount) that the average discounted loss caused by exploration vanishes. This yields a policy whose average value converges to the optimal value in every computable deterministic environment.

The authors argue that strong asymptotic optimality is too demanding for realistic agents; even weak asymptotic optimality can be impossible depending on how far the agent looks ahead (the shape of γ). This highlights the discount function as a fundamental design parameter for AGI: a rapidly decaying discount (short effective horizon) forces early exploitation and precludes the deep exploration needed for learning, while a slowly decaying discount (long horizon) may make learning possible but can also render weak optimality unattainable if the horizon grows too fast.

Finally, the paper situates its results within the broader literature. Existing PAC‑MDP guarantees assume finite state and action spaces and often rely on ergodicity or mixing conditions; the present work removes those assumptions entirely, dealing with the most general class of deterministic computable environments. The impossibility results therefore expose a fundamental limitation of any learning theory that seeks universal guarantees without imposing structural constraints on the environment. The constructive weak‑optimal policy suggests a new direction for practical algorithm design: incorporate principled, discount‑aware exploration bonuses that balance deep learning against long‑term reward, even if the resulting algorithm cannot be fully realized computationally. In sum, the paper delineates the precise boundaries between what is provably achievable and what is fundamentally impossible for universal reinforcement learners, emphasizing the central role of discounting in shaping those boundaries.

Asymptotically Optimal Agents

💡 Research Summary

Comments & Academic Discussion

Leave a Comment