An Optimal Policy for Learning Controllable Dynamics by Exploration

Reading time: 5 minute
...

📝 Original Info

  • Title: An Optimal Policy for Learning Controllable Dynamics by Exploration
  • ArXiv ID: 2512.20053
  • Date: 2025-12-23
  • Authors: ** Peter N. Loxley (School of Science and Technology, University of New England, Australia) **

📝 Abstract

Controllable Markov chains describe the dynamics of sequential decision making tasks and are the central component in optimal control and reinforcement learning. In this work, we give the general form of an optimal policy for learning controllable dynamics in an unknown environment by exploring over a limited time horizon. This policy is simple to implement and efficient to compute, and allows an agent to ``learn by exploring" as it maximizes its information gain in a greedy fashion by selecting controls from a constraint set that changes over time during exploration. We give a simple parameterization for the set of controls, and present an algorithm for finding an optimal policy. The reason for this policy is due to the existence of certain types of states that restrict control of the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why the occurrence of these states makes a non-stationary policy essential for achieving optimal exploration. Six interesting examples of controllable dynamics are treated in detail. Policy optimality is demonstrated using counting arguments, comparing with suboptimal policies, and by making use of a sequential improvement property from dynamic programming.

💡 Deep Analysis

📄 Full Content

AN OPTIMAL POLICY FOR LEARNING CONTROLLABLE DYNAMICS BY EXPLORATION A PREPRINT Peter N. Loxley School of Science and Technology University of New England Australia ABSTRACT Controllable Markov chains describe the dynamics of sequential decision making tasks and are the central component in optimal control and reinforcement learning. In this work, we give the general form of an optimal policy for learning controllable dynamics in an unknown environment by explor- ing over a limited time horizon. This policy is simple to implement and efficient to compute, and allows an agent to “learn by exploring” as it maximizes its information gain in a greedy fashion by selecting controls from a constraint set that changes over time during exploration. We give a sim- ple parameterization for the set of controls, and present an algorithm for finding an optimal policy. The reason for this policy is due to the existence of certain types of states that restrict control of the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why the occurrence of these states makes a non-stationary policy essential for achieving optimal explo- ration. Six interesting examples of controllable dynamics are treated in detail. Policy optimality is demonstrated using counting arguments, comparing with suboptimal policies, and by making use of a sequential improvement property from dynamic programming. Keywords Exploration · Information theory · Optimal control · Markov chains · Non-stationary policy 1 Introduction Environments with controllable dynamics are interesting to understand from a fundamental viewpoint, as well as providing a potential source of useful applications. When an explicit model of the dynamics is not available one possibility is to learn the dynamics by exploring the environment. This is observed in animal behavior, where an inquisitive animal will often explore a novel environment by actively seeking information about the environment, enabling it to better prepare for future events such as avoiding predators [Wood-Gush and Vestergaard, 1991, Pisula and Modlinska, 2023]. Active learning is concerned with a similar task, which involves selecting what data to gather next so as to learn as much as possible about an unknown quantity [MacKay, 1992]. In this work, we are interested in the structure of optimal policies for exploring unknown environments to learn controllable dynamics. This is a difficult problem in general, as illustrated by the following example. Consider the simple environment in Figure 1 given by a 5 × 5 maze with three trapping states (in Markov chains these are called absorbing states). It is certainly possible for an agent to explore this environment and learn the structure of the maze in the process of doing so. The dilemma the agent faces is where to explore? Exploring every part of the environment means the agent will eventually encounter a trapping state and become trapped; rendering it unable to continue exploring, and most likely leading to an incomplete knowledge of the maze. Other types of states can also interfere with exploration, as we shall see. It is therefore desirable for an agent to avoid certain states at certain times during exploration. It is also desirable for an agent to explore new locations that have not yet been explored in order to learn new information. Ideally, a policy for exploring would aim to strike an optimal balance between these somewhat opposing objectives. To address this challenge we apply the widely used framework of controllable Markov chains to describe the dynamics of an unknown environment. We then investigate exploration and learning within such an environment. This is done by arXiv:2512.20053v1 [cs.LG] 23 Dec 2025 An optimal policy for learning controllable dynamics by exploration A PREPRINT Figure 1: A simple environment given by a 5 × 5 maze with three trapping states (red squares). An agent can explore this environment to learn the maze but will likely encounter a trapping state and become trapped along the way. When this happens the agent can no longer continue to explore. making use of information theory and optimal control, building on the idea of optimal experimental design pioneered by Kristen Smith [Smith, 1918], and further developed in works since [Lindley, 1956, Pfaffelhuber, 1972, MacKay, 1992, Oaksford and Chater, 1994, Storck et al., 1995, Little and Sommer, 2013, Loxley and Cheung, 2023]. More specifically, we seek the form of an optimal policy for an agent to learn the transition probabilities of a controllable Markov chain by exploring an unknown environment using a limited number of exploration steps (i.e., a limited time horizon). The existence of certain types of states that restrict control of the dynamics (including transient states, absorbing states, and non-backtracking states) makes planning essential for exploration to be optimal. Early related work in reinforcement learning includes the model-free

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut