📝 Original Info
- Title: An Optimal Policy for Learning Controllable Dynamics by Exploration
- ArXiv ID: 2512.20053
- Date: 2025-12-23
- Authors: ** Peter N. Loxley (School of Science and Technology, University of New England, Australia) **
📝 Abstract
Controllable Markov chains describe the dynamics of sequential decision making tasks and are the central component in optimal control and reinforcement learning. In this work, we give the general form of an optimal policy for learning controllable dynamics in an unknown environment by exploring over a limited time horizon. This policy is simple to implement and efficient to compute, and allows an agent to ``learn by exploring" as it maximizes its information gain in a greedy fashion by selecting controls from a constraint set that changes over time during exploration. We give a simple parameterization for the set of controls, and present an algorithm for finding an optimal policy. The reason for this policy is due to the existence of certain types of states that restrict control of the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why the occurrence of these states makes a non-stationary policy essential for achieving optimal exploration. Six interesting examples of controllable dynamics are treated in detail. Policy optimality is demonstrated using counting arguments, comparing with suboptimal policies, and by making use of a sequential improvement property from dynamic programming.
💡 Deep Analysis
📄 Full Content
AN OPTIMAL POLICY FOR LEARNING CONTROLLABLE
DYNAMICS BY EXPLORATION
A PREPRINT
Peter N. Loxley
School of Science and Technology
University of New England
Australia
ABSTRACT
Controllable Markov chains describe the dynamics of sequential decision making tasks and are the
central component in optimal control and reinforcement learning. In this work, we give the general
form of an optimal policy for learning controllable dynamics in an unknown environment by explor-
ing over a limited time horizon. This policy is simple to implement and efficient to compute, and
allows an agent to “learn by exploring” as it maximizes its information gain in a greedy fashion by
selecting controls from a constraint set that changes over time during exploration. We give a sim-
ple parameterization for the set of controls, and present an algorithm for finding an optimal policy.
The reason for this policy is due to the existence of certain types of states that restrict control of
the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why
the occurrence of these states makes a non-stationary policy essential for achieving optimal explo-
ration. Six interesting examples of controllable dynamics are treated in detail. Policy optimality is
demonstrated using counting arguments, comparing with suboptimal policies, and by making use of
a sequential improvement property from dynamic programming.
Keywords Exploration · Information theory · Optimal control · Markov chains · Non-stationary policy
1
Introduction
Environments with controllable dynamics are interesting to understand from a fundamental viewpoint, as well as
providing a potential source of useful applications. When an explicit model of the dynamics is not available one
possibility is to learn the dynamics by exploring the environment. This is observed in animal behavior, where an
inquisitive animal will often explore a novel environment by actively seeking information about the environment,
enabling it to better prepare for future events such as avoiding predators [Wood-Gush and Vestergaard, 1991, Pisula
and Modlinska, 2023]. Active learning is concerned with a similar task, which involves selecting what data to gather
next so as to learn as much as possible about an unknown quantity [MacKay, 1992].
In this work, we are interested in the structure of optimal policies for exploring unknown environments to learn
controllable dynamics. This is a difficult problem in general, as illustrated by the following example. Consider
the simple environment in Figure 1 given by a 5 × 5 maze with three trapping states (in Markov chains these are
called absorbing states). It is certainly possible for an agent to explore this environment and learn the structure of
the maze in the process of doing so. The dilemma the agent faces is where to explore? Exploring every part of the
environment means the agent will eventually encounter a trapping state and become trapped; rendering it unable to
continue exploring, and most likely leading to an incomplete knowledge of the maze. Other types of states can also
interfere with exploration, as we shall see. It is therefore desirable for an agent to avoid certain states at certain times
during exploration. It is also desirable for an agent to explore new locations that have not yet been explored in order to
learn new information. Ideally, a policy for exploring would aim to strike an optimal balance between these somewhat
opposing objectives.
To address this challenge we apply the widely used framework of controllable Markov chains to describe the dynamics
of an unknown environment. We then investigate exploration and learning within such an environment. This is done by
arXiv:2512.20053v1 [cs.LG] 23 Dec 2025
An optimal policy for learning controllable dynamics by exploration
A PREPRINT
Figure 1: A simple environment given by a 5 × 5 maze with three trapping states (red squares). An agent can explore
this environment to learn the maze but will likely encounter a trapping state and become trapped along the way. When
this happens the agent can no longer continue to explore.
making use of information theory and optimal control, building on the idea of optimal experimental design pioneered
by Kristen Smith [Smith, 1918], and further developed in works since [Lindley, 1956, Pfaffelhuber, 1972, MacKay,
1992, Oaksford and Chater, 1994, Storck et al., 1995, Little and Sommer, 2013, Loxley and Cheung, 2023]. More
specifically, we seek the form of an optimal policy for an agent to learn the transition probabilities of a controllable
Markov chain by exploring an unknown environment using a limited number of exploration steps (i.e., a limited time
horizon). The existence of certain types of states that restrict control of the dynamics (including transient states,
absorbing states, and non-backtracking states) makes planning essential for exploration to be optimal.
Early related work in reinforcement learning includes the model-free
Reference
This content is AI-processed based on open access ArXiv data.