Tournament selection in zeroth-level classifier systems based on average reward reinforcement learning

Reading time: 5 minute
...

📝 Abstract

As a genetics-based machine learning technique, zeroth-level classifier system (ZCS) is based on a discounted reward reinforcement learning algorithm, bucket-brigade algorithm, which optimizes the discounted total reward received by an agent but is not suitable for all multi-step problems, especially large-size ones. There are some undiscounted reinforcement learning methods available, such as R-learning, which optimize the average reward per time step. In this paper, R-learning is used as the reinforcement learning employed by ZCS, to replace its discounted reward reinforcement learning approach, and tournament selection is used to replace roulette wheel selection in ZCS. The modification results in classifier systems that can support long action chains, and thus is able to solve large multi-step problems.

💡 Analysis

As a genetics-based machine learning technique, zeroth-level classifier system (ZCS) is based on a discounted reward reinforcement learning algorithm, bucket-brigade algorithm, which optimizes the discounted total reward received by an agent but is not suitable for all multi-step problems, especially large-size ones. There are some undiscounted reinforcement learning methods available, such as R-learning, which optimize the average reward per time step. In this paper, R-learning is used as the reinforcement learning employed by ZCS, to replace its discounted reward reinforcement learning approach, and tournament selection is used to replace roulette wheel selection in ZCS. The modification results in classifier systems that can support long action chains, and thus is able to solve large multi-step problems.

📄 Content

Tournament selection in zeroth-level classifier systems based on average reward reinforcement learning

Zang Zhaoxiang, Li Zhao, Wang Junying, Dan Zhiping zxzang@gmail.com; zangzx@hust.edu.cn (Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University, Yichang Hubei, 443002, China; College of Computer and Information Technology, China Three Gorges University, Yichang Hubei, 443002, China)

Abstract: As a genetics-based machine learning technique, zeroth-level classifier system (ZCS) is based on a discounted reward reinforcement learning algorithm, bucket-brigade algorithm, which optimizes the discounted total reward received by an agent but is not suitable for all multi-step problems, especially large-size ones. There are some undiscounted reinforcement learning methods available, such as R-learning, which optimize the average reward per time step. In this paper, R-learning is used as the reinforcement learning employed by ZCS, to replace its discounted reward reinforcement learning approach, and tournament selection is used to replace roulette wheel selection in ZCS. The modification results in classifier systems that can support long action chains, and thus is able to solve large multi-step problems.

Key words: average reward; reinforcement learning; R-learning; learning classifier systems (LCS); zeroth-level classifier system (ZCS); multi-step problems 1 Introduction Learning Classifier Systems (LCSs) are rule-based adaptive systems which use Genetic Algorithm (GA) and some machine learning methods to facilitate rule discovery and rule learning[1]. LCSs are competitive with other techniques on classification tasks, data mining[2, 3] or robot control applications[4, 5]. In general, an LCS is a model of an intelligent agent interacting with an environment. Its ability to choose the best policy acting in the environment, namely adaptability, improves with experience. The source of the improvement is the learning from reinforcement, i.e. payoff, provided by the environment. The aim of an LCS is to maximize the achieved environmental payoffs. To do this, LCSs try to evolve and develop a population of compact and maximally general “condition-action-payoff” rules, called classifiers, which tell the system in each state (identified by the condition) the amount of payoffs for any available action. So, LCSs can be seen as a special method of reinforcement learning that provides a different approach to get generalization.
The original Learning Classifier System framework proposed by Holland, is referred to as the traditional framework now. And then, Willson proposed strength-based Zeroth-level Classifier System (ZCS)[6], and accuracy-based X Classifier System (XCS)[7]. The XCS classifier system has solved the former main shortcoming of LCSs, which is the problem of strong over-generals, by its accuracy based fitness approach. Bull and Hurst[8] have recently shown that, despite its relative simplicity, ZCS is able to perform optimally through its use of fitness sharing. That is, ZCS was shown to perform as well, with appropriate parameters, as the more complex XCS on a number of tasks. Despite current research has focused on the use of accuracy in rule predictions as the fitness measure, the present work departs from this popular approach and takes a step backward, aiming to uncover the potential of strength based LCS (and particularly ZCS) in sequential decision problems. In this direction, we will discuss the use of average reward in ZCS, and will introduce an undiscounted reinforcement learning technique called R-learning[9, 10] for ZCS to optimize average reward, which is a different metric from the discounted reward optimized by original ZCS. In particular, we apply R-learning based ZCS to large multi-step problems and compare it with ZCS. Experimental results are encouraging, in that ZCS with R-learning can perform optimally or near optimally in these problems. Later, we will refer to our proposal as “ZCSAR”, and the “AR” stands for “average reward”. The rest of the paper is structured as follows: Section 2 provides some necessary background knowledge on reinforcement learning, including Sarsa and R-learning. Section 3 provides a brief description of ZCS and maze environments. How ZCS can be modified to include the average reward reinforcement learning is described in Section 4, while Section 5 analyzes the trouble resulting from our modification to ZCS, and presents some solution to it. Experiments with our proposal and some related discussion are given in Section 6. Finally, Section 9 ends the paper, presents our main conclusions and some directions for future research. 2 Reinforcement learning Reinforcement learning is a formal framework in which an agent manipulates its environment through a series of actions, and receives some rewards as feedback to its actions, but is not t

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut