Regulating Reward Training by Means of Certainty Prediction in a Neural Network-Implemented Pong Game

Reading time: 6 minute
...

📝 Abstract

We present the first reinforcement-learning model to self-improve its reward-modulated training implemented through a continuously improving “intuition” neural network. An agent was trained how to play the arcade video game Pong with two reward-based alternatives, one where the paddle was placed randomly during training, and a second where the paddle was simultaneously trained on three additional neural networks such that it could develop a sense of “certainty” as to how probable its own predicted paddle position will be to return the ball. If the agent was less than 95% certain to return the ball, the policy used an intuition neural network to place the paddle. We trained both architectures for an equivalent number of epochs and tested learning performance by letting the trained programs play against a near-perfect opponent. Through this, we found that the reinforcement learning model that uses an intuition neural network for placing the paddle during reward training quickly overtakes the simple architecture in its ability to outplay the near-perfect opponent, additionally outscoring that opponent by an increasingly wide margin after additional epochs of training.

💡 Analysis

We present the first reinforcement-learning model to self-improve its reward-modulated training implemented through a continuously improving “intuition” neural network. An agent was trained how to play the arcade video game Pong with two reward-based alternatives, one where the paddle was placed randomly during training, and a second where the paddle was simultaneously trained on three additional neural networks such that it could develop a sense of “certainty” as to how probable its own predicted paddle position will be to return the ball. If the agent was less than 95% certain to return the ball, the policy used an intuition neural network to place the paddle. We trained both architectures for an equivalent number of epochs and tested learning performance by letting the trained programs play against a near-perfect opponent. Through this, we found that the reinforcement learning model that uses an intuition neural network for placing the paddle during reward training quickly overtakes the simple architecture in its ability to outplay the near-perfect opponent, additionally outscoring that opponent by an increasingly wide margin after additional epochs of training.

📄 Content

1 Regulating Reward Training by Means of Certainty Prediction in a Neural Network-Implemented Pong Game

Matt Oberdorfer Matt Abuzalaf

Frost Data Labs, 31910 Del Obispo, 92675, California, U.S.A.

matt@frostdatacapital.com matthew.abuzalaf@yale.edu

Abstract

We present the first reinforcement-learning model to self-improve its reward-modulated training implemented through a continuously improving “intuition” neural network. An agent was trained how to play the arcade video game Pong with two reward-based alternatives, one where the paddle was placed randomly during training, and a second where the paddle was simultaneously trained on three additional neural networks such that it could develop a sense of “certainty” as to how probable its own predicted paddle position will be to return the ball. If the agent was less than 95% certain to return the ball, the policy used an intuition neural network to place the paddle. We trained both architectures for an equivalent number of epochs and tested learning performance by letting the trained programs play against a near-perfect opponent. Through this, we found that the reinforcement learning model that uses an intuition neural network for placing the paddle during reward training quickly overtakes the simple architecture in its ability to outplay the near-perfect opponent, additionally outscoring that opponent by an increasingly wide margin after additional epochs of training.

Keywords: reward-modulated learning, reinforcement learning, intrinsic plasticity, recurrent neural networks, self-organization

  1. Introduction In artificial intelligence, reinforcement learning is an important field of study as it pertains to the ongoing development of “smart technology.” One method by which an agent can regulate its learning is through reward modulation, wherein the agent is trained as it receives rewards for desirable actions in order to form an optimal policy [1]. An extension to this approach is that the agent can also receive a “punishment” of sorts for incorrect actions, so as to train the agent to avoid repeating the same response in future instances. Over successive iterations, the network develops a sort of predictive insight that should ultimately allow it to make a quick determination of the actions it needs to take in order to earn a reward.
    We are introducing reward-modulated learning model that uses inputs and outputs of a prediction neural network to develop an “intuition” policy that is based on the results of the network’s actions, and ultimately, a well- trained agent can use this intuition to make predictions even in a complex environment, where simple pattern recognition may not capture all possible intricacies.
    One key to the successful training of any reinforcement-learning model is an ability to make quick and accurate predictions about the actions it needs to take to be rewarded. If the

2 agent can be trained to make a good prediction in the context of its environment, then it can be further developed to accomplish particular goals in that setting. The obstacle to overcome in the implementation of such an approach, however, tends to be the difficulty in setting up a neural network that can learn efficiently and accurately to predict successful behavior in a previously unknown environment.
At the most basic level, success is constituted here by using such inputs for the network that will capture the fullness of the problem and make a prediction that reflects this complexity. Once this is accounted for, more features can be added, such as additional rewards that are weighted based on their relative importance to the problem, switches that allow for alternation between training styles, and even the implementation of additional learning networks that run in parallel with the original.
To test how our learning model may capture and learn a complex problem, the arcade game Pong was selected as a good candidate. It has been shown [2] that arcade games can represent both a challenge problem and provide a platform and methodology for evaluating the development of general, domain-independent AI technology.
As one of the earliest arcade games, the program’s environment is not particularly complicated: at its basis, the environment involves passing a ball back and forth between paddles as one would do in a game of ping-pong (Figure 1). At the same time, however, the game reflects a degree of complexity in terms of the number of ways any particular return can end up going. The ball can bounce off the top and bottom walls. It can reflect off the paddle at a variety of different angles. And if the ball hits the corner of the paddle, then the return will be made at a greater speed. These factors make for a game where it may not immediately be so simple to predict where to place the paddle for a successful return, and as such, this compl

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut