Regulating Reward Training by Means of Certainty Prediction in a Neural Network-Implemented Pong Game
📝 Abstract
We present the first reinforcement-learning model to self-improve its reward-modulated training implemented through a continuously improving “intuition” neural network. An agent was trained how to play the arcade video game Pong with two reward-based alternatives, one where the paddle was placed randomly during training, and a second where the paddle was simultaneously trained on three additional neural networks such that it could develop a sense of “certainty” as to how probable its own predicted paddle position will be to return the ball. If the agent was less than 95% certain to return the ball, the policy used an intuition neural network to place the paddle. We trained both architectures for an equivalent number of epochs and tested learning performance by letting the trained programs play against a near-perfect opponent. Through this, we found that the reinforcement learning model that uses an intuition neural network for placing the paddle during reward training quickly overtakes the simple architecture in its ability to outplay the near-perfect opponent, additionally outscoring that opponent by an increasingly wide margin after additional epochs of training.
💡 Analysis
We present the first reinforcement-learning model to self-improve its reward-modulated training implemented through a continuously improving “intuition” neural network. An agent was trained how to play the arcade video game Pong with two reward-based alternatives, one where the paddle was placed randomly during training, and a second where the paddle was simultaneously trained on three additional neural networks such that it could develop a sense of “certainty” as to how probable its own predicted paddle position will be to return the ball. If the agent was less than 95% certain to return the ball, the policy used an intuition neural network to place the paddle. We trained both architectures for an equivalent number of epochs and tested learning performance by letting the trained programs play against a near-perfect opponent. Through this, we found that the reinforcement learning model that uses an intuition neural network for placing the paddle during reward training quickly overtakes the simple architecture in its ability to outplay the near-perfect opponent, additionally outscoring that opponent by an increasingly wide margin after additional epochs of training.
📄 Content
1 Regulating Reward Training by Means of Certainty Prediction in a Neural Network-Implemented Pong Game
Matt Oberdorfer Matt Abuzalaf
Frost Data Labs, 31910 Del Obispo, 92675, California, U.S.A.
matt@frostdatacapital.com matthew.abuzalaf@yale.edu
Abstract
We present the first reinforcement-learning model to self-improve its reward-modulated training implemented through a continuously improving “intuition” neural network. An agent was trained how to play the arcade video game Pong with two reward-based alternatives, one where the paddle was placed randomly during training, and a second where the paddle was simultaneously trained on three additional neural networks such that it could develop a sense of “certainty” as to how probable its own predicted paddle position will be to return the ball. If the agent was less than 95% certain to return the ball, the policy used an intuition neural network to place the paddle. We trained both architectures for an equivalent number of epochs and tested learning performance by letting the trained programs play against a near-perfect opponent. Through this, we found that the reinforcement learning model that uses an intuition neural network for placing the paddle during reward training quickly overtakes the simple architecture in its ability to outplay the near-perfect opponent, additionally outscoring that opponent by an increasingly wide margin after additional epochs of training.
Keywords: reward-modulated learning, reinforcement learning, intrinsic plasticity, recurrent neural networks, self-organization
- Introduction
In artificial intelligence, reinforcement learning
is an important field of study as it pertains to the
ongoing development of “smart technology.”
One method by which an agent can regulate its
learning is through reward modulation, wherein
the agent is trained as it receives rewards for
desirable actions in order to form an optimal
policy [1]. An extension to this approach is that
the agent can also receive a “punishment” of
sorts for incorrect actions, so as to train the
agent to avoid repeating the same response in
future instances. Over successive iterations, the
network develops a sort of predictive insight that
should ultimately allow it to make a quick
determination of the actions it needs to take in
order to earn a reward.
We are introducing reward-modulated learning model that uses inputs and outputs of a prediction neural network to develop an “intuition” policy that is based on the results of the network’s actions, and ultimately, a well- trained agent can use this intuition to make predictions even in a complex environment, where simple pattern recognition may not capture all possible intricacies.
One key to the successful training of any reinforcement-learning model is an ability to make quick and accurate predictions about the actions it needs to take to be rewarded. If the
2
agent can be trained to make a good prediction
in the context of its environment, then it can be
further developed to accomplish particular goals
in that setting. The obstacle to overcome in the
implementation of such an approach, however,
tends to be the difficulty in setting up a neural
network that can learn efficiently and accurately
to predict successful behavior in a previously
unknown environment.
At the most basic level, success is constituted
here by using such inputs for the network that
will capture the fullness of the problem and
make a prediction that reflects this complexity.
Once this is accounted for, more features can be
added, such as additional rewards that are
weighted based on their relative importance to
the problem, switches that allow for alternation
between
training
styles,
and
even
the
implementation of additional learning networks
that run in parallel with the original.
To test how our learning model may capture and
learn a complex problem, the arcade game Pong
was selected as a good candidate. It has been
shown [2] that arcade games can represent both
a challenge problem and provide a platform and
methodology for evaluating the development of
general, domain-independent AI technology.
As one of the earliest arcade games, the
program’s environment is not particularly
complicated: at its basis, the environment
involves passing a ball back and forth between
paddles as one would do in a game of ping-pong
(Figure 1). At the same time, however, the game
reflects a degree of complexity in terms of the
number of ways any particular return can end up
going. The ball can bounce off the top and
bottom walls. It can reflect off the paddle at a
variety of different angles. And if the ball hits
the corner of the paddle, then the return will be
made at a greater speed. These factors make for
a game where it may not immediately be so
simple to predict where to place the paddle for a
successful return, and as such, this compl
This content is AI-processed based on ArXiv data.