Dynamic Generalisation of Continuous Action Spaces in Reinforcement Learning: A Neurally Inspired Approach
Date
07/2002Author
Smith, Andrew James
Metadata
Abstract
This thesis is about the dynamic generalisation of continuous action spaces in
reinforcement learning problems.
The standard Reinforcement Learning (RL) account provides a principled and comprehensive
means of optimising a scalar reward signal in a Markov Decision Process.
However, the theory itself does not directly address the imperative issue of generalisation
which naturally arises as a consequence of large or continuous state and action
spaces. A current thrust of research is aimed at fusing the generalisation capabilities
of supervised (and unsupervised) learning techniques with the RL theory. An example
par excellence is Tesauro’s TD-Gammon.
Although much effort has gone into researching ways to represent and generalise over
the input space, much less attention has been paid to the action space. This thesis
first considers the motivation for learning real-valued actions, and then proposes a
set of key properties desirable in any candidate algorithm addressing generalisation
of both input and action spaces. These properties include: Provision of adaptive and
online generalisation, adherence to the standard theory with a central focus on estimating
expected reward, provision for real-valued states and actions, and full support
for a real-valued discounted reward signal. Of particular interest are issues pertaining
to robustness in non-stationary environments, scalability, and efficiency for real-time
learning in applications such as robotics. Since exploring the action space is discovered
to be a potentially costly process, the system should also be flexible enough to
enable maximum reuse of learned actions.
A new approach is proposed which succeeds for the first time in addressing all of the
key issues identified. The algorithm, which is based on the ubiquitous self-organising
map, is analysed and compared with other techniques including those based on the
backpropagation algorithm. The investigation uncovers some important implications
of the differences between these two particular approaches with respect to RL. In particular,
the distributed representation of the multi-layer perceptron is judged to be
something of a double-edged sword offering more sophisticated and more scalable
generalising power, but potentially causing problems in dynamic or non-equiprobable
environments, and tasks involving a highly varying input-output mapping.
The thesis concludes that the self-organising map can be used in conjunction with current
RL theory to provide real-time dynamic representation and generalisation of continuous
action spaces. The proposed model is shown to be reliable in non-stationary,
unpredictable and noisy environments and judged to be unique in addressing and satisfying
a number of desirable properties identified as important to a large class of RL
problems.