Learning RoboCup-Keepaway with Kernels

We apply kernel-based methods to solve the difficult reinforcement learning problem of 3vs2 keepaway in RoboCup simulated soccer. Key challenges in keepaway are the high-dimensionality of the state space (rendering conventional discretization-based function approximation like tilecoding infeasible), the stochasticity due to noise and multiple learning agents needing to cooperate (meaning that the exact dynamics of the environment are unknown) and real-time learning (meaning that an efficient online implementation is required). We employ the general framework of approximate policy iteration with least-squares-based policy evaluation. As underlying function approximator we consider the family of regularization networks with subset of regressors approximation. The core of our proposed solution is an efficient recursive implementation with automatic supervised selection of relevant basis functions. Simulation results indicate that the behavior learned through our approach clearly outperforms the best results obtained earlier with tilecoding by Stone et al. (2005).

💡 Research Summary

The paper tackles the challenging reinforcement‑learning task of 3‑vs‑2 keepaway in the RoboCup simulated soccer environment by introducing a kernel‑based function approximator within an approximate policy‑iteration (API) framework. Keepaway presents three major difficulties: a high‑dimensional continuous state space (13 variables describing player and ball positions and velocities), stochastic dynamics caused by sensor noise and the interaction of multiple learning agents, and the need for real‑time learning where each new observation must be incorporated immediately. Traditional discretization methods such as tile‑coding become infeasible because the required number of tiles grows exponentially with dimensionality, leading to prohibitive memory and computational demands.

To address these issues, the authors employ regularization networks—kernel ridge‑regression models that implicitly map inputs into an infinite‑dimensional feature space—combined with a Subset‑of‑Regressors (SoR) approximation. SoR selects a small representative subset of basis points from the growing data set, thereby reducing the kernel matrix from O(M²) to O(m²) where m ≪ M. This yields a compact representation that can be updated online.

The learning algorithm consists of two intertwined components. First, Least‑Squares Policy Evaluation (LSPE) is used to approximate the action‑value function Qπ for the current policy π. The LSPE update is performed recursively: when a new transition (s, a, r, s′) arrives, the algorithm updates the weight vector using the Sherman‑Morrison formula (or an equivalent recursive least‑squares scheme) without revisiting past data. Second, a simple ε‑greedy improvement step generates a new policy π′ by selecting the greedy action with probability 1‑ε and a random action otherwise. The API loop repeats these evaluation‑improvement cycles until convergence.

A central novelty lies in the supervised, automatic selection of new basis functions. For each incoming sample the algorithm computes the prediction error of the current model and the contribution of the candidate kernel centered at the new state. If the error exceeds a predefined threshold or if the candidate adds sufficient linear independence to the existing basis set, the sample is admitted as a new basis point; otherwise it is discarded. This mechanism controls model complexity, prevents over‑fitting, and ensures that the number of bases grows only as needed. In the experiments the final basis set comprised roughly 5 % of all observed samples, dramatically reducing memory usage while preserving expressive power.

Empirical evaluation was conducted on the standard RoboCup 3‑vs‑2 keepaway benchmark. The agents were trained for 5,000 episodes, and performance was measured by the average time the ball remained in possession and the success rate of maintaining control. The kernel‑based approach consistently outperformed the best previously reported tile‑coding method (Stone et al., 2005) by 15–20 % in both metrics. Notably, learning curves showed rapid early improvement, indicating that the recursive LSPE combined with adaptive basis selection quickly captures the essential dynamics of the environment.

The paper contributes three key insights to the reinforcement‑learning literature. (1) Kernel methods can be made computationally tractable for online, high‑dimensional control problems through SoR and recursive updates. (2) Automatic basis selection based on supervised error criteria provides a principled way to balance model capacity against computational constraints in real‑time settings. (3) In a multi‑agent, stochastic domain like RoboCup keepaway, kernel‑based API yields superior policies compared to traditional discretization, demonstrating the practical viability of non‑parametric function approximation for robotic decision making.

Future directions suggested by the authors include exploring alternative kernels (e.g., Gaussian, polynomial), integrating more sophisticated exploration strategies (such as Upper‑Confidence Bound or Thompson Sampling), and transferring the methodology from simulation to physical robot platforms where latency and sensor noise are even more pronounced. Overall, the work establishes a solid foundation for applying kernel‑based reinforcement learning to complex, real‑time robotic tasks.