Move Evaluation in Go Using Deep Convolutional Neural Networks
The game of Go is more challenging than other board games, due to the difficulty of constructing a position or move evaluation function. In this paper we investigate whether deep convolutional networks can be used to directly represent and learn this knowledge. We train a large 12-layer convolutional neural network by supervised learning from a database of human professional games. The network correctly predicts the expert move in 55% of positions, equalling the accuracy of a 6 dan human player. When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GnuGo in 97% of games, and matched the performance of a state-of-the-art Monte-Carlo tree search that simulates a million positions per move.
💡 Research Summary
The paper investigates whether deep convolutional neural networks (CNNs) can directly learn a strong move‑evaluation function for the game of Go, a domain traditionally considered too complex for handcrafted evaluation due to its enormous state space (≈10^170 positions) and sharp tactical non‑linearities. Using a massive dataset of 29.4 million (board state, next move) pairs extracted from 160 000 professional games on the KGS server, the authors encode each 19 × 19 board into 17 binary feature planes that capture raw game‑rule information: stone colour, liberties, post‑move liberties, legality, turns‑since, capture size, ladder status, and a one‑hot encoding of the player’s rank (1–9 dan). Data augmentation is performed by randomly applying the eight board symmetries (rotations and reflections) during minibatch construction.
The core model is a 12‑layer deep CNN. The first hidden layer uses 5 × 5 filters; the remaining eleven layers use 3 × 3 filters, all with stride 1 and zero‑padding to preserve the 19 × 19 spatial resolution. Each layer contains between 64 and 192 filters, yielding about 2.3 million trainable parameters and roughly 630 million connections. Rectified linear units (ReLU) provide non‑linearity, and position‑dependent biases are added to each convolution to allow the network to learn location‑specific preferences. The output consists of two 19 × 19 softmax maps—one for black to move, one for white—so the network directly produces a probability distribution over all legal moves.
Training employs asynchronous stochastic gradient descent (ASGD) across 50 GPU workers, each using a fixed learning rate of 0.128 (scaled by batch size 128) for 25 epochs, followed by a fine‑tuning phase on a single GPU for three epochs with a halved learning rate schedule. No momentum or weight decay is used. The model achieves 55 % top‑1 move prediction accuracy on a held‑out test set of 2 million positions, essentially matching the performance of a 6‑dan human (≈52 %±5 %). The top‑n analysis shows that the correct move lies within the top 10 predictions 94 % of the time, indicating a highly informative policy.
To assess playing strength, the network is used in a “pure policy” mode: given a board, it selects the move with highest probability. Against GnuGo 3.8 at its strongest level, the 12‑layer CNN wins 97 % of games, a dramatic improvement over shallow baselines (3‑layer networks win ≈3 %). Depth and filter count correlate strongly with both prediction accuracy and win rate; a 3‑layer 16‑filter model yields 37.5 % accuracy and 3.4 % win rate, while the 12‑layer 128‑filter model reaches 55.2 % accuracy and 97.2 % win rate. Adding the rank‑one‑hot planes allows the same network to emulate different player strengths: when forced to act as a 1‑dan, 5‑dan, or 9‑dan player, it wins 49 %, 60 %, and 68 % respectively against a fixed 10‑layer CNN.
Weight‑symmetry experiments (forcing filters to be invariant under the eight board symmetries) improve shallow networks modestly but have no effect on the deep 12‑layer model, suggesting that depth alone captures the necessary invariances. The authors also integrate the CNN with Monte‑Carlo Tree Search (MCTS) using a “delayed prior” scheme: the network is evaluated asynchronously on a GPU, and its policy vector is injected into the search once available. With 100 000 rollouts per move, the combined system defeats the raw CNN 87 % of the time and matches the strength of state‑of‑the‑art programs such as MoGo (100 k rollouts) and Pachi (10 k rollouts ≈ 2 million simulated positions). Against other strong programs (Fuego, Pachi at 100 k rollouts), the CNN wins 10–23 % of games, demonstrating competitive performance without any handcrafted search heuristics.
The paper’s contributions are threefold: (1) it shows that a sufficiently deep and wide CNN can learn a high‑quality move‑selection policy directly from expert data, achieving human‑level prediction accuracy; (2) it demonstrates that this policy alone, without any search, can defeat traditional search‑based programs and rival sophisticated MCTS systems; and (3) it provides a practical method for integrating deep policies with MCTS, paving the way for hybrid systems that combine the pattern‑recognition power of deep learning with the exhaustive exploration of tree search. The authors suggest future work on scaling the network further, incorporating additional tactical features, and employing reinforcement learning from self‑play to move beyond the under‑fitting regime observed in their experiments.
Comments & Academic Discussion
Loading comments...
Leave a Comment