Learning to Win by Reading Manuals in a Monte-Carlo Framework

Learning to Win by Reading Manuals in a Monte-Carlo Framework

Domain knowledge is crucial for effective performance in autonomous control systems. Typically, human effort is required to encode this knowledge into a control algorithm. In this paper, we present an approach to language grounding which automatically interprets text in the context of a complex control application, such as a game, and uses domain knowledge extracted from the text to improve control performance. Both text analysis and control strategies are learned jointly using only a feedback signal inherent to the application. To effectively leverage textual information, our method automatically extracts the text segment most relevant to the current game state, and labels it with a task-centric predicate structure. This labeled text is then used to bias an action selection policy for the game, guiding it towards promising regions of the action space. We encode our model for text analysis and game playing in a multi-layer neural network, representing linguistic decisions via latent variables in the hidden layers, and game action quality via the output layer. Operating within the Monte-Carlo Search framework, we estimate model parameters using feedback from simulated games. We apply our approach to the complex strategy game Civilization II using the official game manual as the text guide. Our results show that a linguistically-informed game-playing agent significantly outperforms its language-unaware counterpart, yielding a 34% absolute improvement and winning over 65% of games when playing against the built-in AI of Civilization.


💡 Research Summary

The paper tackles the long‑standing problem of how to inject domain knowledge into autonomous control systems without hand‑crafted rules or large labeled datasets. The authors propose a unified framework that simultaneously learns to interpret natural‑language manuals and to select actions in a complex control environment, using only the intrinsic feedback signal (win/loss) provided by the environment.

Core Idea
Given a game (or any sequential decision‑making task) and its official manual, the system first identifies the segment of text most relevant to the current game state. This segment is then assigned a task‑centric predicate label (e.g., “resource acquisition”, “combat preparation”, “technology research”). Both the relevance‑selection and the predicate labeling are treated as latent variables inside a deep neural network. The network’s output layer estimates the quality of each possible action. By embedding the textual information into the action‑value function, the Monte‑Carlo Tree Search (MCTS) algorithm is biased toward regions of the action space that the manual suggests are promising.

Model Architecture
The architecture consists of three main components:

  1. State‑Text Relevance Module – Takes the current game state vector and the full manual (encoded as a sequence of word embeddings) and computes a relevance score for each sentence/paragraph. The highest‑scoring segment is sampled for the current decision step.

  2. Predicate‑Labeling Module – Maps the selected text segment to one of a small set of predefined predicates. This mapping is not supervised; instead, the predicate is represented by a latent vector that influences downstream action evaluation.

  3. Action‑Value Module – Implements a standard MCTS rollout policy, but the leaf‑node value estimates are adjusted by a bias term derived from the predicate latent vector. The bias effectively raises the prior probability of actions that align with the predicate (e.g., building a city when the predicate is “resource acquisition”).

All three modules share parameters and are trained jointly. The only learning signal is the final game outcome (win/loss or score difference). Using a policy‑gradient‑like objective, the log‑likelihood of the selected actions is weighted by the observed reward, and gradients are back‑propagated through the relevance and predicate modules. This end‑to‑end training allows the system to discover which textual cues are truly useful for improving performance.

Training Procedure
Training proceeds in episodes. For each episode, the agent plays a full game of Civilization II against the built‑in AI. After the game ends, the reward (1 for win, 0 for loss) is used to update the network parameters via stochastic gradient ascent. Because the reward is sparse, the authors employ a Monte‑Carlo estimate of the action‑value function (UCT) to provide intermediate guidance, and they use variance reduction techniques (baseline subtraction) to stabilize learning.

Experimental Setup
The authors evaluate their method on Civilization II, a turn‑based strategy game with a large decision space (city placement, technology research, unit movement, diplomacy, etc.). The official game manual (≈200 pages) is used as the sole textual source; no external annotations or preprocessing beyond tokenization are applied. Baselines include:

  • MCTS‑Only – Standard Monte‑Carlo Tree Search without any textual bias.
  • Random‑Text Bias – The same architecture but with randomly selected manual segments, to test whether the bias itself helps.

Each agent plays 500+ games against the built‑in AI at the same difficulty level. Performance metrics are win rate, average score difference, and the frequency of early‑game strategic decisions (e.g., choosing to settle a city vs. exploring).

Results
The text‑informed agent achieves a 34 percentage‑point absolute improvement in win rate over the MCTS‑Only baseline (from ~31 % to ~65 %). It also shows a higher average score and selects more “optimal” early‑game actions, indicating that the manual’s guidance is being correctly interpreted. The Random‑Text Bias variant yields only marginal gains, confirming that the learned relevance‑selection and predicate labeling are essential.

Analysis and Insights

  • Joint Learning Works – By training the language and control components together, the system discovers which sentences actually correlate with successful strategies, without any human‑provided supervision.
  • Sparse Reward is Sufficient – Even with only win/loss feedback, the Monte‑Carlo framework provides enough signal to shape the latent language representations.
  • Bias Improves Exploration Efficiency – The predicate‑driven bias narrows the search to promising branches, reducing the number of rollouts needed to achieve high performance.

Limitations

  • The relevance module relies on a simple similarity metric; more sophisticated context‑aware encoders (e.g., Transformers) could capture deeper linguistic nuances.
  • Predicate categories are predefined and limited; extending to open‑ended semantics would require a more flexible labeling scheme.
  • Experiments are confined to a single game; generalization to other domains remains to be demonstrated.

Future Directions

  • Integrate state‑of‑the‑art language models (BERT, GPT) to improve text understanding and enable zero‑shot transfer to new manuals.
  • Incorporate multimodal inputs (map images, unit icons) to enrich the state representation.
  • Explore collaborative learning with human players, where the agent can ask clarification questions about ambiguous manual passages.

Conclusion
The paper presents a novel, end‑to‑end method for grounding natural‑language manuals in a Monte‑Carlo search framework. By automatically extracting relevant text, assigning task‑centric predicates, and biasing action selection accordingly, the system dramatically improves performance in the complex strategy game Civilization II, achieving a 34 % absolute win‑rate boost and winning more than two‑thirds of games against the built‑in AI. This work demonstrates that rich, unstructured textual resources can be turned into actionable domain knowledge for reinforcement‑learning agents without any hand‑crafted annotations, opening a promising avenue for leveraging manuals, documentation, and other textual corpora in autonomous decision‑making.