A Dataset for StarCraft AI & an Example of Armies Clustering

This paper advocates the exploration of the full state of recorded real-time strategy (RTS) games, by human or robotic players, to discover how to reason about tactics and strategy. We present a dataset of StarCraft games encompassing the most of the games’ state (not only player’s orders). We explain one of the possible usages of this dataset by clustering armies on their compositions. This reduction of armies compositions to mixtures of Gaussian allow for strategic reasoning at the level of the components. We evaluated this clustering method by predicting the outcomes of battles based on armies compositions’ mixtures components

💡 Research Summary

The paper introduces a comprehensive dataset of StarCraft real‑time strategy games that captures the full game state at every frame, rather than only player actions or final outcomes. Using the game engine’s API, the authors recorded unit positions, health, resources, building states, vision, and mini‑map information for over 10,000 matches involving both human players and bots. The data are stored in a machine‑learning‑friendly format (JSON/Parquet) and cover roughly 30 minutes of gameplay per match at 24 fps, providing a rich substrate for a variety of analytical tasks.

To demonstrate a concrete use case, the authors focus on clustering army compositions. Each army at a given time step is represented as a high‑dimensional vector of unit counts (one dimension per unit type). Instead of deterministic clustering such as K‑means, they apply a Gaussian Mixture Model (GMM) and fit it with the Expectation‑Maximization algorithm. The number of mixture components K is selected automatically via the Bayesian Information Criterion, yielding an optimal range of 13–16 components for their data. Each component corresponds to an interpretable tactical archetype—e.g., “air‑dominant”, “mechanized infantry”, “fast‑moving raiders”, or “defensive tanks”.

For battle outcome prediction, the mixture posterior probabilities of the two opposing armies are concatenated into a 2K‑dimensional feature vector. This representation is fed into classifiers such as logistic regression and Gradient Boosting Trees. Compared with baseline features that use raw unit counts or simple aggregates, the GMM‑based model improves prediction accuracy by 8–12 percentage points, especially in engagements where armies contain mixed tactical elements. The authors also argue that the GMM parameters themselves (means and covariances) serve as meta‑data describing typical unit ratios and variability, enabling longitudinal studies of balance patches, strategic evolution, or policy‑guided AI that selects the most suitable mixture component given the current game context.

All dataset files, preprocessing scripts, and clustering code are released publicly, encouraging the community to explore further applications such as reinforcement‑learning curricula, opponent modeling, or macro‑strategic analysis. In sum, the work validates that full‑state RTS data combined with probabilistic army clustering provides a powerful abstraction layer for strategic reasoning and can substantially enhance predictive and decision‑making capabilities in StarCraft AI research.