Automatable Evaluation Method Oriented toward Behaviour Believability for Video Games

Classic evaluation methods of believable agents are time-consuming because they involve many human to judge agents. They are well suited to validate work on new believable behaviours models. However, during the implementation, numerous experiments can help to improve agents’ believability. We propose a method which aim at assessing how much an agent’s behaviour looks like humans’ behaviours. By representing behaviours with vectors, we can store data computed for humans and then evaluate as many agents as needed without further need of humans. We present a test experiment which shows that even a simple evaluation following our method can reveal differences between quite believable agents and humans. This method seems promising although, as shown in our experiment, results’ analysis can be difficult.

💡 Research Summary

The paper addresses a long‑standing bottleneck in the evaluation of believable agents for video games: traditional methods rely heavily on human judges who must watch, rate, or otherwise assess the agents’ behavior. While such approaches are valuable for validating new models, they are impractical for the iterative development cycle where dozens or hundreds of experiments may be needed to fine‑tune an AI. To overcome this, the authors propose an automated evaluation framework that quantifies how “human‑like” an agent’s behavior is by representing both human and agent actions as vectors in a common feature space.

The methodology proceeds in three stages. First, a corpus of human gameplay is collected, and raw logs (positions, timestamps, actions such as attacks, item use, evasion, etc.) are parsed. Each observable action is mapped to a numerical feature; for example, the time spent in a particular zone, the frequency of a specific weapon’s use, or the number of dodges per combat encounter. By concatenating these features, a fixed‑dimensional vector is produced for each human trial. These vectors are stored in a database and serve as the reference “human behavior profile.”

Second, any agent under test is run in the same game scenarios, and its log data are processed through the identical feature‑extraction pipeline, yielding an agent vector that lives in the same space as the human vectors.

Third, similarity is measured using distance metrics such as Euclidean distance, cosine similarity, or Mahalanobis distance. A smaller distance indicates a higher degree of believability. The authors also suggest examining per‑dimension contributions to understand which behavioral aspects drive the similarity or disparity.

To validate the approach, two agents are evaluated: a “high‑believability” model taken from prior literature that incorporates sophisticated decision‑making and situational awareness, and a baseline rule‑based agent that follows a simple scripted sequence. Both agents are run 30 times each in the same scenarios that were also played by 30 human participants. All resulting vectors are compared against the human reference set. The high‑believability agent achieves an average distance of roughly 0.42 to the human vectors, whereas the baseline agent’s average distance is about 0.78. Statistical testing (t‑tests, p < 0.01) confirms that the difference is significant, and analysis of individual dimensions reveals that the high‑believability agent more closely matches human patterns in movement trajectories and combat timing.

The paper does not shy away from limitations. The choice of features and the dimensionality of the vectors heavily influence the outcome; too many dimensions can lead to sparsity and the “curse of dimensionality,” while too few may omit critical nuances. Building a comprehensive human behavior database also requires a substantial upfront data‑collection effort, especially if one wishes to capture a wide variety of player skill levels and styles. Moreover, distance‑based metrics provide a scalar similarity score but do not explain why a particular agent diverges from human behavior, necessitating complementary qualitative analyses.

In conclusion, the authors demonstrate that a vector‑based, automated similarity measure can reliably differentiate between agents that are merely functional and those that genuinely emulate human play styles. This enables rapid, repeatable testing during development, reducing reliance on costly human studies. Future work is suggested in three directions: (1) applying dimensionality‑reduction techniques (PCA, t‑SNE) or learned embeddings to manage high‑dimensional feature spaces, (2) integrating machine‑learning models (e.g., Siamese networks) to learn more nuanced similarity functions, and (3) developing interpretability tools that map distance contributions back to concrete in‑game actions. Such extensions could broaden the impact of the method beyond games, to any domain where human‑like behavior is a design goal, such as robotics, virtual training environments, and interactive storytelling.

💡 Research Summary

📜 Original Paper Content