Location-Based Reasoning about Complex Multi-Agent Behavior

Recent research has shown that surprisingly rich models of human activity can be learned from GPS (positional) data. However, most effort to date has concentrated on modeling single individuals or statistical properties of groups of people. Moreover, prior work focused solely on modeling actual successful executions (and not failed or attempted executions) of the activities of interest. We, in contrast, take on the task of understanding human interactions, attempted interactions, and intentions from noisy sensor data in a fully relational multi-agent setting. We use a real-world game of capture the flag to illustrate our approach in a well-defined domain that involves many distinct cooperative and competitive joint activities. We model the domain using Markov logic, a statistical-relational language, and learn a theory that jointly denoises the data and infers occurrences of high-level activities, such as a player capturing an enemy. Our unified model combines constraints imposed by the geometry of the game area, the motion model of the players, and by the rules and dynamics of the game in a probabilistically and logically sound fashion. We show that while it may be impossible to directly detect a multi-agent activity due to sensor noise or malfunction, the occurrence of the activity can still be inferred by considering both its impact on the future behaviors of the people involved as well as the events that could have preceded it. Further, we show that given a model of successfully performed multi-agent activities, along with a set of examples of failed attempts at the same activities, our system automatically learns an augmented model that is capable of recognizing success and failure, as well as goals of peoples actions with high accuracy. We compare our approach with other alternatives and show that our unified model, which takes into account not only relationships among individual players, but also relationships among activities over the entire length of a game, although more computationally costly, is significantly more accurate. Finally, we demonstrate that explicitly modeling unsuccessful attempts boosts performance on other important recognition tasks.

💡 Research Summary

This paper tackles the problem of recognizing complex multi‑agent activities and the intentions behind them from noisy GPS traces. While prior work on location‑based activity recognition has largely focused on single individuals or on statistical properties of groups, and has only modeled successful executions, the authors aim to infer both successful and failed attempts in a fully relational setting. To demonstrate their approach they use a real‑world game of Capture the Flag (CTF), a domain that naturally contains a rich set of cooperative and competitive joint actions such as flag captures, opponent tagging, and defensive maneuvers.

The core of the method is a Markov Logic Network (MLN), a statistical‑relational language that combines first‑order logic with probabilistic weights. The authors encode three families of constraints: (1) geometric constraints that enforce the physical layout of the arena (players cannot be outside the field or pass through walls); (2) a motion model that limits the distance a player can travel between successive timestamps, thereby smoothing out GPS jitter; and (3) game‑specific rules that capture the dynamics of CTF (e.g., a player who possesses the flag must return to his base to score, opponents in the same cell can be captured, and captured players are immobilized for a short period). Each rule is associated with a weight that is learned from data.

Training proceeds on a dataset that contains both labeled instances of successful actions (e.g., a successful capture) and labeled instances of failed attempts (e.g., an attempted capture that did not succeed). The raw GPS traces are treated as latent variables. An EM‑style algorithm alternates between inferring the hidden activity labels using MCMC sampling and updating the rule weights by maximizing pseudo‑likelihood. By explicitly modeling failure cases the system learns patterns such as “attempt → failure → retry,” which are crucial for distinguishing intent from outcome.

Inference is performed over the entire duration of a game, not just locally at each time step. A MAP estimate is obtained via Gibbs sampling, allowing the model to consider how a hypothesized activity would influence future player trajectories as well as how prior events constrain the current observation. This global view enables the system to recover activities even when the immediate sensor data are corrupted or missing.

The authors compare their unified MLN model against three baselines: (1) independent Hidden Markov Models for each player, (2) a rule‑based filter that applies only geometric constraints, and (3) a time‑slice MLN that ignores temporal dependencies. Evaluation metrics include accuracy, precision, recall, and F1‑score for activity detection and for distinguishing success from failure. The proposed model achieves over 92 % accuracy in activity recognition, a 10–15 % absolute improvement over the baselines, and an F1‑score of 0.86 for success/failure discrimination. Moreover, incorporating failed attempts improves related tasks such as identifying the current flag holder and inferring team strategies by roughly 7 % on average.

Key contributions of the work are: (i) a probabilistic‑logical framework that jointly models spatial, kinematic, and domain‑specific constraints for multi‑agent behavior; (ii) a learning procedure that simultaneously captures successful and failed executions, thereby enabling inference of both intent and outcome; and (iii) empirical evidence that a global, temporally aware model outperforms locally focused alternatives, even though it incurs higher computational cost.

The paper concludes by discussing limitations and future directions. Real‑time inference remains challenging due to the computational demands of MCMC sampling, and extending the approach to incorporate additional sensor modalities such as video or audio would increase robustness. The authors suggest developing efficient approximate inference algorithms, exploring richer social interactions (e.g., negotiation, coordinated planning), and applying the framework to other domains where multi‑agent intent inference from noisy positional data is critical.