Latent Perspective-Taking via a Schrödinger Bridge in Influence-Augmented Local Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Operating in environments alongside humans requires robots to make decisions under uncertainty. In addition to exogenous dynamics, they must reason over others’ hidden mental-models and mental-states. While Interactive POMDPs and Bayesian Theory of Mind formulations are principled, exact nested-belief inference is intractable, and hand-specified models are brittle in open-world settings. We address both by learning structured mental-models and an estimator of others’ mental-states. Building on the Influence-Based Abstraction, we instantiate an Influence-Augmented Local Model to decompose socially-aware robot tasks into local dynamics, social influences, and exogenous factors. We propose (a) a neuro-symbolic world model instantiating a factored, discrete Dynamic Bayesian Network, and (b) a perspective-shift operator modeled as an amortized Schrödinger Bridge over the learned local dynamics that transports factored egocentric beliefs into other-centric beliefs. We show that this architecture enables agents to synthesize socially-aware policies in model-based reinforcement learning, via decision-time mental-state planning (a Schrödinger Bridge in belief space), with preliminary results in a MiniGrid social navigation task.

💡 Research Summary

The paper tackles the fundamental challenge of enabling robots to operate alongside humans by reasoning not only about uncertain environmental dynamics but also about the hidden mental models and mental states of other agents. Classical frameworks such as Interactive POMDPs (I‑POMDPs) and Bayesian Theory of Mind (BToM) provide a principled formulation, yet exact nested‑belief inference quickly becomes intractable (PSPACE/EXPSPACE) and hand‑crafted symbolic models break down in open‑world, stochastic settings. To overcome these limitations, the authors propose a data‑driven architecture that (i) learns structured, factored mental‑models and (ii) provides a tractable belief‑to‑belief transformation for perspective‑taking.

The core of the approach is an Influence‑Augmented Local Model (IALM), derived from the Influence‑Based Abstraction (IBA). IALM factorizes the global multi‑agent state into a local state X, a set of influence sources U, and non‑local states Y. By extracting a minimal d‑separating set Dₜ from the local action‑state history, the model can marginalize the influence distribution I(uₜ|Dₜ) and define an augmented local transition (\bar T) that incorporates external effects without enumerating the full global space. This yields a compact two‑stage Dynamic Bayesian Network (2‑DBN) that can be solved locally.

To learn the components of the IALM, the authors build a neuro‑symbolic world model. Raw observations are encoded via discrete variational autoencoders (dVAEs) with Gumbel‑Softmax into a set of categorical factors (C, F). These are mapped to N factored latent state variables each with K categories, forming the factored state xₜ. The transition model (\dot T) is “influence‑naïve” and uses α‑entmax cross‑attention to induce data‑driven sparsity, effectively learning parent‑child dependencies among latent factors. Observation and reward models are similarly factorized, preserving the explicit DBN structure while benefiting from end‑to‑end deep learning.

Perspective‑taking is cast as a belief‑transport problem solved by a Schrödinger Bridge (SB). An SB seeks a stochastic process whose path distribution is closest (in KL divergence) to a reference dynamics (\dot T) while matching prescribed start and end marginal distributions—here the robot’s egocentric belief b₀ and the desired other‑centric belief bₙ. The bridge is realized via a Doob h‑transform of (\dot T) with forward and backward potentials ϕₜ, ψₜ. The authors amortize these potentials using a bidirectional GRU (BiGRU) equipped with α‑entmax attention over the action‑local‑state history hₜ. The attention mechanism simultaneously computes the minimal d‑set Dₜ₊₁, ensuring that the bridge respects the influence‑augmented structure.

Training proceeds by sampling “epistemic counterfactuals” – synthetic other‑centric beliefs obtained from single‑agent episodes – and minimizing the KL loss between the bridge‑induced trajectory distribution and the reference dynamics, subject to the endpoint constraints. The overall loss combines reconstruction terms for the world model (L̄O), transition learning (L̇T), and the SB potentials (LΦ). Stabilization techniques such as τ‑annealing for Gumbel‑Softmax, knowledge‑distillation confidence gating, and socially‑weighted counterfactual objectives are employed.

Empirical evaluation uses a partially observable MiniGrid “person‑following” task. The robot must follow a fixed‑policy human avatar to an unknown target while respecting social norms: staying within view, avoiding collisions, and receiving a large terminal reward for reaching the goal. Policies are learned with a DQN that conditions on both the robot’s belief b₀ₜ and the estimated other‑centric belief (\hat b_{i,t}). Baselines include (i) perfect‑information (using the true belief of the other), (ii) no‑information (uniform belief), and (iii) a multi‑step rollout using only the learned local dynamics (\dot T). Results show that the SB‑based perspective‑shift learns significantly faster and achieves higher cumulative returns than all baselines. The advantage is especially pronounced early in training, where context‑aware mental‑state planning enables the robot to anticipate the human’s future actions more accurately than context‑free estimates.

The paper situates its contributions relative to prior work on meta‑learned agent embeddings, latent world‑model generation of other‑centric observations, and opponent modeling under partial observability. Unlike these, the current work explicitly treats perspective‑taking as optimal transport in belief space, leverages a learned IALM for structured dynamics, and employs an amortized Schrödinger Bridge for real‑time belief transformation. Limitations include the assumption of homogeneous perception and action models across agents; the authors propose extending the framework with heterogeneous epistemic counterfactuals. Additionally, the current discrete latent space may need to be integrated with continuous sensory modalities for broader applicability.

In summary, the authors present a unified framework that (1) decomposes socially‑aware tasks via influence‑augmented local models, (2) learns factored neuro‑symbolic mental‑models, and (3) enables decision‑time mental‑state planning through an amortized Schrödinger Bridge. This combination bridges the gap between symbolic interpretability and probabilistic scalability, offering a practical pathway for socially intelligent robots to perform perspective‑taking and collaborative decision‑making in uncertain, multi‑agent environments.

Latent Perspective-Taking via a Schrödinger Bridge in Influence-Augmented Local Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment