Nested Training for Mutual Adaptation in Human-AI Teaming

Nested Training for Mutual Adaptation in Human-AI Teaming
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mutual adaptation is a central challenge in human–AI teaming, as humans naturally adjust their strategies in response to a robot’s policy. Existing approaches aim to improve diversity in training partners to approximate human behavior, but these partners are static and fail to capture adaptive behavior of humans. Exposing robots to adaptive behaviors is critical, yet when both agents learn simultaneously in a multi-agent setting, they often converge to opaque implicit coordination strategies that only work with the agents they were co-trained with. Such agents fail to generalize when paired with new partners. In order to capture the adaptive behavior of humans, we model the human-robot teaming scenario as an Interactive Partially Observable Markov Decision Process (I-POMDP), explicitly modeling human adaptation as part of the state. We propose a nested training regime to approximately learn the solution to a finite-level I-POMDP. In this framework, agents at each level are trained against adaptive agents from the level below. This ensures that the ego agent is exposed to adaptive behavior during training while avoiding the emergence of implicit coordination strategies, since the training partners are not themselves learning. We train our method in a multi-episode, required cooperation setup in the Overcooked domain, comparing it against several baseline agents designed for human-robot teaming. We evaluate the performance of our agent when paired with adaptive partners that were not seen during training. Our results demonstrate that our agent not only achieves higher task performance with these adaptive partners but also exhibits significantly greater adaptability during team interactions.


💡 Research Summary

The paper tackles the problem of mutual adaptation in human‑AI teaming, where both parties continuously adjust their behavior in response to each other. Existing methods either train robots against a diverse set of static partners or allow simultaneous learning of both agents. The former fails to capture human adaptation, while the latter often leads to the emergence of opaque, partner‑specific coordination conventions that do not generalize to new teammates.

To address these issues, the authors model the collaboration as a finitely nested Interactive Partially Observable Markov Decision Process (I‑POMDP). In this formulation, the state includes not only the physical environment but also a model of the human partner at a lower reasoning level. The hierarchy is defined as follows: level‑0 consists of fixed robot policies, level‑1 comprises human policies that adapt to level‑0 robots, and level‑2 is the robot policy that adapts to the set of level‑1 human policies. This nesting captures the belief that each agent holds about the other’s strategy.

The core contribution is a “nested training” regime. First, a pool of level‑1 human policies is trained against a finite set of static level‑0 robot policies, ensuring that each human policy learns to infer and react to different robot types. Then, the level‑2 robot policy is trained against this pool of adaptive human policies, which remain fixed during robot training. Because the robot never co‑learns with its partners, it cannot converge to a brittle coordination convention; instead, it learns to handle a distribution of adaptive behaviors.

To make belief updates tractable, the authors learn a latent embedding of the interaction history (z_t = f_θ(h_t)) and condition the policy on this embedding: π_θ(a | o_t, z_t). This amortizes uncertainty over partner types and enables end‑to‑end reinforcement learning without explicit Bayesian inference.

Experiments are conducted in the Overcooked domain with a required‑cooperation setup. Eight unseen adaptive partners (the level‑1 human policies) are paired with the robot across ten rounds, each consisting of either five episodes (short evaluation) or twenty‑five episodes (extended evaluation). The proposed method achieves an average success rate of 0.90 in the short setting and 0.935 in the extended setting, substantially outperforming baselines such as LIAM, LILI, PA‑CE, and a Generalist policy (the best baseline reaches 0.575 and 0.65 respectively). Moreover, per‑partner success rates remain consistently high, whereas baselines exhibit large variance and often collapse when faced with new adaptive partners.

Behavioral analysis reveals that level‑1 humans display a “wait‑until‑the‑robot‑reveals‑its‑type” strategy, reflecting level‑1 reasoning. The level‑2 robot anticipates this waiting behavior and proactively commits first, demonstrating level‑2 reasoning. Baseline agents, by contrast, oscillate between recipe choices and fail to establish stable conventions. Statistical tests in the appendix confirm the significance of these patterns.

The paper concludes that explicitly modeling partner adaptation within a nested I‑POMDP framework prevents the formation of brittle conventions and yields policies that generalize to unseen adaptive teammates. Future work includes validation with real human participants, extending the approach to mixed‑motive scenarios where agents have divergent rewards, and exploring deeper nesting levels for richer theory‑of‑mind reasoning.


Comments & Academic Discussion

Loading comments...

Leave a Comment