Counterfactual Conditional Likelihood Rewards for Multiagent Exploration
Efficient exploration is critical for multiagent systems to discover coordinated strategies, particularly in open-ended domains such as search and rescue or planetary surveying. However, when exploration is encouraged only at the individual agent level, it often leads to redundancy, as agents act without awareness of how their teammates are exploring. In this work, we introduce Counterfactual Conditional Likelihood (CCL) rewards, which score each agent’s exploration by isolating its unique contribution to team exploration. Unlike prior methods that reward agents solely for the novelty of their individual observations, CCL emphasizes observations that are informative with respect to the joint exploration of the team. Experiments in continuous multiagent domains show that CCL rewards accelerate learning for domains with sparse team rewards, where most joint actions yield zero rewards, and are particularly effective in tasks that require tight coordination among agents.
💡 Research Summary
The paper tackles the long‑standing problem of redundant exploration in cooperative multi‑agent reinforcement learning, especially in domains where external rewards are sparse and only granted when a sufficient number of agents coordinate their actions. Existing intrinsic rewards such as novelty, curiosity, or local observation entropy (OEM) are applied at the individual level and therefore fail to capture the contribution of each agent to the joint state‑space coverage. To address this, the authors introduce Counterfactual Conditional Likelihood (CCL) rewards, a novel intrinsic signal that quantifies how much an individual agent’s current observation increases the likelihood of the team’s joint observation compared with a counterfactual scenario where the agent’s observation is held at its previous value.
The method works as follows. Each agent’s raw observation o_i^t is passed through a fixed random encoder φ, producing a low‑dimensional embedding z_i^t (four dimensions in the experiments). All agents’ embeddings are concatenated to form a joint embedding z^t that represents the team’s joint observation. A replay memory M stores these joint embeddings over an episode. For each agent i, a counterfactual joint embedding ˜z^{(i)}t is created by replacing z_i^t with the embedding of the previous observation φ(o_i^{t‑1}). Using k‑nearest‑neighbor density estimation in the joint embedding space, the algorithm computes the distance to the k‑th nearest neighbor for both the actual and counterfactual embeddings, denoted ε_act and ε_cfact. To ensure a fair comparison, a shared radius ε_shared = max(ε_act, ε_cfact) is defined, and the number of neighbors within this radius for the actual and counterfactual cases (n_act, n_cfact) are counted. The conditional log‑likelihoods are approximated via the digamma function ψ as log p(o_i^t|o{‑i}^t) ≈ ψ(n_act+1) and log p(˜o_i^t|o_{‑i}^t) ≈ ψ(n_cfact+1). The CCL reward for agent i is then r_i^{CCL}=ψ(n_act+1)−ψ(n_cfact+1). This quantity is positive when the agent’s current observation makes the joint observation more probable than the counterfactual, effectively rewarding unique informational contributions and penalizing redundant ones.
The authors integrate CCL into a standard centralized‑training‑decentralized‑execution (CTDE) framework using Multi‑agent Proximal Policy Optimization (MAPPO) with LSTM policies and a centralized critic. They evaluate the approach in continuous, sparse‑reward environments that require tight coordination (e.g., simultaneous target acquisition). Experiments compare four settings: (1) baseline MAPPO with only extrinsic rewards, (2) MAPPO + local OEM, (3) MAPPO + CCL, and (4) MAPPO + both OEM and CCL. Results show that CCL dramatically accelerates learning—agents converge 2–3× faster than the OEM‑only baseline—and achieves higher final returns (≈30 % improvement). Moreover, the combination of OEM and CCL yields the best performance, indicating that encouraging both individual diversity and coordinated joint coverage is synergistic. The benefit of CCL grows with the number of agents and with increasing reward sparsity, confirming its suitability for large‑scale cooperative tasks.
Key contributions of the work are: (i) a principled intrinsic reward based on counterfactual conditional likelihood that directly measures each agent’s marginal contribution to joint exploration; (ii) a practical implementation that avoids training joint embeddings by using fixed random encoders and k‑NN density estimation, thereby sidestepping non‑stationarity issues; (iii) empirical evidence that CCL reduces redundant exploration and promotes the discovery of tightly coordinated behaviors; and (iv) demonstration that CCL can be seamlessly combined with existing intrinsic rewards to further boost performance.
Overall, the paper presents a compelling new direction for intrinsic motivation in multi‑agent systems, offering a statistically grounded, computationally lightweight, and empirically validated mechanism to foster coordinated exploration in environments where external feedback is scarce. This has immediate relevance for real‑world multi‑robot applications such as search‑and‑rescue, planetary surveying, and underwater exploration, where efficient joint coverage and rapid coordination are critical.
Comments & Academic Discussion
Loading comments...
Leave a Comment