Learning to Communicate Across Modalities: Perceptual Heterogeneity in Multi-Agent Systems

Learning to Communicate Across Modalities: Perceptual Heterogeneity in Multi-Agent Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Emergent communication offers insight into how agents develop shared structured representations, yet most research assumes homogeneous modalities or aligned representational spaces, overlooking the perceptual heterogeneity of real-world settings. We study a heterogeneous multi-step binary communication game where agents differ in modality and lack perceptual grounding. Despite perceptual misalignment, multimodal systems converge to class-consistent messages grounded in perceptual input. Unimodal systems communicate more efficiently, using fewer bits and achieving lower classification entropy, while multimodal agents require greater information exchange and exhibit higher uncertainty. Bit perturbation experiments provide strong evidence that meaning is encoded in a distributional rather than compositional manner, as each bit’s contribution depends on its surrounding pattern. Finally, interoperability analyses show that systems trained in different perceptual worlds fail to directly communicate, but limited fine-tuning enables successful cross-system communication. This work positions emergent communication as a framework for studying how agents adapt and transfer representations across heterogeneous modalities, opening new directions for both theory and experimentation.


💡 Research Summary

The paper investigates emergent communication between artificial agents that perceive the world through different sensory modalities. Building on a multi‑step referential game, the authors extend the original image‑text setting to an audio‑image scenario, where a Sender receives an audio clip of an object (e.g., a dog bark) and a Receiver observes a visual representation of the same object. Communication proceeds in discrete timesteps; at each step the Sender emits a D‑dimensional binary vector, the Receiver decides whether to stop and make a guess or to reply with its own binary vector, and the exchange continues until termination or a maximum number of steps.

Two experimental conditions are compared: a unimodal baseline (audio‑audio) and a heterogeneous multimodal pair (audio‑image). Agents are simple neural networks: the Sender is a feed‑forward model that conditions on its private input and the latest Receiver message, while the Receiver is a recurrent network that integrates the Sender’s messages together with candidate embeddings. Training jointly optimises a loss that combines classification error, REINFORCE‑based reward, and an entropy regulariser. The authors evaluate on a synthetic “Shapes World” dataset for controlled analysis and on a realistic pairing of CIFAR‑100 images with UrbanSound8K/ESC‑50 audio clips.

Key Findings

  1. Communication Efficiency – When message length is reduced from 50 to 5 bits, unimodal agents retain high accuracy and low classification entropy, whereas multimodal agents suffer a sharp drop in accuracy and a rise in entropy. This demonstrates that perceptual misalignment introduces effective channel noise, requiring longer messages to preserve performance.

  2. Class Consistency – Sender messages exhibit strong within‑class cosine similarity, especially under tighter capacity constraints, indicating deterministic encodings. Receiver messages are less consistent because they also depend on the distractor set and confidence levels. Cross‑agent consistency is low, suggesting that each side develops its own protocol rather than sharing a common symbolic lexicon.

  3. Distributional Meaning Encoding – Bit‑perturbation experiments reveal that “constant” bits (almost always 0 or 1) carry most class‑discriminative information; flipping them drastically reduces accuracy. “Variable” bits have minimal impact. Moreover, the effect of flipping a constant bit depends on the surrounding bit pattern, indicating that meaning is distributed across the whole message rather than compositional at the level of individual bits.

  4. Grounding in Perceptual Structure – t‑SNE visualisations of Sender embeddings under controlled variations of audio frequency and amplitude show systematic clustering by frequency, both in unimodal and multimodal settings. Even when the Receiver processes a different modality, the Sender’s messages retain traces of its own perceptual space, confirming that emergent communication can remain grounded in the sender’s sensory world despite modality gaps.

  5. Interoperability and Adaptation – Direct communication between a multimodal Sender (trained with an image Receiver) and a unimodal audio Receiver fails (≈random accuracy). However, a modest amount of fine‑tuning (as few as 2 epochs) dramatically improves performance, and after about 15 epochs both agents achieve high accuracy without severely degrading performance with their original partners. Accuracy spikes after the first message exchange, suggesting rapid partner identification and protocol adjustment.

Implications and Future Work

The study shows that perceptual heterogeneity forces agents to use longer, less certain messages and leads to a distributional encoding of meaning. Nevertheless, agents can quickly adapt their protocols with minimal additional training, highlighting the flexibility of emergent languages. These insights have relevance for multimodal robotics, human‑machine interaction, and theories of embodied cognition that emphasize grounding across divergent sensory worlds. Future directions include testing alternative architectures, embodied robot platforms, and more complex grounding mechanisms to broaden our understanding of heterogeneous emergent communication.


Comments & Academic Discussion

Loading comments...

Leave a Comment