NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality’s agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.
💡 Research Summary
NegoCollab addresses the fundamental challenge of immutable heterogeneity in collaborative perception, where agents equipped with different sensors and fixed perception models generate domain gaps in the intermediate features they share. Existing solutions either retrain specialized collaborative modules, employ pairwise domain adapters, or designate the representation of a single agent as the common representation. The former is impractical for safety‑critical autonomous driving systems, the second incurs high training costs, and the third struggles when the designated agent’s representation is far from those of other agents.
The proposed framework introduces two novel components: a Negotiator and a plug‑and‑play Sender‑Receiver pair for each agent. The Negotiator takes the standardized local representations from all modalities and builds a multi‑level feature pyramid. At each pyramid level, an estimator predicts a contribution weight for each modality; the weighted representations are then averaged to form a level‑wise common feature. Concatenating and shrinking these level features yields a unified multimodal common representation P that is not tied to any single agent, thereby reducing the inherent domain gap between the common space and each local space.
The Sender transforms an agent’s local representation into the common space using a ConvNeXt‑based recombiner followed by a fused axial‑attention aligner. The Receiver performs the inverse transformation: a fused axial‑attention converter maps the received common features back to the local space, guided by the Sender’s recombiner output as a query, and a ConvNeXt recombiner refines the result. After receiving transformed features from collaborators, each agent fuses them with its own local features and passes the result to a task‑specific head (e.g., detection, segmentation).
Training proceeds in two stages. Stage 1 jointly optimizes the Negotiator and the Sender‑Receiver pair using a cyclic distribution consistency loss, which penalizes the L2 distance and standardized difference between the original local representation and the one reconstructed after a round‑trip through the common space. This loss minimizes information loss during bidirectional transformation and further narrows the domain gap. Stage 2 introduces a multi‑dimensional alignment loss composed of:
- Distribution Alignment Loss – enforces identical means and standard deviations between the Sender‑produced common representation and the Negotiator’s common representation.
- Structural Alignment Loss – aligns spatial and channel‑wise structures, typically via cosine similarity or attention map consistency, ensuring that the geometric layout of features is preserved across spaces.
- Pragmatic Alignment Loss – directly ties the alignment to downstream task performance, often implemented as a task‑specific loss (e.g., detection loss) computed on the fused features.
By simultaneously minimizing these three components, the multimodal knowledge embedded in the Negotiator’s common representation is fully distilled into the Sender, while the Receiver learns to reconstruct local features without residual domain discrepancies.
Extensive experiments on benchmark collaborative perception datasets (such as OPV2V and V2X‑Set) demonstrate that NegoCollab outperforms prior common‑representation methods (e.g., Gao et al., 2025) by 3–5 % in mean Average Precision (mAP) and achieves comparable or superior performance to pairwise domain‑adapter approaches, while requiring substantially lower training resources (≈30 % of the cost). The gains are especially pronounced in scenarios with large modality gaps, such as LiDAR‑camera fusion, confirming the effectiveness of the negotiated common representation.
In summary, NegoCollab offers a low‑cost, scalable, and robust solution for heterogeneous collaborative perception. Its negotiator‑based common representation, bidirectional Sender‑Receiver architecture, and multi‑dimensional alignment loss together enable agents to retain their proprietary perception models yet share information seamlessly. This makes the framework highly suitable for real‑world multi‑agent systems, including autonomous vehicle fleets, drone swarms, and collaborative robotics, where safety, efficiency, and adaptability are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment