Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games

Reading time: 6 minute
...

📝 Abstract

Recent closed-source multimodal systems have made great advances, but their hidden language for understanding the world remains opaque because of their black-box architectures. In this paper, we use the systems’ preference bias to study their hidden language: During the process of compressing the input images (typically containing multiple concepts) into texts and then reconstructing them into images, the systems’ inherent preference bias introduces specific shifts in the outputs, disrupting the original input concept co-occurrence. We employ the multi-round “telephone game” to strategically leverage this bias. By observing the co-occurrence frequencies of concepts in telephone games, we quantitatively investigate the concept connection strength in the understanding of multimodal systems, i.e., “hidden language.” We also contribute Telescope, a dataset of 10,000+ concept pairs, as the database of our telephone game framework. Our telephone game is test-time scalable: By iteratively running telephone games, we can construct a global map of concept connections in multimodal systems’ understanding. Here we can identify preference bias inherited from training, assess generalization capability advancement, and discover more stable pathways for fragile concept connections. Furthermore, we use Reasoning-LLMs to uncover unexpected concept relationships that transcend textual and visual similarities, inferring how multimodal systems understand and simulate the world. This study offers a new perspective on the hidden language of multimodal systems and lays the foundation for future research on the interpretability and controllability of multimodal systems.

💡 Analysis

Recent closed-source multimodal systems have made great advances, but their hidden language for understanding the world remains opaque because of their black-box architectures. In this paper, we use the systems’ preference bias to study their hidden language: During the process of compressing the input images (typically containing multiple concepts) into texts and then reconstructing them into images, the systems’ inherent preference bias introduces specific shifts in the outputs, disrupting the original input concept co-occurrence. We employ the multi-round “telephone game” to strategically leverage this bias. By observing the co-occurrence frequencies of concepts in telephone games, we quantitatively investigate the concept connection strength in the understanding of multimodal systems, i.e., “hidden language.” We also contribute Telescope, a dataset of 10,000+ concept pairs, as the database of our telephone game framework. Our telephone game is test-time scalable: By iteratively running telephone games, we can construct a global map of concept connections in multimodal systems’ understanding. Here we can identify preference bias inherited from training, assess generalization capability advancement, and discover more stable pathways for fragile concept connections. Furthermore, we use Reasoning-LLMs to uncover unexpected concept relationships that transcend textual and visual similarities, inferring how multimodal systems understand and simulate the world. This study offers a new perspective on the hidden language of multimodal systems and lays the foundation for future research on the interpretability and controllability of multimodal systems.

📄 Content

Recent multimodal systems, particularly closed-source ones (Hurst et al., 2024;StepFun, 2024;Bai et al., 2025), have made significant advances, e.g., the newest GPT-4o with Image Generation (OpenAI, 2025) (abbreviated as GPT-4o-IG(20250325)). However, because of these systems’ closed features, closed data, and even closed architectures, we are unable to study the systems’ understanding of the world using methods based on training. Therefore, test-time methods are urgently needed.

The hidden language reflects the connection strength between concepts within multimodal systems (Chefer et al., 2023), offering insight into how they understand the world. While prior training- The “Hidden Language” here: Dog is more closely connected with Frame than TV based methods explored it via internal features (Chen et al., 2023;Chefer et al., 2023;Ghandeharioun et al., 2024), the rise of closed-source models renders such access impossible. Hence, we investigate the hidden language of multimodal systems at test time.

We innovatively propose to strategically leverage the multimodal systems’ preference bias to study their hidden language at test time. Multimodal systems are trained to fit textual and visual representations of the same scenes, which typically involves multiple interrelated concepts. Sufficient training strengthens these concept connections in systems’ hidden understanding space (abbreviated as hidden space), while limited training weakens them. Therefore, imbalanced training data brings different concept connection strengths, i.e., hidden language. As illustrated in Figure 1, during imageto-text compression, the systems prefer to discard weakly connected concepts; during text-to-image reconstruction, the systems prefer strongly connected concepts (Zhao et al., 2024), even with the latest SOTA GPT-4o-IG(20250325). These preference biases will lead to changes in input concepts, thereby disrupting their co-occurrence in the output scene.

In this paper, we innovatively propose a test-time framework based on multi-round telephone game to leverage this preference bias, a plug-and-play method involving multiple cycles of image reconstruction. As the telephone game progresses, fragile concept combinations gradually degrade, revealing their fragile connection strength in systems’ understanding. And we quantify the connection strength (i.e., hidden language) using the concept co-occurrence frequency in the telephone game. As shown in Figure 2, a higher co-occurrence frequency indicates a stronger concept connection. This metric captures both the training bias and generalization capability: Stronger generalization enables consistent responses to similar patterns, corresponding to a uniform connection strength distribution.

We also contribute Telescope, a dataset consisting of 10,000+ concept pairs derived from 150 common visual concepts, primarily covering basic spatial relations (e.g., “A adjacent to B”) and some complex interactions (e.g., “A displayed on TV screen”). Leveraging the telephone game and Telescope, we propose a scalable test-time probing framework for the hidden language of multimodal systems: Each new telephone game iteration tends to reveal new concept connections, and as test-time compute scales up, we progressively build a detailed “world map” of the multimodal hidden language.

In this way: (1) We uncover key terms associated with a concept in multimodal systems’ understanding, revealing the training bias (which combinations are better-trained or not) and the systems’ generalization capability;

(2) By analyzing connection strengths across multiple pathways, we can identify intermediate concepts to enhance concept connections to promote the co-occurrence of discordant concepts;

(3) Reasoning-LLMs help to understand how the these connection strengths interprets physical-world laws, revealing unexpected relationships beyond textual and visual similarities.

Here, we summarize our contributions:

Figure 2: The longevity of concepts combinations in the telephone game (i.e., their co-occurrence frequency) quantitatively reflects the concept connections in multimodal systems’ hidden space, termed the “hidden language.” (Lighter color means the weaker connection)

• Test-time Telephone Game Framework: We innovatively propose to reveal the hidden language of multimodal systems using the framework of telephone game and the concept co-occurrence frequency at test time; • Telescope Dataset: We contribute the Telescope, a database for systematic telephone game probing on multimodal systems’ hidden language; • Test-Time Scalable Framework: We keep on creating an increasingly comprehensive hidden language world map of multimodal systems in a scalable way.

MultiModal Systems Recent advances in multimodal intelligence systems (Lu et al., 2019;Baltrušaitis et al., 2018;Xie et al., 2024;Guo et al., 2019;Li et al., 2023Li et al., , 2022;;Tan and Bansal, 2019) have shown great ability in processing cross-modal info

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut