Did somebody say "Gest-IT"? A pilot exploration of multimodal data management
The paper presents a pilot exploration of the construction, management and analysis of a multimodal corpus. Through a three-layer annotation that provides orthographic, prosodic, and gestural transcriptions, the Gest-IT resource allows to investigate the variation of gesture-making patterns in conversations between sighted people and people with visual impairment. After discussing the transcription methods and technical procedures employed in our study, we propose a unified CoNLL-U corpus and indicate our future steps
💡 Research Summary
The paper presents the initial development of the Gest‑IT corpus, a multimodal resource designed to capture naturalistic interactions between sighted speakers and speakers with visual impairment. Recognizing that traditional language corpora focus almost exclusively on written or spoken text and therefore miss the rich semiotic resources of gesture, gaze, and prosody, the authors set out to create an “ecological” dataset that records audio‑visual data without imposing intrusive equipment on participants.
The study recruited 14 adult participants (six blind or partially sighted, eight sighted) balanced for age, gender, and education. Each pair engaged in a 30‑minute spontaneous conversation, resulting in 13 recorded sessions (approximately seven hours of footage). The experimental design systematically varied two factors: (1) visual condition – either both participants share the same visual status (S) or have different statuses (D), and (2) visual access – participants either faced each other (unmasked, U) or sat back‑to‑back (masked, M) so that sighted interlocutors could not see the gestures of their partners. This factorial arrangement allows the researchers to disentangle the influence of visual perception on gesture production.
A three‑layer annotation scheme is introduced. The first layer provides an orthographic transcription of the spoken content. The second layer adds a prosodic transcription using IPA‑based symbols to encode stress, intonation, and boundary information. The third layer is a novel gestural transcription that records the form of each gesture in an objective manner, specifying articulators (hand, arm, head, torso), movement direction, trajectory, and temporal dynamics. By aligning these three layers on a common time axis, the corpus enables integrated analyses of speech‑gesture synchrony, prosody‑gesture interaction, and the effect of visual feedback on gestural behavior.
To ensure reproducibility and collaborative development, the corpus is managed in a central Git repository. Each participant and each conversation is described by a YAML metadata file, allowing continuous integration/continuous deployment (CI/CD) pipelines to automatically generate summary tables and status reports whenever new data are added. The annotated data are exported to a unified CoNLL‑U format, making them directly compatible with existing Universal Dependencies tools and facilitating downstream NLP tasks that incorporate multimodal information.
The authors discuss the broader challenges that motivated the project: the lack of standardized gesture transcription systems, the labor‑intensive nature of multimodal annotation, and the scarcity of ecological, publicly available multimodal corpora for Italian. They position Gest‑IT as a pilot that addresses these gaps and outline future work, which includes scaling up the corpus, developing semi‑automatic transcription and annotation tools (e.g., video‑speech alignment, deep‑learning‑based gesture detection), extending the resource to other languages and cultural contexts, and collaborating with the international community to converge on a common gestural annotation standard. The ultimate goal is to provide a robust, open‑access multimodal dataset that can be seamlessly integrated into computational linguistics, psycholinguistics, and human‑computer interaction research.
Comments & Academic Discussion
Loading comments...
Leave a Comment