Emotion Recognition in Signers

Reading time: 5 minute
...

📝 Abstract

Recognition of signers’ emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a crosslingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.

💡 Analysis

Recognition of signers’ emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a crosslingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.

📄 Content

Emotion Recognition in Signers Kotaro Funakoshi FIRST, Institute of Integrated Research Institute of Science Tokyo funakoshi@first.iir.isct.ac.jp Yaoxiong Zhu ICT, School of Engineering Institute of Science Tokyo zhuyaoxiong@lr.first.isct.ac.jp Abstract Recognition of signers’ emotions suffers from one theoretical challenge and one practical chal- lenge, namely, the overlap between grammat- ical and affective facial expressions and the scarcity of data for model training. This pa- per addresses these two challenges in a cross- lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with sub- titles. In eJSL, two signers expressed 78 dis- tinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emo- tion recognition in spoken language mitigates data scarcity in sign language, 2) temporal seg- ment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs. 1 Introduction Emotion recognition is a core topic not only in nat- ural language processing (Yun et al., 2024) but also in affective computing and human-computer inter- action (Zeng et al., 2009; El Ayadi et al., 2011), en- abling more natural and empathetic systems. Such systems are equally or more important for social minorities. Recently, more light is shed on sign language (Long et al., 2024; Yin et al., 2024; Wang et al., 2025), however, automatic emotion recog- nition in signers has not been explored at all. To our best knowledge, the single contribution in this direction is the EmoSign dataset for American Sign Language (ASL) (Chua et al., 2025). In this paper, we introduce eJSL1, a new dataset for emotion recognition. This dataset is largely different from EmoSign in two aspects. First, it is in Japanese Sign Language (JSL). Second, it is essentially para-linguistic. That is, we asked 1eJSL is available upon request to the authors. two signers to express 78 distinct sentences with each of seven different emotional states, resulting in 1,092 video clips. Because human languages are highly context-dependent, any linguistic expression potentially can be expressed with any emotion. In this dataset, thus, the task is emotion recognition in signing signers rather than that in sign language. Here, the arising challenge is that emotion ex- pressions in signers are further complicated be- cause facial expressions convey both grammatical and affective information (Brentari, 1999; Wilbur, 2000). For example, eyebrow movement can signal a yes/no question (Pfau and Quer, 2010) or express surprise (Valli and Lucas, 2000), creating ambi- guity for emotion recognition models trained on non-signers. To address the challenge, we investigate three hypotheses: (1) caption-based weakly labeled data can support effective model fine-tuning, (2) select- ing temporal segments less affected by grammatical expressions improve accuracy, and (3) hand gesture features enhance recognition beyond facial features alone. Experiments on multiple datasets validate these hypotheses and offer insights into understand- ing of emotional communication in signers. 2 Emotion Recognition and Sign Language As discussed, a unique challenge in emotion recog- nition in signers lies in the overlap between gram- matical facial expressions (GFEs) and affective fa- cial expressions (AFEs). Unlike spoken language, sign language uses non-manual markers such as facial movements and head gestures to encode syn- tax. These signals often occur simultaneously with AFEs, making their separation critical for accurate understanding of communicative information. To this end, Silva et al. (2020) annotated their corpus with facial Action Units (AUs) to encode GFEs. However, this corpus is not annotated in terms of emotion. Although there are many other arXiv:2512.15376v1 [cs.CV] 17 Dec 2025 sign language datasets (see Table 2 of (Albanie et al., 2021) for a not-exaustive but rich list of 30 datasets), none of them are with emotion annota- tion except for EmoSign and our eJSL. However, both of them are small-scale benchmark-oriented datasets. Thus the scarcity of data available for supervised training is another challenge. In human multimodal communication, verbal and non-verbal information can be independent, even contradictory in sentiment. In such contra- dictory situations, facial information can be highly dominant more than verbal information (Mehra- bian, 1971). Nevertheless, in usual situations, it is repeatedly observed that textual information is dominant by the multimodal spoken language emo- tion recognition literature (Li et al., 2023; Yun et al., 2024). Therefore, we can expect that, even in sign language, textual caption/subtitle data (i.e., transla- tions in spoken language) are useful

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut