Domain-Agnostic Causal-Aware Audio Transformer for Infant Cry Classification

Reading time: 5 minute
...

📝 Original Info

  • Title: Domain-Agnostic Causal-Aware Audio Transformer for Infant Cry Classification
  • ArXiv ID: 2512.16271
  • Date: 2025-12-18
  • Authors: Geofrey Owino, Bernard Shibwabo Kasamani, Ahmed M. Abdelmoniem, Edem Wornyo

📝 Abstract

Accurate and interpretable classification of infant cry paralinguistics is essential for early detection of neonatal distress and clinical decision support. However, many existing deep learning methods rely on correlation-driven acoustic representations, which makes them vulnerable to noise, spurious cues, and domain shifts across recording environments. We propose DACH-TIC, a Domain-Agnostic Causal-Aware Hierarchical Audio Transformer for robust infant cry classification. The model integrates causal attention, hierarchical representation learning, multi-task supervision, and adversarial domain generalization within a unified framework. DACH-TIC employs a structured transformer backbone with local token-level and global semantic encoders, augmented by causal attention masking and controlled perturbation training to approximate counterfactual acoustic variations. A domain-adversarial objective promotes environment-invariant representations, while multi-task learning jointly optimizes cry type recognition, distress intensity estimation, and causal relevance prediction. The model is evaluated on the Baby Chillanto and Donate-a-Cry datasets, with ESC-50 environmental noise overlays for domain augmentation. Experimental results show that DACH-TIC outperforms state-of-the-art baselines, including HTS-AT and SE-ResNet Transformer, achieving improvements of 2.6 percent in accuracy and 2.2 points in macro-F1 score, alongside enhanced causal fidelity. The model generalizes effectively to unseen acoustic environments, with a domain performance gap of only 2.4 percent, demonstrating its suitability for real-world neonatal acoustic monitoring systems.

💡 Deep Analysis

Figure 1

📄 Full Content

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE Domain-Agnostic Causal-Aware Audio Transformer for Infant Cry Classification Geofrey Owino School of Computing and Engineering Sciences Strathmore University Nairobi, Kenya geoffreyowino@strathmore.edu Bernard Shibwabo Kasamani School of Computing and Engineering Sciences Strathmore University Nairobi, Kenya 0000-0002-0827-9857 Edem Wornyo Google Research Google LLC New York New York, USA wornyo@google.com Ahmed M. Abdelmoniem School of Electronic and Computer Engineering Science Queen Mary University of London London, UK ahmed.sayed@qmul.ac.uk Abstract— Accurate and interpretable classification of infant cry paralinguistics is vital for early diagnosis of neonatal distress and clinical decision support. However, most existing deep learning approaches rely heavily on correlation-driven signal representations, making them vulnerable to acoustic perturbations and domain shifts across recording environments. In this work, we present DACH-TIC, a novel Domain-Agnostic Causal-Aware Hierarchical Audio Transformer that integrates causal reasoning, multi-task modeling, and adversarial domain generalization for robust infant cry classification. DACH-TIC introduces a structured transformer backbone composed of local token-level and global semantic encoders, augmented by causal attention masking and a controlled perturbation training approach that simulates counterfactual acoustic perturbations. A domain adversarial head further enables invariance across recording environments, while multi-task supervision improves representation robustness by jointly optimizing cry type, distress intensity, and causal relevance. We evaluate DACH-TIC on cry recordings from the Baby Chillanto and Donate-a-Cry datasets, with ESC- 50 noise overlays for domain augmentation. Compared to existing state-of-the-art models such as Hierarchical Token- Semantic Audio Transformer (HTS-AT) and Squeeze-and- Excitation Residual Network Transformer (SE-ResNet Transformer), DACH-TIC achieves significant improvements in accuracy (↑2.6%), macroF1 (↑2.2), and causal fidelity metrics. It also generalizes well to unseen environments with minimal performance degradation (domain gap: 2.4%). These results establish DACH-TIC as a causally grounded, domain-resilient model for real-world deployment in neonatal acoustic monitoring systems. Keywords— Domain-agnostic, Causal-aware, Transformer, Paralinguistics, Domain-adversarial, Pseudo-interventional. I. INTRODUCTION Crying remains the primary communication modality in infants, often signaling critical physiological or emotional states such as hunger, pain, fatigue, or discomfort[1]. Classifying these cries with high precision enables timely interventions, particularly in neonatal intensive care units (NICUs) and low resource environments[2, 3]. Ji et al. have shown that early paralinguistic analysis of infant cries holds predictive power for neurological and developmental outcomes[4, 5]. Recent evidence also shows that neonatal cry acoustics are associated with later neurodevelopment[6]. A multicenter cohort of very preterm infants found cry features at NICU discharge linked to cognitive, language, and behavioral outcomes at 2 years[7]. Lawford et al. in their review likewise supports that cry acoustics are markers of early neurological dysfunction[6]. Although recent advancements in deep learning have yielded notable gains in infant cry classification, most existing models, including those proposed by Teeravajanadet et al. and Maghfira et al., are based on convolutional or recurrent architectures [8, 9]. These models often learn spurious correlations in the input data, which degrades their generalization performance in real-world settings characterized by domain shifts, such as noise variation or microphone differences[10]. Transformer-based architectures like Hierarchical Token Semantic Audio Transformer (HTS-AT), proposed by Chen et al., have achieved strong performance by modeling temporal token dependencies through attention mechanisms [11]. However, these models are still vulnerable to overfitting to non-causal acoustic artifacts, especially when trained without explicit structural priors. Baevski et al. further demonstrated the limits of unsupervised speech transformers in generalizing across acoustic domains [12]. According to Jiao et al., domain-specific noise and environmental conditions can introduce confounding artifacts that affect the learned representation [13]. Lin et al. argued that Bayesian risk minimization is a powerful alternative to empirical risk minimization in such non-independent and identically distributed conditions [14]. Similarly, Scholkopf et al. emphasized that models aligned with causal principles are more likely to generalize under interventions and distribution shifts [15]. Existing infant-cry models and audio tra

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut