Deception Detection in Dyadic Exchanges Using Multimodal Machine Learning: A Study on a Swedish Cohort
This study investigates the efficacy of using multimodal machine learning techniques to detect deception in dyadic interactions, focusing on the integration of data from both the deceiver and the deceived. We compare early and late fusion approaches, utilizing audio and video data - specifically, Action Units and gaze information - across all possible combinations of modalities and participants. Our dataset, newly collected from Swedish native speakers engaged in truth or lie scenarios on emotionally relevant topics, serves as the basis for our analysis. The results demonstrate that incorporating both speech and facial information yields superior performance compared to single-modality approaches. Moreover, including data from both participants significantly enhances deception detection accuracy, with the best performance (71%) achieved using a late fusion strategy applied to both modalities and participants. These findings align with psychological theories suggesting differential control of facial and vocal expressions during initial interactions. As the first study of its kind on a Scandinavian cohort, this research lays the groundwork for future investigations into dyadic interactions, particularly within psychotherapy settings.
💡 Research Summary
This paper investigates the use of multimodal machine learning to detect deception in dyadic (two‑person) exchanges, focusing on a newly collected Swedish cohort. The authors argue that most prior deception‑detection work has examined either a single speaker or a single modality (audio or video) in isolation, which limits performance and fails to capture the interactive nature of deception. Drawing on Interpersonal Deception Theory (IDT) and cognitive‑load theories, they hypothesize that incorporating data from both the deceiver and the deceived, and fusing audio and visual cues, will improve detection accuracy.
Data were gathered from 80 native‑Swedish speaker pairs (160 participants) who engaged in truth‑telling or lying tasks about emotionally salient topics. Each interaction was recorded with high‑resolution video and high‑quality audio. From the video stream, the authors extracted 17 Facial Action Units (AUs) and gaze coordinates using OpenFace. From the audio stream, they computed standard low‑level descriptors such as MFCCs, pitch, energy, and spectral centroid. All features were synchronized at the frame level.
Three fusion strategies were evaluated:
- Early Fusion – After modality‑specific preprocessing (normalization, tokenization), all features were concatenated into a single vector and fed to traditional classifiers (SVM, Random Forest, XGBoost).
- Late Fusion – Separate classifiers were trained for each modality and each participant. Their predictions (probability scores) were then combined by a meta‑classifier (multiclass logistic regression) that learned how to weight each base model.
- Joint Fusion – Modality‑specific streams were processed by parallel CNN‑LSTM arms and merged in a deep neural network before the final prediction layer.
Performance was assessed using 5‑fold cross‑validation, reporting accuracy, precision, recall, and F1‑score. Single‑modality models achieved modest accuracies of 58–63 %. Combining audio and video with early fusion raised accuracy to about 66 %. The late‑fusion approach yielded the best results, reaching 71 % accuracy, a statistically significant improvement over all baselines. Crucially, models that incorporated data from both participants outperformed those that used only the deceiver’s data (64 % vs. 71 %). This empirical finding supports IDT’s claim that deceivers monitor and adapt to the receiver’s reactions, making the receiver’s cues valuable for detection.
Feature‑importance analysis revealed that specific facial AUs (e.g., AU17 – chin raiser, AU20 – lip stretcher) and gaze fixation duration were strong visual indicators of lying, while audio cues such as reduced pitch variability and slower speech rate were prominent vocal markers. The authors note that gaze metrics, often overlooked in prior work, contributed meaningfully to the classifier.
Limitations include the relatively small sample size, laboratory‑controlled setting (which may not fully reflect real‑world psychotherapy or forensic contexts), and cultural specificity to Swedish speakers. The joint‑fusion deep network suffered from overfitting and higher computational cost, which limited its performance relative to the simpler late‑fusion pipeline.
Future directions suggested are: (1) employing more sophisticated temporal models (e.g., Transformers) to capture fine‑grained dynamics, (2) integrating physiological signals (heart rate, skin conductance) for richer multimodal representations, (3) developing real‑time deception‑detection tools for clinical or security applications, and (4) expanding the dataset across languages and cultures to test generalizability.
In summary, the study demonstrates that multimodal, dyadic‑aware machine‑learning models—particularly late‑fusion architectures that preserve modality‑specific strengths while learning to combine them—substantially improve deception detection over traditional single‑modality approaches. The work bridges psychological theory and computational methods, offering a promising foundation for applied deception‑detection systems in psychotherapy, law enforcement, and beyond.
Comments & Academic Discussion
Loading comments...
Leave a Comment