Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework
Sarcasm is a pragmatic phenomenon in which speakers convey meanings that diverge from literal content, relying on an interaction between semantics and prosodic expression. However, how these cues jointly contribute to the recognition of sarcasm remains poorly understood. We propose a computational framework that models sarcasm as the integration of semantic interpretation and prosodic realization. Semantic cues are derived from an LLaMA 3 model fine-tuned to capture discourse-level markers of sarcastic intent, while prosodic cues are extracted through semantically aligned utterances drawn from a database of sarcastic speech, providing prosodic exemplars of sarcastic delivery. Using a speech synthesis testbed, perceptual evaluations demonstrate that both semantic and prosodic cues independently enhance listeners’ perception of sarcasm, with the strongest effects emerging when the two are combined. These findings highlight the complementary roles of semantics and prosody in pragmatic interpretation and illustrate how modeling can shed light on the mechanisms underlying sarcastic communication.
💡 Research Summary
The paper presents a novel speech‑synthesis framework that explicitly models sarcasm by integrating two complementary cues: semantic content and prosodic realization. To obtain sarcasm‑aware semantic embeddings, the authors fine‑tune a LLaMA 3‑8B language model with LoRA adapters on a large news‑headline sarcasm dataset (≈28 k items). This low‑rank adaptation injects discourse‑level sarcasm markers while keeping the bulk of the model frozen, yielding token‑level embeddings (Eₛ) that capture pragmatic incongruity.
For prosody, the system builds a retrieval‑augmented generation (RAG) module. A curated sarcastic speech database (MUStARD++) provides 1 202 utterances (half sarcastic). Each utterance is encoded with the same LLaMA 3‑LoRA encoder to obtain a semantic vector aᵢ. Given an input text, its semantic embedding Eₛ is used as a query; cosine similarity retrieves the top‑K (=3) semantically aligned sarcastic utterances. These retrieved audio clips are passed through wavLM, pooled, and transformed into fixed‑length prosody embeddings (E_wₖ). This automatic retrieval replaces manual reference selection and allows multiple prosodic exemplars to influence synthesis.
The synthesis backbone is VITS. Phoneme embeddings (E_p) serve as queries in a cross‑attention layer whose keys and values are the semantic embeddings Eₛ. The attention output H therefore encodes phoneme‑aligned semantic information. A linear projection of the summed prosody embeddings is added to H, producing decoder states Z that are conditioned on phoneme, meaning, and prosody simultaneously.
Four experimental conditions are evaluated: (1) Baseline VITS (no sarcasm cues), (2) Semantic‑only (LLaMA 3‑LoRA embeddings), (3) Prosody‑only (RAG‑retrieved prosodic exemplars), and (4) Combined (both cues). Additionally, the Semantic‑only condition is broken down into three embedding sources (BERT, vanilla LLaMA 3, LoRA‑fine‑tuned LLaMA 3) to assess the impact of the adaptation. Speech samples (≈200 per condition) are synthesized and presented to 30 native‑English listeners, who rate perceived sarcasm on a 5‑point Likert scale.
Results show that both semantic and prosodic cues independently increase sarcasm perception relative to the baseline (p < 0.01). The combined condition yields the highest average rating (4.3/5), confirming a synergistic effect. Among semantic encoders, LoRA‑fine‑tuned LLaMA 3 outperforms BERT (3.7) and vanilla LLaMA 3 (3.8), demonstrating that low‑rank adaptation successfully captures sarcasm‑specific discourse cues. The prosody‑only condition benefits from using multiple retrieved exemplars; listeners report more natural and “sarcastic‑sounding” intonation compared with a single‑reference approach.
The study contributes (1) a concrete method for disentangling and recombining meaning and prosody in sarcastic speech, (2) evidence that low‑rank LLM adaptation is an efficient way to obtain pragmatic semantic features, (3) a scalable retrieval‑based prosody conditioning mechanism, and (4) perceptual validation that meaning and prosody jointly shape sarcasm perception. Limitations include reliance on English‑only corpora, a fixed K value for retrieval, and the subjective nature of listener ratings. Future work is outlined: extending the framework to multilingual settings, learning dynamic retrieval weights, incorporating visual cues (facial expressions), and integrating the model into interactive dialogue agents for real‑time sarcastic generation and feedback.
Comments & Academic Discussion
Loading comments...
Leave a Comment