Arxiv 2511.10693

Reading time: 5 minute
...

📝 Original Info

  • Title: Arxiv 2511.10693
  • ArXiv ID: 2511.10693
  • Date: Pending
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (※ 실제 논문을 확인하시어 저자명을 추가하시기 바랍니다.) — **

📝 Abstract

Voice-based artificial intelligence is increasingly expected to adhere to human social conventions, but can it learn implicit cues that are not explicitly programmed? This study investigates whether state-of-the-art text-to-speech systems have internalized the human tendency to reduce speech rate to convey politeness -a non-obvious prosodic marker. We prompted 22 synthetic voices from two leading AI platforms (AI Studio and OpenAI) to read a fixed script under both "polite and formal" and "casual and informal" conditions and measured the resulting speech duration. Across both AI platforms, the polite prompt produced slower speech than the casual prompt with very large effect sizes, an effect that was statistically significant for all of AI Studio's voices and for a large majority of OpenAI's voices. These results demonstrate that AI can implicitly learn and replicate psychological nuances of human communication, highlighting its emerging role as a social actor capable of reinforcing human social norms.

💡 Deep Analysis

📄 Full Content

Generative artificial intelligence (GenAI) systems now mediate millions of daily conversations, fundamentally reshaping communication by moving interactions with machines from purely functional exchanges to nuanced social encounters. As voice-based agents become deeply integrated into sensitive domains such as healthcare, education, and personal companionship, their ability to navigate complex human social conventions is no longer a mere technical feature, but a fundamental requirement for establishing trust, ensuring user acceptance, and guaranteeing efficacy. This new reality raises a critical question: Do these systems, which learn from vast corpora of human data, implicitly acquire the subtle, non-literal rules that govern social conduct? Can artificial intelligence (AI) learn not just what to say, but how to say it in a socially appropriate manner? This study addresses these questions by investigating a well-documented yet implicit human behavior -the tendency to reduce speech rate to convey politeness (Nussinson et al., 2025) -as a test case to probe the depth of social learning in state-of-the-art voice AI.

Voice-based AI systems, which operate through spoken language, are becoming more common because speaking and listening are natural forms of interaction, even for users who may not be literate (Carolus et al., 2023). Advances in both language understanding and speech generation have fueled this growth. Large language models (LLMs) enable these systems to understand context and generate comprehensive responses. When combined with speech synthesis, they phonological encoding, while a slower rate increases the interval between information units and may thus require more cognitive resources to maintain context. Both extremes can reduce processing fluency and increase listening effort (Colby & McMurray, 2021). A high cognitive load caused by inappropriate speech rates can impair understanding and memory retention, particularly with complex or unfamiliar topics. However, adaptive strategies, such as pausing at clause boundaries and using prosodic cues to mark important content, can mitigate these effects (Beier et al., 2025).

Speech rate not only influences cognitive processing but also serves as a subtle yet powerful cue in the management of social interactions. The most influential framework for understanding this process is Politeness Theory, developed by Brown and Levinson (1987).

The theory posits that speakers are motivated to protect their own and their interlocutor’s “face” -the public self-image that every person wants to claim. Many speech acts, such as making a request, constitute a Face-Threatening Act (FTA) because they impose on the hearer’s autonomy. To mitigate these threats, speakers employ politeness strategies. A slower speech rate can serve as a key component of such strategies, particularly “negative politeness.” By speaking more slowly, a speaker can signal deference, reduce the perceived imposition of the request, and convey that they are not rushing or pressuring the listener, thereby preserving social harmony (Yusupova, 2025). Furthermore, recent findings suggest that slow speed is associated with psychological distance and that, more specifically, slow-pace speech is associated with greater social distance between the speaker and the interlocutor (Nussinson et al., 2024). As politeness is known to both reflect and create social distance (Stephan et al., 2010), slow-paced speech may be associated with politeness exactly because politeness is a manifestation of social distance (Nussinson et al., 2025). This theoretical lens provides a direct rationale for the hypothesis that polite speech is systematically associated with a reduced tempo.

As AI agents become an integral part of social life, their ability to adhere to human norms of politeness is critical for fostering user trust, acceptance, and effective collaboration (Ribino, 2023). Early research in Human-Computer Interaction, particularly the “Computers as Social Actors” (CASA) paradigm, established the understanding that users naturally apply social rules to machines and respond to cues of politeness or impoliteness (Reeves & Nass, 1996).

Consequently, a significant portion of the work on politeness in AI has focused on implementing explicit politeness strategies, primarily through lexical and syntactic choices.

This includes programming agents to use words like “please” and “thank you,” often driven by concerns that command-based interactions with digital assistants could negatively affect social behavior, especially in children (Burton & Gaskin, 2019). However, focusing solely on lexical markers overlooks the primary channel through which social meaning is conveyed: prosody. Authentic social competence requires more than adherence to explicit rules; it involves mastering the subtle, non-verbal cues that often accompany and even override verbal content.

Prosody-the rhythm, pitch, and rate of speech-is a central channe

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut