Voice-Based Chatbots for English Speaking Practice in Multilingual Low-Resource Indian Schools: A Multi-Stakeholder Study

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Spoken English proficiency is a powerful driver of economic mobility for low-income Indian youth, yet opportunities for spoken practice remain scarce in schools. We investigate the deployment of a voice-based chatbot for English conversation practice across four low-resource schools in Delhi. Through a six-day field study combining observations and interviews, we captured the perspectives of students, teachers, and principals. Findings confirm high demand across all groups, with notable gains in student speaking confidence. Our multi-stakeholder analysis surfaced a tension in long-term adoption vision: students favored open-ended conversational practice, while administrators emphasized curriculum-aligned assessment. We offer design recommendations for voice-enabled chatbots in low-resource multilingual contexts, highlighting the need for more intelligible speech output for non-native learners, one-tap interactions with simplified interfaces, and actionable analytics for educators. Beyond language learning, our findings inform the co-design of future AI-based educational technologies that are socially sustainable within the complex ecosystem of low-resource schools.

💡 Research Summary

This paper presents a multi‑stakeholder field study of a voice‑based English‑speaking chatbot deployed in four low‑resource schools in Delhi, India. Recognizing that spoken English proficiency is a key driver of economic mobility for low‑income Indian youth, yet opportunities for oral practice are scarce due to large class sizes, limited exposure to proficient speakers, and infrastructural deficits (only about 57 % of schools have functional computers and 54 % have reliable internet), the authors built a low‑cost prototype called ChatFriend.

System design: ChatFriend is a web application built with React, hosted as a static site on AWS S3 and delivered via CloudFront. Students interact on school‑provided Android tablets using a “hold‑to‑talk” button. Audio is streamed to OpenAI’s Whisper‑1 for real‑time transcription, the resulting text is fed to a GPT‑4o‑mini model that generates a response based on a custom prompt containing the conversation topic and a brief learner profile. The response is first filtered through OpenAI’s Moderation API, then synthesized into speech with Google Text‑to‑Speech (TTS) and streamed back to the learner. A live transcript of both user and system turns is displayed on screen to aid comprehension. The prototype does not support full‑duplex (simultaneous speaking and listening) and relies on intermittent internet connectivity.

Methodology: The study followed an interpretivist, multiple‑case qualitative design over six days. Day 1 involved supervised, real‑time observation where students were introduced to the chatbot and immediate feedback was collected. Days 2‑6 allowed students to use the chatbot independently while researchers observed usage patterns and conducted reflective interviews. Participants included 23 students (middle‑ and high‑school age), 6 teachers, and 5 principals, selected through purposive maximum‑variation sampling to capture diverse language backgrounds (Hindi, Punjabi, Urdu, etc.) and school types (government‑aided, low‑fee private). Data sources comprised video recordings, field notes, and semi‑structured interview transcripts, which were coded iteratively to identify themes related to experience, enabling/hindering factors, and design implications.

Findings:

High demand and confidence gains – All stakeholder groups expressed strong desire for more spoken‑English practice. Students reported noticeable increases in speaking confidence after just a few interactions.
Technical frictions – ASR struggled with child voices, non‑native accents, and code‑switching (Hinglish), leading to transcription errors and broken turn‑taking. TTS defaulted to a neutral American accent; students preferred a slower, locally familiar Indian accent. Intermittent connectivity caused latency spikes, making the “hold‑to‑talk” experience feel sluggish.
Pedagogical tensions – Students favored open‑ended, free‑conversation practice that allowed them to explore topics of personal interest. Teachers and principals, however, emphasized alignment with curriculum objectives and the need for measurable assessment data. They requested analytics such as turn counts, error types, and progress dashboards to integrate the chatbot into formal grading.
Design preferences – Participants wanted a one‑tap microphone activation, minimal on‑screen navigation, and optional L1 (Hindi) scaffolds that could appear on demand without disrupting L2 (English) flow. Visual transcripts were helpful for comprehension, especially when speech recognition faltered.
Sustainability concerns – Principals highlighted hardware maintenance, electricity reliability, and recurring data costs as potential barriers to scaling. They suggested leveraging existing government ICT platforms (e.g., DIKSHA) to reduce duplication.

Design recommendations:

Speech output: Use slower speech rates and Indian‑accented TTS; allow users to toggle speed and accent.
Interaction simplicity: Implement a single‑tap “record” button, icon‑based topic selection, and hide advanced settings.
Teacher analytics: Provide a dashboard showing per‑student conversation length, pronunciation error categories, and alignment with lesson objectives.
Offline resilience: Cache conversation scripts locally and enable limited offline interaction to mitigate network outages.
Safety & privacy: Continue using moderation APIs and align data handling with Indian education data policies.

Contributions: The paper offers (1) an empirical, multi‑day, multi‑stakeholder evaluation of a voice‑first English‑practice chatbot in a low‑resource multilingual context; (2) nuanced insight into the trade‑offs between learner‑driven open conversation and curriculum‑driven assessment; (3) a practical checklist for deploying educational chatbots in resource‑constrained settings, including technical, pedagogical, and operational considerations; and (4) broader implications for co‑designing AI‑driven educational tools that are socially sustainable within complex school ecosystems.

In sum, the study demonstrates that voice‑based conversational agents can meaningfully increase spoken‑English confidence among low‑income Indian students, but long‑term adoption hinges on improving speech recognition for accented speech, aligning the tool with curricular goals, and ensuring affordable, reliable infrastructure. The findings provide a roadmap for researchers and practitioners aiming to scale AI‑enabled language learning in similarly constrained, multilingual environments.

Voice-Based Chatbots for English Speaking Practice in Multilingual Low-Resource Indian Schools: A Multi-Stakeholder Study

💡 Research Summary

Comments & Academic Discussion

Leave a Comment