Kahaani: A Multimodal Co-Creative Storytelling System

Kahaani: A Multimodal Co-Creative Storytelling System
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper introduces Kahaani, a multimodal, co-creative storytelling system that leverages Generative Artificial Intelligence, designed for children to address the challenge of sustaining engagement to foster educational narrative experiences. Here we define co-creative as a collaborative creative process in which both the child and Kahaani contribute to the generation of the story. The system combines Large Language Model (LLM), Text-to-Speech (TTS), Text-to-Music (TTM), and Text-to-Video (TTV) generation to produce a rich, immersive, and accessible storytelling experience. The system grounds the co-creation process in two classical storytelling framework, Freytag’s Pyramid and Propp’s Narrative Functions. The main goals of Kahaani are: (1) to help children improve their English skills, (2) to teach important life lessons through story morals, and (3) to help them understand how stories are structured, all in a fun and engaging way. We present evaluations for each AI component used, along with a user study involving three parent-child pairs to assess the overall experience and educational value of the system.


💡 Research Summary

The paper presents “Kahaani,” a multimodal, co‑creative storytelling system designed for children that integrates several generative AI components: a large language model (LLM) for text generation, text‑to‑speech (TTS), text‑to‑music (TTM), and text‑to‑video (TTV). The system is grounded in two classic narrative frameworks—Freytag’s Pyramid (exposition, rising action, climax, falling action, resolution) and Propp’s 31 narrative functions—providing a structured scaffold for story creation. Children interact with the system by selecting “cards” that represent Propp functions for each phase of Freytag’s Pyramid and answering guided questions. Their inputs are fed to a “Writer” LLM, which produces a draft story. A “Reviewer” LLM then checks the draft for age‑appropriateness, making edits as needed. The finalized text is simultaneously processed by three downstream pipelines: (1) a TTS engine (XTTS‑v2 or StyleTTS2) creates natural‑sounding narration; (2) a “Film Director” LLM writes a detailed scene guide for each paragraph, while a “Music Director” LLM produces a music guide reflecting the emotional tone; the music guide is turned into background music by a TTM model; (3) an “Animator” LLM uses the scene guide and music to generate video via the CogVideoX‑5b model. The output is a cohesive package of text, audio narration, background music, and animated video.

The authors review related work, noting that most prior story‑generation systems focus on text only and lack multimodal or co‑creative capabilities. They position Kahaani as filling this gap, especially for educational contexts where multimodal stimuli can support diverse learning styles, improve attention, and aid comprehension for children with learning disabilities.

Experimental evaluation proceeds in two stages. First, each AI component is assessed individually by six computer‑science students using a 0‑2 scoring rubric. For story generation, six LLMs (Gemma‑2‑9b, Gemma‑2‑27b, Llama‑3.1‑8b, Llama‑3.1‑70b, GPT‑4o, GPT‑4o‑mini) are compared on seven criteria: grammar, linguistic consistency, appropriate language, structural consistency, creativity, adherence to instructions, and naturalness. Pairwise win/tie/loss rates and a Bradley‑Terry ranking reveal that the smaller Gemma‑2‑9b consistently outperforms larger models, while GPT‑4o‑mini is the second‑best. Content moderation is evaluated with a curated set of 100 Gutenberg stories (50 appropriate, 50 inappropriate) plus 50 LLM‑generated stories, testing the reviewer LLM’s ability to filter out violent or explicit material.

For TTS, XTTS‑v2 and StyleTTS2 are compared on clarity, pause placement, emotion preservation, intonation, fluency, and pronunciation, using 50 random Gutenberg paragraphs and voice‑cloning tests against two reference speakers. For TTV, CogVideoX‑5b is tested in three visual styles (cartoon, anime, free‑form) on 50 paragraphs, evaluated for naturalness, temporal quality, fine‑grained alignment, overall alignment, and child‑friendliness.

A user study with three parent‑child pairs (children aged 6‑10) assesses the end‑to‑end system. Children answer Likert‑scale questions about story comprehension, recommendation likelihood, animation appeal, narration quality, story quality, overall experience, and learning outcomes. Parents evaluate content appropriateness, child engagement, recommendation likelihood, design satisfaction, overall satisfaction, child reaction, perceived learning, and potential language or creativity improvements. Results show high satisfaction on both sides, with children rating the experience around 4–5 out of 5 and parents noting the system’s educational value and suitability for their children’s age.

The paper discusses limitations: the user study’s small sample size, the need for longer‑term assessments of language skill development, and the current quality ceiling of generated video and music, which may not yet match professional standards. Safety concerns remain, as the reviewer LLM may not catch every inappropriate nuance.

In conclusion, Kahaani demonstrates a viable pipeline that combines narrative theory with state‑of‑the‑art generative models to deliver an engaging, multimodal storytelling experience for children. The systematic evaluation of each component, the comparative analysis of LLM sizes, and the co‑creative interface provide valuable insights for future research in AI‑enhanced educational content creation. Further work should expand user testing, refine multimodal generation quality, and strengthen content moderation to ensure safe, scalable deployment.


Comments & Academic Discussion

Loading comments...

Leave a Comment