Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization, Privacy, and Layout Fidelity
A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher’s original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at https://github.com/hugohammer/TTS-Narrated-Ebook-Creator.git.
💡 Research Summary
Calliope is an open‑source framework that automatically converts a standard EPUB e‑book into a narrated EPUB 3 Media Overlay publication, leveraging state‑of‑the‑art neural text‑to‑speech (TTS) models while preserving the original layout, styling, and embedded media. The authors identify a gap in the current ecosystem: commercial services (e.g., ElevenReader, NaturalReader) provide high‑quality narrated books but require cloud APIs, incur recurring costs, and expose copyrighted or sensitive content to third‑party servers. Existing open‑source tools either generate only audiobooks (e.g., epub2tts), rely on forced‑alignment techniques that suffer from synchronization drift (e.g., Syncabook, Storyteller), or reconstruct the EPUB container in a way that strips CSS and layout information.
Calliope addresses these shortcomings through three tightly coupled phases.
Phase 1 – Text extraction and layout preservation: The system unpacks the EPUB (an Open Container Format zip), parses each XHTML content document, and traverses the DOM to locate block‑level elements (paragraphs, headings). Text is Unicode‑normalized, punctuation‑standardized, and segmented into sentences using a robust tokenizer. For each sentence a globally unique identifier is injected as a <span id="…"> element, preserving the parent element’s CSS classes and inline styles. This “surgical” modification ensures that the visual appearance of the original publisher’s design remains untouched while providing precise anchors for synchronization.
Phase 2 – Neural TTS synthesis and deterministic timestamp generation: Calliope supports two open‑source neural TTS back‑ends – XTTS‑v2 (Coqui, an autoregressive GPT‑2‑style transformer with HiFi‑GAN vocoder) and Chatterbox (Resemble AI, a non‑autoregressive flow‑matching model built on a 0.5 B‑parameter Llama backbone). Both models have strict context‑window limits and can raise token‑overflow errors for unusually long or dense inputs. To guarantee stable execution, the pipeline implements a two‑tier strategy: (i) a pre‑emptive heuristic that splits sentences exceeding a safety length (≈200 characters) at the nearest whitespace, and merges very short sentences (≈<60 characters) to avoid under‑utilization; (ii) a reactive error handler that catches runtime token‑overflow exceptions, recursively splits the offending segment, and concatenates the resulting audio fragments.
Each generated waveform is post‑processed with a short (≈50 ms) fade‑out and a fixed silence padding (≈0.15 s) to eliminate click artifacts and to provide a natural pause between sentences. The overall chapter audio is formed by sequential concatenation of these processed segments. Because the synthesis is performed sentence‑by‑sentence, timestamps are computed deterministically: the start time of a sentence is the sum of durations of all preceding audio segments plus their padding, and the end time is extended to the start of the next sentence, yielding a “gapless” visual experience where the highlight never disappears during pauses. This deterministic approach eliminates the drift observed in forced‑alignment pipelines, achieving sub‑millisecond alignment accuracy.
Phase 3 – Packaging, SMIL generation, and accessibility enhancements: For each XHTML chapter a corresponding SMIL file is generated, mapping each <span id> to its calculated `
Comments & Academic Discussion
Loading comments...
Leave a Comment