Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent “encode-the-whole-utterance” latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.
💡 Research Summary
Moonshine v2 addresses the fundamental latency bottleneck of streaming automatic speech recognition (ASR) by replacing the quadratic‑complexity full‑attention encoder with a linear‑complexity sliding‑window self‑attention mechanism. The authors observe that conventional high‑accuracy ASR models such as Whisper or NVIDIA Parakeet rely on global attention, which allows every acoustic frame to attend to every other frame. While this yields strong contextual disambiguation, it forces the encoder to wait for the entire audio prefix before any decoder token can be emitted, causing time‑to‑first‑token (TTFT) to grow linearly with utterance length.
To solve this, Moonshine v2 introduces a four‑stage architecture: (1) a lightweight audio front‑end that converts 16 kHz raw audio into a 50 Hz feature stream using 5 ms non‑overlapping windows, cepstral mean‑variance normalization, an asinh non‑linearity, and two causal stride‑2 convolutions; (2) an “ergodic” Transformer encoder that contains no absolute or relative positional embeddings, making its computation translation‑invariant in time. Each encoder layer applies sliding‑window attention with a left context of 16 frames (320 ms) and, for the first two and last two layers, a right context of 4 frames (80 ms). This limited look‑ahead bounds algorithmic latency while still providing a modest amount of future information for disambiguation. (3) an adapter that injects learned positional embeddings back into the encoder output and projects the dimensionality to match the decoder; (4) a standard causal decoder with rotary positional embeddings (RoPE) and SwiGLU feed‑forward blocks, which autoregressively generates text tokens and cross‑attends to the adapter features.
Three model sizes are trained—Tiny (≈22 M parameters), Small (≈69 M), and Medium (≈245 M). Training uses the same data pipeline as the original Moonshine work, augmented with an additional 100 k hours of internal speech data for a total of roughly 300 k hours. The Schedule‑Free optimizer with a 2 × 10⁻³ learning rate runs for 400 k steps on eight NVIDIA H100 GPUs.
Evaluation is performed on the Open ASR leaderboard and standard benchmarks. Results show that Moonshine v2 achieves word‑error rates comparable to Whisper Large v3 while using roughly one‑sixth of the parameters. More importantly, the sliding‑window encoder delivers TTFT under 250 ms on hardware capable of 0.1 TOPS (≈100 GOPS), a threshold often cited for acceptable interactive voice latency. By contrast, a full‑attention encoder on the same hardware exceeds 250 ms after only about 4 seconds of audio. Real‑world response latency—measured from voice‑activity detection to transcript output—averages 320 ms, making the system suitable for live captioning, voice command interfaces, and other latency‑critical applications. Compute cost measured on an Apple M3 CPU shows the full pipeline operating at 2–3 × faster than real‑time, confirming its suitability for on‑device deployment.
The paper also discusses limitations and future directions. The decoder remains autoregressive, so generating a long transcript still incurs token‑by‑token latency. The authors suggest integrating streaming‑friendly loss functions such as CTC, RNN‑T, or Token‑and‑Duration Transducer (TDT) to enable a fully streaming encoder‑only model. Additionally, while the current implementation leverages Flash‑Attention’s sliding‑window backend, further gains could be realized with custom CUDA or CPU kernels optimized for edge devices.
In summary, Moonshine v2 demonstrates that a carefully designed local‑attention encoder, combined with a lightweight adapter and a standard decoder, can deliver high‑accuracy, low‑latency ASR on resource‑constrained hardware. By eliminating absolute positional embeddings and restricting attention to a bounded temporal window, the model achieves linear inference cost, predictable latency, and competitive word‑error rates, opening the door to practical, privacy‑preserving speech interfaces on smartphones, wearables, and other edge platforms. Future work will explore fully streaming decoder alternatives and broader multimodal integration.
Comments & Academic Discussion
Loading comments...
Leave a Comment