Qwen3-ASR Technical Report

Qwen3-ASR Technical Report
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.


💡 Research Summary

The Qwen3‑ASR Technical Report introduces a family of three models built on the Qwen3‑Omni foundation: two all‑in‑one automatic speech recognition (ASR) models (Qwen3‑ASR‑1.7B and Qwen3‑ASR‑0.6B) and a non‑autoregressive forced‑alignment model (Qwen3‑ForcedAligner‑0.6B). All three models are released under the Apache 2.0 license together with a full inference and fine‑tuning toolkit.

Model Architecture
Both ASR models share a common architecture: a pretrained AuT (Attention‑U‑Transformer) encoder derived from Qwen3‑Omni and a large language model (LLM) decoder. The AuT encoder performs 8× down‑sampling of 128‑dimensional Fbank features, yielding a 12.5 Hz token stream. A dynamic flash‑attention window ranging from 1 s to 8 s enables both low‑latency streaming and long‑form offline inference. The 1.7 B variant couples the Qwen3‑1.7 B LLM (1024 hidden size) with a 300 M‑parameter AuT encoder; the 0.6 B variant uses the Qwen3‑0.6 B LLM (896 hidden size) with a 180 M‑parameter encoder, achieving a compact yet powerful design.

Training Pipeline
Training proceeds in four stages:

  1. AuT pre‑training on ~40 million hours of pseudo‑labeled speech (mostly Chinese and English) under an AED framework, establishing robust acoustic representations.
  2. Omni pre‑training where the models are further trained on 3 trillion multimodal tokens (audio, vision, text) to inherit the broad understanding of Qwen3‑Omni.
  3. Supervised fine‑tuning (SFT) on a disjoint multilingual corpus that includes standard ASR data, streaming‑enhancement data, and context‑biasing examples. The models are trained as pure ASR systems without instruction‑following prompts to avoid injection errors.
  4. Reinforcement learning (RL) using Group Sequence Policy Optimization (GSPO) on ~50 k utterances (balanced across Chinese, English, multilingual, and functional data). This stage improves noise robustness, transcription stability, and handling of difficult acoustic conditions.

Key Features and Performance

  • Multilingual & Dialect Coverage: Support for 52 languages and dialects (30 languages + 22 Chinese dialects).
  • Robustness: Accurate transcription of noisy speech, singing voice, and music‑backed audio.
  • Long‑Form & Streaming: Handles up to 1 200 seconds of audio in a single inference and provides unified streaming/offline operation.
  • Inference Efficiency: Benchmarks using vLLM show the 0.6 B model achieving an average time‑to‑first‑token (TTFT) of 92 ms, a real‑time factor (RTF) of 0.064, and a throughput of 2 000 seconds of audio per second at concurrency 128. The 1.7 B model attains state‑of‑the‑art (SOTA) open‑source ASR accuracy and competes with leading commercial APIs.
  • Open‑Source Toolkit: Full code for inference (including multi‑granularity alignment, streaming, and multilingual processing) and reproducible fine‑tuning recipes are provided.

Forced Alignment Model (Qwen3‑ForcedAligner‑0.6B)
This is the first LLM‑based non‑autoregressive forced aligner. The approach reformulates alignment as a slot‑filling task: the transcript is augmented with special


Comments & Academic Discussion

Loading comments...

Leave a Comment