MORE: Multi-Objective Adversarial Attacks on Speech Recognition

MORE: Multi-Objective Adversarial Attacks on Speech Recognition

📝 Abstract

The emergence of large-scale automatic speech recognition (ASR) models such as Whisper has greatly expanded their adoption across diverse real-world applications. Ensuring robustness against even minor input perturbations is therefore critical for maintaining reliable performance in real-time environments. While prior work has mainly examined accuracy degradation under adversarial attacks, robustness with respect to efficiency remains largely unexplored. This narrow focus provides only a partial understanding of ASR model vulnerabilities. To address this gap, we conduct a comprehensive study of ASR robustness under multiple attack scenarios. We introduce MORE, a multi-objective repetitive doubling encouragement attack, which jointly degrades recognition accuracy and inference efficiency through a hierarchical staged repulsion-anchoring mechanism. Specifically, we reformulate multi-objective adversarial optimization into a hierarchical framework that sequentially achieves the dual objectives. To further amplify effectiveness, we propose a novel repetitive encouragement doubling objective (REDO) that induces duplicative text generation by maintaining accuracy degradation and periodically doubling the predicted sequence length. Overall, MORE compels ASR models to produce incorrect transcriptions at a substantially higher computational cost, triggered by a single adversarial input. Experiments show that MORE consistently yields significantly longer transcriptions while maintaining high word error rates compared to existing baselines, underscoring its effectiveness in multi-objective adversarial attack.

💡 Deep Analysis

📄 Full Content

Automatic speech recognition (ASR) models, exemplified by the Whisper family [26], have become integral to a wide range of applications, including virtual assistants, realtime subtitling, clinical documentation, and spoken navigation [11]. Despite their success, the reliability of these systems in practical deployments remains fragile: even small adversarial perturbations can substantially degrade recognition accuracy or disrupt inference efficiency-for instance, by causing misinterpretation of user commands or inducing denial-of-service behaviors. These vulnerabilities underscore the need for a systematic examination of ASR robustness across both accuracy and efficiency, which is essential for ensuring dependable performance in real-world, time-sensitive environments.

Most prior work has been dedicated to accuracy robustness under adversarial attacks [28,27,24,21,9,33,13]. While these efforts help understanding ASR model accuracy vulnerabilities, the efficiency robustness of ASR models, and their ability to maintain real-time inference under adversarial conditions remain largely unexplored. Such efficiency is critical, as adversaries can exploit it to degrade system responsiveness, e.g., causing systems to output unnaturally long transcripts, severely impacting usability and causing the inference process to be excessively time-consuming. Therefore, enhancing and evaluating the efficiency robustness of ASR models is crucial to ensure their practicality in real-time, user-facing systems.

As efficiency robustness plays a pivotal role in the real-world applicability of deep learning models, there is a growing need to systematically assess it. Recent research has proposed adversarial attack methods to evaluate efficiency robustness in various domains, including computer vision [19,6], machine translation [5], natural language processing [7,18,10,17], and speech generation models [12]. However, research on the efficiency robustness of ASR models under attacks remains critically scarce, with SlothSpeech [15] standing as the only known effort. Yet, SlothSpeech does not consider the impact of efficiency attacks on accuracy and does not systematically explore adversarial output patterns. This leaves the efficiency dimension of ASR robustness insufficiently examined and calls for further investigation.

Nevertheless, the robustness of both accuracy and efficiency in ASR models still lags considerably behind human speech recognition performance [15,13]. This stark disparity underscores the need for a more holistic investigation into the vulnerabilities of these models. In this paper, we conduct a comprehensive study of the robustness of the Whisper family, a set of representative large-scale ASR models, with respect to both accuracy and efficiency.

To this end, we propose a novel Multi-Objective Repetitive Doubling Encouragement attack approach (MORE) that simultaneously targets both accuracy and efficiency vulnerabilities. Unlike prior attacks that optimize a single objective, MORE incorporates a multi-objective repulsion-anchoring optimization strategy that unifies accuracy-based and efficiency-based adversarial attacks within a single network. Motivated by natural human speech repetitions and repetitive decoding loops observed in transformer-based models [35], we introduce a repetitive encouragement doubling objective (REDO) that promotes duplicative text pattern generation periodically by maintaining accuracy degradation in producing elongated transcriptions. An asymmetric interleaving mechanism further reinforces periodic context doubling while an EOS suppression objective discourages early termination.

The contributions of this paper include: (a) this paper presents the first unified attack approach that jointly targets both accuracy and efficiency robustness against largescale ASR models via a multi-objective repulsion-anchoring optimization strategy; (b) we propose REDO, which bridges efficiency with accuracy gradients via guiding the accuracy-modified gradients towards repetitive elongated semantic contexts, thereby inducing incorrect yet extended transcriptions; and (c) we provide a comprehensive comparative study of diverse attack methods with insightful findings to balance accuracy and efficiency degradation. Extensive experiments demonstrate that the proposed MORE consistently outperforms all baselines in producing longer transcriptions while maintaining strong accuracy attack performance.

Adversarial Attacks on Speech Recognition. Automatic speech recognition has been extensively studied regarding its vulnerability to attacks. These attacks primarily seek to degrade recognition accuracy by introducing typically subtle perturbations into speech inputs, thereby compromising transcription accuracy [15,23,29,32,36,14]. Notable examples include attacks in the MFCC feature domain [30,8], targeted attacks designed to trigger specific commands [3], and perturbations constrained to ultrasonic frequency bands (e.g., DolphinAttack [37]). Most prior works on attacking ASR have concentrated on traditional architectures, such as CNN or Kaldi-based systems [31], with limited exploration into modern large-scale transformer-based ASR models.

Recent advances in ASR have been driven by the emergence of large-scale models, notably OpenAI Whisper [26], a transformer-based encoder-decoder architecture trained on large-scale datasets (680K hours of data), demonstrating greater robustness and generalization across diverse speech scenarios. Consequently, there has been an increasing research interest in evaluating the adversarial robustness of Whisper, particularly focusing on accuracy-oriented attacks. Such efforts include universal attacks [28,27], targeted Carlini&Wagner (CW) attacks [24] and gradient-based methods, i.e., projected gradient descent (PGD) [21], momentum iterative fast gradient sign method (MI-FGSM) [9], variance-tuned momentum iterative fast gradient sign method (VMI-FGSM) [33], as well as speech-aware adversarial attacks [13]. However, most existing approaches focus only on accuracy robustness and overlook vulnerabilities in inference efficiency, which can be exploited through decoding manipulation. Sloth-Speech [15] represents the only prior efficiency-focused attack in ASR, but it does not jointly consider accuracy degradation or structured repetition, limiting its ability to assess multi-dimensional robustness.

Motivations and Applications. Different from previous attacks, our proposed MORE systematically evaluates and undermines both accuracy and efficiency within a single adversarial network, offering a comprehensive understanding of large-scale ASR model’s vulnerabilities that previous single-objective methods cannot provide. The significance of studying the adversarial robustness of ASR models, particularly Whisper, is amplified by their potential deployment in hate speech moderation [20,34] and private speech data protection. Practically, our proposed MORE can be applied to distort the transcription of harmful or private speech, preventing ASR systems from reliably converting such content into readable text. By inducing incorrect and excessively long transcriptions, MORE exposes decoding weaknesses that are not revealed by accuracy-only attacks, offering a more comprehensive view of ASR vulnerability.

Victim model. We consider a raw speech input represented as a sequence X = [x 1 , x 2 , . . . , x T ]. Its corresponding ground-truth transcription is a sequence of text tokens Y = [y 1 , y 2 , . . . , y L ]. The target ASR model is denoted by a function f (•) that maps a speech sequence to a predicted transcription, i.e., f (X) = Ŷ . The model vocabulary is denoted by V , and EOS ∈ V is the end-of-sequence token. Our objective is to construct an adversarial perturbation δ such that the perturbed input X + δ triggers harmful behavior during decoding.

Attack objective. Most existing adversarial attacks on ASR aim solely to maximize transcription error. However, practically disruptive attacks must also degrade inference efficiency, especially in real-time ASR systems where excessive decoding time can break user interactions. We therefore formulate a dual-objective optimization targeting both transcription accuracy and computational efficiency:

where WER(•) denotes the word error rate and |f (•)| denotes the length of the predicted sequence. This formulation explicitly seeks perturbations that (i) increase transcription error relative to the ground truth and (ii) induce excessively long outputs, thereby amplifying computational overhead.

Perturbation constraint. We impose both energy-and peak-based constraints for imperceptibility. A standard measure is the signal-to-noise ratio (SNR), which compares the energy of the signal and the perturbation:

While SNR constrains overall perturbation energy, it may still allow short, high-amplitude distortions. To avoid this, we additionally bound the perturbation’s peak amplitude using the ℓ ∞ norm:

This ℓ ∞ constraint ensures that no single sample deviates excessively, which aligns with psychoacoustic masking principles. The adversarial example is thus defined as X adv = X + δ with δ ∈ ∆.

The proposed MORE attack is motivated by the autoregressi ve nature of ASR models and the different optimization dynamics of our two goals: reducing transcription accuracy and prolonging decoding for efficiency degradation, as illustrated in Fig. 1. In autoregressive models, each predicted token influences all future predictions, with the end-of-sentence (EOS) token being particularly sensitive; small perturbations to its logits can drastically alter when decoding stops [28,24], but it receives sparse gradient signal compared with ordinary tokens. The accuracy attack objective, in contrast, distributes across many token positions, encouraging mis-transcriptions and resulting in a relatively large feasible adversarial set, as many incorrect transcripts are possible. The efficiency attack objective, however, mainly targets the non-stopping behavior associated with a single EOS token, where gradients are narrowly concentrated and typically smaller in magnitude compared with those of the broader accuracy objective. Because accuracy gradients are broad and efficiency gradients are sharp and concentrated, combining them in a single-step optimization often causes one objective to dominate. This makes direct multi-objective optimization unstable.

To address this, our proposed MORE uses a hierarchical two-stage strategy consisting of a repulsion stage for accuracy degradation and an anchoring stage for efficiency degradation. The repulsion stage forces the model away from the correct transcription. The anchoring stage then exploits remaining degrees of freedom to extend decoding.

Formally, we approximate the hierarchical formulation in Eq. 4 using a two-stage repulsion-anchoring method. In the repulsion stage, we maximize a differentiable proxy of WER. In the anchoring stage, we extend the decoded sequence length while preserving the high error rate obtained in the repulsion stage. The following sections describe the optimization procedure for each in detail.

This staged hierarchical design avoids forcing accuracy and efficiency gradients to compete simultaneously and provides a stable optimization path.

The first pillar of our MORE attack is the repulsion stage, which focuses on degrading transcription accuracy. The repulsion stage applies a standard gradient-based accuracydegradation attack using cross-entropy (CE) as a differentiable proxy for WER. Minimizing negative CE reduces the probability of ground-truth tokens, pushing the model toward incorrect outputs and increasing WER.

The accuracy attack loss is formulated as:

Taking gradient changing steps w.r.t. this loss function encourages the ASR model to output incorrect tokens, which directly correlates with an increase in the ultimate WER. This serves as the initial “destabilization” repulsion stage of our attack to destabilize the decoding trajectory and prepare the model for the subsequent efficiency attack.

Complementing the accuracy degradation, MORE’s efficiency attack targets the nowvulnerable model by anchoring it to generate excessively long and computationally expensive transcriptions. This anchoring stage is accomplished through the two components below.

EOS suppression. Decoding normally terminates when EOS is predicted. By penalizing the probability of this token, we can deceive the model into prolonging the decoding process indefinitely, often resulting in the generation of irrelevant or meaningless tokens. Penalizing only the EOS token is insufficient since its probability is typically dominant at the final decoding step. To enhance attack effectiveness, we reduce the likelihood of the EOS token. In addition, we increase the probability of the competing token with the second-largest likelihood. Reinforcing this alternative token not only diminishes EOS dominance but also guides the model toward continued generation. Therefore, the EOS-suppression loss is formulated as:

where P EOS L is the probability that the model emits EOS at the final position L. z is the token with the second largest probabilities at position L:

We denote L as the output sequence length, and P v L is the model’s predicted probability of token v being produced at output position L. This dual adjustment ensures the model is discouraged from selecting EOS while being nudged toward an alternative continuation, thereby minimizing this loss reduces EOS dominance and favors continuation tokens, which in turn prolongs decoding. Repetitive Encouragement Doubling Objective (REDO). While effective, simple EOS suppression can lead to unstable optimization or low-confidence, random outputs. To introduce a more structured and potent method for sequence elongation, we propose a novel repetitive encouragement doubling objective (REDO), inspired by repetition loops observed in transformer models [35] and natural speech disfluencies. Transformer models are known to enter self-reinforcing repetition loops where once a sentence with high generation probability is produced, the model tends to reproduce it in subsequent steps, as its presence in the context further boosts its likelihood of being selected again [35]. This recursive amplification leads to a self-sustaining loop of repetition, wherein repeated sentences reinforce their own future generation by dominating the context.

Our REDO leverages this mechanism to force long structured repetitions, thereby reliably increasing sequence length, as demonstrated in Figure 1. At each period, REDO constructs a duplicated version of the earlier decoded segment and uses CE to encourage the model to reproduce the extended sequence consistently. This produces stable semantic repetition and much longer sequences than EOS suppression alone.

Specifically, given an initial decoding output Ŷ , we construct a new target sequence Ȳ that contains a repeated segment. We then force the model to predict this new, longer sequence using a cross-entropy objective. The target sequence Ȳi for step i is constructed as:

where L is the length of target sequence length at step i D , D is the the doubling period, controlling how frequently the sequence is duplicated. The floor function i D ensures that the repeated segment remains fixed within each interval of D steps, only updating once every D steps. This periodic repetition creates stable semantic loops that encourage longer and more redundant model outputs. The doubling loss REDO is:

For a concrete example, if the target sequence for attacking step

the target sequence for attacking step 0 to 9 should be Ȳ = [y 1 , y 2 , y 3 , y 1 , y 2 , y 3 ] with doubling the regular tokens and strictly eliminating the EOS token. This loss explicitly guides the model to produce periodic, repeated segments, which serve to rapidly and reliably inflate the output token count while maintaining a degree of linguistic structure, making the attack more potent. Finally, for efficiency attack, the loss is formulated as

Asymmetric interleaving. Applying a single-stage long-repeated target for attack can destabilize gradient optimization with the long-horizon optimization difficulties [2,1,22]. To mitigate this, REDO breaks the long-horizon repetition task into a sequence of easier subproblems, yielding smoother optimization than trying to force a single-stage long-target objective. REDO is therefore formulated as a stepwise, curriculum-style attack that progressively optimizes for longer repeated outputs. Concretely, we interleave: for step s maintain the repeated target fixed when s mod D ̸ = 0 and extend it to the next longer form when s mod D = 0. This “periodic booster” concentrates high-variance, long-horizon REDO updates sparsely while using frequent short-horizon updates to stabilize learning, and is distinct from summing losses or applying dense REDO updates at every step.

Algorithm Details. We integrate these components into a unified hierarchical procedure (Algorithm 1) to the dual-objective problem. We provide a detailed analysis in Appendix B. In particular, Appendix B characterizes how the computational cost Compute L RDEO by Eq. 9 17:

end for 20: return δ of both the repulsion (accuracy-degradation) and anchoring (REDO-based repetition) stages scales with model depth, width, and the REDO-induced growth in output length. We additionally provide FLOPs analysis in Appendix C.

Datasets. We utilize two widely used ASR datasets from HuggingFace, LibriSpeech [25] and LJ-Speech [16], and evaluate the first 500 utterances from each (LJ-Speech and the test-clean subset of LibriSpeech), with all audio resampled to 16,000 Hz.

Threat Model. We conduct white-box attacks with full access to the model on five Whisper-family models [26], including Whisper-tiny, Whisper-base, Whisper-small, Whisper-medium, and Whisper-large, all obtained from HuggingFace. To benchmark the proposed MORE approach, we compare it against five strong white-box attack baselines: PGD [24], MI-FGSM [9], VMI-FGSM [33], the speech-aware gradient optimization (SAGO) method [13], and SlothSpeech [15].

Experimental Setup. In MORE, the hyperparameters I and K a are set to 10 and 50, respectively. To ensure imperceptibility to human listeners, we set the perturbation magnitudes ϵ to 0.002 and 0.0035, which correspond to average signal-to-noise ratios (SNRs) of 35 dB and 30 dB, respectively, both within the range generally considered inaudible to humans [13]. All experiments are conducted using a NVIDIA H100 GPU. Evaluation Metrics. To evaluate ASR accuracy degradation, we adopt word error rate (WER) as the metric, which quantifies the proportion of insertions, substitutions, and deletions relative to the number of ground-truth words. Given that adversarial transcriptions may be excessively long, we truncate the predicted sequence to match the length of the reference text to better evaluate accuracy degradation in the initial portion of the output, where meaningful recognition should occur; WER values exceeding 100.00% are capped at 100.00% for normalization. Higher WER indicates lower ASR performance and thus a more effective accuracy attack. Efficiency attack performance is measured by the length of the predicted text tokens, where a greater length indicates a more effective efficiency attack.

We evaluate MORE against state-of-the-art baselines on both accuracy and efficiency across multiple ASR models, examine its robustness under different SNR levels, and conduct ablations to analyze component contributions. A case study with adversarial examples and decoded transcriptions further illustrates its effectiveness.

We compare MORE with SOTA baselines across multiple ASR models, with results on two datasets at 30 dB and 35 dB shown in Table 1 Notably, on the robust Whisper-large model, our MORE approach exhibits exceptional performance in both accuracy and efficiency dimensions. It achieves higher accuracy degradation (WER of 53.72 compared to about 30 for accuracy-oriented attack baselines) while producing average transcription lengths of 301.47-roughly 10 times longer than accuracy-oriented baselines and approximately 3.8 times longer than SlothSpeech (79.78). Additionally, SlothSpeech’s weaker accuracy degradation (WER of 46.80 vs. MORE’s 91.01) highlights the limitations of optimizing solely for efficiency and underscores the necessity of incorporating accuracy objectives for comprehensive attacks. These findings validate the robustness and utility of our proposed multi-objective optimization approach.

We further investigate the effect of varying SNR levels on attack effectiveness. By comparing results under 30 dB SNR (Table 2) and 35 dB SNR (Table 1), we observe that attacks under 30 dB SNR generally yield stronger performance, characterized by higher WER and significantly longer transcriptions (length). This confirms that lower SNR (i.e., noisier conditions) provides a more favorable environment for adversarial perturbations to succeed.

We conduct an ablation study to assess the contribution of each component in the proposed MORE approach (

To qualitatively demonstrate the effectiveness of MORE, we present decoded adversarial transcriptions in Table 4. Unlike other baselines, which produce either incorrect outputs, MORE generates a fully incorrect transcription with a structured repetition of the sentence “and her voice” over 100 times, resulting in a length of 334 and a WER of 100.00. This showcases the strength of our proposed multi-objective optimization and the repetitive doubling encouragement objective in simultaneously disrupting transcription accuracy and inducing extreme inefficiency through systematic and semantically coherent redundancy. More case studies can be found in Appendix A.

We propose MORE, a novel adversarial attack approach that introduces multi-objective repulsion-anchoring optimization to hierarchically target recognition accuracy and inference efficiency in ASR models. MORE integrates a periodically updated repetitive encouragement doubling objective (REDO) with end-of-sentence suppression to induce structured repetition and generate substantially longer transcriptions while retaining effectiveness in accuracy attacks. Experimental results demonstrate that MORE outperforms existing baselines in efficiency attacks while maintaining comparable performance in accuracy degradation, effectively revealing dual vulnerabilities in ASR models. The code will be made publicly available upon acceptance.

All authors have read and agree to adhere to the ICLR Code of Ethics. We understand that the Code applies to all conference participation, including submission, reviewing, and discussion. This work does not involve human-subject studies, user experiments, or the collection of new personally identifiable information. All evaluations use publicly available research datasets under their respective licenses. No attempts were made to attack deployed systems, bypass access controls, or interact with real users. Our contribution, MORE, is an adversarial method that can degrade both recognition accuracy and inference efficiency of ASR systems. While our goal is to advance robustness research and stress-test modern ASR models, the same techniques maybe misused to (i) impair assistive technologies (e.g., captioning for accessibility), (ii) disrupt safety-or time-critical applications (e.g., clinical dictation, emergency call transcription, navigation), or (iii) increase computational costs for shared services via artificially elongated outputs. We do not provide instructions or artifacts intended to target any specific deployed product or service, and we caution that adversarial perturbations, especially those designed to be inconspicuous, present real risks if applied maliciously.

To reduce misuse risk and support defenders, we suggest that some concrete defenses should be integrated into ASR systems: decoding-time safeguards such as repetition/loop detectors; input-time defenses such as band-limiting; and training-time strategies such as adversarial training focused on EOS/repetition pathologies. We will explore some targeting defense mechanisms against MORE in future work.

We take several steps to support independent reproduction of our results. Algorithmic details for MORE are provided in Sec. 3 and further clarified in the Appendix B. Dataset choices and preprocessing (LibriSpeech & LJ-Speech, first 500 utterances per set, 16 kHz resampling) are specified in Sec. 4, while exact model variants (Whispertiny/base/small/medium/large from HuggingFace), hardware, perturbation budgets/SNRs, and all attack hyperparameters are detailed in Sec. 4. The evaluation protocol is defined in Sec. 4. We will not include the code archive in the submission due to proprietary requirements. Upon acceptance, we will release a public repository mirroring the anonymous package as soon as we get permission.

To showcase the effectiveness of MORE, we present more decoded adversarial transcriptions in Table 5 andTable 6.

Scope and assumptions. We analyze two costs: (i) attack-time compute to craft δ via Algorithm 1, and (ii) victim-time compute when the ASR model decodes on X+δ. Our derivation accounts for encoder self-attention, decoder self-/cross-attention, feedforward layers, vocabulary projection/softmax, greedy decoding used to materialize doubled targets, and the backward/forward constant factor. We express totals in terms of model hyperparameters and the scheduling parameters (K, K a , D) defined in Algorithm 1, and we refer to the objectives in Eqs. 5, 6, 8, and 9. On the construction of doubled targets. Eq. 8 defines Ȳi via ŷ⌊ i D ⌋ but does not specify how the hypothesis Ŷ is obtained at the start of each block. To make Eq. 8 operational, we explicitly realize Ŷ with a greedy decode on the current perturbed input:

Here, Eq. 10 computes decoding the perturbed input X+δ with the victim model f (•) using greedy decoding (selecting the most probable token at each step until termination). In Eq. 11, STRIPEOS(•) removes the terminal EOS token from the decoded hypothesis, leaving only the content-bearing tokens. The notation “∥” denotes sequence concatenation, so Eq. 12 constructs the doubled sequence by repeating B m back-to-back. Finally, l calculated by Eq. 13 is the length of this doubled target. This makes Eq. 8 explicit and adds a per-block greedy-decoding cost accounted for below.

Adversarial decoded text from adversarial speech Length WER Clean in fourteen sixty-five sweynheim and pannartz began printing in the monastery of subiaco near rome 22 0.00 PGD With a 14-a-tip to the dog, framed by its and clan herbs begin in renting a Ramona scurrying, home-or-dity-actyl-healed blowing.

SlothSpeech In 1365. Twain, and patterns. We began printing in the monastery on Superacro. We are going to be here. We are going to be here. We are going to be here. We are going to be here. We are going to be here. We are going to be here. We are going to be here.

The encoder cost in Eq. Loss conditioning and reuse of passes. Eqs. 5 and 9 are cross-entropy (CE) objectives and must be computed under teacher forcing to provide stable gradients. We assume CE terms use teacher forcing throughout. For Eq. 6, we define P v L at the last teacher-forced position (so L = lm in Stage 2). With Algorithm 1 summing L EOS +L REDO each Stage 2 step, both losses share one forward/backward pass at length lm ; no additional pass is required for EOS.

Stage 1 (Accuracy; K a steps). Each step backpropagates L acc (Eq. 5) under teacher forcing on Y of length L acc = |Y |:

Optional early-stopping evaluations (e.g., greedy WER every E steps) add

where L eval is the decoded length at evaluation.

Stage 2 (Efficiency; K-K a steps). Stage 2 consists of M blocks of D steps. In each block m:

• Greedy anchor (once per block). Build Ȳ (m) by decoding Ŷ (m) and doubling (Eq. 8):

• PGD steps (every step in the block). Algorithm 1 uses the sum L EOS +L REDO each step, with teacher-forced length lm = 2L m . Hence, per step:

and over D steps:

Summing over blocks,

greedy,m + C

Growth envelopes for L m . The doubled-target curriculum encourages L m to increase across blocks.

• Geometric (until cap). In this case, L m grows by doubling until it reaches the cap L max :

Then the sums are

Plugging into Eq. 22, the self-attention term scales as Θ D • 4 M ⋆ before saturation at L max .

• Linear (until cap). If L m = L 0 + (m-1)∆ (capped at L max ), then the uncapped sums are

Here the dominant term in the self-attention sum is Θ(D M 3 ∆ 2 ) when ∆ > 0.

Total attack-time cost. Combining stages,

with C Stage2 calculated from Eq. 22. The PGD update is O(T ) and negligible.

When decoding on X+δ without backprop, expected per-example compute is

where ℓ adv is the induced output length under the chosen decoding policy (greedy by default). Focusing on decoder-dominant terms, the slowdown relative to the clean case (ℓ clean ) is

This bound separates: (i) repeated encoder passes linear in K and F ; (ii) decoder selfattention terms growing with m L 2 m (the principal driver under elongation); and (iii) vocabulary and greedy-decoding overheads linear in sequence length and V .

To complement our analysis based on output length, we provide a more explicit characterization of the computational overhead induced by MORE in terms of floating-point operations (FLOPs).

Following standard FLOPs estimates for Transformer models, i.e., approximately 2 • N FLOPs per generated token for a model with N parameters [4], we approximate the per-example inference cost (encoder + decoder) for Whisper. Since the encoder runs once per utterance while the decoder runs once per output token, the relative increase in FLOPs is dominated by the increase in output length caused by MORE. In typical (nonattack) conditions on LibriSpeech, Whisper produces transcriptions roughly the same length as the reference transcript-about 22 tokens on average per utterance. Based on Table 1, MORE can induce 10× to 14× longer transcripts compared to normal outputs across different Whisper model sizes. Using the parameter sizes of Whisper models and the average output lengths observed in our experiments, we obtain the following per-example FLOP on the LibriSpeech dataset.

Here we approximate (1) FLOPs per token ≈ 2 • N params ; (2) Total FLOPs per example ≈ FLOPs encoder + (#tokens) • 2 • N params . And we use the empirical baseline vs. MORE token lengths from our experiments. These estimates show that, across all Whisper sizes, MORE increases per-example inference compute by roughly an order of magnitude (≈ 9-14×), purely by forcing the model to generate much longer, repetitive transcripts. This quantifies an efficiency vulnerability: MORE does not just degrade accuracy, but also inflates the FLOPs required for inference, threatening the real-time and resource efficiency of ASR deployments.

In Fig. 2, we profile the inference time of the Whisper-Large model as a function of the number of output tokens. The inference time increases almost linearly with the output length. Moreover, Whisper-Large pads all inputs shorter than 30 seconds to a fixed 30second window before processing, so utterances shorter than 30 seconds incur identical encoding time, making the output length the only factor that determines the overall resource consumption. These observations support our motivation to maximize the output length in order to induce the greatest possible waste of computational resources.

We used a large language model (ChatGPT) solely as a writing assist tool for grammar checking, wording consistency, and style polishing of author-written text. All technical content, results, and conclusions originate from the authors. Suggested edits were reviewed by the authors for accuracy and appropriateness before inclusion. No confidential or proprietary data beyond manuscript text was provided to the tool. This disclosure is made in accordance with the venue’s policy on LLM usage.

1: Input: Original audio X, true transcript Y , ℓ∞ radius ϵ, step size α, total steps K, accuracy steps Ka, doubling period D, ASR model f (•).

Comparison of average recognition accuracy (WER%) and average transcribed text token length of various attack methods on the LibriSpeech and LJ-Speech datasets at an SNR of 35 dB. The reported accuracy and token length are averaged over 500 utterances for each dataset. ‘Clean’ denotes performance on the original, unperturbed speech. Note that higher WER and longer transcribed token length indicate a more successful attack.

generating significantly longer transcriptions while maintaining high WER for accuracy degradation across both SNR levels. Specifically, accuracy-oriented baselines (PGD, SAGO, MI-FGSM, and VMI-FGSM) achieve substantially lower transcription lengths (e.g., 31.65 vs. our 300.13), highlighting the effectiveness of our novel repetitive encouragement doubling objective (REDO). Compared to SlothSpeech, a baseline specifically designed for efficiency attacks, MORE still achieves significantly longer outputs (e.g., 208.52 vs. 65.85), underscoring the efficacy of our doubling loss design in REDO to effectively induce repetitive and extended transcriptions.

MOREher mind, her voice, her voice, and her voice, and her voice, and her voice, and her voice, and her voice, and her voice, and her voice, …….(repeat 100 times) and her voice, …… and her voice, and her voice, and her voice, and 334 100.00

MORE

MO with accuracy loss or EOS. Removing all efficiency attack components-EOS and REDO-causes efficiency to fail completely (length drops to 30.60), with only the accuracy attack (WER 93.63) remaining effective. This confirms that all efficiency design components are essential for achieving successful efficiency degradation. Overall, all components are critical and effectively contribute to the success of the attacks against the victim ASR models.

2 (self-attention) and as ℓF (cross-attention). The vocabulary projection/softmax adds O(ℓV ) per forward/backward; we keep it explicit when informative.

2