Resource Consumption Threats in Large Language Models
Given limited and costly computational infrastructure, resource efficiency is a key requirement for large language models (LLMs). Efficient LLMs increase service capacity for providers and reduce latency and API costs for users. Recent resource consu…
Authors: Yuanhe Zhang, Xinyue Wang, Zhican Chen
Resour ce Consumption Thr eats in Large Language Models Y uanhe Zhang 1 , Xinyue W ang 1 , Zhican Chen 1 , W eiliu W ang 2 , Zilu Zhang 1 , Zhengshuo Gong 1 , Zhenhong Zhou 3 , Kun W ang 3 , Li Sun 1 , Y ang Liu 3 , Sen Su 4, 1, † 1 Beijing Uni versity of Posts and T elecommunications 2 Hangzhou Dianzi Uni versity 3 Nanyang T echnological Univ ersity 4 Chongqing Uni versity of Posts and T elecommunications {charmes-zhang, susen}@bupt.edu.cn; Abstract Giv en limited and costly computational in- frastructure, resource efficienc y is a key re- quirement for lar ge language models (LLMs). Efficient LLMs increase service capacity for providers and reduce latency and API costs for users. Recent resource consumption threats induce excessi v e generation, de grading model efficienc y and harming both service av ailabil- ity and economic sustainability . This survey presents a systematic re vie w of threats to re- source consumption in LLMs. W e further es- tablish a unified view of this emer ging area by clarifying its scope and examining the problem along the full pipeline from threat induction to mechanism understanding and mitigation. Our goal is to clarify the problem landscape for this emerging area, thereby pro viding a clearer foundation for characterization and mitigation. 1 Introduction Large language models ( V aswani et al. , 2017 ) op- erate under limited and costly computational in- frastructure ( Samsi et al. , 2023 ; Miao et al. , 2024 ), making resource ef ficiency a core requirement for practical deployment. Ef ficient resource usage im- prov es service throughput for providers and re- duces API costs for users. Consequently , improv- ing computational efficienc y has been studied as an engineering problem ( Zhou et al. , 2024 ; Fernandez et al. , 2025 ) across LLMs ( V aswani et al. , 2017 ), reasoning lar ge language models (RLLMs) ( W ei et al. , 2022 ), multimodal large language models (MLLMs) ( Alayrac et al. , 2022 ), and LLM-based agentic en vironments (Agents) ( Y ao et al. , 2022 ). Ho we ver , optimization alone is insuf ficient to ad- dress this threat, as many adversarial resource con- sumption attacks still threaten the usability and sustainability of LLMs ( Zhang et al. , 2025e ). † indicates corresponding author . Reasoning: 12×13 = (10+2)(10+3) = 100 + 30 + 20 + 6 = 130 + 20 + 6 = 150 + 6 = 150 + 6 = 150 + 6 ... Reasoning: 12×13=156 Answer: 156 Reasoning: 12×13=156 Let me double check ... Let me verify again ... Answer: 156 Resource Consumption Thr eats in Large Language Models Question: What is 12 × 13? T okens Compute Latency LLM RLLM MLLM Agent Generative AI System Resource Consumption Attacks Benign T okens Compute Latency T okens Compute Latency Overthinking ... GPU saturation Server overload Reduced Concurrency Infrastructure Impact Service Throttling Capacity Imbalance Resource Competition Ecosystem Impact User Impact Higher latency Unstable availability Increased usage cost Consequences Unbounded Drift Figure 1: Overvie w of resource consumption threats in large language models. Resource consumption attacks are designed to induce disproportionate computational ov erhead in LLMs ( Shumailo v et al. , 2021 ). Recent stud- ies suggest that such attacks hav e emerged as an important safety risk ( Geiping et al. , 2024 ; Gao et al. , 2024c ). As shown in Figure 1 , rather than directly inducing harmful outputs or extracting pri- v ate information, the y trigger excessi ve and un- necessary generation ( Li et al. , 2025a ), which de- grades model throughput, inflates operational costs, and places e xcessi v e pressure on shared service resources ( Gao et al. , 2025 ). This shift reframes resource usage in LLMs from a performance issue to a core safety concern. Although research on this problem has begun to appear across di verse model settings ( W ang et al. , 2023b ), the field re- mains fragmented by inconsistent terminology and threat assumptions. As a result, the scope and core research questions of resource consumption attacks remain insuf ficiently clarified. Our revie w centers on resour ce consumption thr eats in lar ge language models , especially mali- cious behaviors that amplify computational costs through extended or uncontrolled generation. T o provide a unified vie w , we organize this landscape using an effect-oriented taxonomy with two rep- resentati ve regimes: Overthinking , where gen- Overthinking Unbounded Drift Attack Mechanism Mitigating Generation Regime LLM RLLM MLLM Agent Sponge Poisoning Crabs ThinkTrap BitHydra V erbose Image QuantAttack Uniform Inputs NICGSlowDown CGRS High Inflow W ords MoT Overthinking EO-VLM Hidden T ail VLMInferSlow WSC ExtendAttack T ALE Poison-splat Phantom Sponges Excessive Reasoning Attack BadReasoner BadThink CoT -V alve LAPO FR-Ponder Attention Sink Circuit Patching DSC RAP Queue-based RecurrentDetector Unlikelihood T raining LLM Attacks MLLM Attacks RLLM Attacks LLM Mechanisms RLLM Defenses LLM Defenses Repetitions Are Not All Alike Induction Head T oxicity Repetition Features Induction Heads For ICL Repetition Neurons High Inflow W ords Winner Takes All Non-halting Queries LLMEffiChecker Coercing llms to do LoopLLM Engorgio CPFreezer SlowLiDAR SlowTrack CORBA Overload LLM Mechanisms LLM Attacks Agent Attacks 2020 2021 2022 2023 2024 2025 0.5K 1K 5K 10K 50K 100K 500K 1M 5M 10M Context Window Size (T okens) 4K 16K 128K 8K 32K 128K 128K 512K 10M Growing Context W indows Mainstream Models Fromtier Models Amplification 2020 2021 2022 2023 2024 2025 0 10 50 T otal Number of Publications Articles Published Attack Mechanism Mitigation Bench 20 30 40 0 6 4 7 19 57 Figure 2: A unified view of resource consumption threats in lar ge language models. eration remains rele vant to the task b ut exceeds practical utility , and Unbounded Drift , where the generation trajectory becomes progressi vely less controlled and less con vergent. W e also discuss selected studies from machine learning and early deep learning architectures as related evidence that helps contextualize generati ve resource abuse. As sho wn in Figure 2 , this taxonomy clarifies the scope of resource consumption threats and links attacks to mechanism analysis and mitigation. Based on this perspecti ve, this surve y makes three main contributions. First, we provide a comprehensi ve ov erview of recent advances on resource consumption threats. Second, we intro- duce a unified taxonomy that organizes resource consumption risks into two representati ve re gimes, Overthinking and Unbounded Drift, clarifying the scope of this safety problem. Third, we analyze open challenges in this emer ging area and outline promising directions for future research. 2 Safety Implications and Problem T axonomy 2.1 Attack Impact and System Risks Resource consumption attacks exploit LLMs’ com- putational characteristics to induce e xcessiv e gener - ation, thereby degrading system ef fectiv eness. The risks of such amplification are evident e ven in non- adversarial deplo yment settings. For e xample, the launch of DeepSeek-R1 attracted massive traffic, while the computational cost of its chain-of-thought reasoning was underestimated, together saturating av ailable compute capacity and leading to service disruptions 2 . This example illustrates provider-side risks such as reduced throughput and increased op- erational cost. More broadly , such pressure can also propagate to users in the form of higher la- tency , unstable av ailability , and increased usage cost. Resource consumption, therefore, constitutes a systemic safety risk affecting both pro viders and users in the LLM ecosystem. T o better understand this emerging threat landscape, a clearer taxonomy is needed, as illustrated in Figure 3 . 2.2 A T axonomy of Generation Regimes Accordingly , while the survey revie ws attacks, mechanisms, and defenses, the taxonomy focuses on the induced generation behavior . This distinc- tion allows different attack forms to be analyzed under a common lens when they prolong genera- tion in similar ways. W e distinguish them along two criteria: whether the extended generation re- mains aligned with the original task objectiv e, and whether the generation process remains on a con- trollable path to ward termination. Overthinking refers to e xcessiv e generation that remains task-aligned and preserves normal stop- ping behavior , but incurs unnecessary cost through verbosity , redundant reasoning, or ov er-elaborate description. Its continuation is mainly sustained by task-rele v ant but lo w-utility elaboration. Unbounded Drift refers to e xcessive generation in which the decoding trajectory is no longer re- liably gov erned by the original task or by normal con ver gence dynamics. It typically manifests as 2 ht tp s: // ww w. bi na nc e. co m/ en /s qu ar e/ po st /0 1- 2 6- 2025- deepseek- r1- api- 19448332337730 Original T ask Objective Overthinking Unbounded Drift Contr ollable Path T oward T ermination Resource Consumption Threats Crabs Overthinking Sponge Poisoning BitHydra CORBA LoopLLM RECUR ··· ··· Figure 3: T axonomy of resource consumption threats. repetiti ve loops, recursiv e self-extension, or self- reinforcing interaction chains that weaken semantic control, disrupt timely termination, or both. The distinction is therefore not based on output length alone. Some cases may e volv e from over - thinking into drift; in such cases, we classify them by the dominant mechanism sustaining continua- tion. W e provide a more detailed discussion of these impacts in Appendix D . 3 Threat Landscape and Related Precursors The threat landscape includes both core attacks in generati ve settings and sev eral adjacent precursor lines. The former directly exploits the generative process to amplify computation, whereas the latter arises in earlier vision or system pipelines and are included here only when the y help contextualize similar resource-amplification mechanisms. 3.1 Early F orms of Attacks The challenge of resource consumption attacks originated in early deep learning architectures. Specifically , Sloth ( Hong et al. , 2020 ) re vealed that gradient-based optimization could coerce DNNs into the most computationally expensiv e infer- ence paths. Building on this concept of compu- tational asymmetry , the seminal work on this at- tack, Sponge Examples ( Shumailov et al. , 2021 ), demonstrated that targeted perturbations of neural acti v ations could dramatically increase energy con- sumption by disrupting hardw are-lev el ex ecution. This principle was quickly extended to Neural Ma- chine T ranslation (NMT), where T ranSlowDo wn ( Chen et al. , 2021 ) and NMTSloth ( Chen et al. , 2022a ) showed that minute perturbations delay- ing the End-of-Sentence (EOS) token could force sequences to ward maximum length limits, signifi- cantly inflating inference costs. As research progressed, the attack surface expanded from inference-time perturbations to training-time backdoors. Prior work showed that ef ficiency-targeted backdoors can manipulate model behavior by hijacking inference control flow , including early model integrity attacks ( Cinà et al. , 2021 ), later studies on dynamic routing ( Chen et al. , 2023b ), and other deep architectures ( Meftah et al. , 2025 ). This was further refined by Sponge P oi- soning ( Cinà et al. , 2025 ; Lintelo et al. , 2024 ), which embeds backdoors to disrupt hardware-le vel sparsity and skip-connection logic. These early paradigms also exposed the practical severity of resource-oriented attacks on edge hardware ( W ang et al. , 2023b ; Hasan et al. , 2025a ). Prior work sho wed that computational overhead can lead to direct physical consequences, including battery drain and device-le vel Denial-of-Service (DoS) attacks. These attacks re veal that adv ersarial inputs can amplify seemingly small perturbations into disproportionately high computational costs. 3.2 Attack Surfaces in Large Language Models 3.2.1 Risks of Overthinking The most direct w ay to induce resource consump- tion is to artificially prolong model outputs. Sponge Poisoning in LLM ( Cinà et al. , 2025 ) introduces a training-time backdoor that induces sustained out- put elongation and resource e xhaustion. In black- box settings, Crabs ( Zhang et al. , 2025e ) sho ws that semantic expansion via tree-structured queries can also effecti vely increase output length. Be- yond direct output prolongation, ThinkT rap ( Li et al. , 2025b ) maps discrete tokens into a continu- ous embedding space and performs optimization in a low-dimensional subspace to induce excessi ve generation under black-box access. At a more fine-grained lev el, resource consump- tion can e ven be induced by directly manipulating model parameters. BitHydra ( Y an et al. , 2025 ) proposes a bit-flip inference cost attack that alters model weights at the hardware-rele vant level to compromise ef ficiency . RepetitionCurse ( Huang et al. , 2025d ) shows that imbalanced expert routing in MoE can induce analogous repetiti ve f ailures. These studies suggest that resource consumption vulnerabilities span multiple levels of the stack, from high-lev el semantic elicitation to lo w-lev el parameter and hardware-oriented manipulation. 3.2.2 Risks of Unbounded Drift Although existing methods use dif ferent induction strategies, they often con ver ge on a shared un- bounded drift ef fect: once termination is desta- bilized, the generation process can enter repetitiv e or ef fectively non-terminating loops. Fix ed points ( Hammouri et al. , 2025 ), and attention sinks ( Y ona et al. , 2025a ) e xploit intrinsic decoding dynamics, making repetitive trajectories easier to sustain. Oth- ers operationalize this effect more directly through entropy-dri ven optimization. LoopLLM ( Li et al. , 2025a ) induces repetitiv e generation loops through entropy-based search, while GCG ( Geiping et al. , 2024 ), Engorgio ( Dong et al. , 2024 ), and LLMEf- fiChecker ( Feng et al. , 2024 ) interfere with termi- nation by manipulating critical tokens. In these attacks, the model is prev ented from halting nor- mally and is pushed to ward a loop. K ey Insights. Current research lacks a cross-generational perspecti ve, typically focusing on parameter scaling within a single model family rather than analyzing how these issues ev olve as model capabilities gro w . Furthermore, the field is heavily dominated by white-box methods, making it dif ficult to study com- mercial black-box en vironments. This is particularly true for unbounded drift. 3.3 Attack Surfaces in Reasoning Large Language Models Reasoning large language models (RLLMs) are prone to inefficiently prolonged generation: ev en on simple tasks such as 2+3, they may generate excessi vely long, redundant, and repetiti ve reason- ing processes ( Chen et al. , 2025 ), which creates significant opportunities for attacks. 3.3.1 Risks of Overthinking Chain-of-thought reasoning exposes a distinctive attack surface, as attackers can deliberately induce excessi ve and unnecessary reasoning. Existing at- tacks achiev e this through div erse strate gies, includ- ing backdoor triggers in BadReasoner ( Y i et al. , 2025 ) and BadThink ( Liu et al. , 2025b ), conte xt- based slo wdown attacks that exploit retrie val or search mechanisms ( Kumar et al. , 2025 ; Zhu et al. , 2025a ), and adversarial perturbations that interfere with normal termination ( Si et al. , 2025 ). T ogether, these studies sho w that ov erthinking in reasoning models is not just an inefficiency but an exploitable vulnerability that prolongs reasoning and increases computational cost. 3.3.2 Risks of Unbounded Drift Unbounded drift poses more sev ere risks than or- dinary ov erthinking because it can trap reasoning models in self-perpetuating generation loops. RE- CUR ( W ang et al. , 2026b ) shows that chain-of- thought reasoning may enter repeated reflective cycles, suggesting that such loop-like degeneration may arise broadly across generati ve architectures. K ey Insights. Recent studies hav e provided a relati vely comprehensi ve understanding of o verthink- ing in reasoning models, while also sho wing that unbounded drift poses a severe threat to stable and reliable generation. 3.4 Attack Surfaces in Multimodal Large Language Models Multimodal large language models (MLLMs) com- bine computationally intensiv e multimodal percep- tion with autoregressi ve generation, thereby expos- ing a broader attack surface. 3.4.1 Risks of Overthinking . Related risks had already appeared in earlier vision systems, that visual perturbations can weaken spar - sity benefits, ne gate dynamic quantization gains, or ov erload downstream vision pipelines, as seen in Uniform Inputs ( Müller and Quiring , 2024 ), Quan- tAttack ( Baras et al. , 2025 ), and Phantom Sponges and Poison-splat ( Shapira et al. , 2023 ; Schoof et al. , 2024 ; Lu et al. , 2024 ). Although these works do not directly target MLLMs, they foreshado w that seemingly benign vision inputs can be crafted to induce disproportionate resource consumption. This risk becomes more explicit in MLLMs. Ex- isting attacks manipulate the autoregressi ve de- coding process to trigger excessi vely long out- puts ( Chen et al. , 2022b ; Gao et al. , 2024a , b ), while more recent methods improv e stealth by silently generating in visible tokens that conceal substan- tial computational ov erhead ( Zhang et al. , 2025a ). Other frame works, such as VLMInferSlo w ( W ang et al. , 2025d ) and EO-VLM ( Seo et al. , 2025 ), fur - ther enhance the practicality of such slowdo wn attacks under black-box access. T ogether, these studies sho w that in MLLMs, overthinking is a con- crete attack surface that can be e xploited to inflate inference cost and degrade system ef ficiency . 3.4.2 Risks of Unbounded Drift. LingoLoop ( Fu et al. , 2025 ) and RECITE ( Gao et al. , 2025 ) rev eal a multimodal form of un- bounded drift in image-conditioned generation. The former traps the model in linguistically con- strained visual descriptions, while the latter induces repetiti ve visual recall, dri ving visual-language in- teraction into a self-reinforcing decoding loop. K ey Insights. Existing studies ha ve rev ealed significant vulnerabilities in MLLMs, but most focus on single-modal attacks, with limited atten- tion to cross-modal threats. Furthermore, the potential risks in audio and video modal- ities hav e not been fully rev ealed. 3.5 Attack Surfaces in Agentic Systems Agentic systems enable lar ge language models to interact with real-world en vironments, b ut the y also introduce new attack surfaces that may in- crease real-world risks. LLM-based agent sys- tems ha ve also been shown to be vulnerable to re- source consumption attacks. LeechHijack ( Zhang et al. , 2025b ) injects auxiliary workloads into an agent’ s reasoning loop, cov ertly steering it to per - form attacker -specified computation while main- taining seemingly normal outputs. CORB A ( Zhou et al. , 2025 ) targets multi-agent systems by prop- agating self-replicating prompts across agents, in- ducing recursi ve interactions that w aste computa- tional resources and undermine system av ailability . Computer-use agents such as OpenClaw ( Stein- berger , 2026 ) and Cloud Code ( Anthropic , 2025 ) are designed to interact directly with operating sys- tems. During ex ecution, these agents may spawn persistent background processes without clear ter - mination conditions, leading to prolonged server- side resource consumption ( Shapira et al. , 2026 ). Beyond computer -use settings, related resource consumption risks hav e also been observed in other domains. In autonomous dri ving, existing attacks primarily target perception pipelines, where ad- versaries increase detection proposals or tracked objects to induce inference latency ( Chen et al. , 2021 , 2024 ; Muller et al. , 2025 ; Ma et al. , 2024 ). CP-FREEZER ( W ang et al. , 2025a ) focuses on cooperati ve perception scenarios, where perturba- tions applied to shared features increase the com- putational ov erhead of non-maximum suppression (NMS). Slo wLiD AR ( Liu et al. , 2023 ) further ex- tends resource consumption from the visual modal- ity to the point cloud modality . Agents deployed on edge de vices are likewise vulnerable to attacks that significantly increase energy consumption and inference latency ( Hasan et al. , 2025b ). K ey Insights. Existing studies rev eal vulnerabilities in re- source consumption within agentic systems, yet analysis of resource use across agent components remains limited. In particular, the resource consumption characteristics of agent memory mechanisms are still lar gely unexplored. Moreover , many downstream applications of agentic systems have not been systematically studied from the per - specti ve of resource consumption. 4 Mechanisms Underlying Resource Consumption Beha viors This section re vie ws mechanistic and interpretabil- ity research on the emergence of resource-intensi ve behaviors in autore gressive generation. Additional technical details are provided in Appendix G . 4.1 Mechanisms Behind Overthinking Mechanistic analysis of o verthinking remains lim- ited. Unlike unbounded drift, overthinking often stays task-rele vant and may appear as an extreme extension of otherwise normal generation, making its causal boundary harder to isolate. As a result, existing studies provide only partial clues rather than a mature taxonomy of mechanisms. Prior work suggests that unnecessarily prolonged genera- tion may arise from both probability-le vel trapping ef fects, such as high-inflow tok ens that bias decod- ing to ward extended trajectories ( Fu et al. , 2021 ), and attention irre gularities that drive repeated di- ver gence into loosely related continuations ( Y ona et al. , 2025b ). T ogether , these findings offer prelim- inary e vidence that overthinking may be sustained by local decoding dynamics ev en when generation remains superficially coherent. K ey Insights. Output prolongation stems from attention di ver gence and probability traps. Ho we ver , current interpretability research focuses on simple failures, lea ving verbose reasoning and ov erthinking in Chain-of-Thought set- tings largely undere xplored. 4.2 Mechanisms Behind Unbounded Drift Theoretical Foundations. Early studies on de- generati ve generation provide initial clues about the roots of unbounded drift. Prior work suggests that autoregressi ve decoding can be drawn into repet- iti ve trajectories by attractor -like token dynamics or collapse toward a limited subset of tokens ( Fu et al. , 2021 ; Ildiz et al. , 2024 ). Related evidence also points to training objecti ves as a contributing factor: unlik elihood training ( W elleck et al. , 2019 ) was introduced to suppress repetitiv e outputs be- yond standard lik elihood optimization, while subse- quent work argues that such auxiliary penalties may still be insuf ficient to fully correct the underlying probability ske w ( Lin et al. , 2021 ). T ogether , these studies suggest that unbounded drift may reflect not only inference-time failures b ut also deeper biases in sequence modeling and training. Subsequent work has mov ed beyond output-level penalization to ward intervening at the source of rep- etition. HAP AX ( Sahin et al. , 2025 ), for e xample, excludes induction-head-predictable tokens from the training objectiv e, while Repetition Dropout ( Li et al. , 2023 ) reduces reliance on repeated context by randomly dropping attention to repeated words during training. T aken together , these studies sug- gest that unbounded drift is shaped not only by inference-time dynamics but also by the objecti ves and optimization signals used during training. No- tably , howe ver , this line of analysis has so far been de veloped primarily for text-based LLMs, with a much less mechanistic understanding of reasoning or multimodal settings. Circuit and Representation Mechanisms. From a mechanistic perspecti ve, unbounded drift appears to arise not from a single faulty component, but from coordinated repetition-promoting circuits spanning attention heads, do wnstream neurons, and latent features. Prior work first identifies repetition-related neurons distributed across layers, with intermediate layers detecting repeated patterns and higher layers amplifying context copying ( Hiraoka and Inui , 2025 ). Layer -wise causal analysis further suggests that these neurons do not operate in isolation, but often function do wnstream of induction heads that propagate copying signals ( Doan et al. , 2025 ). Moving beyond individual units, Sparse Autoencoder analyses rev eal latent directions that explicitly encode repetition-related features ( Y ao et al. , 2025 ), indicating that repetiti ve beha vior is repre- sented at the feature le vel rather than emerging as a purely surface-lev el artifact. This view is further reinforced by studies connecting induction heads to in-context learning and repetition dynamics ( Crosbie and Shutova , 2025 ; W ang et al. , 2025b ; Mahaut and Franzon , 2025 ). T aken together , these findings suggest that unbounded drift reflects a structured internal bias tow ard copying and repetition, which can be amplified into persistent looping once normal stopping behavior breaks do wn. Beha vioral Dynamics. At the beha vioral lev el, repetition exhibits strong self-reinforcing dynam- ics once triggered. Pseudo-Repetition Penaliza- tion ( Xu et al. , 2022 ) sho ws that, after repetition begins, the probability of generating identical con- tent rapidly increases, forming a closed loop. Com- plementarily , uncertainty-dri ven fallback model- ing ( Ivgi et al. , 2024 ) interprets repetition as a fall- back behavior under uncertainty , where models re vert to lo w-entropy states by repeating prior text when unable to identify a plausible continuation. K ey Insights. Unbounded drift stems from both model architecture and training methods. While decoding-time interventions offer immedi- ate relief, training-le vel approaches may provide more stable, fundamental mitiga- tion by addressing root causes rather than surface-le vel symptoms. 5 Mitigation Strategies f or Resource Consumption Threats This section revie ws existing mitigation methods for threats to resource consumption. Importantly , these methods are not equally security-oriented: many aim to impro ve ef ficiency under benign con- ditions, whereas only a smaller subset explicitly defends against adversarial resource amplification. Detailed technical descriptions of mitigation strate- gies are provided in Appendix H . 5.1 Mitigating Overthinking LLMs and RLLMs. For reasoning models, most e xisting mitigations are better understood as ef ficiency-oriented interv entions rather than de- fenses explicitly designed against adv ersarial re- source amplification. Existing overthinking is often studied as an inference-efficienc y problem; current methods mainly aim to reduce unnecessary rea- soning length under benign conditions, while only indirectly mitigating attack-induced ov erhead. At the training level, these methods aim to foster aw areness of the length of reasoning. W u et al. ( 2025 ) introduce budget signals through reward shaping, while parameter-space tuning identifies controllable directions that compress reasoning chains within a single model ( Ma et al. , 2025 ). At the decoding lev el, lightweight interv en- tions suppress redundant reasoning without re- training, for example, by stopping unnecessary reflection once the model is sufficiently certain or by interrupting repetitiv e hidden-state patterns and injecting control vectors to regulate thinking depth ( Huang et al. , 2025b ; Xie et al. , 2025 ; Liu et al. , 2025a ; Lin et al. , 2025 ; He and T ang , 2025 ). At the system lev el, resource allocation is ad- justed more explicitly . Difficulty-adapti ve methods allocate inference b udget according to query com- plexity ( W ang et al. , 2025e ), while token-budget- aw are prompting and monitoring-of-thought frame- works terminate unnecessary reasoning e xternally without modifying model parameters ( Han et al. , 2025 ; Zhu et al. , 2025b ). Overall, these methods show that ov erthinking can be mitigated from multiple le vels, b ut the cur- rent literature remains centered on ef ficiency opti- mization rather than attack-specific defense. Multimodal Large Language Models (MLLMs). In multimodal settings, ov erthinking can be ampli- fied by both unnecessary reasoning and large exter - nal conte xts. Certainty-based routing ( Lu et al. , 2025 ) dynamically in vokes CoT only when the model is uncertain, thereby av oiding prolonged reasoning on simpler inputs. Memory-augmented frame works ( Bhat et al. , 2025 ; Ottem , 2025 ) further mitigate context bloat through structured memory modules and verification-centric filtering, signif- icantly reducing context size without degrading retrie v al quality . K ey Insights. Existing mitigations ha ve sho wn effecti ve- ness in alleviating o verthinking. Howe ver , current methods predominantly focus on ef ficiency-oriented mitigations rather than attack-aw are defenses, and e xplicit mitiga- tions for autonomous agent pipelines re- main largely une xplored. 5.2 Mitigating Unbounded Drift General LLMs. Unbounded drift in text gener- ation often manifests as repetiti ve or uncontrolled decoding. As a result, mitigation strate gies in gen- eral LLMs mainly focus on detecting or suppress- ing repetition to prev ent outputs from degenerating into runaw ay generation loops. At the model lev el, existing methods mitig ate degeneration either through training-time objec- ti ve design or decoding-time control. Unlikeli- hood training ( W elleck et al. , 2019 ) reduces de- generate text by penalizing repeated tokens, while RAP ( Huang et al. , 2025a ) pro vides a structured way to tune repetition penalties during decoding while maintaining task performance. At the system lev el, rob ustness is enforced through runtime detection and scheduling safe- guards. For instance, RecurrentDetector ( Y u et al. , 2025 ) identifies repetitiv e acti vation patterns to halt infinite loops. Additionally , PD 3 F ( Zhang et al. , 2025d ) combines request scheduling with output EOS amplification. Queue-based architec- tures also help stabilize throughput under heavy workloads ( Barek et al. , 2025 ). Overall, these methods show that uncontrolled repetiti ve generation in te xt models can be effec- ti vely mitigated through interv entions across train- ing, decoding, and system layers. Reasoning Models. Dedicated defenses target- ing non-terminating reasoning in RLLMs remain scarce. Existing techniques such as WSC ( Xie et al. , 2025 ) and self-af firmation suppression ( Liu et al. , 2025a ) primarily focus on removing redun- dant reasoning during decoding, which can inciden- tally alle viate mild repetiti ve behaviors. Agentic Systems. Agentic systems introduce ad- ditional vulnerabilities due to continuous percep- tion pipelines. Background-attentiv e adversarial training ( W ang et al. , 2025c ) improv es rob ustness by incorporating perturbation-aw are training ob- jecti ves, ef fectiv ely restoring inference speed and stability under resource abuse on edge de vices. K ey Insights. System-le vel framew orks eff ecti vely miti- gate crash-le vel threats and DoS-style at- tacks in general LLMs. Howe ver , purpose- built crash defenses for MLLMs remain largely absent. 6 Open Challenges and Future Dir ections 6.1 From Efficiency Optimization to Security Guarantees A useful distinction is between efficienc y-oriented mitigation and security-oriented defense. The for- mer treats excessi ve generation primarily as a cost problem under benign workloads, whereas the lat- ter treats it as an adversarial resource-amplification threat that requires rob ustness guarantees. Much of the literature still frames excessi ve generation as a cost-latency trade-of f, aiming to reduce av er- age output length under benign conditions rather than defend against adv ersarial resource amplifi- cation ( Jiang et al. , 2024 ; Agrawal et al. , 2024 ; Del Corro et al. , 2023 ). Looking forw ard, two research directions appear especially important. First, future research should mov e from heuristic efficiency control to security- oriented budget protection, treating resource usage as a constrained attack surf ace rather than merely an optimization target. Second, defenses should protect not only output-le vel resource usage b ut also process-lev el computation, including interme- diate reasoning, tool interactions, and multi-turn ex ecution trajectories. T ogether , these directions would help ele vate resource control from an effi- ciency concern to a security objectiv e for reliable and sustainable LLMs. 6.2 Understanding Threat Mechanisms A major limitation of current research is the lack of a unified mechanistic understanding of resource amplification. Existing studies largely focus on specific attack instances, such as reasoning o ver- expansion or image-coded induction ( Gao et al. , 2025 ; Fu et al. , 2025 ), while of fering little gen- eral theory of ho w generation trajectories expand under attack manipulation. Future work should therefore integrate the shared dynamics and model- specific properties of dif ferent generativ e systems to de velop transferable theories that support cross- model analysis and principled defense design. 6.3 Resource Ab use in LLM Ecosystems LLMs expose an increasingly serious form of re- source abuse, yet e xisting studies remain limited to a small number of attack surf aces ( Zhang et al. , 2025b ; Zhou et al. , 2026 ). Compared with model- le vel attacks, these threats are harder to character - ize because resource consumption may accumu- late across tool calls, intermediate services, mem- ory updates, and multi-agent coordination, while the final output can still appear normal ( Lee et al. , 2026 ). Future research should therefore move be- yond isolated case studies to ward a systematic un- derstanding of agent-le vel resource ab use, includ- ing component-wise resource accounting, cross- step dependenc y analysis, and defenses that can de- tect or interrupt malicious amplification throughout ex ecution. This progress is increasingly important as agentic systems are deplo yed in real-world w ork- flo ws, where resource abuse can directly undermine the reliability and sustainability of the services they support. 6.4 T oward Standardized Evaluation A fundamental limitation of current research is the lack of a standardized ev aluation framework for re- source consumption threats. Existing studies span dif ferent model families, modalities, and deploy- ment settings, but the y often rely on heterogeneous metrics, which makes cross-study comparison dif- ficult. T o address this gap, Appendix F summa- rizes existing ev aluation practices and presents a more unified e v aluation frame work. In particu- lar , we argue that resource consumption threats should be assessed jointly from model-le vel beha v- ior , hardware-le vel pressure, and application-le vel service impact, so that attacks and defenses can be compared under a common view . Without such a shared protocol, it remains difficult to consistently measure amplification severity , comprehensively e v aluate defense coverage, or support do wnstream gov ernance of resource consumption risks. 7 Conclusion Resource consumption threats are emer ging as an important safety concern for lar ge language mod- els. By inducing excessi ve and unnecessary genera- tion, they not only reduce ef fi cienc y but also create broader risks for system reliability , service av ail- ability , and operational cost. This survey provided a unified view of the area through two representativ e regimes, Overthinking and Unbounded Drift , and re vie wed existing work spanning attacks, mech- anisms, defenses, and open challenges. Despite recent progress, the field remains fragmented in taxonomy , theory , ev aluation, and mitigation. W e hope this surve y provides a clearer foundation for future research on understanding and mitigating threats to resource consumption and helps advance safer , more sustainable LLMs. Limitations This surve y has several limitations. First, our dis- cussion centers on r esour ce consumption thr eats in LLMs , rather than the broader landscape of ef- ficiency , rob ustness, or a vailability problems. In particular , we primarily focus on malicious or ad- versarial resource abuse that induces excessi ve gen- eration, and therefore do not aim to comprehen- si vely cov er benign efficienc y optimization, general systems engineering for acceleration, or all forms of denial-of-service beha vior outside the genera- ti ve process itself. As a result, some closely related work on inference acceleration, scheduling, or hard- ware optimization is discussed only when it directly informs the security perspective of resource con- sumption. Second, the taxonomy proposed in this survey is intended as a unifying conceptual abstraction rather than a complete partition of all possible fail- ure modes. While this distinction helps organize existing studies by the e volution pattern of genera- tion, some attacks may e xhibit mixed characteris- tics, shift between the two regimes across settings, or in volv e multiple system components simultane- ously . This issue is especially relev ant in agentic systems, where resource amplification may accu- mulate across tool calls, memory updates, retriev al steps, and multi-agent coordination, making strict categorization more dif ficult. Third, the literature in this area remains highly une ven across model families and attack settings. Existing studies are concentrated on text-based LLMs and a relati vely small set of open-source models, while black-box commercial systems re- main harder to study systematically . Coverage is also imbalanced across modalities and deployment scenarios: current multimodal work still focuses heavily on image-centered settings, whereas the risks in audio and video modalities remain far less explored; similarly , many do wnstream agent appli- cations hav e not yet been systematically examined from the perspecti ve of resource consumption. Finally , because this is an emerging field, the av ailable empirical e vidence remains fragmented, and e v aluation practices are not yet standardized. Many papers report attack success using different metrics, threat models, and deployment assump- tions, making direct comparison difficult and po- tentially limiting the stability of any unified con- clusions. For this reason, our synthesis should be understood as a structured ov erview of the current research landscape rather than a definitiv e bench- mark of attack prev alence, mechanism uni versality , or defense ef fecti veness. Ethical Considerations This surve y revie ws resource consumption attacks on LLMs. Although summarizing attack mecha- nisms may increase understanding of such threats, our goal is strictly defensive: to clarify the threat landscape, support standardized e v aluation, and en- courage effecti ve mitigation. W e therefore focus on conceptual analysis and system-le vel implications rather than actionable attack instructions. W e further emphasize that resource consumption abuse can ha ve real-w orld consequences beyond ef- ficiency degradation. In shared LLM infrastructure, excessi ve generation may reduce service a vailabil- ity , increase operational costs, and harm end users through higher latency , unstable access, and higher usage costs. These risks can become e ven more se vere in agentic systems, where recursi ve ex ecu- tion and covert workload amplification may disrupt workflo ws and cause hidden financial loss. By syn- thesizing this emerging area, we aim to support safer , more reliable, and more sustainable LLMs. References Amey Agra wal, Nitin Kedia, Ashish Panwar , Jayashree Mohan, Nipun Kwatra, Bharga v Gula vani, Alexey T umanov , and Ramachandran Ramjee. 2024. T am- ing { Throughput-Latency } tradeoff in { LLM } infer- ence with { Sarathi-Serve } . In 18th USENIX sympo- sium on operating systems design and implementa- tion (OSDI 24) , pages 117–134. Soogand Alavi, Salar Nozari, and Andrea Luangrath. 2025. Cost transparency of enterprise ai adoption. arXiv pr eprint arXiv:2511.11761 . Jean-Baptiste Alayrac, Jef f Donahue, Pauline Luc, Antoine Miech, Iain Barr, Y ana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, and 1 others. 2022. Flamingo: a visual language model for fe w-shot learning. Advances in neural information pr ocessing systems , 35:23716– 23736. Sotiris Anagnostidis, Dario Pa vllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, and Thomas Hof- mann. 2023. Dynamic context pruning for efficient and interpretable autoregressi ve transformers. Ad- vances in Neural Information Processing Systems , 36:65202–65223. Anthropic. 2025. Claude code . Amit Baras, Alon Zolfi, Y uv al Elovici, and Asaf Shab- tai. 2025. Quantattack: Exploiting quantization techniques to attack vision transformers. In 2025 IEEE/CVF W inter Confer ence on Applications of Computer V ision (W ACV) , pages 6730–6740. IEEE. MD Abdul Barek, Md Bajlur Rashid, Md Mostafizur Rahman, ABM Kamrul Islamc Riad, Guillermo Fran- cia, Hossain Shahriar , and Sheikh Iqbal Ahamed. 2025. V ulnerability to stability: Scalable large lan- guage model in queue-based web service. In 2025 IEEE 49th Annual Computers, Softwar e, and Appli- cations Conference (COMPSAC) , pages 995–1000. IEEE. Dirk Bergemann, Alessandro Bonatti, and Alex Smolin. 2025. The economics of large language models: T o- ken allocation, fine-tuning, and optimal pricing. In Pr oceedings of the 26th A CM Confer ence on Eco- nomics and Computation , pages 786–786. Sachin Bhat, Erik Cambria, and Haibo Peng. 2025. Nm- ret: A memory-augmented retrie val framew ork for large language models. T echnical report, SenticNet / Nanyang T echnological Uni versity . Chen and 1 others. 2021. T ranslowdo wn: Ef ficiency attacks on neural machine translation systems . Erh-Chung Chen, Pin-Y u Chen, I-Hsin Chung, and Che- Rung Lee. 2024. Overload: Latency attacks on ob- ject detection for edge devices. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 24716–24725. Guangyao Chen, Siwei Dong, Y u Shu, Ge Zhang, Jaward Sesay , Börje F Karlsson, Jie Fu, and Y emin Shi. 2023a. Autoagents: A frame work for automatic agent generation. arXiv pr eprint arXiv:2309.17288 . Simin Chen, Hanlin Chen, Mirazul Haque, Cong Liu, and W ei Y ang. 2023b. The dark side of dynamic rout- ing neural netw orks: T ow ards ef ficiency backdoor injection. In Pr oceedings of the IEEE/CVF Confer- ence on Computer V ision and P attern Recognition , pages 24585–24594. Simin Chen, Cong Liu, Mirazul Haque, Zihe Song, and W ei Y ang. 2022a. Nmtsloth: understanding and test- ing efficienc y degradation of neural machine transla- tion systems. In Pr oceedings of the 30th ACM Joint Eur opean Software Engineering Confer ence and Sym- posium on the F oundations of Software Engineering , pages 1148–1160. Simin Chen, Zihe Song, Mirazul Haque, Cong Liu, and W ei Y ang. 2022b. Nicgslowdo wn: Ev aluating the efficienc y rob ustness of neural image caption gen- eration models. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recog- nition , pages 15365–15374. Xingyu Chen, Jiahao Xu, T ian Liang, Zhiwei He, Jianhui Pang, Dian Y u, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui W ang, Zhaopeng T u, Haitao Mi, and Dong Y u. 2025. Do NO T think that much for 2+3=? on the o verthinking of long reasoning models . In F orty-second Interna- tional Confer ence on Machine Learning . Antonio Emanuele Cinà, Ambra Demontis, Battista Big- gio, Fabio Roli, and Marcello Pelillo. 2025. Energy- latency attacks via sponge poisoning. Information Sciences , 702:121905. Antonio Emanuele Cinà, Sebastiano V ascon, Ambra Demontis, Battista Biggio, Fabio Roli, and Marcello Pelillo. 2021. The hammer and the nut: Is bilev el op- timization really needed to poison linear classifiers? In 2021 International J oint Confer ence on Neur al Networks (IJCNN) , pages 1–8. IEEE. Antonia Creswell and Murray Shanahan. 2022. Faith- ful reasoning using large language models. arXiv pr eprint arXiv:2208.14271 . Joy Crosbie and Ekaterina Shutov a. 2025. Induction heads as an essential mechanism for pattern matching in in-context learning. In F indings of the Association for Computational Linguistics: NAA CL 2025 , pages 5034–5096. Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Y u, Ahmed A wadallah, and Subhabrata Mukher - jee. 2023. Skipdecode: Autore gressive skip decoding with batching and caching for efficient llm inference. arXiv pr eprint arXiv:2307.02628 . Nhi Hoai Doan, T atsuya Hiraoka, and K entaro Inui. 2025. Understanding and controlling repetition neu- rons and induction heads in in-context learning. In Pr oceedings of the 14th International J oint Confer- ence on Natural Langua ge Pr ocessing and the 4th Confer ence of the Asia-P acific Chapter of the Asso- ciation for Computational Linguistics , pages 2854– 2876. Ben Dong, Hui Feng, and Qian W ang. 2026. Cla wdrain: Exploiting tool-calling chains for stealthy to- ken exhaustion in openclaw agents . Preprint , Jianshuo Dong, Ziyuan Zhang, Qingjie Zhang, Tianwei Zhang, Hao W ang, Hewu Li, Qi Li, Chao Zhang, Ke Xu, and Han Qiu. 2024. An engorgio prompt makes large language model babble on. arXiv pr eprint arXiv:2412.19394 . Erin Feiglin, Nir Hutnik, and Raz Lapid. 2026. Ben- chov erflow: Measuring o verflow in large language models via plain-text prompts. arXiv preprint arXiv:2601.08490 . Xiaoning Feng, Xiaohong Han, Simin Chen, and W ei Y ang. 2024. Llmef fichecker: Understanding and test- ing efficienc y de gradation of lar ge language mod- els. ACM T ransactions on Softwar e Engineering and Methodology , 33(7):1–38. Jared Fernandez, Clara Na, V ashisth Tiw ari, Y onatan Bisk, Sasha Luccioni, and Emma Strubell. 2025. En- ergy considerations of lar ge language model infer - ence and efficienc y optimizations. In Pr oceedings of the 63r d Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 32556–32569. Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Hai- jing Guo, Dingkang Y ang, Zhaoyu Chen, and W en- qiang Zhang. 2025. Lingoloop attack: Trapping mllms via linguistic conte xt and state entrapment into endless loops. arXiv preprint . Zihao Fu, W ai Lam, Anthony Man-Cho So, and Bei Shi. 2021. A theoretical analysis of the repetition problem in text generation. In Pr oceedings of the AAAI Con- fer ence on Artificial Intelligence , v olume 35, pages 12848–12856. Pascale Fung, Y oram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, W illy Chung, Emmanuel Dupoux, Hongyu Gong, Hervé Jégou, Alessandro Lazaric, and 1 others. 2025. Embod- ied ai agents: Modeling the world. arXiv pr eprint arXiv:2506.22355 . Haoran Gao, Y uanhe Zhang, Zhenhong Zhou, Lei Jiang, Fan yu Meng, Y ujia Xiao, Li Sun, Kun W ang, Y ang Liu, and Junlan Feng. 2025. Resource consumption red-teaming for large vision-language models. arXiv pr eprint arXiv:2507.18053 . Kuofeng Gao, Y ang Bai, Jindong Gu, Shu-T ao Xia, Philip T orr, Zhifeng Li, and W ei Liu. 2024a. In- ducing high energy-latenc y of large vision-language models with verbose images. arXiv pr eprint arXiv:2401.11170 . Kuofeng Gao, Jindong Gu, Y ang Bai, Shu-T ao Xia, Philip T orr , W ei Liu, and Zhifeng Li. 2024b. Energy- latency manipulation of multi-modal large lan- guage models via verbose samples. arXiv pr eprint arXiv:2404.16557 . Kuofeng Gao, T ianyu Pang, Chao Du, Y ong Y ang, Shu- T ao Xia, and Min Lin. 2024c. Denial-of-service poi- soning attacks against large language models. arXiv pr eprint arXiv:2410.10760 . Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Y uxin W en, and T om Goldstein. 2024. Coercing llms to do and re veal (almost) anything, 2024. URL https://arxiv . or g/abs/2402.14020 . Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Peiyi W ang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025a. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 . Y i Guo, Kanchana Thilakarathna, Nirhoshan Si v a- roopan, Jo Plested, T im L ynar , Jack Y ang, W angli Y ang, and 1 others. 2025b. Prompt-induced o ver- generation as denial-of-service: A black-box attack- side benchmark. arXiv pr eprint arXiv:2512.23779 . Ghaith Hammouri, K emal Derya, and Berk Sunar . 2025. Non-halting queries: Exploiting fixed points in llms. In 2025 IEEE Confer ence on Secur e and T rustworthy Machine Learning (SaTML) , pages 1–22. IEEE. T ingxu Han, Zhenting W ang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhen yu Chen. 2025. T oken- budget-a ware llm reasoning. In F indings of the As- sociation for Computational Linguistics: ACL 2025 , pages 24842–24855. Shibo Hao, Y i Gu, Haodi Ma, Joshua Hong, Zhen W ang, Daisy W ang, and Zhiting Hu. 2023. Rea- soning with language model is planning with world model. In Pr oceedings of the 2023 Confer ence on Empirical Methods in Natural Languag e Processing , pages 8154–8173. Syed Mhamudul Hasan, Hussein Zangoti, Iraklis Anag- nostopoulos, and Abdur R Shahid. 2025a. Sponge attacks on sensing ai: Energy-latency vulnerabili- ties and defense via model pruning. arXiv preprint arXiv:2505.06454 . Syed Mhamudul Hasan, Hussein Zangoti, Iraklis Anag- nostopoulos, and Abdur R. Shahid. 2025b. Sponge attacks on sensing ai: Energy-latenc y vulnerabil- ities and defense via model pruning . Preprint , Jujie He, Jiacai Liu, Chris Y uhao Liu, Rui Y an, Chaojie W ang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, W ei Shen, and 1 others. 2025. Sky- work open reasoner 1 technical report. arXiv pr eprint arXiv:2505.22312 . Y ixin He and Lumingyuan T ang. 2025. Learning to ponder: Adaptive reasoning in latent space. arXiv pr eprint arXiv:2509.24238 . T atsuya Hiraoka and K entaro Inui. 2025. Repetition neurons: How do language models produce repeti- tions? In Pr oceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 2: Short P apers) , pages 483– 495. Sanghyun Hong, Y i ˘ gitcan Kaya, Ionu¸ t-Vlad Modoranu, and T udor Dumitra ¸ s. 2020. A panda? no, it’ s a sloth: Slowdo wn attacks on adaptiv e multi-exit neural net- work inference. arXiv pr eprint arXiv:2010.02432 . Donghao Huang, Thanh-Son Nguyen, Fiona Liausvia, and Zhaoxia W ang. 2025a. Rap: A metric for balanc- ing repetition and performance in open-source large language models. In Pr oceedings of the 2025 Con- fer ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P apers) , pages 1479–1496. Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He, and Lu Hou. 2025b. Efficient reasoning for large reasoning language models via certainty-guided reflection suppression. arXiv pr eprint arXiv:2508.05337 . Junquan Huang, Haotian Wu, Y ubo Gao, Y ibo Y an, Junyan Zhang, Y onghua Hei, Song Dai, Jie Zhang, Puay Sie w T an, and Xuming Hu. 2025c. Ef fireason- bench: A unified benchmark for ev aluating and ad- vancing ef ficient reasoning in large language models. arXiv pr eprint arXiv:2511.10201 . Ruixuan Huang, Qingyue W ang, Hantao Huang, Y udong Gao, Dong Chen, Shuai W ang, and W ei W ang. 2025d. Repetitioncurse: Measuring and un- derstanding router imbalance in mixture-of-e xperts llms under dos stress . Pr eprint , Muhammed Emrullah Ildiz, Y ixiao Huang, Y ingcong Li, Ankit Singh Rawat, and Samet Oymak. 2024. From self-attention to markov models: Un veiling the dynamics of generativ e transformers. In Inter- national Confer ence on Machine Learning , pages 20955–20982. PMLR. Maor Ivgi, Ori Y oran, Jonathan Berant, and Mor Ge va. 2024. From loops to oops: Fallback behaviors of language models under uncertainty . arXiv preprint arXiv:2407.06071 . Fengqing Jiang, Zhangchen Xu, Y uetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Y uchen Lin, and Radha Poov endran. 2025. Safechain: Safety of language models with long chain-of-thought reasoning capa- bilities. In F indings of the Association for Computa- tional Linguistics: ACL 2025 , pages 23303–23320. Huiqiang Jiang, Qianhui W u, Xufang Luo, Dongsheng Li, Chin-Y ew Lin, Y uqing Y ang, and Lili Qiu. 2024. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Pr oceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (V olume 1: Long P apers) , pages 1658–1677. Abhinav Kumar , Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer , Amir Houmansadr , and Eu- gene Bagdasarian. 2025. Overthink: Slowdo wn at- tacks on reasoning llms . Preprint , arXi v:2502.02542. Y ohan Lee, Jisoo Jang, Seoyeon Choi, Sangyeop Kim, and Seungtaek Choi. 2026. Overthinking loops in agents: A structural risk via mcp tools. arXiv pr eprint arXiv:2602.14798 . Huayang Li, T ian Lan, Zihao Fu, Deng Cai, Lemao Liu, Nigel Collier , T aro W atanabe, and Y ixuan Su. 2023. Repetition in repetition out: T ow ards understanding neural text degeneration from the data perspective. Advances in Neural Information Pr ocessing Systems , 36:72888–72903. Xingyu Li, Xiaolei Liu, Cheng Liu, Y ixiao Xu, Kangyi Ding, Bangzhou Xin, and Jia-Li Y in. 2025a. Loopllm: T ransferable energy-latency attacks in llms via repetitiv e generation. arXiv preprint arXiv:2511.07876 . Y unzhe Li, Jianan W ang, Hongzi Zhu, James Lin, Shan Chang, and Minyi Guo. 2025b. Thinktrap: Denial- of-service attacks against black-box llm services via infinite thinking. arXiv pr eprint arXiv:2512.07086 . Xiang Lin, Simeng Han, and Shafiq Joty . 2021. Straight to the gradient: Learning to use no vel tok ens for neural text generation. In International confer ence on machine learning , pages 6642–6653. PMLR. Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, W enxiao W ang, Deng Cai, Zheng W ang, and Jieping Y e. 2025. Controlling thinking speed in rea- soning models. arXiv pr eprint arXiv:2507.03704 . Jona te Lintelo, Stefanos K off as, and Stjepan Picek. 2024. The skipsponge attack: Sponge weight poi- soning of deep neural networks. arXiv preprint arXiv:2402.06357 . Han Liu, Y uhao W u, Zhiyuan Y u, Y evgeniy V orobey- chik, and Ning Zhang. 2023. Slo wlidar: Increasing the latency of lidar -based detection using adv ersarial examples. In Pr oceedings of the IEEE/CVF Confer- ence on Computer V ision and P attern Recognition (CVPR) , pages 5146–5155. Kaiyuan Liu, Chen Shen, Zhanwei Zhang, Junjie Liu, Xiaosong Y uan, and 1 others. 2025a. Efficient rea- soning through suppression of self-affirmation re- flections in lar ge reasoning models. arXiv pr eprint arXiv:2506.12353 . Shuaitong Liu, Renjue Li, Lijia Y u, Lijun Zhang, Zhim- ing Liu, and Gaojie Jin. 2025b. Badthink: T riggered ov erthinking attacks on chain-of-thought reasoning in large language models . Pr eprint , Nitin Lodha. 2025. T okenops: A compiler-style archi- tecture for token optimization in llm api w orkflows . T echnical report, Chitrangana.com. Jiahao Lu, Y ifan Zhang, Qiuhong Shen, Xinchao W ang, and Shuicheng Y an. 2024. Poison-splat: Compu- tation cost attack on 3d gaussian splatting. arXiv pr eprint arXiv:2410.08190 . Jinghui Lu, Haiyang Y u, Siliang Xu, Shiwei Ran, Guozhi T ang, Siqi W ang, Bin Shan, T eng Fu, Hao Feng, Jingqun T ang, and 1 others. 2025. Prolonged reasoning is not all you need: Certainty-based adap- tiv e routing for ef ficient llm/mllm reasoning. arXiv pr eprint arXiv:2505.15154 . Chen Ma, Ningfei W ang, Qi Alfred Chen, and Chao Shen. 2024. Slo wtrack: Increasing the latency of camera-based perception in autonomous dri ving us- ing adversarial e xamples . Pr oceedings of the AAAI Confer ence on Artificial Intelligence , 38(5):4062– 4070. Xinyin Ma, Guangnian W an, Runpeng Y u, Gongfan Fang, and Xinchao W ang. 2025. Cot-valv e: Length- compressible chain-of-thought tuning. In Pr oceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P a- pers) , pages 6025–6035. Matéo Mahaut and Francesca Franzon. 2025. Repe- titions are not all alik e: distinct mechanisms sus- tain repetition in language models. arXiv preprint arXiv:2504.01100 . Eran Malach. 2024. Auto-regressi ve next-token pre- dictors are uni versal learners. In Pr oceedings of the 41st International Confer ence on Machine Learning , pages 34417–34431. Hanene FZ Brachemi Meftah, W assim Hamidouche, Sid Ahmed Fezza, Olivier Déforges, and Kassem Kallas. 2025. Energy backdoor attack to deep neural networks. In ICASSP 2025-2025 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocess- ing (ICASSP) , pages 1–5. IEEE. Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. 2024. Spotserve: Serving generative large language models on pre- emptible instances. In Proceedings of the 29th A CM International Confer ence on Arc hitectural Support for Pr ogramming Languag es and Oper ating Systems, V olume 2 , pages 1112–1127. Andreas Müller and Erwin Quiring. 2024. The impact of uniform inputs on acti vation sparsity and ener gy- latency attacks in computer vision. In 2024 IEEE Security and Privacy W orkshops (SPW) , pages 104– 111. IEEE. Raymond Muller , Ruo yu Song, Chenyi W ang, Y uxia Zhan, Jean-Phillipe Monteuuis, Y anmao Man, Ming Li, Ryan Gerdes, Jonathan Petit, and Z. Berkay Celik. 2025. In vestigating physical latency attacks against camera-based perception . In 2025 IEEE Symposium on Security and Privacy (SP) , pages 4588–4605. Sania Nayab, Giulio Rossolini, Marco Simoni, Andrea Saracino, Giorgio Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. 2024. Concise thoughts: Impact of output length on llm reasoning and cost. arXiv pr eprint arXiv:2407.19825 . Andreas Ottem. 2025. Meve: A modular system for memory verification and ef fectiv e context control in language models. In CS & IT Confer ence Pr oceed- ings , volume 15. CS & IT Conference Proceedings. V ijay Janapa Reddi. 2025. Generati ve ai at the edge: challenges and opportunities: the next phase in ai deployment. Queue , 23(2):79–137. Kerem Sahin, Sheridan Feucht, Adam Belfki, Jannik Brinkmann, Aaron Mueller , Da vid Bau, and Chris W endler . 2025. In-conte xt learning without copying. arXiv pr eprint arXiv:2511.05743 . Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, W illiam Berg- eron, Jeremy K epner , De vesh Tiw ari, and V ijay Gade- pally . 2023. From w ords to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE high performance extr eme computing con- fer ence (HPEC) , pages 1–9. IEEE. Coen Schoof, Stefanos K offas, Mauro Conti, and Stjepan Picek. 2024. Beyond phantomsponges: En- hancing sponge attack on object detection models. In Pr oceedings of the 2024 ACM W orkshop on W ireless Security and Machine Learning , pages 14–19. Minjae Seo, Myoungsung Y ou, Junhee Lee, Jaehan Kim, Hwanjo Heo, Jintae Oh, and Jinwoo Kim. 2025. Eo- vlm: Vlm-guided energy o verload attacks on vision models. arXiv pr eprint arXiv:2504.08205 . A vishag Shapira, Alon Zolfi, Luca Demetrio, Battista Biggio, and Asaf Shabtai. 2023. Phantom sponges: Exploiting non-maximum suppression to attack deep object detectors. In Pr oceedings of the IEEE/CVF W inter Conference on Applications of Computer V i- sion , pages 4571–4580. Natalie Shapira, Chris W endler , A very Y en, Gabriele Sarti, K oyena Pal, Olivia Floody , Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, Jas- mine Cui, Giordano Rogers, Jannik Brinkmann, Can Rager , Amir Zur , Michael Ripa, Aruna Sankara- narayanan, Da vid Atkinson, Rohit Gandik ota, and 19 others. 2026. Agents of chaos . Pr eprint , Ilia Shumailov , Y iren Zhao, Daniel Bates, Nicolas Pa- pernot, Robert Mullins, and Ross Anderson. 2021. Sponge examples: Energy-latenc y attacks on neu- ral networks. In 2021 IEEE Eur opean symposium on security and privacy (Eur oS&P) , pages 212–231. IEEE. W ai Man Si, Michael Backes, and Y ang Zhang. 2023. Mondrian: Prompt abstraction attack against large language models for cheaper api pricing. arXiv pr eprint arXiv:2308.03558 . W ai Man Si, Mingjie Li, Michael Backes, and Y ang Zhang. 2025. Excessiv e reasoning attack on reason- ing llms . Pr eprint , Peter Steinberger . 2026. Openclaw: Personal ai assis- tant . Y ang Sui, Y u-Neng Chuang, Guanchu W ang, Jiamu Zhang, T ianyi Zhang, Jiayi Y uan, Hongyi Liu, An- drew W en, Shaochen Zhong, Na Zou, and 1 others. 2025. Stop o verthinking: A surv ey on efficient rea- soning for lar ge language models. arXiv pr eprint arXiv:2503.16419 . Guoheng Sun, Ziyao W ang, Bowei T ian, Meng Liu, Zheyu Shen, Shwai He, Y exiao He, W anghao Y e, Y iting W ang, and Ang Li. 2025. Coin: Counting the in visible reasoning tokens in commercial opaque llm apis. arXiv pr eprint arXiv:2505.13778 . Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser , and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information pr ocessing systems , 30. Chenyi W ang, Ruoyu Song, Raymond Muller , Jean- Philippe Monteuuis, Z. Berkay Celik, Jonathan Pe- tit, Ryan Gerdes, and Ming Li. 2025a. Cp-freezer: Latency attacks against vehicular cooperati ve percep- tion . Pr eprint , Guanzhi W ang, Y uqi Xie, Y unfan Jiang, Ajay Man- dlekar , Chaowei Xiao, Y uke Zhu, Linxi Fan, and An- ima Anandkumar . 2023a. V oyager: An open-ended embodied agent with lar ge language models. arXiv pr eprint arXiv:2305.16291 . Lei W ang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Y ang, Jingsen Zhang, Zhiyuan Chen, Jiakai T ang, Xu Chen, Y ankai Lin, and 1 others. 2024a. A surv ey on large language model based autonomous agents. F rontier s of Computer Science , 18(6):186345. Shuxun W ang, Qingyu Y in, Chak T ou Leong, Qiang Zhang, and Linyi Y ang. 2025b. Induction head toxic- ity mechanistically explains repetition curse in lar ge language models. arXiv pr eprint arXiv:2505.13514 . Siqi W ang, Hailong Y ang, Xuezhu W ang, T ongxuan Liu, Pengbo W ang, Xuning Liang, Kejie Ma, T ianyu Feng, Xin Y ou, Y ongjun Bao, and 1 others. 2024b. Minions: Accelerating lar ge language model infer - ence with aggregated speculati ve ex ecution. arXiv pr eprint arXiv:2402.15678 . T ianyi W ang, Huawei Fan, Y uanchao Shu, Peng Cheng, and Cong W ang. 2026a. Rethinking latency denial- of-service: Attacking the llm serving frame work, not the model. arXiv pr eprint arXiv:2602.07878 . T ianyi W ang, Zichen W ang, Cong W ang, Y uanchao Shu, Ruilong Deng, Peng Cheng, and Jiming Chen. 2025c. Can’t slo w me down: Learning robust and hardware- adaptiv e object detectors against latency attacks for edge de vices. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 19230– 19240. Xiasi W ang, T ianliang Y ao, Simin Chen, Runqi W ang, Lei Y e, Kuofeng Gao, Y i Huang, and Y uan Y ao. 2025d. Vlminferslow: Ev aluating the efficiency ro- bustness of lar ge vision-language models as a service. In Proceedings of the 63r d Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 16035–16050. Xinglin W ang, Shaoxiong Feng, Y iwei Li, Peiwen Y uan, Y ueqi Zhang, Chuyi T an, Boyuan Pan, Y ao Hu, and Kan Li. 2025e. Make ev ery penny count: Difficulty- adaptiv e self-consistency for cost-efficient reasoning. In Findings of the Association for Computational Linguistics: NAA CL 2025 , pages 6904–6917. Y anlin W ang, Jiadong W u, Tian yue Jiang, Mingwei Liu, Jiachi Chen, Chong W ang, Ensheng Shi, Xilin Liu, Y uchi Ma, and Zibin Zheng. 2025f. Draincode: Stealthy energy consumption attacks on retriev al- augmented code generation via context poisoning. In 2025 40th IEEE/A CM International Confer ence on A utomated Softwar e Engineering (ASE) , pages 778–790. IEEE. Zijian W ang, Shuo Huang, Y ujin Huang, and Helei Cui. 2023b. Energy-latency attacks to on-de vice neural networks via sponge poisoning. In Pr oceedings of the 2023 Secure and T rustworthy Deep Learning Systems W orkshop , pages 1–11. Ziwei W ang, Y uanhe Zhang, Jing Chen, Zhenhong Zhou, Ruichao Liang, Ruiying Du, Ju Jia, Cong W u, and Y ang Liu. 2026b. Recur: Resource ex- haustion attack via recursi ve-entropy guided coun- terfactual utilization and reflection. arXiv pr eprint arXiv:2602.08214 . Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elic- its reasoning in large language models. Advances in neural information pr ocessing systems , 35:24824– 24837. Sean W elleck, Ilia K ulikov , Stephen Roller, Emily Di- nan, Kyunghyun Cho, and Jason W eston. 2019. Neu- ral text generation with unlik elihood training. arXiv pr eprint arXiv:1908.04319 . Xingyu W u, Y uchen Y an, Shangke L yu, Linjuan W u, Y iwen Qiu, Y ongliang Shen, W eiming Lu, Jian Shao, Jun Xiao, and Y ueting Zhuang. 2025. Lapo: Internal- izing reasoning efficienc y via length-adaptive polic y optimization. arXiv pr eprint arXiv:2507.15758 . Zhiheng Xi, W enxiang Chen, Xin Guo, W ei He, Y i- wen Ding, Boyang Hong, Ming Zhang, Junzhe W ang, Senjie Jin, Enyu Zhou, and 1 others. 2025. The rise and potential of large language model based agents: A survey . Science China Information Sci- ences , 68(2):121101. W enya Xie, Shaochen Zhong, Hoang Anh Duy Le, Zhaozhuo Xu, Jianwen Xie, and Zirui Liu. 2025. W ord salad chopper: Reasoning models waste a ton of decoding budget on useless repetitions, self- knowingly . In Pr oceedings of the 2025 Conference on Empirical Methods in Natural Language Pr ocess- ing , pages 33576–33586. Jin Xu, Xiaojiang Liu, Jianhao Y an, Deng Cai, Huayang Li, and Jian Li. 2022. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Process- ing Systems , 35:3082–3095. Xiaobei Y an, Y iming Li, Hao W ang, Han Qiu, and T ianwei Zhang. 2025. Bithydra: T o wards bit-flip inference cost attack against lar ge language models. arXiv pr eprint arXiv:2505.16670 . Junchi Y ao, Shu Y ang, Jianhua Xu, Lijie Hu, Mengdi Li, and Di W ang. 2025. Understanding the repeat curse in large language models from a feature perspecti ve . In Findings of the Association for Computational Linguistics: A CL 2025 , pages 7787–7815, V ienna, Austria. Association for Computational Linguistics. Shunyu Y ao, Jef frey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik Narasimhan, and Y uan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv pr eprint arXiv:2210.03629 . Biao Y i, Zekun Fei, Jianing Geng, T ong Li, Lihai Nie, Zheli Liu, and Y iming Li. 2025. Badrea- soner: Planting tunable overthinking backdoors into large reasoning models for fun or profit . Pr eprint , Itay Y ona, Ilia Shumailov , Jamie Hayes, Federico Bar- bero, and Y ossi Gandelsman. 2025a. Interpreting the repeated token phenomenon in large language models. arXiv pr eprint arXiv:2503.08908 . Itay Y ona, Ilia Shumailov , Jamie Hayes, and Y ossi Gan- delsman. 2025b. Interpreting the repeated token phe- nomenon in large language models . In Pr oceedings of the 42nd International Confer ence on Machine Learning , volume 267 of Pr oceedings of Mac hine Learning Resear ch , pages 72535–72555. PMLR. Junzhe Y u, Y i Liu, Huijia Sun, Ling Shi, and Y uqi Chen. 2025. Breaking the loop: Detecting and mitig ating denial-of-service vulnerabilities in large language models. arXiv pr eprint arXiv:2503.00416 . Qiusi Zhan, Zhixiang Liang, Zifan Y ing, and Daniel Kang. 2024. Injecagent: Benchmarking indirect prompt injections in tool-inte grated large language model agents. In F indings of the Association for Computational Linguistics: ACL 2024 , pages 10471– 10506. Rui Zhang, Zihan W ang, T ianli Y ang, Hongwei Li, W enbo Jiang, Qingchuan Zhao, Y ang Liu, and Guowen Xu. 2025a. Hidden tail: Adversarial im- age causing stealthy resource consumption in vision- language models. arXiv pr eprint arXiv:2508.18805 . Y uanhe Zhang, W eiliu W ang, Zhenhong Zhou, Kun W ang, Jie Zhang, Li Sun, Y ang Liu, and Sen Su. 2025b. Leechhijack: Co vert computational resource exploitation in intelligent agent systems . Pr eprint , Y uanhe Zhang, W eiliu W ang, Zhenhong Zhou, Kun W ang, Jie Zhang, Li Sun, Y ang Liu, and Sen Su. 2025c. Leechhijack: Covert computational resource exploitation in intelligent agent systems. arXiv pr eprint arXiv:2512.02321 . Y uanhe Zhang, Xinyue W ang, Haoran Gao, Zhen- hong Zhou, Fan yu Meng, Y uyao Zhang, and Sen Su. 2025d. pd3f: A pluggable and dynamic dos- defense framew ork against resource consumption at- tacks targeting large language models. arXiv preprint arXiv:2505.18680 . Y uanhe Zhang, Zhenhong Zhou, W ei Zhang, Xin yue W ang, Xiaojun Jia, Y ang Liu, and Sen Su. 2025e. Crabs: Consuming resource via auto-generation for llm-dos attack under black-box settings. In F ind- ings of the Association for Computational Linguis- tics: ACL 2025 , pages 11128–11150. Shuli Zhao, Qinsheng Hou, Zihan Zhan, Y anhao W ang, Y uchong Xie, Y u Guo, Libo Chen, Shenghong Li, and Zhi Xue. 2025. Mind your server: A systematic study of parasitic toolchain attacks on the mcp ecosystem. arXiv pr eprint arXiv:2509.06572 . Y ue Zheng, Y uhao Chen, Bin Qian, Xiufang Shi, Y uan- chao Shu, and Jiming Chen. 2025. A re view on edge large language models: Design, execution, and appli- cations . ACM Comput. Surv . , 57(8). Kaiyu Zhou, Y ongsen Zheng, Y icheng He, Meng Xue, Xueluan Gong, Y uji W ang, and Kwok-Y an Lam. 2026. Beyond max tokens: Stealthy resource am- plification via tool calling chains in llm agents. arXiv pr eprint arXiv:2601.10955 . Zhenhong Zhou, Zherui Li, Jie Zhang, Y uanhe Zhang, Kun W ang, Y ang Liu, and Qing Guo. 2025. Corba: Contagious recursi ve blocking attacks on multi-agent systems based on large language models . Pr eprint , Zixuan Zhou, Xuefei Ning, K e Hong, Tian yu Fu, Ji- aming Xu, Shiyao Li, Y uming Lou, Luning W ang, Zhihang Y uan, Xiuhong Li, and 1 others. 2024. A surve y on ef ficient inference for large language mod- els. arXiv pr eprint arXiv:2404.14294 . Zhenhao Zhu, Y ue Liu, Zhiwei Xu, Y ingwei Ma, Hongcheng Gao, Nuo Chen, Y anpei Guo, W enjie Qu, Huiying Xu, Zifeng Kang, Xinzhong Zhu, and Jiaheng Zhang. 2025a. Extendattack: Attacking servers of lrms via extending reasoning . Preprint , Zihao Zhu, Hongbao Zhang, Ruotong W ang, Ke Xu, Si- wei L yu, and Baoyuan W u. 2025b. T o think or not to think: Exploring the unthinking vulnerability in lar ge reasoning models. arXiv preprint . A Scope and Inclusion Criteria This surve y focuses on threats to resource consump- tion in LLMs. Our primary focus is on adv ersarial or malicious resource ab use that increases compu- tational cost through e xtended, excessi ve, or uncon- trolled generation. Accordingly , we treat attacks on LLMs, reasoning models, multimodal generativ e systems, and agentic pipelines as the core liter- ature when the y induce abnormal output length, non-con ver gent decoding, recursi ve ex ecution ex- pansion, or related forms of resource amplification. W e also discuss a limited set of closely related works from vision and systems security , including selected latency and sponge-style attacks, b ut only in a supporting role. These works are included when they pro vide historical precursors, analogous threat patterns, or mechanistic insights that help explain later resource ab use in generativ e systems. They should not be interpreted as equally central instances of the problem studied in this surve y . By contrast, we do not aim to comprehensi vely re vie w benign efficiency optimization, generic in- ference acceleration, hardware scheduling, or all forms of denial-of-service and a vailability attacks outside the generati ve process itself. Such work is referenced only when it directly informs the secu- rity perspecti ve on resource consumption threats or helps contextualize the surv ey’ s boundary . B Preliminary B.1 A utoregressi ve Generation Most modern LLMs produce outputs through an autoregressi ve decoding process ( V aswani et al. , 2017 ; Malach , 2024 ). Giv en an input prompt x , the model generates an output sequence y = ( y 1 , y 2 , . . . , y T ) token by token according to the conditional distribution: p ( y t | x, y 272K context length) a $5.00 $22.50 GPT -5.4-pro (<272K context length) a $30.00 $180.00 GPT -5.4-pro ( > 272K context length) a $60.00 $270.00 GPT -5 mini a $0.25 $2.00 Claude Opus 4.6 b $5.00 $25.00 Claude Sonnet 4.6 b $3.00 $15.00 Claude Haiku 4.5 b $1.00 $5.00 Gemini 2.5 Pro ( ≤ 200K prompt) c $1.25 $10.00 Gemini 2.5 Pro ( > 200K prompt) c $2.50 $15.00 Gemini 2.5 Flash c $0.15 $1.25 Gemini 2.5 Flash-Lite c $0.10 $0.40 Gemini 2.0 Flash c $0.05 $0.20 Mistral Large 2 d $2.00 $6.00 Mistral Medium 3 / 3.1 d $0.40 $2.00 Mistral Small d $0.20 $0.60 DeepSeek-V3 e ∼ $0.27 ∼ $1.1 DeepSeek-R1 e ∼ $0.55 ∼ $2.2 a OpenAI pricing page: https://developers.openai.com/api/docs/pricing b Anthropic pricing page: http s://pl atform .claud e.com/ docs/e n/abou t- cla ude/pricing c Google AI pricing page: https://ai.google.dev/pricing d Mistral pricing page: https://mistral.ai/pricing#api e DeepSeek pricing page: http s://api- doc s.deeps eek.co m/quick _start /pri cing/ T able 2: API pricing comparison of representativ e mainstream models. Prices are listed in USD per 1M input/output tokens. Models 2020 2021 2022 2023 2024 2025 Mainstream Models 0.5K 2K 8K 32K 128K 128K Frontier Models 1K 4K 16K 128K 512K 10M T able 3: Context W indow Comparison of Representati ve Mainstream Models of the generation process and in the types of risks they introduce. Overthinking threats. In the Overthinking regime, the generation process remains semanti- cally aligned with the task but becomes unneces- sarily verbose or computationally intensive. Such behavior can arise when prompt s or inputs induce the model to produce e xcessiv ely detailed reason- ing ( Zhu et al. , 2025a ), redundant e xplanations ( Ku- mar et al. , 2025 ), or lo w-information-density de- scriptions ( Shumailov et al. , 2021 ). This phe- nomenon is particularly prominent in reasoning- oriented models ( W ei et al. , 2022 ; Creswell and Shanahan , 2022 ; Hao et al. , 2023 ; Guo et al. , 2025a ; He et al. , 2025 ) and agent-based systems ( Y ao et al. , 2022 ; Xi et al. , 2025 ; W ang et al. , 2024a ; Chen et al. , 2023a ), where extended reasoning chains or repeated tool interactions may significantly expand the generation trajectory . For example, adv ersar- ial manipulations of reasoning prompts can induce models to generate prolonged chains of interme- diate reasoning ( Si et al. , 2025 ), dramatically in- creasing token consumption and latenc y ( Y i et al. , 2025 ). Similar amplification effects ha ve also been observed in multimodal systems, where carefully crafted inputs can trigger e xtremely verbose te x- tual outputs despite containing limited semantic information ( Gao et al. , 2025 ). These behaviors may appear benign at the output le vel, yet the y sub- stantially inflate inference cost and degrade ov erall system throughput. When scaled across large num- bers of queries, such amplification can accumu- late into significant economic losses or operational ov erhead. Sev eral realistic deployment scenarios illustrate how such behavior can be exploited in practice. First, in commercial LLM services, a malicious or poorly re gulated service provider could inten- tionally design models or prompting pipelines that encourage unnecessarily verbose reasoning or re- dundant explanations, ef fectiv ely increasing token consumption and inflating user-side API costs ( Zhu et al. , 2025b ). Because users typically pay for out- put tokens ( Alavi et al. , 2025 ), ev en subtle increases in verbosity can translate into substantial financial ov erhead when deployed at scale ( Sun et al. , 2025 ; Lodha , 2025 ; Bergemann et al. , 2025 ). Second, in agent-based ecosystems where multi- ple tools, plugins, or intermediate services partici- pate in a task pipeline, adversarial intermediaries may inject additional reasoning steps, auxiliary tasks, or lo w-value instructions into the agent work- flo w ( Zhan et al. , 2024 ; Zhao et al. , 2025 ; Zhang et al. , 2025c ). Such manipulations can cause the model to perform extra computations that appear semantically related to the task while silently con- suming additional API b udget or computational resources ( Dong et al. , 2026 ; W ang et al. , 2025f ). Third, o verthinking behaviors can pose avail- ability risks for resource-constrained deployments, such as small service providers or edge de- vices ( Zheng et al. , 2025 ; Reddi , 2025 ; Fung et al. , 2025 ). In these settings, excessi ve generation may significantly delay task completion, block concur - rent requests, or exhaust limited compute capac- ity , thereby reducing system responsiv eness and degrading overall service reliability ( W ang et al. , 2023b ; Chen et al. , 2021 ). T aken together, these scenarios suggest that seemingly benign verbosity in generation can be- come a practical v ector for resource abuse when deployed in real-world LLM ecosystems. Unbounded Drift , in contrast, refers to gener- ation trajectories that progressi vely de viate from the intended task and fail to con verge within a rea- sonable progression. Such beha vior can arise when adversarial prompts ( Gao et al. , 2025 ), decoding perturbations ( Geiping et al. , 2024 ), or recursive reasoning ( W ang et al. , 2026b ). This phenomenon is particularly prominent in autore gressiv e decod- ing and agentic systems ( W ang et al. , 2023a ), where weakened termination signals, repetiti ve token dy- namics, or recursi ve tool use may cause the gener - ation trajectory to e xpand far beyond normal task completion. For example, carefully crafted mali- cious inputs can trap models in repetitiv e decoding loops or non-halting generation, forcing them to continue producing tokens until system-imposed limits intervene ( Gao et al. , 2025 ). Similar ef fects can also emer ge in agent en vironments, where re- cursi ve tool-calling or multi-step interaction chains expand the e xecution trajectory far beyond the orig- inal task scope ( Zhou et al. , 2026 ; Lee et al. , 2026 ). These behaviors are more ov ertly destructive than Overthinking, since the y not only inflate computa- tional cost b ut also directly threaten service a v ail- ability and task completion. Sev eral realistic de- ployment scenarios illustrate how such behavior can be exploited in practice. First, malicious users or competing service providers may deliberately craft requests that sup- press termination or trigger repetitive generation, forcing a provider to allocate disproportionate com- pute to a small number of adversarial queries ( Gao et al. , 2024c ). When amplified at scale, such re- quests can monopolize shared resources, degrade service av ailability , and in extreme cases resem- ble denial-of-service attacks against generativ e LLM infrastructure ( Li et al. , 2025b ; W ang et al. , 2026a ). Recent work explicitly frames this threat as inference-time DoS, where a small malicious input can monopolize GPU time, queue slots, or memory resources, starving legitimate users ( Zhang et al. , 2025e ). Second, untrusted intermediaries in user -facing pipelines may inject malicious suf fixes ( Zhao et al. , 2025 ; Si et al. , 2023 ), hidden instructions, or adver - sarial constructions into otherwise benign requests, causing the model to enter prolonged decoding tra- jectories or non-con vergent loops. In this setting, the immediate consequence is reduced end-user usability , including longer wait times, unstable re- sponsi veness, and higher usage costs. This threat is especially concerning because the injected con- tent may remain in visible to end users while still shifting the request into a pathologically expensi ve generation regime. Third, in agentic en vironments, adversarial inter- mediaries or compromised components may induce recursi ve tool-calling loops or self-amplifying in- teraction chains that both block task execution and corrupt task usefulness. Such behaviors not only consume excessi ve resources but may also trap the agent in stalled workflows, produce low-v alue or erroneous outputs, and obstruct downstream pro- cesses that depend on timely completion ( Lee et al. , 2026 ; Zhou et al. , 2025 ). Recent studies sho w that tool-layer manipulation can expand agent trajecto- ries to more than 60,000 tok ens ( Zhou et al. , 2026 ), inflate costs by hundreds of times, and substan- tially reduce co-running throughput, while other attacks can co vertly parasitize the user’ s compute budget by injecting unauthorized auxiliary work- loads into apparently legitimate workflo ws ( Zhang et al. , 2025b ). T aken together , these scenarios suggest that Un- bounded Drift is not merely a generation abnormal- ity , but a practical route through which resource abuse can escalate into service disruption, task fail- ure, and system-le vel a vailability de gradation. D.3 Why Resource Consumption Security Matters The preceding analysis highlights that resource con- sumption is no longer merely an efficiency issue but an emerging security concern in LLMs. Because modern model services operate on shared computa- tional infrastructure, e xcessive generation or non- con ver gent interaction patterns can disproportion- ately consume limited resources and degrade ser- vice a vailability for other users. As LLMs continue to scale and become embedded in agentic work- flo ws and real-world deployments, such beha viors may amplify operational costs, disrupt service re- liability , and threaten the sustainability of model infrastructure. Despite these risks, existing studies remain fragmented across dif ferent models and sys- tem settings, lacking unified taxonomies. A clearer conceptual frame work for understanding threats to resource consumption is therefore essential for adv ancing reliable and sustainable LLMs. E A Unified V iew of the Survey Organization T o clarify the scope and internal structure of this surve y , Figure 4 pro vides a unified o verview of ho w the literature on resource consumption issues is or ganized in this work. Specifically , we structure the problem into two primary resource consump- tion regimes, Overthinking and Unbounded Drift , which serve as the main taxonomy throughout the paper . Building on this taxonomy , the surv ey is further de veloped from three complementary per - specti ves: attacks , which examine how resource amplification is induced in different systems; mech- anisms , which analyze the underlying dynamics that lead to excessi ve or non-con vergent generation; and defenses , which summarize current mitigation strategies and their co verage. W ithin each perspectiv e, existing studies are fur- ther grouped by system type, including large lan- guage models, reasoning models, multimodal large language models, and agentic systems. This orga- nization is intended to pro vide a consistent view of the field: the taxonomy captures what kinds of resource consumption behaviors emer ge, while the three research perspectives capture how the y ar e induced , why the y arise , and how they may be miti- gated . In this way , the figure serves as a compact map of the surv ey , highlighting both the cov erage of current research and the une ven maturity of dif- ferent branches. F T owards a Standardized Evaluation Framework f or Resource Consumption Threats Resource consumption threats are increasingly rec- ognized as a critical safety concern in LLMs. While many studies hav e explored attacks and defenses across dif ferent modalities, model families, and deployment scenarios, the literature remains frag- mented: ev aluation metrics, datasets, and e xper- imental setups vary widely , making cross-study comparison challenging. This appendix provides a structured ov erview of e xisting work and, based on observed gaps, proposes a unified e valuation frame work to guide future research. The following sections first summarize the current landscape of attacks and defenses, then introduce recommended e v aluation guidelines for standardized assessment. F .1 Summary of Existing W ork T able 4 to T able 8 present representati ve resource consumption attacks, including the datasets used, e v aluation metrics, targeted models, and observ ed ef ficiency or latency amplification. T able 9 and T able 10 summarize corresponding defenses and their reported effecti veness. These tables illustrate se veral k ey points: High r esearch acti vity acr oss multiple modali- ties and settings: Existing studies cov er text-based LLMs, reasoning models, multimodal systems, and agentic en vironments. Attacks have been demon- strated on a wide range of models, including GPT - Resource Consumption Threat Attacks (§ 3 ) Overthinking Large Language Model Sponge Poisoning ( Cinà et al. , 2025 ); Crabs ( Zhang et al. , 2025e ); ThinkTrap ( Li et al. , 2025b ); BitHydra ( Y an et al. , 2025 ); RepetitionCurse ( Huang et al. , 2025d ). Multimodal Large Language Model NICGSlowDo wn ( Chen et al. , 2022b ) ;V erbose Image ( Gao et al. , 2024a ); V erbose Samples ( Gao et al. , 2024b );Uniform Inputs ( Müller and Quiring , 2024 ); QuantAttack ( Baras et al. , 2025 );Phantom Sponges ( Shapira et al. , 2023 ); Poison-splat ( Lu et al. , 2024 );Hidden T ail ( Zhang et al. , 2025a ); VLMInferSlow ( W ang et al. , 2025d );EO-VLM ( Seo et al. , 2025 ); Enhanced PhantomSponges ( Schoof et al. , 2024 ) Large Reasoning Language Model Overthinking ( Kumar et al. , 2025 ); BadReasoner ( Yi et al. , 2025 ); Excessive Reasoning Attack ( Si et al. , 2025 ) ExtendAttack ( Zhu et al. , 2025a );BadThink ( Liu et al. , 2025b ). Agentic System CLA WDrain ( Dong et al. , 2026 ). Unbounded Drift Large Language Model Non-halting Queries ( Hammouri et al. , 2025 ); LoopLLM ( Li et al. , 2025a ); Coercing llms to do ( Geiping et al. , 2024 ); Engorgio ( Dong et al. , 2024 ); LLMEffiChecker ( Feng et al. , 2024 ). Multimodal Large Language Model LingoLoop ( Fu et al. , 2025 );RECITE ( Gao et al. , 2025 ) Large Reasoning Language Model RECUR ( W ang et al. , 2026b ). Agentic System Overload ( Chen et al. , 2024 ); CPFreezer ( W ang et al. , 2025a ); SlowT rack ( Ma et al. , 2024 ); SlowLiD AR ( Liu et al. , 2023 ); CORBA ( Zhou et al. , 2025 ); LeechHijack ( Zhang et al. , 2025b ). Performance (§ 4 ) Overthinking Large Language Model High Inflow W ords ( Fu et al. , 2021 ); Attention Sink Circuit Patching ( Y ona et al. , 2025b ). Multimodal Large Language Model Large Reasoning Language Model Agentic System Unbounded Drift Large Language Model Winner T akes All ( Ildiz et al. , 2024 ); High Inflow W ords ( Fu et al. , 2021 ); Repetition Neurons ( Hiraoka and Inui , 2025 ); Three-segment Neuron Ablation ( Doan et al. , 2025 ); Induction Heads For ICL ( Crosbie and Shutova , 2025 ); Repetition Features ( Y ao et al. , 2025 ); Induction Head T oxicity ( W ang et al. , 2025b ); Repetitions Are Not All Alike ( Mahaut and Franzon , 2025 ); Pseudo-Repetition Penalization ( Xu et al. , 2022 ); Uncertainty-Driven Fallback Modeling ( Ivgi et al. , 2024 ); Unlikelihood Training ( W elleck et al. , 2019 ); ScaleGrad ( Lin et al. , 2021 ); HAP AX ( Sahin et al. , 2025 ); Repetition Dropout ( Li et al. , 2023 ). Multimodal Large Language Model Large Reasoning Language Model Agentic System Defense (§ 5 ) Overthinking Large Language Model CCoT ( Nayab et al. , 2024 ). Multimodal Large Language Model CAR ( Lu et al. , 2025 );NMRet ( Bhat et al. , 2025 );MeV e ( Ottem , 2025 ). Large Reasoning Language Model LAPO ( Wu et al. , 2025 );CoT -V alve ( Ma et al. , 2025 );CGRS ( Huang et al. , 2025b ); WSC ( Xie et al. , 2025 );Self-Affirmation ( Liu et al. , 2025a ); Controlling Thinking Speed ( Lin et al. , 2025 );FR-Ponder ( He and T ang , 2025 ); T ALE ( Han et al. , 2025 );MoT ( Zhu et al. , 2025b );DSC ( W ang et al. , 2025e ). Agentic System Unbounded Drift Large Language Model Unlikelihood Training ( W elleck et al. , 2019 );RAP ( Huang et al. , 2025a ); RecurrentDetector ( Y u et al. , 2025 );PD 3 F ( Zhang et al. , 2025d ); Queue-based ( Barek et al. , 2025 ). Multimodal Large Language Model Large Reasoning Language Model WSC ( Xie et al. , 2025 ); Self-Affirmation ( Liu et al. , 2025a ). Agentic System Can’t Slow Me Down ( W ang et al. , 2025c ). Figure 4: Overall organization of resource consumption issues across attack, mechanism, and defense perspectiv es. family , Gemini, DeepSeek, and various perception- model pipelines. Heterogeneous evaluation protocols: Different works report distinct metrics, such as tok en ampli- fication, latenc y increase, attack success rate, or system-le vel impact. Similarly , datasets vary from standard benchmarks to real-world agent w ork- loads, reflecting di verse e xperimental contexts. Fragmented coverage and limited compara- bility: Most studies focus on specific attack types, model classes, or deployment scenarios. Direct comparison between methods is complicated by dif fering metrics, model sizes, and ev aluation en- vironments. While this demonstrates the growing interest in threats to resource consumption, it also highlights the need for consistent, standardized as- sessment. These observations moti vate the de velopment of a unified ev aluation framework that provides a com- mon basis for comparing attacks and defenses and supports reproducible, comprehensi ve research. F .2 Pr oposed Evaluation Guidelines Gi ven the heterogeneity of existing studies, we rec- ommend e valuating resource consumption threats from three complementary perspectives: model- level behavior , hardwar e-level pressur e , and application-level impact . This design follows the intuition that excessi ve generation first manifests as abnormal decoding beha vior at the model le vel, then translates into measurable computational bur - den on the serving hardware, and finally affects the quality of service in real deployments. Evaluating all three le vels together provides a more complete vie w of both attack se verity and defense ef fectiv e- ness. At the model le vel , we recommend reporting out- put length and infer ence latency . Output length directly captures the e xtent to which generation is induced by the attack or controlled by the de- fense. This metric is particularly important be- cause many resource consumption threats operate by triggering unnecessarily long outputs, repeti- ti ve reasoning traces, or non-terminating decoding patterns. It is also highly portable across model families, including standard LLMs, reasoning mod- els, multimodal systems, and agentic components that generate intermediate textual outputs. How- e ver , output length alone is insuf ficient, since the same number of tokens may incur v ery dif ferent computational costs depending on the model archi- tecture, decoding strategy , and modality . W e there- fore additionally recommend inference latenc y as a direct measure of time overhead. Latency reflects ho w long the system takes to process a query un- der attack or defense, and is especially useful for capturing practical degradation in responsiveness. T ogether, output length and latenc y characterize the immediate behavioral footprint of resource am- plification at the model side. At the har dware le vel , we recommend reporting GPU utilization and memory utilization . These metrics capture whether excessi ve generation is ac- tually con verted into pressure on the underlying compute infrastructure. GPU utilization measures ho w intensiv ely the accelerator is occupied dur- ing inference, and is useful for identifying whether an attack induces sustained computational satura- tion rather than merely producing longer outputs. Memory utilization, including both allocated and peak memory usage where possible, reflects the storage burden created by long contexts, repeated decoding steps, or multimodal feature processing. This metric is especially important for large mod- els and multimodal systems, where memory bottle- necks may become the dominant constraint e ven before compute is fully saturated. Compared with model-le vel metrics, hardware-le vel metrics are more deployment-sensiti ve, since the y depend on serving frame works, batching policies, quantiza- tion, and device types. Ne vertheless, they are es- sential when the goal is to assess the actual infras- tructure impact of a threat or the practical efficienc y gain of a defense. At the application level , we recommend report- ing throughput . Throughput reflects the number of requests or tasks that can be completed within a unit of time, and therefore serves as a direct indica- tor of service capacity under attack or defense. This metric is particularly important because the ulti- mate consequence of resource consumption threats is often not limited to a single query becoming slo wer , but rather to the whole system handling fe wer users, fewer concurrent tasks, or fe wer agent ex ecutions. Throughput thus captures the service- le vel manifestation of resource abuse, making it especially rele vant for online APIs, multi-user serv- ing platforms, and agentic systems with concurrent workflo ws. In practice, throughput should be in- terpreted jointly with latenc y , since a system may maintain acceptable latency for indi vidual queries while still suf fering degraded o verall capacity un- der increased workload. T aken together, these metrics form a layered e v aluation framew ork. Output length and infer- ence latency capture ho w the model behav es under resource amplification; GPU and memory utiliza- tion measure whether this behavior imposes real hardware b urden; and throughput reflects whether such a burden ultimately de grades service capac- ity in practical deployments. This layered de- sign also impro ves comparability across settings. Model-le vel metrics are broadly applicable and should be reported in nearly all studies. Hardware- le vel metrics are particularly important for systems and deployment-oriented e valuations. Application- le vel metrics are most informati ve in realistic serv- ing scenarios, especially for production APIs and agentic en vironments. W e therefore recommend that future studies report at least one metric from each lev el whenev er possible, so that resource con- sumption threats can be assessed not only as gen- eration anomalies, but also as infrastructure and service risks. G Detailed T echnical Analysis of Resource Consumption Mechanisms Rather than reiterating the high-lev el taxonomy presented in the main te xt, this section provides a more fine-grained technical account of the mech- anisms underlying resource consumption behaviors in autoregressi ve generation. W e organize the dis- cussion across four complementary levels of analy- sis—Marko v chain foundations, circuit-le vel mech- anisms, behavioral dynamics, and training-level causes—so that researchers can more precisely trace ho w repetition and output prolongation arise from theoretical properties, internal model compo- nents, emergent generation patterns, and learning objecti ves, respecti vely . G.1 Marko v Chain By modeling autoregressiv e generation as a Marko v chain and deriving theoretical upper bounds on the A verage Repetition Probability (ARP), ( Fu et al. , 2021 ) identifies High Inflow W or ds —tokens to ward which disproportionately many others transition with high probability . These tokens form absorbing-like loops within the tran- sition matrix: once the generati ve process enters such high-inflow states, it becomes trapped in repet- iti ve c ycles rather than progressing to ward the EOS token. The proposed rebalanced encoding merges high inflo w pairs into single tokens, effecti vely re- ducing the inflow term in the ARP bound and signif- icantly lowering repetition rates in both translation and language modeling tasks. From a complemen- tary angle, ( Ildiz et al. , 2024 ) establishes a formal equi valence between single-layer self-attention and Context-Conditioned Markov Chains (CCMC), and analyzes the autoregressiv e trajectory from a single prompt. Due to the non-mixing nature of CCMC, majority tokens undergo self-reinforcement across generation steps, causing the output distribution to collapse into a singleton or highly limited token subset—providing a principled mathematical ac- count of why LLMs tend to generate repetiti ve te xt during prolonged decoding. G.2 Circuit-Le vel Mechanisms Repetition Neurons and F eatures Specific neu- rons within MLP blocks hav e been identified as direct executors of repetitiv e content. Dif feren- tial acti vation analysis ( Hiraoka and Inui , 2025 ) re vealed “repetition neurons” distributed through- out the model: intermediate-layer neurons tend to detect repeating patterns, while top-layer neu- rons dri ve the model to replicate previous conte xt. Layer-wise causal analysis ( Doan et al. , 2025 ) fur - ther shows that these neurons often function as do wnstream components of Induction Heads, serv- ing to amplify copying signals. Moving beyond in- di vidual neurons, ( Y ao et al. , 2025 ) utilized Sparse Autoencoders (SAEs) to identify latent directions that specifically encode “repetition features, ” indi- cating that the model explicitly represents repetitiv e behaviors within its feature space. Induction Heads Through prefix matching score analysis and targeted attention knock outs, ( Cros- bie and Shutov a , 2025 ) provides quantitativ e ev- idence that induction heads are the fundamental operators of few-shot in-context learning (ICL). Because output repetition functions as an uncon- trolled, degenerate form of ICL, Induction Head Descaling ( W ang et al. , 2025b ) formalized the “In- duction Head T oxicity” theory to explain this dy- namic. From a mechanistic perspecti ve, toxicity occurs when induction heads disproportionately dominate the output logits, thereby suppressing contributions from other attention heads. This logit dominance enforces rigid pattern replication and triggers a rapid entropy collapse in the ne xt-token probability distrib ution. Adding structural nuance to this, ( Mahaut and Franzon , 2025 ) revealed that repetition is actually sustained by two distinct, par- allel mechanisms. The first is the aforementioned ICL-induced repetition, which relies on a special- ized, late-dev eloping circuit of attention heads and MLPs that operate with high prediction confidence. The second is a naturally occurring repetition mech- anism that emerges v ery early in training; rather than using a dedicated circuit, it functions as a degenerate fallback, with attention weights anoma- lously collapsing onto lo w-information, structural tokens (e.g., ne wlines), sustaining repetitiv e loops e ven without e xplicit contextual prompts. G.3 Beha vioral Dynamics By manually constructing pseudo-repetiti ve data and comparing token probabilities across increas- ing repetition counts, Pseudo-Repetition Penal- ization ( Xu et al. , 2022 ) rev eals a critical self- r einfor cement effect : once a model generates one repeated sentence, the probability of continuing to repeat rises almost monotonically with the number of historical repetitions, e ventually stabilizing at a high ceiling v alue. Sentences with higher initial probabilities e xhibit stronger self-reinforcement, explaining wh y maximization-based decoding is particularly prone to sentence-lev el loops. Based on this finding, the proposed DITTO method trains models to exponentially decay repetition probabil- ity on pseudo data, significantly reducing repeti- tions without sacrificing perplexity . From a com- plementary angle, uncertainty-dri ven f allback mod- eling ( Ivgi et al. , 2024 ) systematically varies model size, pretraining tokens, and instruction tuning to analyze fallback beha viors under epistemic uncer- tainty . Their e xperiments re veal a consistent or - dering: as uncertainty increases during generation, models shift from producing correct facts to hal- lucinations, then to de generate text, and finally to verbatim sequence repetitions—positioning repeti- tion as the simplest fallback state when parametric kno wledge is exhausted. G.4 T raining-Lev el Causes Unlikelihood training introduces auxiliary losses to penalize repeated tokens ( W elleck et al. , 2019 ). Experimental results demonstrate that the approach significantly reduces repetition and dullness while maintaining competitiv e perplexity and token ac- curacy , and produces higher -quality generations under standard decoding strategies such as greedy search and beam search. Howe ver , such losses alone do not resolve the ske wed token-le vel proba- bilities inherent in MLE—moti vating direct gradi- ent manipulation as an alternati ve ( Lin et al. , 2021 ). ScaleGrad ( Lin et al. , 2021 ) directly modifies the gradients of the training objective to encourage the model to assign higher importance to novel tok ens during learning. The approach ef fectiv ely reduces repetition and enhances diversity while maintain- ing strong performance, as evidenced by both auto- matic metrics and human e valuations. A more tar - geted approach, HAP AX ( Sahin et al. , 2025 ), omits the loss contrib ution of an y token predictable by in- duction heads—ef fectively ensuring that repeated n-grams within a context window nev er produce gradient signals. Despite a 66% reduction in ver - batim copying, HAP AX models surpass the vanilla baseline on 13 of 21 abstractive in-context learn- ing tasks, demonstrating that suppressing inductiv e copying does not compromise broader ICL capabil- ities. At the data level, Repetition Dropout ( Li et al. , 2023 ) randomly drops attention to repetitiv e words during training, directly reducing the model’ s ex- posure to repeated patterns. This simple strategy substantially lo wers the repetition rate in generated text, and further analysis shows that it provides a unified explanation for prior methods—penalizing training-data repetitions emerges as the common and fundamental factor underlying the ef fecti ve- ness of high-inflow w ord mitigation, likelihood ob- jecti ve modifications, and self-reinforcement sup- pression. Notably , this ef fect persists across larger model scales and instruction-tuned settings. H Detailed Discussion of Mitigation Methods Mitigation is a necessary counterpart to the study of resource consumption threats, especially as re- cent attacks ha ve sho wn that resource ab use can escalate from ef ficiency degradation to severe op- erational and economic harm. The main text has already provided a high-le vel o verview of the cur - rent defense landscape, focusing on the av ailable mitigation strategies, their cov erage, and the ma- jor gaps that remain. Rather than reiterating the high-le vel taxonomy in the main text, this appendix provides a more fine-grained technical account of representati ve defense methods, so that future re- searchers and practitioners can more easily identify the most suitable interv entions for dif ferent threat regimes and deplo yment settings. Specifically , we expand the discussion along the same two primary threat classes identified earlier— Overthinking and Unbounded Drift —and further or ganize existing methods by the system layer at which intervention occurs, including training, decoding, and external control, with emphasis on their technical mecha- nisms , reported effects , and practical limitations. H.1 Defenses against Overthinking Large Language Models (LLMs). This line of work mitigates o verthinking primarily through pr ompt-level output contr ol , where conciseness constraints are explicitly injected into the in- put to suppress unnecessary reasoning expansion. Constrained-CoT (CCoT) ( Nayab et al. , 2024 ) im- plements this idea by introducing user-specified length b udgets (e.g., 15, 30, or 45 w ords) directly into the prompt, forcing the model to trade of f answer correctness against output bre vity during decoding. T o ev aluate this trade-off, it defines three correctness metrics: Hard- k Concise Accuracy (HCA), which counts only correct responses within a fixed length threshold; Soft- k Concise Accuracy (SCA), which applies an exponential penalty to moderate length violations; and Consistent Con- cise Accuracy (CCA), which further measures the stability of concise reasoning across generations. It also introduces the Redundancy Mean Score (RMS) and the Information Flow Score to quantify syntac- tic redundancy and semantic continuity in gener- ated reasoning traces. In essence, these methods do not modify model parameters or decoding dy- namics, but instead rely on external prompt-side constraints to compress generation, which makes them lightweight yet inherently limited against ad- versarially induced length amplification. Reasoning Large Language Models (RLLMs). In reasoning models, a more direct line of defense is to internalize length contr ol during training , so that budget a wareness becomes an intrinsic capabil- ity of the model rather than an external prompt-side constraint. LAPO ( W u et al. , 2025 ) implements this idea with a tw o-stage reinforcement learning proce- dure. In the first stage, the model performs GRPO rollouts and records the output lengths of correct re- sponses only; the 30th and 70th percentiles are then used to define a filtered target interval [ L min , L max ] , and a linear decay re ward penalizes generations that fall outside this range. The median feasible length L median is then carried into the second stage as an e xplicit self-b udgeting tar get embedded in the prompt, while a Gaussian-style length-adherence re ward encourages the model to match its declared budget during generation. This effecti vely con verts length control from post hoc truncation to learned planning beha vior , yielding up to 40.9% token re- duction and a 2.3% accuracy gain on mathematical reasoning benchmarks. CoT -V alve ( Ma et al. , 2025 ) addresses the same problem through parameter -space contr ollability rather than rew ard shaping. Using LoRA, it iden- tifies a controllable direction in parameter space that governs reasoning length, allowing a single model to compress or e xpand its chain of thought by adjusting the interv ention magnitude. T o sup- port this control, the method constructs the Mix- Chain dataset, where each question is paired with reasoning traces of dif ferent lengths. Under this design, reasoning compression is achiev ed as a con- tinuous model-side capability rather than a fixed decoding heuristic: on QwQ-32B, the method re- duces GSM8K reasoning length from 741 to 225 tokens with only a 0.15% accuracy drop, and com- presses AIME traces from 6,827 to 4,629 tokens with only one additional error . At the decoding stage, existing defenses miti- gate o verthinking through lightweight infer ence- time contr ol without updating model parameters. A first line of work suppresses explicit reflection triggers once continued reasoning becomes unnec- essary . CGRS ( Huang et al. , 2025b ), for example, inserts a probe at structural delimiters and estimates current confidence from output entropy; when the confidence exceeds 0.9, it directly downweights reflection-leading words such as “W ait”, “But”, and “ Alternativ ely” by assigning them large neg- ati ve logits, thereby pre venting further reflection loops at decoding time. A second line of work detects and interrupts se- mantically empty repetition from internal represen- tations. WSC ( Xie et al. , 2025 ) targets useless self- repetitions that consume decoding budget without adding semantic value. It trains a model-specific linear classifier on hidden states at end-of-chunk positions and triggers chopping when either 2 con- secuti ve long repetiti ve chunks ( ≥ 10 tokens) or 5 short ones ( < 10 tokens) are detected; after interrup- tion, a rescue regeneration prompt such as “Let me reconsider . . . ” is appended under a fixed token bud- get. The self-affirmation suppression approach ( Liu et al. , 2025a ) focuses on a narrower redundancy mode, namely reflecti ve steps that merely reaffirm earlier correct content, and suppresses them by ex- ploiting probability biases in their leading words, achie ving length reductions of 18.7% in the train- free setting and 50.2% in the train-based setting. A third line of work performs r epresentation- level steering of reasoning depth. Controlling Thinking Speed ( Lin et al. , 2025 ) extracts a steering vector by applying PCA to hidden-state dif ference vectors between fast- and slow-thinking trajecto- ries, and then adjusts reasoning speed through a strength parameter α ( α > 0 accelerates think- ing and α < 0 slo ws it down). It further intro- duces an adapti ve v ariant that estimates real-time reasoning difficulty from the Jensen–Shannon di- ver gence between early-layer and final-layer log- its, enabling dynamic switching within a single inference pass and yielding an a verage + 1.3% ac- curacy gain together with a − 8.6% token reduc- tion. FR-Ponder ( He and T ang , 2025 ) adopts a re- lated latent-control strategy , extracting contrasti ve steering vectors from step-by-step reasoning ver- sus direct-answer prompts and applying additi ve hidden-state perturbations through a lightweight controller trained with GRPO, with curriculum learning used to align compute allocation to task dif ficulty; under this design, it reports 30–40% to- ken reduction with up to 10% accuracy improve- ment on GSM8K, MA TH500, and GPQA. A fourth line of work controls overthinking by al- locating sampling-time budg ets . DSC ( W ang et al. , 2025e ) first performs batch-lev el dif fi culty ranking using the model itself: queries with zero entropy across pre-samples are assigned to an “Easy” group and recei ve only a single chain-of-thought sample, whereas “Hard” queries are assigned an initial sam- pling budget estimated from similarly difficult prior queries. If consensus is not reached, the sampling windo w is further e xpanded using a Dirichlet-based stopping rule. This makes DSC fundamentally a batch-le vel budget scheduler rather than a single- query real-time controller , which limits its applica- bility in latency-sensiti ve inference settings. From an e xternal control perspecti ve, existing methods mitig ate o verthinking by imposing b ud- get signals or supervisory constraints outside the model itself , without directly modifying the under- lying reasoning dynamics. T ALE ( Han et al. , 2025 ) implements this idea through token-budget control conditioned on problem complexity: T ALE-EP es- timates an appropriate budget via zero-shot prompt- ing, while T ALE-PT internalizes budget awareness through SFT/DPO post-training, so that the bud- get can be explicitly injected into the reasoning process as an external control variable. In a sim- pler form, CCoT ( Nayab et al. , 2024 ) applies the same prompt-side control principle by directly em- bedding length constraints into user instructions, thereby capping output verbosity without changing model parameters or decoding rules. A more intervention-oriented line of work intro- duces external supervisor s that monitor and termi- nate redundant reasoning online. MoT ( Zhu et al. , 2025b ) follo ws this design by exploiting the ma- nipulability of special delimiter tokens, the same external control surface used in adversarial BoT - style attacks, to insert a plug-and-play monitoring mechanism ov er the reasoning process. Rather than compressing reasoning through static prompt- ing alone, it dynamically halts redundant or risky reasoning paths during generation, serving both as an ef ficiency controller to reduce o verthinking and as a safety-oriented supervisor to terminate unsafe reasoning, with a monitoring interval of e very 200 tokens. Multimodal Large Language Models (MLLMs) and Memory Systems. In multimodal and long- context settings, ov erthinking mitigation must ad- dress not only unnecessary reasoning depth, but also the growth of retriev ed or cached context. A first line of work controls whether long-form r ea- soning is invok ed at all . CAR ( Lu et al. , 2025 ) implements this through perplexity-based routing, dynamically switching between short responses and chain-of-thought reasoning according to the model’ s estimated confidence, so that e xpensi ve long-form reasoning is triggered only for uncertain inputs. A second line of work controls how much con- textual state is carried into g eneration . NM- Ret ( Bhat et al. , 2025 ) manages long-context o ver- head through a structured memory architecture con- sisting of a stateful neural memory for abstract long-term storage, a vector store for contextual retrie v al, and a reasoning compressor for inter- mediate context management, thereby reducing context-windo w growth during multi-step reason- ing. MeV e ( Ottem , 2025 ) approaches the same problem from the retriev al side by inserting explicit verification and compression before generation: it performs initial kNN retriev al, cross-encoder- based rele vance filtering, BM25 fallback retrie v al, rele v ance-based context prioritization with redun- dancy remo val, and finally token-b udgeted greedy packing to control the final context size. Under this design, irrele vant or low-v alue conte xt is filtered before entering the model, yielding a 57% reduc- tion in context token consumption on W ikipedia datasets and 75% on HotpotQA. H.2 Defenses against Unbounded Drift The follo wing methods target non-con vergent gen- eration, where repetition collapse, infinite loops, or runaw ay execution amplify resource consump- tion beyond output inef ficiency and into crash-le vel system pressure. Large Language Models (LLMs). F or non- con ver gent generation in general LLMs, e xisting defenses mainly intervene at the training and de- coding stages. At training time, the core idea is to reshape the learning objecti ve so that repetiti ve pat- terns are explicitly suppressed during optimization. Unlikelihood training ( W elleck et al. , 2019 ) imple- ments this by introducing ne gativ e updates at two granularities. At the token le vel, pre viously gener- ated tokens in the context are treated as ne gative candidates, and an unlikelihood term is added to the standard MLE objecti ve to penalize high prob- ability assigned to those historical tokens, thereby suppressing local repetition and high-frequency to- ken dominance. At the sequence lev el, tokens be- longing to repeated n -grams in the decoded outputs are marked as neg ati ve samples during fine-tuning, extending this principle to longer structural repeti- tion patterns. These two objectiv es are combined in training, though the y exhibit different limitations: token-le vel optimization suf fers from a distribu- tion mismatch between training and generation, whereas sequence-level optimization requires de- coding full sequences within the training loop and is therefore substantially more expensiv e. Under beam search, the combined design reduces the 4- gram repetition rate from 0.442 to 0.013 and in- creases the number of unique generated tokens by 77%. At decoding time, the focus shifts from suppress- ing repetition in the learned distribution to con- trolling it online during generation. RAP ( Huang et al. , 2025a ) formalizes this process through sys- tematic tuning of the Repetition Penalty Parameter (RPP). Its core measurement component, ReD A, computes a repetition ratio (RR) for each output sequence by detecting not only standard textual rep- etition but also consecutiv e non-word-character rep- etition and space-free long-form repetition through regular -expression-based matching. Given RR, RAP selects the penalty strength by maximizing the score P × F (RR) , where P denotes task perfor- mance and F (RR) is a penalty function over repeti- tion se verity . Among fiv e candidate forms—linear , quadratic, cubic, log arithmic, and exponential—the cubic function (1 − RR) 3 yields the best trade-off according to the reported ablation results. For more sev ere resource exhaustion and DoS- style threats, the most mature defenses operate at the system level , where intervention is applied dur - ing serving rather than through model retraining. A representati ve direction is online loop detection . RecurrentDetector ( Y u et al. , 2025 ) monitors T rans- former acti vation states at each generation step with a lightweight MLP classifier and terminates genera- tion once the cosine similarity between the current and previous states e xceeds 0.95, treating high state recurrence as a signal of non-con vergent looping. Under this design, it achiev es 95.24% detection accuracy with a 2.59% f alse positive rate and only 0.36ms additional latency . Its scope, ho wev er, is limited by strong white-box assumptions: it cannot be applied to closed-source APIs such as GPT -4, may be bypassed by polymorphic attacks that pre- serve semantic loop structure while varying surface tokens, and has been ev aluated on only 6 open- source architectures across 4,000 prompts. A complementary direction performs r esource- awar e serving contr ol under adversarial load. PD 3 F ( Zhang et al. , 2025d ) implements this through a two-stage pipeline. On the input side, it computes a Resource Index from fi ve features — total inference time, peak GPU memory , peak GPU utilization, input length, and output length — and uses this signal to guide dynamic request polling under high-concurrency , malicious prompts. On the output side, it applies an Adaptive End- Based Suppression mechanism that e xplicitly am- plifies the EOS logit to terminate maliciously pro- longed generation. Under AutoDoS-style attacks, this design improves legitimate-user throughput by up to 500%, achiev es attack-detection accuracy abov e 99% across AutoDoS, GCG-DoS, and P- DoS settings, and has been validated across six open-source LLMs. Its main limitation is threshold sensiti vity: the IQR-based decision rule may mis- classify legitimate but computationally intensi ve requests, such as long code generation, leading to substantial false positi ves in real deployments. At a broader infrastructure le vel, queue-based web service architectures ( Barek et al. , 2025 ) mit- igate ov erload by decoupling r equest admission fr om dir ect model execution . Instead of treating e very request as an immediately executable gener - ation job, they use distributed queuing to smooth bursty demand, isolate ov erload pressure from the serving backend, and preserve stable, near-linear scalability under e xtreme workloads. Compared with model-centric defenses, this line of work does not directly suppress malicious generation beha v- ior but pro vides an engineering-lev el containment mechanism for maintaining service stability when resource abuse cannot be fully block ed upstream. Reasoning Large Language Models (LRMs). Purpose-built defenses against crash-le vel and non- terminating beha vior in reasoning models remain largely absent. Existing mitigation methods only provide incidental cover age by suppressing seman- tically empty self-repetition during ov erthinking control. F or example, the self-af firmation suppres- sion approach ( Liu et al. , 2025a ) and WSC ( Xie et al. , 2025 ) can interrupt mild repetiti ve loops be- cause both target redundant reflecti ve steps that consume decoding budget without adding semantic v alue. Ho wever , these methods are not designed to handle adversarially induced collapse. WSC relies on a model-specific classifier , and its regeneration stage may itself enter a loop, while CGRS depends on hardcoded reflection triggers and can therefore be bypassed when prompt injection suppresses or substitutes those exact tokens. As a result, cur- rent reasoning-model defenses do not yet provide reliable protection against sustained crash-style at- tacks, leaving a clear gap between o verthinking mitigation and true non-termination defense. Multimodal Large Language Models (MLLMs). For latency-oriented resource threats in multi- modal and edge-deployed systems, existing de- fenses mainly intervene through hardwar e-aware adversarial tr aining . Background-attentiv e adver - sarial training ( W ang et al. , 2025c ) implements this idea by explicitly coupling perturbation robustness with de vice-le vel capacity constraints across hetero- geneous GPUs. T echnically , it uses binary masks to concentrate perturbation aw areness on vulnerable background re gions, where latency attacks often induce excessi ve post-processing ov erhead, and in- corporates objectness loss as an auxiliary signal to distinguish true objects from attacker-induced phantom detections. Under this design, the defense improv es both rob ustness and serving efficienc y: on Jetson Orin NX, it restores processing speed from 13 FPS to 43 FPS, achie ves 8–10% higher robust accuracy than MTD and OOD, and incurs only 4.4% clean accurac y loss. The reported e val- uation cov ers multiple Y OLO-based detectors, in- cluding Y OLOv3, YOLOv5, and Y OLOv8, across autonomous dri ving and general object detection settings. I Benchmarks and Datasets Existing benchmarks for resource consumption in large language models remain relati vely limited. A small number of benchmarks explicitly focus on excessi ve generation and DoS–style resource con- sumption. For example, BenchOverflo w ( Feiglin et al. , 2026 ) measures the o verflo w phenomenon, in which benign plaintext prompts trigger abnormally long outputs. Prompt-Induced Over -Generation as Denial-of-Service ( Guo et al. , 2025b ) constructs an attack-side benchmark that ev aluates how prompts can induce over -generation under black-box access. In contrast, a larger body of benchmarks fo- cuses on ef ficiency issues arising from long-chain reasoning. Stop Overthinking ( Sui et al. , 2025 ) surve ys efficient reasoning methods and high- lights the prev alence of excessiv e reasoning be- havior in chain-of-thought generation. EffiReason- Bench ( Huang et al. , 2025c ) proposes a unified benchmark for e valuating ef ficiency–performance trade-of f s across multiple reasoning tasks and mod- els. SafeChain ( Jiang et al. , 2025 ) studies the safety implications of long chain-of-thought reasoning and introduces datasets for e valuating safety and robustness in reasoning-intensi ve settings. Overall, while recent work has begun to es- tablish e valuation frame works for both resource- consumption attacks and reasoning efficienc y , exist- ing benchmarks remain fragmented across different research objecti ves. A unified e valuation paradigm for resource-aware generation and rob ustness is still largely lacking. J AI Writing Assistance Disclosure W e used AI tools solely for language polishing to improve clarity and readability . The AI tools did not contrib ute to the scientific content, ideas, analyses, or conclusions of this work. Method Dataset Evaluation Metrics Models Efficiency A sloth (Multi- exit) ( Hong et al. , 2020 ) CIF AR- 10, Ima- geNet, Tin y- ImageNet Exit Inde x, A vg Latenc y , Energy MSDNet, ResNet (Multi- exit) N/A SkipSponge ( Lin- telo et al. , 2024 ) CIF AR-10, SVHN FLOPs Inflation, Accu- racy ResNet, W ideResNet N/A On-Device Sponge ( W ang et al. , 2023b ) Speech Commands, IMDB Battery Drain, Ex ecution T ime MobileNetV2, Shuf fleNet N/A Dynamic Rout- ing ( Chen et al. , 2023b ) CIF AR-100, ImageNet A vg Layers, Activ ation Ratio GaterNet, SkipNet N/A Sponge Poisoning ( W ang et al. , 2023b ) ImageNet, CIF AR Latency , GPU Energy VGG, ResNet N/A Energy Backdoor ( Meftah et al. , 2025 ) GLUE, SQuAD Energy Increase, Clean Accuracy BER T , RoBER T a 512* Sensing AI Sponge ( Hasan et al. , 2025a ) Audio/Sensor Data Sensor Po wer , Pruning Ra- tio CNN, DeepSense N/A T ransslowdown ( Chen et al. , 2021 ) WMT’14 (En-De), WMT’16 (En-Ro) T ranslation Latenc y , T o- ken Count, BLEU Score T ransformer , LSTM- based NMT 1024 NMTSloth ( Chen et al. , 2022a ) WMT’17 (En-De), WMT’19 (Zh-En) Real-world Latency , En- ergy , Response Time Fairseq, OpenNMT , Mari- anNMT 80x T able 4: Summary of representativ e resource consumption attacks in early adaptive architectures. Method Dataset Evaluation Metrics Models Efficiency A utoDoS ( Zhang et al. , 2025e ) Chatdoctor , MMLU, Hellaswag, Codexglue, GSM Attack Success Rate, Safety Compliance Rate, T ask Success, T oken Usage, Latency , Compute Cost GPT , Llama, Qwen, Deepseek, Ministral, Gamma 8192 Repeated T o- ken ( Y ona et al. , 2025a ) OpenW ebT ext, Custom Rep- etition Dataset Attention Scores, Loss, Output Length Llama-3, GPT -2, Pythia 8192 Non-halting Queries ( Ham- mouri et al. , 2025 ) RA G Sys- tems, 10k Probe Dataset A vg T oken Length, Gener- ation Success GPT -4o, Llama-3, Gemma-2 8192 ThinkT rap ( Li et al. , 2025b ) Sponge, LLMEf- fiChecker Throughput, Response La- tency , ASR GPT -4o, Gemini 2.5 Pro, DeepSeek R1 8192 Crabs (A utoDoS) ( Zhang et al. , 2025e ) MMLU, GSM8K, Chatdoctor , Codexglue Latency Inflation, Output Length ↑ , GPU Memory GPT -4, Llama-3, Qwen-2, DeepSeek, Gamma 8192 BitHydra ( Y an et al. , 2025 ) MMLU, GSM8K Inference Cost, Bit-flip Rate, ASR Llama-2, OPT , Bloom 8192 Coercing LLMs ( Geiping et al. , 2024 ) AdvBench, V icuna- bench ASR, Safety , Ef ficiency Loss V icuna, Llama-2, GPT -3.5 4096 Engorgio Prompt ( Dong et al. , 2024 ) ShareGPT , Alpaca Response Length, Ener gy Draw Llama-3, Mistral, GPT -4 8192 LLMEffiChecker ( F eng et al. , 2024 ) Ef ficiency- Bench, W ikiT ext Delay , Energy , Through- put Llama-2, V icuna, Alpaca 8192 LoopLLM ( Li et al. , 2025a ) MT -Bench, Chatbot Arena Repetition Ratio, Energy Latency Llama-3, Qwen-2, Phi-3 8192 DoS Poisoning ( Gao et al. , 2024c ) Anthropic HH-RLHF Inference Latency , Poison- ing Ratio GPT -Neo, RoBER T a 1024 T able 5: Summary of representativ e resource consumption attacks in LLMs. Method Dataset Evaluation Metrics Models Efficiency Hidden T ail ( Zhang et al. , 2025a ) MS-COCO Attack Success Rate,Output Length, Latency , V isible length, Response Quality Qwen2.5-VL, MiMo-VL- 7B-RL,Gemma-3-4B-IT 1831 NICGSlowDown ( Chen et al. , 2022b ) MS-COCO, Flickr8k I-Loops, I-Latency (CPU/GPU) ResNext-LSTM, GoogLeNet-RNN, MobileNets-LSTM N/A V erbose Images ( Gao et al. , 2024a ) MS-COCO, ImageNet Sequence Length, Energy , Latency , Uncertainty BLIP ,BLIP-2, Instruct- BLIP , MiniGPT -4 8x V erbose Samples ( Gao et al. , 2024b ) MSVD, TGIF Length, Latency , Energy V ideoChat-2,V ideo- V icuna,V ideo-LLaMA 4x Unif orm Inputs ( Müller and Quir - ing , 2024 ) ImageNet, Acti v ation Density , Acti- v ation Sparsity ResNet, DenseNet, Mo- bileNetV2 N/A LingoLoop ( Fu et al. , 2025 ) MS-COCO, ImageNet T okens, Energy Latency InstructBLIP , Qwen2.5- VL, InternVL3 2048 Phantom Sponges ( Shapira et al. , 2023 ) Berkele y Deep Dri ve(BDD), Mapillary T raffic Sign Dataset (MTSD), LISA, P AS- CAL V OC Number of candidates, Processing T ime, Detec- tion Recall Y OLOv5, Y OLOv3, Y OLOv4 N/A Enhanced Phan- tomSponges ( Schoof et al. , 2024 ) Berkele y Deep-Dri ve Number of candidates, Processing T ime, Detec- tion Recall Y OLOv5 N/A RECITE ( Gao et al. , 2025 ) ImageNet, MMLU, HumanEv al, GSM8K Attack Success Rate, Service response latency , GPU utilization, A verage generation length InstructBLIP , LLaV A, Qwen-VL 2048 EO-VLM ( Seo et al. , 2025 ) - GPU Po wer, Inference T ime Y OLOv8, MASKDINO, Detectron2 N/A QuantAttack ( Baras et al. , 2025 ) ImageNet GPU Memory , Process- ing T ime, Energy , Outlier Count, Accuracy V ision Transformer (V iT), Data-ef ficient image T ransformer (DeiT) N/A VLMInferSlow ( W ang et al. , 2025d ) MS-COCO, ImageNet I-length I-latency I-ener gy FLAMINGO, BLIP , GIT , FLORENCE N/A Sponge Examples ( Shumailov et al. , 2021 ) CIF AR-10, SVHN, W ikiT ext- 103, En-De Energy Consumption, La- tency , Memory Access VGG-16, ResNet-18, GPT -2 512 T able 6: Summary of representativ e resource consumption attacks in MLLMs. Method Dataset Evaluation Metrics Models Efficiency OverThink ( Ku- mar et al. , 2025 ) FreshQA, SQuAD, MuSR Reasoning token ampli- fication, Answer correct- ness, Guardrail ev asion, Defense e v aluation o1, o1-mini, DeepSeek- R1 46x Excessive Rea- soning Attack ( Si et al. , 2025 ) GSM8K, ORCA Reasoning length and out- put length, utility perfor- mance (accuracy) DeepSeek-R1-Distill- LLaMA, DeepSeek-R1- Distill-Qwen, o1-mini, o3-mini, DeepSeek-R1, QWQ 9x BadReasoner ( Y i et al. , 2025 ) - Reasoning verbosity , an- swer correctness, control- lability Marco-o1, QwQ, DeepSeek-R1 N/A BadThink ( Liu et al. , 2025b ) MA TH-500, GSM8K ASR, RIR, T A C, BAD, SD DeepSeek-R1- Distill-Qwen (1.5B/7B/14B/32B), OpenR1-Qwen-7B, Light- R1-7B-DS 63.85x ExtendAttack ( Zhu et al. , 2025a ) AIME 2024, AIME 2025, HumanEv al, BigCodeBench - Complete Response Length, Latency , Accuracy(P ass@1) o3, o3-mini, QwQ-32B, Qwen3-32B N/A RepetitionCurse ( Huang et al. , 2025d ) - prefill latency ,TTFT ,router imbalance Mixtral-8x7B series, Qwen3-30B-A3B series, GPT -OSS-20B/120B, Kimi-Linear-Instruct, DeepSeek-V2-Lite, Llama-4-Scout-17B-16E- Instruct N/A T able 7: Summary of representativ e resource consumption attacks in RLLMs. Method Dataset Evaluation Metrics Models Efficiency SlowLiD AR ( Liu et al. , 2023 ) KITTI, nuScenes Runtime latency , Imper- ceptibility PointPillars, SECOND, PV -RCNN 2.7x CORB A ( Zhou et al. , 2025 ) LLM-MAS simulation tasks P-ASR, PTN, A v ailability degradation GPT -4o-mini, GPT -4, GPT -3.5-turbo, Gemini- 2.0-Flash, Qwen2.5-14B- Instruct, Llama-3.1-70B- Instruct, Gemma-2-27B-it N/A CP-FREEZER ( W ang et al. , 2025a ) OPV2V End-to-end latency , Attack success rate, Frame pro- cessing time OpenCOOD 90x SlowT rack ( Ma et al. , 2024 ) MO T17 Latency increase (R-Lat), Imperceptibility , System- le vel crash rate SOR T , F airMO T , Byte- T rack, BoT -SOR TSOR T (Y OLOv5 detector), Fair - MO T , ByteT rack, BoT - SOR T 4x LeechHijack ( Zhang et al. , 2025b ) GAIA, GPQA, MMLU Attack success rate, Resource ov erhead, De- tectability (stealthiness) DeepSeek, Qwen, GPT , Gemini N/A Overload ( Chen et al. , 2024 ) MS COCO dataset Inference time, NMS la- tency Y OLOv5 10x Clawdrain ( Dong et al. , 2026 ) Real Open- Claw agent workloads T oken amplification, Cost ov erhead, Attack success rate, Stealthiness OpenClaw v2026.2.9 ∼ 9x T able 8: Summary of representativ e resource consumption attacks in Agents. Method Dataset Evaluation Met- rics Models Efficiency Model Class PD 3 F ( Zhang et al. , 2025d ) MMLU, Hellaswag, HumanEval, GSM, GPQA Attack Success Rate, T ask Success, T oken Usage, La- tency Llama, Mistral, Qwen T otal T ime 50% ↓ LLM RecurrentD- etector ( Y u et al. , 2025 ) Custom trigger dataset (2388 inputs), ShareGPT Accuracy , F1, FPR, Recall, trigger at- tempts, latency Llama-3/2 (7/13B), V icuna-v1.5 (7/13B), Gemma-2 (2B), GPT -4o/mini RecurrentGenerator: avg attempts 272.1 vs 1679.1 random; RecurrentDetector: 0.36 ms inference LLM CCoT ( Nayab et al. , 2024 ) GSM8K, SV AMP , ASDIV Accuracy , gen time, token count, HCA/SCA/CCA, RMS, Info Flow Llama2-70b/7b, Falcon-40b/7b, V icuna-13b 5.12 s generate time ↓ , 4.41% A CC ↑ LLM CoT -V alve ( Ma et al. , 2025 ) GSM8K, PRM800K (ground truth), MixChain C/Z; Eval: GSM8K, AIME24 Pass@1, token count, A CU QwQ 32B Previe w , DeepSeek R1 Distill Llama 8B, LLaMA 3.1/3.2 (8B/1B), Qwen2.5 32B (w/ LIMO) T okens usage 69.6% ↓ LLM FR-Ponder ( He and T ang , 2025 ) GSM8K, MA TH500, GPQA Acc, avg tokens, avg FLOPs (log) LLaMA 3 (8/70B), Qwen 2.5 (0.5/3/7B) 30–50% token ↓ LLM DSC ( W ang et al. , 2025e ) MA TH, GSM8K, CSQA, SQA, Last Letter , Coin Flip Acc, cost ($), to- kens, time GPT 3.5 Turbo, GPT 4, Mistral 7B Instruct v0.3 Cost ↓ 65% (GPT 4), 56% (GPT 3.5) LLM MeV e ( Ot- tem , 2025 ) English Wikipedia (first 100 articles), HotpotQA subset A vg context tokens, retriev al time, grounding/rele- vance proxies Embedding, cross encoder , tokenizer , BM25 (fallback) Context tokens ↓ 57.7% (W iki) & 75% (HotpotQA) LLM Unlikelihood T raining ( W elleck et al. , 2019 ) W ikitext 103, GPT 2 fine tuning corpus seq/token repetition metrics, ppl, acc, human win rate 16 layer T rans- former , GPT 2 (medium pre trained) T oken le vel: 150k updates; seq lev el: 1.5k updates LLM NMRet ( Bhat et al. , 2025 ) CoQA, ai arxi v2, T extVQA RA GAS metrics GeminiLLM, T itans pytorch NeuralMemory , Qdrant/Chroma, CLIPEmbeddings, LightThinkerCom- pressor LightThinker Com- pressor reduces to- ken footprint for KV cache LLM T able 9: Summary of representativ e defenses against resource consumption threats (Part 1). Method Dataset Evaluation Met- rics Models Efficiency Model Class RAP ( Huang et al. , 2025a ) QA: W ebQSP , MS MARCO; MT : MA C (Chinese English) RR, RAP score, F1/BER T F1/COMET Llama 2 (7/13/70B), Llama 3 (8/70B), Llama3.1 70B, Gemma 1.1 (2/7B), Phi 3/3.5 mini (3.8B), Mistral 7B v0.2/0.3 RR ↓ up to 93% (W ebQSP) and 74% (MS MARCO) LLM Queue- based ( Barek et al. , 2025 ) Queue based W eb Service (Sidekiq) with custom load tests (120/300/600 req) T otal Time, A vg T ime per Request GPT 2, BLOOM, OPT avg time 21–35 sec LLM CAR ( Lu et al. , 2025 ) Multimodal (DocVQA, ChartQA, FUNSD, etc.) & T ext (GSM8K, MathQA, StrategyQA) + pilot tasks A CC (VQA/KIE), EM, token count, PPL Qwen2.5 0.5B/7B, Llama3.1 8B, Qwen2 VL 7B T oken ↓ 21–39%, A CC ↑ 5.5–6.9% LLM, MLLM T ALE ( Han et al. , 2025 ) GSM8K, GSM8K Zero, MathBench (Arithmetic/Mid- dle/High/College) A CC, output tokens, expense GPT 4o mini, Y i lightning, GPT 4o, o3 mini, Llama 3.1 8B Instruct token ↓ 64–67%, expense ↓ 45–59% LLM, RLLM Underload ( W ang et al. , 2025c ) P ASCAL-VOC, COCO, Berkeley DeepDriv e (BDD) mAP50, FPS, NMS latency , compute ov erhead, memory transfer Y OLOv3, Y OLOv5, YOLOv8 FPS 13 → 43 on Jet- son Orin NX MLLM Thinking Speed ( Lin et al. , 2025 ) MA TH 500, AIME24/25, GPQA Diamond, Liv eCodeBench Pass@1, token count, latency , mode switch DeepSeek-R1- Distill-Qwen- 7B/32B, QwQ-32B, Qwen3-8B 1.26% acc ↑ and 8.56% token ↓ RLLM Self- Affirmation ( Liu et al. , 2025a ) AIME24, AMC23, GSM8K, MA TH500, GPQA D (train free & train based) Acc, token count, LR ratio R1 Distill Qwen (1.5B/7B/32B), QwQ 32B, Qwen3 32B T rain free: 8.4–18.7% to- ken ↓ RLLM LAPO ( W u et al. , 2025 ) T rain: 10k math (6k DeepScaleR, 4k MA TH); Eval: MA TH500, AIME24, AMC23, OlympiadBench, GPQA Pass@1, token count, trade off DeepSeek R1 1.5B, DeepScaleR 1.5B Previe w T oken ↓ up to 40.9% RLLM MoT ( Zhu et al. , 2025b ) AIME 2024, MA TH500, StrongReject, Harmbench, W ild- Jailbreak ASR, R TC, RPC, C A CC, Min Steps, Refusal, Harmful Score DeepSeek R1 (1.5/7/14/32B), Light R1 7B DS, Open R1 7B, QwQ 32B, Marco o1 7B ASR > 90%, R TC ∼ 80% ↓ , RPC –90% RLLM W ord Salad Chopper ( Xie et al. , 2025 ) GSM8K, MA TH500, AIME25, GPQA Diamond A CC, length com- pression ratio DeepSeek R1 Distill Qwen (1.5B/7B), DeepSeek R1 Distill Llama 8B, Qwen3 8B Length compres- sion 13–57% RLLM T able 10: Summary of representativ e defenses against resource consumption threats (Part 2).
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment