Efficient Reasoning on the Edge

Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These…

Authors: Yelysei Bondarenko, Thomas Hehn, Rob Hesselink

Efficient Reasoning on the Edge
Efcient Reasoning on the Edge Y elysei Bondarenko * , Thomas Hehn * , Rob Hesselink * , Romain Lepert * , Fabio Valerio Massoli * , Evgeny Mironov * , Le yla Mirvakhabov a * , T ribhuvanesh Orekondy * , Spyridon Stasis * , Andrey Kuzmin † , Anna Kuz- ina † , Markus Nagel † , Ankita Na yak † , Corrado Rainone † , Ork de Rooij † , Paul N Whatmough † , Arash Be- hboodi ‡ , Babak Ehteshami Bejnordi ‡ Lar ge language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context require- ments make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefciencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference . In this work, we propose a lightweight approach to enable reasoning in small LLMs us- ing LoRA adapters combined with supervised ne-tuning. W e further introduce budget forcing via reinforcement learning on these adapters , signicantly reducing response length with minimal accu- racy loss. T o address memory -bound decoding, we exploit par allel test-time scaling, impro ving ac- curacy at minor latency increase. Finally , we present a dynamic adapter-s witching mechanism that activates reasoning only when needed and a KV -cache sharing strategy during prompt encoding, re- ducing time-to-rst-tok en for on-device inference. Experiments on Qwen2. 5-7B demonstrate that our method achieves efcient, accurate reasoning under strict resource constraints, making LLM reason- ing practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are av ailable on our project page: https://qualcomm- ai- research.github.io/llm- reasoning- on- edge/ . 1 Introduction The success and impact of large language models (LLMs) continue to expand, with reasoning emerging as a funda- mental component of their achievements. Commercial-grade coding agents reason about code structure , refactor - ing, debug ging, and dependency resolution [ 1 ; 2 ]. In scientic domains, LLMs are increasingly used to assist profes- sional mathematicians with research level problems [ 3 ; 4 ; 5 ]. On mobile devices, reasoning-capable LLMs unlock a new class of intelligent personal assistants able to plan multi-step tasks, respond contextually to user queries, and operate autonomously across apps and interf aces. Recent advances from Gemini and OpenAI demonstrate increasingly capable standalone reasoning [ 6 ; 7 ], and a growing wa ve of agentic use cases is now emerging where models interact directly with device UIs and real-world ser vices [ 8 ; 9 ; 10 ; 11 ]. T hese achievements, howev er , come at the cost of generating massive numbers of tokens, with reasoning traces accounting for a large fraction of the over all computation [ 12 ]. Deploying reasoning models on edge de vices is attractive for mobile scenarios because it keeps sensitive data on-device , reduces round-trip latency , and remains available even under limited connectivity . In practice, howe ver , * Core contributors † Contributors ‡ Project leads Names within each group are sorted alphabetically . Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 1 25 Tra n s fo r me r bl o ck Tra n s fo r me r bl o ck Tra n s fo r me r bl o ck Switcher Classifi er KV Ca che (prefill) Reasoning LoRA Adapters with Budget F orcing Embedding Linear Head Original Instruct model (no changes) Input Math Ques tion … Genera te 𝑁 solu ti on stream s in parallel Co T token s Answer Output Answe r Generato r (LLM) Ver if ic ati on H ea d Tra in i ng Quantization Model Expo rt Deploy On - Device SFT RL SFT SFT Reasoning LoRA Adapters Switcher Verificati on Bias (INT32) LLM Weights (INT4) Activat ion (INT16) LoRA Wei ght s (INT8) ONNX e xport Static gr aph for QNN Inference Qualc omm Gen AI Inference Engine (Genie) Stream Generation t o Androi d Appl icat ion Prompt - dependent Dynamic Switch to Rea soning (a) Sequential Reasoning (b) Par allel Reasoning (c) End - to - end Pipeline Figure 1: Overvie w of the proposed efcient reasoning framew ork for edge devices. (a ) The model architecture utilizes parameter-efcient LoRA adapters and a lightweight switcher to dy namically route queries. This design allows the base model and the reasoning-activated mode to seamlessly share a reusable KV cache during prell. (b) P arallel test-time scaling strategy , generating multiple reasoning streams concurrently to improve accuracy without severe latency penalties. (c) The end-to-end deployment pipeline , illustrating the progression from multi- stage training (SFT and budget-forced RL) to quantization, model export, and nal on-device execution. on-device reasoning faces sever al key limitations. The rst is the memory bottleneck: mobile devices, constrained by current DRAM capacities, can ty pically support smaller models with moderate quantization, and larger models only with more aggressive quantization schemes [ 13 ; 14 ]. T he second limitation is the cost of token gener ation in terms of power consumption, latency , and memor y footprint. Long reasoning traces and lar ge context lengths substantially increase KV cache size , quickly exhausting available memor y . Finally , general purpose LLMs with broad capability scopes are difcult to realize within the model sizes supported by edge devices. While specialized small language models (SLMs) can match the performance of lar ger models on targeted tasks, model switching introduces additional memory movement overhead. Edge deployed models must therefore operate within strict memory budgets while achieving usable tokens per second (TPS) and acceptable time to response . This work pr oposes an end-to-end pipeline for deploying r easoning-capable language models on edge devices under strict token, latency , and memory budgets. Our design starts from a base non-reasoning instruct model and enables a “reasoning mode ” via LoRA adapters, so the same backbone can run either in standard chat mode (no adapters) or in reasoning mode (reasoning adapters enabled). A lightweight switcher routes each incoming query to the appropriate mode, enabling reasoning only when it is likely to help. At training time, we use parameter-efcient ne-tuning to specialize the base model across domains while keeping deployment practical. We demonstrate that LoRA [ 15 ] is effective for enabling domain targeted reasoning and task specialization. Selecting the LoRA rank as a function of the base model and desired performance is a central question, which we address through extensive analy sis. Because LoRA adapters can be tog gled at runtime , a single base model can be loaded once and then dy namically adapted to different tasks by enabling or disabling adapters. T o enable KV -cache sharing between the base model and the LoR A augmented reasoning model, we propose masked LoRA training during the prell phase, which we show has no signicant impact on accuracy . Our training recipe follows a two-stage structur e commonly used for reasoning models: supervised ne-tuning (SFT) on high-quality reasoning traces, followed by a reinforcement learning (RL) phase for further alignment [ 16 ]. Since, reasoning verbosity is not explicitly addressed during SFT and models often become verbose and repetitive after initial training, a key objective of our RL phase is to penalize excessively long reasoning traces via budget forc- ing [ 17 ; 18 ]. W e apply budget forcing during RL and explore different training strategies, rew ard designs, prompting Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 2 methods, and hard enforcement mechanisms, distilling best practices from our experiments. At inference time, we target memory-bound decoding as a primary on-device bottleneck. W e leverage parallel test- time scaling with neural verication to improv e accuracy without incurring signicant latency overhead, particularly in typical on-de vice settings. Concretely , because on-device inference splits into a compute-bound prell phase and a memor y -bound decoding phase, we can better utilize compute units (e .g., NPUs) by running parallel decoding paths with minimal incremental overhead. We further show that neural verication in an outcome reward model style can be implemented using a lightweight verier head trained on the latent representations of the base model. T ogether , these components allow us to start from a non-reasoning model and pr ogressively harvest r easoning performance while maintaining deployability on edge devices. Figure 1 provides a comprehensive overview of our proposed end-to-end framework, illustrating the complete progression from adapter training to on-device de- ployment. W e instantiate this pipeline on Qw en2. 5 series of models, and we enable r easoning for these models using our proposed pipeline. On device deployment requires additional considerations. We discuss a range of quantization schemes supporting 4- to 8-bit weight quantization while allowing higher precision for activ ations to accommodate their dynamic range, minimizing quantization loss throughout the pipeline . The resulting quan- tized models are then exported and compiled for on-device execution. The entire workow is implemented using Qualcomm open sour ce tooling, including Qualcomm FastForward [ 19 ] and Qualcomm GENIE SDK [ 20 ]. In this paper , we pro vide deeper insights into the design choices and empirical analyses underpinning this work. We po- sition the resulting system as a practical blueprint for deploying reasoning capable language models on resource constrained edge devices. 2 Reasoning on Edge: Sy stem Design Section 3 Section 5 Section 4 Base LLM Reasoning LoRA SFT Ecient Reasoning LoRA RL + Budget F orcing Hyb rid reasoning model Switcher Mo del reasoning-needed classier Figure 2: Architectur e of the Hybrid Reasoning Model. The pipeline begins with a compact base LLM, which is specialized for reasoning via LoRA-based super vised ne-tuning (SFT). T o enforce concise generation and prevent excessive verbosity , these adapters undergo reinforcement learning (RL) with Budget Forcing. Finally , a lightweight Switcher module is introduced to act as a reasoning-needed classier , cr eating a hybrid model that dynamically routes incoming queries to either the fast base model or the specialized reasoning adapters based on task com- plexity . Developing compact yet capable reasoning models requir es a training pipeline that can reliably elicit high-quality chain-of-thought (Co T) behavior from a relatively small base LLM while avoiding unnecessary verbosity and exces- sive computation. Although pretrained LLMs can sometimes function as zero-shot reasoners [ 21 ], explicit reason- ing via approaches such as scratchpads [ 22 ] or supervised Co T [ 23 ] remains essential for strong performance on mathematics and coding tasks . Recent systems such as OpenAI O1 [ 2 ] and specialized small reasoning models, including Tina [ 24 ], Phi-4-mini [ 25 ], and hybrid reasoning architectures [ 26 ], illustrate that such capabilities can be distilled into relatively small models when combined with targeted ne-tuning and alignment techniques. Our objective is to ne-tune a modest base LLM so that it performs competitively on complex reasoning tasks Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 3 while producing concise outputs and minimizing token generation. T o keep adaptation lightweight and deployable , we perform training primarily through Low-Rank Adapters (LoRA) [ 15 ], which preserve a reusable frozen backbone and enable modular reasoning specialization. The ov erall pipeline, presented in Figure 2 , integrates super vised ne-tuning, reinforcement learning, and lightweight routing in a manner that preserves efciency and supports deployment under strict memor y and latency constraints. Supervised ne-tuning constitutes the rst stage of our pipeline and is designed to unlock the reasoning capa- bilities of the pretrained base LLM. R ather than performing dense ne-tuning, we adopt parameter -efcient ne- tuning using LoR A. LoRA has been shown to match or ev en surpass dense ne-tuning in reasoning settings [ 27 ], while enabling the base model to remain frozen and reusable for multiple domains. This stage equips the model with the fundamental ability to r eason through multi-step problems, but, as widely observed, also increases ver- bosity and can lead to unnecessarily long or repetitive traces [ 16 ]. T o rene the reasoning behavior and control verbosity , we apply Reinforcement Learning [ 28 ; 29 ] (RL) using a custom re war d function tailored to tw o objectives: accuracy and efciency . First, we use budget forcing [ 30 ], a mechanism that penalizes excessiv ely long responses. This constraint encourages the model to produce concise reasoning traces without sacricing correctness. Second, we incorporate an answer -based reward, which directly incentivizes the model to generate correct nal answer s. For optimization, we employ the group-based r elative policy optimization ( GRPO ) algorithm [ 31 ], which updates the LoRA parameters. Finally , not all user queries require multi-step reasoning. T o av oid unnecessary computation, we introduce a lightweight Switcher module that predicts whether reasoning is needed based on hidden prompt representa- tions. When reasoning is unnecessary , the model by passes the LoRA adapters and relies on the base model directly , reducing latency and limiting KV cache gro wth, an important consideration for edge deploy ment. In the following sections, we decompose our end-to-end pipeline into its main components, LoRA -based adapta- tion (Section 3 ), dy namic inference-time routing using a Switcher module to activate or bypass these adapters (Section 4 ), budget-forced RL for verbosity control (Section 5 ), and parallel test- time scaling (Section 6 ), and the deployment path including quantization and on-device execution (Section 7 ). In each section, we describe the com- ponent and then quantify its impact with targeted experiments, highlighting key insights and focused ablations to isolate which design choices drive accuracy and which improve on-device efciency . 3 LoRA for Modular Reasoning In our reasoning framework, we adopt parameter -efcient ne-tuning (PEFT) for two reasons: rst, it enables scal- able experimentation at low training costs, and second, it produces modular adapters that can be enabled or dis- abled at runtime, allowing a single base model to switch between general-purpose chat and enhanced reasoning modes. T o elicit reasoning behavior , we perform SFT on datasets composed of reasoning traces generated by stronger teacher models such as DeepSeek-R1 [ 16 ] and QwQ-32B [ 32 ]. Our experiments show that 3B and 7B models can acquire strong reasoning ability using straightforward SFT on curated trace datasets in a cost-efcient setup, reaching performance comparable to substantially larger distilled baselines (e .g., DeepSeek-R1-Distill-Qwen-7B). These results suggest that strong reasoning does not require heavy distillation pipelines or large training budgets; high-quality trace data (e .g., Open Thoughts3 [ 33 ]) combined with lightweight ne-tuning is sufcient to close most of the gap. Ov erall, this provides a practical and scalable path to building capable reasoning models without expensive infrastructure . 3 .1 Experimental Setup The primary goal of our LoRA adaptation stage is to determine if small, general-purpose instruct models can ac- quire expert-le vel reasoning capabilities without full-par ameter distillation. Accordingly , we adopt the Qwen2. 5- Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 4 3B-Instruct and Qwen2. 5-7B-Instruct models [ 34 ] as our core experimental backbones. In this section, we outline the datasets and optimization strategies used to elicit reasoning beha vior via LoRA, as well as the diverse suite of math, science, and coding benchmarks used to rigor ously evaluate the resulting performance tr ade-offs. W e additionally provide an extensive LoRA hyperparameter study designed to identify the most stable and compute- efcient adaptation strategies for both the 3B and 7B backbones. 3 .2 T raining Details Data. In our experiments, we utilize two SFT datasets. The rst is Mixture of Thoughts (Mo T) [ 35 ], which contains 350k reasoning traces distilled from the DeepSeek-R1 model. This dataset cover s three core domains: Math (93 .7k traces), Code (83 .1k traces), and Science (173k traces). The second dataset is OpenT houghts3-1.2M (O T3 ) [ 33 ], comprising 850k Math questions, 250k Code questions, and 100k Science questions. The annotation traces for O T 3 were generated using the QwQ-32B model. T raining conguration. All models were trained for 5 epochs using the boat16 data ty pe with the DeepSpeed zero2 conguration and CPU ofoading enabled. Across all congurations, we applied a cosine learning rate sched- ule with a warmup ratio of 0 .1, and weight decay was set to 0 . For the baseline dense training on the Mo T dataset, the learning rate was set to 1 e − 5 for Qwen2. 5-3B-Instruct and Qwen2. 5-7B-Instruct models, with a global batch size of 128. Model weights were optimized using the AdamW op- timizer with ( β 1 , β 2 ) = (0 . 9 , 0 . 95) . For dense training on the OT 3 dataset, we followed the recipe described in [ 33 ]. Dense models were trained with a learning rate of 8 e − 5 and a batch size of 512, using AdamW with ( β 1 , β 2 ) = (0 . 9 , 0 . 999) . In our L oRA training setup, w e follow the common practice of employing r elatively larger learning rates. We also nd that reducing the batch size generally leads to more stable optimization and improv ed results. Throughout all PEFT experiments , we use a LoRA rank of 128 , set the LoRA alpha to twice this value, adopt a learning rate of 2 e − 4 , and train with a batch size of 64 . 3 .3 Evaluation Details Benchmarks . We assess the reasoning capabilities of our trained models using a diverse set of benchmarks that span tasks across mathematics, science, and coding domains. T o evaluate multi-step mathematical problem- solving, we use challenging competition datasets including AIME 24/25 [ 36 ], AMC23 [ 37 ], and MA TH500 [ 38 ]. Sci- entic reasoning is measured using the PhD-level GPQ A Diamond dataset [ 39 ]. Finally , to evaluate code genera- tion, we utilize LiveCodeBench (v2, code generation scenario only) [ 40 ] for recent competitive programming prob- lems, alongside standard Python programming tasks from HumanEval [ 41 ] and MBPP [ 42 ], including their rigor - ously veried EvalPlus variants (HumanEval+ and MBPP+) [ 43 ]. Comprehensive descriptions of each benchmark, including problem counts and specic testing scenarios, are provided in Appendix A . Evaluation pipeline. Our evaluation pipeline is structured as follows. For reasoning benchmarks including AIME24 , AIME25 , MA TH500, GPQA, and AMC , we allow a generation length of up to 32,768 tokens, with the temperature set to 0. 6 and top _ p to 0 . 95. For models trained on the O T 3 dataset, we adopt the generation parameters recommended by the authors in [ 33 ], specically setting the temperature to 0 .7 and top _ p to 1.0 and keeping generation length up to 32,768 tokens. Also for L CB, MBPP and HumanEval (and their + counterparts) we set the generation parameters as recommended by the models ’ creators; we set the maximal generation length to 32768 tokens for LCB, and 1024 tokens for HumanEval and MBPP . Given that some benchmark s contain a limited number of questions and are therefor e more susceptible to accu- racy variance, we perform multiple ev aluation runs to ensure robustness similarly to the protocol outlined in [ 33 ]. Specically , we evaluate the AIME24 , AIME25 and AMC datasets 10 times and report the a veraged r esults. For Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 5 GPQ A, we conduct 4 evaluation runs and for MA TH500 we run the evaluation once. For coding benchkmarks , the pass@1 score is estimated from a pool of 16 candidate solutions for L CB, and of 200 candidate solutions for Hu- manEval and MBPP , using the unbiased pass@ k estimator rst proposed in [ 41 ]. All evaluations are conducted using the lightev al framework [ 44 ] with vLLM support [ 45 ], except for those on HumanEval, MBPP and their en- hanced variants, for which we used the Evalplus package 1 . T able 1: Results of the main experiments. FT data indicates the netuning dataset. The LoRA column species the adapter rank, or its absence for dense netuning. L CB refers to LiveCodeBench, and HE denotes HumanEval. R1 ∗ corresponds to the R1-Distill-Qwen-7B model released by DeepSeek; we evaluated this model following our standardized ev aluation protocol. FT data LoRA AIME24 AIME25 MA TH500 GPQ A AMC23 L CB HE HE+ MBPP MBPP+ Qwen-3B - - 0 .09 0 .03 0 . 66 0 .31 0 .36 0 .23 0. 40 0. 64 0 .72 0 .61 Mo T - 0 .20 0.22 0 .80 0 . 37 0 .61 0 .38 0 .43 0 . 38 0 .34 0.29 O T 3 - 0 .51 0 .43 0. 92 0 .40 0 .83 0. 50 0. 48 0 . 43 0 . 36 0 . 30 O T 3 rk 128 0.18 0.21 0 .79 0 . 32 0 . 55 0 .23 0. 40 0 . 35 0. 39 0 . 32 Qwen-7B - - 0 .10 0 .17 0 .76 0 .3 7 0 .60 0 .36 0 . 83 0 .75 0 . 80 0 . 68 R1 ∗ - 0 . 55 0 . 40 0. 92 0 .49 0 . 89 0 . 59 0. 41 0 .38 0 .42 0 .36 Mo T - 0 . 37 0 .28 0 . 90 0. 48 0. 78 0 . 55 0 . 53 0 .47 0 . 60 0 . 50 O T 3 - 0 . 61 0. 54 0 . 95 0. 47 0 . 89 0. 66 0 . 58 0 . 51 0 .56 0 .46 O T 3 rk 64 0. 47 0 . 39 0 . 92 0. 45 0 .78 0. 48 0. 60 0 . 54 0 . 63 0 . 52 O T 3 rk 128 0 .56 0. 38 0. 93 0 .43 0 . 82 0 .54 0 . 60 0 .54 0. 57 0. 46 O T 3 + Mo T rk 128 0. 44 0. 39 0 . 93 0. 48 0. 83 0 . 51 0 . 59 0 . 52 0 . 60 0 .49 3 .4 Results As shown in T able 1 , w e ev aluate Qwen2. 5-3B-Instruct and Qwen2. 5-7B-Instruct models under dense and LoRA- based ne-tuning, and compare against key baselines. Finetuning on the Open Thoughts3 (O T3 ) dataset yields the largest and most consistent gains in reasoning performance across both backbones, improving accuracy substan- tially on math and science benchmark s and also boosting performance on the more reasoning-sensitiv e coding benchmark (LiveCodeBench). In contrast, Mixtur e of Thoughts (Mo T) provides clear improv ements over the base models, but its gains are consistently smaller than O T3 . In addition, the densely trained Qwen2. 5-3B model on O T 3 performs on par with, or slightly above , the densely trained Qwen2. 5-7B model on Mo T , suggesting that higher data quality can partially compensate for smaller backbone size . For Qwen2. 5-7B, LoRA ne-tuning on OT 3 recovers most of the dense O T3 improvements, and increasing the adapter rank from 64 to 128 generally narrow s the gap on core r easoning benchmarks ( e.g., AIME24 and LCB). Notably , O T 3 with LoRA rank 128, which requires updating only 4.24% of the parameters compared to dense ne-tuning, reaches performance close to the R1-Distill-Qwen-7B baseline on several reasoning benchmarks, in- dicating that lightweight adapter training can recov er much of a distilled model’s capability at signicantly lower adaptation cost. For Qwen2. 5-3B, however , O T 3 LoRA (rank 128) underperforms the dense OT 3 model by a large margin across reasoning benchmarks, highlighting that adapter capacity and/ or optimization details matter more at smaller scales and motivating our ablations in the subsequent sections. The coding results reveal a trade-off between reasoning specialization and “ direct-answ er” code generation. While SFT consistently improves performance on L CB, it alw ay s results in some degradation for MBPP , HumanEval and their respective .+ variants. This pattern is consistent with a shift tow ard explicit multi-step reasoning: it helps on har der , reasoning-sensitive coding tasks like L CB, but can be counterproductive on benchmarks that re ward short, direct code outputs. Concretely , our SFT stage is designed to elicit reasoning beha vior into an otherwise 1 https://github.com/evalplus/evalplus Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 6 non-reasoning model, and LCB is a setting where such explicit reasoning can be benecial. In contrast, MBPP and HumanEval ty pically do not prompt the model to reason, and a response must be given directly 2 . Inter estingly , for Qwen2. 5-7B, OT 3 LoRA (rank 64/128 ) often retains stronger HumanEval/MBPP performance than dense O T3 , sug- gesting that PEFT can partially mitigate the specialization/forgetting trade-off relative to dense ne-tuning. Prior work has similarly observed [ 46 ] that SFT on reasoning traces can result in partial forgetting of general capabilities, and that his phenomenon could potentially be mitigated by employing RL ne-tuning [ 47 ]. We also explored a tw o-stage training strategy for the 7B model to see if we could combine the distinct benets of both datasets. Our motivation was that while the OT 3 dataset (generated by the smaller QwQ model) provided the strongest baseline performance, the Mo T dataset might contain complementary , highly complex reasoning patterns that the model could absorb in a subsequent training phase. Howev er , interestingly , additional training on Mo T following training on O T 3 results in minimal changes across benchmarks. Accuracies remain largely stable, except for a 0.12-point decrease on AIME24 and a 0. 05-point improvement on GPQA. 3 .4 .1 LoRA Hyperparameter Study: Rank, Learning Rate, and Batch Size T o study the effect of hy perparameters on the performance of PEFT models, we consider a range of values for learning rate, batch size, and LoRA rank. We trained Qwen2. 5-3B-Instruct and Qw en2. 5-7B-Instruct on a subset of OT 3 consisting of 50 , 000 entries for 1 epoch. For each model, we vary the values within the following ranges: learning r ate in { 1 e − 4 , 2 e − 4 , 5 e − 4 } , batch size in { 32 , 64 , 128 } , and LoR A r ank in { 32 , 64 , 128 , 256 } . LoRA adapters were applied to all linear la yers , with α set to 2 × rank and dropout xed at 0 . 1 . For our ablation study , we ev aluate the trained models on MA TH and Science benchmarks including AIME24, AIME25 , MA TH500, GPQ A Diamond and AMC23 . Similarly to the setup used in [ 33 ], we av erage the accuracy values over 10 runs for smaller datasets like AIME24, AIME25 and AMC23, over 4 runs for GPQA and once for MA TH500 . We set the temperatur e to 0 .7 and top _ p value to 1. 0 . Full evaluation results for this ablation study can be found in Appendix B . Qwen2. 5-3B-Instruct results. The full performance of the models trained on a set of considered hyperparame- ters is summarized in T able 12 in the Appendix. T o isolate the effect of each hyperparameter , we summarize results by av eraging accuracy ov er the other two dimensions (learning rate , batch size , and LoRA rank), highlighting the main trends. As show n in T able 2 , learning rate has a noticeable impact on performance, with 2 e − 4 providing the highest over all av erage accuracy across tasks. We also observe opposite sensitivities across benchmarks: AIME24 and AIME25 impro ve as the learning rate increases, whereas MA TH500 degrades , suggesting that larger learning rates can lead to over -adaptation on more complex mathematical reasoning. T able 2: A verage performance grouped by learning rate for Qwen2. 5-3B-Instruct. LR AIME24 AIME25 MA TH500 GPQA AMC23 Av g 1 e − 4 0. 040 0. 025 0 .5 73 0 .277 0 .281 0.239 2 e − 4 0. 051 0 . 033 0 . 57 4 0 .27 4 0.299 0 .246 5 e − 4 0. 051 0 .038 0. 556 0 .280 0.2 75 0.240 As presented in T able 3 , LoRA rank shows the most consistent positiv e inuence on aggregate performance. In- creasing r ank generally improves accuracy , with rank 256 achieving the highest overall average . Howev er , rank 128 already performs strongly on most benchmark s, making it a practical operating point when adapter memory is constrained. Across tasks, AIME25 is particularly sensitive to rank, while MA TH500 remains comparativ ely sta- ble. Over all, higher ranks are preferable when resources permit, but rank 128 offers a fav orable accuracy–efcienc y trade-off for edge deployment. 2 We have nonetheless conrmed that the model’s answers are never cut short by the Evalplus harness. Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 7 T able 3: Aver age performance grouped by LoR A rank for Qwen2. 5-3B-Instruct. % TP denotes the percentage of trainable parameters . Rank % TP AIME24 AIME25 MA TH500 GPQ A AMC23 A vg 32 1. 94% 0 .047 0 . 020 0 . 568 0 .270 0 .284 0 .238 64 3. 88% 0 . 040 0 . 023 0 . 570 0 .283 0.284 0 .240 128 7 .76% 0 . 054 0 . 038 0 .560 0 .272 0.28 7 0.242 256 15 . 52% 0 . 048 0 . 048 0 . 573 0.282 0 .284 0.247 Finally , batch size had a relatively minor effect compared to learning rate and rank as shown in T able 4 . The best over all performance was obser v ed at batch size set to 64, though the differences were small. MA TH500 beneted slightly fr om lar ger batches like 128 , while AIME25 peaked at 64. These results indicate that batch size can be chosen primarily based on computational constraints without signicant impact on ov erall performance. T able 4: A verage performance grouped by batch size for Qwen2. 5-3B-Instruct. Batch size AIME24 AIME25 MA TH500 GPQA AMC23 A vg 32 0 .054 0 . 030 0 .562 0 .277 0.288 0.242 64 0 . 048 0. 038 0. 566 0 .278 0 .287 0.243 128 0 .040 0 . 028 0. 575 0 .277 0 .281 0 .240 Qwen2. 5-7B-Instruct results. T able 13 in the Appendix summarizes the Qwen2. 5-7B-Instruct LoRA ablations over learning rate , batch size , and adapter rank. In contrast to the 3B setting, where learning r ate affects perfor- mance but the av erages vary only modestly across the tested values, the 7B backbone exhibits a narrower stable learning-rate range (T able 5 ): LR= 1 e − 4 and 2 e − 4 trains reliably , whereas LR= 5 e − 4 is often unstable and can lead to collapsed runs. O verall, a practical guideline is to use a lower learning rate for stable training on larger backbones, and tune the remaining hy perparameter s within that stable regime . For the individual tables presented below , we account for this distinction explicitly . The table reporting aver ages across different learning rate settings includes all runs, highlighting the instability introduced by higher learning rates. In contrast, the tables reporting averaged accuracies for batch size and LoRA rank exclude runs with a learning rate of 5 e − 4 , as those congurations contain diverged results. T able 5: A verage performance grouped by learning rate for Qwen2. 5-7B-Instruct. LR AIME24 AIME25 MA TH500 GPQA AMC23 Av g 1 e − 4 0 .158 0 .132 0 .775 0. 366 0 . 533 0. 393 2 e − 4 0 .170 0.148 0 . 779 0. 354 0 .543 0. 399 5 e − 4 0 .148 0 .112 0. 635 0. 350 0 .439 0 . 337 As shown in T able 6 , LoRA rank has a measurable but relatively small impact on Qw en2. 5-7B performance: the av erage score improves from 0 .388 (rank 32) to 0 .402 (rank 128 ), while rank 256 is comparable at 0 . 397 . Overall, ranks 64–128 form a tight trade-off region, with rank 128 best on a verage but only marginally better than low er ranks . Compared to the 3B model, the 7B results are more tightly clustered across ranks, indicating that rank is a weaker lever here than it is for smaller backbones. Similarly , T able 7 shows that batch size has a negligible effect on Qwen2. 5-7B performance in our sweep . The av erage accuracy varies only from 0. 394 to 0 .398 . T his mirrors the 3B trend, suggesting batch size can lar gely be chosen based on computational constraints once learning rate is in a stable regime . Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 8 T able 6: Av erage performance grouped by LoRA rank for Qwen2. 5-7B-Instruct. % TP denotes the per centage of trainable parameters . Rank % TP AIME24 AIME25 MA TH500 GPQA AMC23 A vg 32 1. 06% 0 .152 0 .119 0 .776 0. 357 0. 534 0. 388 64 2.12% 0 .159 0 .150 0 .779 0. 355 0. 535 0 . 396 128 4.24% 0 .178 0 .141 0 .784 0. 367 0. 541 0 . 402 256 8. 48% 0 .165 0.152 0. 771 0. 359 0 . 536 0 . 397 T able 7: Average performance grouped by batch size for Qwen2. 5-7B-Instruct. Batch size AIME24 AIME25 MA TH500 GPQA AMC23 Av g 32 0 .154 0 .148 0 . 781 0 . 355 0. 530 0 . 394 64 0 .164 0 .146 0 .77 4 0. 366 0 . 538 0. 398 128 0 .1 72 0 .127 0. 776 0 . 358 0. 541 0 . 395 As a result of the hy perparameter study , it follows that the setup with learning r ate value of 2 e − 4 , batch size of 64 , and LoRA rank set to 128 results in consistently good results while pro viding a good compr omise between efciency and training stability . 4 Dynamic LoRA Routing via the Switcher Module While reasoning models excel at complex problem-solving, not every user query requires an exhaustive , multi-step Co T . For standard conversational prompts or straightforward factual questions, generating long reasoning traces incurs unnecessary latency and computational overhead which can be a critical bottleneck for edge devices. T o address this, we introduce a lightweight Switcher module that enables dynamic adapter routing. By analyzing the user prompt, the switcher decides whether to bypass or activate the reasoning-specic LoRA adapters. When dis- abled, the system operates as the highly efcient original instruct model for regular conversation; when activated, it seamlessly transitions into a specialized reasoning engine. Architecturally , the switcher serves as an auxiliary classication head on top of the base LLM and operates during the prelling stage. It processes the hidden states from the nal transformer layer and computes an aver aged sequence representation. Based on this repr esentation, the switcher performs binary classication to determine whether the input sequence corresponds to a reasoning-oriented task. If classied as such, the reasoning LoRA adapters are activated for subsequent decoding. The switcher head is implemented as a lightweight multilayer perceptron (MLP) with a single hidden dimension of 8 , a ReLU activation function, and a dropout rate of p = 0 . 2 . This compact architecture ensures negligible overhead for efcient on-device inference while maintaining sufcient expressive capacity for sequence-level classication. On edge devices, the prell phase is heavily compute-bound. Processing a long input prompt in a single pass can incur prohibitive computational overhead. T o mitigate this, practical on-device implementations ty pically divide the prell sequence into smaller , discrete chunk s. The s witcher module is explicitly designed to support this chunked prell strategy . Rather than buffering the hidden states of the entire prompt to compute a global average , the switcher updates its sequence representation on the y . Specically , we compute a running exponential moving av erage of the hidden states across these chunks. In our setup, we use a chunk size of 128 tokens with a smoothing coefcient of α = 0 . 5 . T o enhance robustness to quantization artif acts, we inject independent Gaussian noise with zero mean and standard deviation σ = 0 . 5 into the a veraged representation during training. Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 9 Masked LoRA training for KV -cache reuse. A major challenge with dynamically activ ating LoRA adapters at in- ference time is KV-cache compatibility . Under standard L oRA training, the model expects the KV cache for the prompt tokens (the prell phase) to be generated with the LoRA adapters fully active. If a query is routed to the rea- soning mode after the base model has already encoded the prompt, the sy stem would ty pically need to re-encode the entire prompt with the LoRA adapters activ ated to generate a compatible KV cache. On edge devices, this re- encoding incurs a severe latency and compute penalty . T o eliminate this inefciency , we introduce a masked LoRA training strateg y . During the ne-tuning of the reasoning adapters, we mask (disable ) the LoRA weights during the forward pass of the prompt tokens, activating them only for the generation of the response tokens. This forces the LoRA adapters to adapt to the prompt KV cache generated strictly by the base model. Empirically , we observe that this strategy incurs no drop in reasoning accuracy , while allowing the base model and reasoning mode to seam- lessly share a single prell KV cache, entirely obviating the need to re-encode prompt tokens when switching. 4 .1 T raining details T o train the switcher , we constructed a small dataset that combines both straightforward conversational or knowl- edge based queries and complex queries , enabling the model to learn how to distinguish when reasoning is re- quired. The dataset is structured as follows: each entry consists of a question prompt paired with a label 0 or 1, indicating low or high complexity , respectively . Complexity is determined based on the source of the dataset from which the prompt w as sampled. We included prompts from datasets covering math and non-math domains to reduce the risk that the switcher overts to domain-specic cues ( e.g., assuming all math questions are inherently more difcult). The nal dataset contains approximately 2k samples. Easy queries were randomly dra wn from the SQuAD2.0 dataset ( 600 questions) [ 48 ], which consists primarily of general-knowledge comprehension questions, and from the MMLU math subset (419 questions) [ 38 ], which includes straightforw ard mathematical problems. Hard prompts w ere sourced from a subset of the S1K dataset (500 questions ) [ 17 ], encompassing challenging questions spanning math, science, and crossw ords, as well as from Strateg y QA ( 500 questions) [ 49 ] covering reasoning questions from non-scientic domains. 4 .2 Results The primary motivation for the switcher module is to optimize standard, day -to-day user interactions. In real- world edge deployments, the vast majority of user queries are simple conver sations, factual lookups , or basic instructions that do not require multi-step reasoning. By aggressively thresholding the switcher to route these simple queries to the base instruct model, we can achie ve massive aggregate savings in tok en generation, latency , and power consumption, reserving the LoRA reasoning adapters strictly for complex tasks . T o rigorously assess the impact of this dy namic routing, we evaluated the switcher’ s performance on the challeng- ing MA TH500 benchmark as it contains questions with varying complexity . We swept acr oss different switcher condence thresholds to vary the fr action of prompts routed to the reasoning adapters ver sus the base model. Figure 3 illustrates how o verall model performance changes as a larger fr action of queries is routed to the rea- soning model. Here, we study the base Qwen2. 5-7B-Instruct model and its counterpart trained on OT 3 with LoRA adapters of rank 128 without any budget forcing. As more answ ers ar e generated with reasoning, accur acy rises smoothly from the base model baseline tow ard the reasoning-only upper bound. This demonstrates that the switcher effectively prioritizes the reasoning model on more complicated queries where it is most benecial, allowing accuracy levels that cannot be achieved by the base model alone. The right panel show s the corresponding computational cost, measured as average completion length. Choos- Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 10 0.0 0.2 0.4 0.6 0.8 1.0 F raction of prompts routed to L oRA 0.78 0.80 0.82 0.85 0.88 0.90 0.93 Accuracy R easoning Mode: (acc=93.0%) Instruct Mode: (acc=76.4%) 0.764 0.902 0.906 0.926 0.928 0.930 Combined model accuracy 0 1000 2000 3000 4000 5000 A vg completion length Figure 3: Impact of the Switcher module on MA TH500. Left: Combined model accuracy as the fraction of queries routed to the reasoning adapters. Right: Average completion length versus overall accuracy across different switcher thresholds. ing a higher-accur acy operating model requires a pr oportional increase in computational costs, while lower -cost regimes are possible when accuracy demands are modest. The switcher thus provides a exible mechanism for navigating this tradeoff . 5 Budget Forcing and Inference- T ime Compute Optimization Co T prompting [ 23 ] scales Inference- Time Compute (ITC ) by decoding intermediate reasoning steps, substantially improving LLM performance on complex tasks, but often at the cost of high latency and large token footprints. Theoretical analysis [ 50 ] argue that the optimal test-time compute should scale linearly with problem difculty . Y et unconstrained models routinely violate this optimality , exhibiting degenerate verbosity and overthinking even on trivial tasks [ 1 7 ]. T o reconcile reasoning capability with computational efciency , Budget Forcing [ 17 ] methods aim to align genera- tion with explicit token/compute constraints. RL-based methods, among others [ 51 ; 52 ; 53 ; 54 ], achieved impres- sive trade-off between performance and Co T length reduction. These methods ty pically requir e to augment the rew ard with a length-based penalty term [ 55 ] or to enforces hard truncation constraints upon reaching a target budget [ 56 ]. Formally , a standard budget-forced reward objective can be expressed as an additive penalty: R ( y , x ) = R accuracy ( y , x ) − λ · R budget ( L ) (1) where x is the prompt, y is the generated response, R accuracy ( y , x ) is the accuracy reward, L represents the total token length, and R budget ( L ) is a penalty function scaled by the hy perparameter λ . 5 .1 Soft-Barrier Rew ard Formulation Building upon the foundations of budget-penalized RL, our reward is based on three core rationales: 1. Av oidance of strict tok en matching: We do not force the model to exactly match a predened budget. Doing so assumes perfect a priori knowledge of the optimal compute required for a specic task, which contradicts the premise of generative exploration. 2. T rajectory e xploration: The model must retain sufcient degrees of freedom to explore diverse reasoning paths without premature truncatio n. Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 11 3 . Prompt-adherent budget compliance: The model must reliably satisfy the user-dened budget constraints provided in the prompt. T o realize these principles , w e prompt the model with discrete generation length budgets, specically bucketing constraints into 1000 , 3000 , 4000 , and 6000 tokens. Instead of an additive penalty , we introduce a multiplicative, piecewise-linear soft barrier . This barrier decays the budget re war d from 1 . 0 to 0 . 0 as the generation length ex - ceeds the prompted bucket. The linear decay ser v es as a buffer , discouraging the model from exceeding the limit without inicting catastrophic penalties for minor budget infractions. This decay operates within a sy mmetric window centered around the target budget B , where the half-size of the window , m ∈ [0 , 1] , is treated as a tunable hy perparameter . Formally , we dene the budget reward modier as: R budget ( L ) =      1 L ≤ L low p L > L high 1 − (1 − p ) L − L low L high − L low L low < L ≤ L high (2) where L is the total length of the generated response, p is the maximum budget penalty oor , L low = (1 − m ) B , and L high = (1 + m ) B . T hrough empirical observation, we noted that setting a negative penalty oor provided no optimization benets; thus, we set p = 0 . The nal holistic rewar d R is then dened as the product of the task accuracy and the budget compliance modier: R ( y , x ) = R accuracy ( y , x ) × R budget ( L ) ( 3) where R accuracy ( y , x ) ∈ { 0 , 1 } represents the binar y accuracy rew ard. Challenges and rew ard hacking. Because budget forcing imposes a strict constraint on the optimization mani- fold, it inherently induces a trade-off between reasoning performance and compute cost. W e observed that naively applying a length penalty (for instance, penalizing only the tokens within the Co T reasoning trace) is highly suscep- tible to rewar d hacking. During early iterations, the policy rapidly collapses into a degenerate , “lazy” strategy: the model learns to circumvent the penalty by prematurely closing the reasoning block with a token, only to continue its verbose Co T in the nal response output. By penalizing the total gener ation length L rather than just the reasoning trace, our multiplicative formulation effectively neutralizes this exploit. F urthermore , while our - nal reward formulation strips aw ay explicit format-follo wing rewar ds, empirical evaluations conrm that the model consistently maintains the desired structural formatting throughout training. 5 .2 Experimental Setup 5 .2.1 Tr aining Details T o demonstrate the efcacy of our soft-barrier rew ard formulation in compressing Co T reasoning, we conduct ex - tensive experiments on state-of-the-art reasoning models. We utilize the DeepScaleR dataset [ 57 ] as our primary training corpus. T o maximize training stability and prevent degenerate optimization steps, we apply a rigorous lter- ing criterion to the dataset: any prompt exhibiting a group rewar d standard deviation of zero is removed, ensuring that the model alwa ys receives a meaning ful comparative signal during policy updates. We optimize our models using GRPO [ 31 ]. GRPO is particularly well-suited for reasoning tasks as it bypasses the need for a separate value model by leveraging group-scaled rew ards. For a given prompt x , the objective minimizes the following loss: L GRPO ( θ | x ) = − 1 G G X i =1 min  ρ i A i , clip ( ρ i , 1 − ϵ , 1 + ϵ ) A i  + β D KL  π θ ( · | x )   π ref ( · | x )  , (4) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 12 where G denotes the group size, and the probability ratio ρ i and the advantage A i are dened as: ρ i = π θ ( y i | x ) π old ( y i | x ) , A i = r i − µ r σ r + ε ( 5) Here , µ r and σ r represent the mean and standard deviation of the rewar ds within the group, respectively: µ r = 1 G G X j =1 r j , σ r = v u u t 1 G G X j =1 ( r j − µ r ) 2 ( 6) Our implementation is built on the trl library (version 0 .26 .2) [ 58 ]. W e execute training on a single compute node equipped with 8 NVIDIA H100 (80GB) GPUs. We sample 8 generations per prompt ( G = 8 ) during the GRPO rollouts. A comprehensive summary of the training hy perparameter s is detailed in T able 14 in Appendix C . 5 .2.2 Evaluation Details T o assess the impact of our budget forcing technique on mathematical reasoning, we use the large-scale Math500 [ 59 ] benchmark as our main testbed. F or robust and reproducible evaluation, we employ the lighteval frame- work (version 0 . 8.1). All inference processes are acceler ated using vLLM (ver sion 0 .10 .2). T o standardize the as- sessment of pass@1 accuracy , we apply a consistent sampling strateg y across all benchmarks: generations are sampled with a temperature of 0. 6 , a top_p of 0 .95 , and a maximum completion length extended to 32K tokens to accommodate any lingering verbose trajectories from the baseline models. 5 .3 Results: Efciency -A ccurac y T rade-off As established, our primary objective is to compress the generated reasoning trajectories with minimal degrada- tion in task performance. Because our soft-barrier rew ard formulation deliberately omits a strict regularizer weight for the budget penalty (see eq. 1 ), we discovered that the Kullback-Leibler (KL) diver gence penalty coefcient in GRPO ( β KL ) ser v es as an effective control mechanism to enforce budget-friendly behavior . Empirically , setting β KL = 10 − 3 yields the optimal balance, signicantly reducing generation length with negligible performance drops. Conversely , a relaxed penalty of β KL = 10 − 4 improv es formatting adherence at very short completion lengths, al- beit at the cost of a slightly higher performance regression when evaluated on lar ger , unbounded contexts. Figure 4 illustr ates the a verage completion length distributions for the unconstrained baseline (purple ) and two intermediate checkpoints trained with β KL = 10 − 3 . The left and right panels depict ev aluations where the max - imum completion length is strictly capped at 4K and 6K tokens, respectively . T o enforce this hard budget during inference , we abruptly truncate the generation upon hitting the token limit and subsequently append a prompt forc- ing the model to immediately output its nal answ er . As demonstrated in Figure 4 , our RL ne-tuning effectively shifts the distribution density towar d signicantly shorter lengths. Crucially , the transition from the baseline (purple) through the intermediate checkpoint (blue) to the nal policy (green) highlights a stable , progressive optimization trajectory . R ather than experiencing sudden, erratic policy collapse, the model smoothly and monotonically learns to generate more concise reasoning traces over the course of training to solve the tasks. T o quantif y this compression, Figure 5 provides a granular breakdown of the actual Co T length reductions achieved via our RL ne-tuning. Specically , the right panel of Figure 5 reveals that our approach yields an average comple- tion length reduction factor of ∼ 2 . 4 × , with maximum compression rates reaching up to ∼ 8 × on certain queries. As noted previously , this aggressive reduction in verbosity is achieved while maintaining comparable performance to the base model, with only minimal-and in many instances, negligible-accuracy drops. Pr actically , this reduction directly translates to lower overall inference latency and a faster time-to-nal-answ er , making advanced reasoning models signicantly more viable for deployment in resour ce-constrained environments. Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 13 −1000 0 1000 2000 3000 4000 5000 A verage Completion Length 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 p.d.f . Maximum T ok en Budget = 4000 0 2000 4000 6000 8000 A verage Completion Length p.d.f . Maximum T ok en Budget = 6000 Baseline Budget F orcing (intermediate checkpoint) Budget F orcing (final checkpoint) Figure 4: A verage Completion Length Distributions. Left: Evaluation with a forced maximum completion length of 4K tokens. Right: Evaluation with a maximum of 6K tokens. Note that distribution tails extending below zer o or abov e the maximum budget ar e standard artifacts of Kernel Density Estimation (KDE) curve smoothing. The progression from the baseline ( purple ) through the intermediate ( blue ) to the nal RL ne-tuned checkpoint ( green ) demonstrates stable, progressive learning of concise generation ( β KL = 10 − 3 ). 0 1000 2000 3000 4000 5000 6000 A verage Completion Length 0.0 0.2 0.4 0.6 0.8 1.0 c.d.f . Baseline (acc = 82.7%) RL BF (acc = 88.3%) 2.0x 4.0x 6.0x 8.0x R eduction Length 0.0 0.1 0.2 0.3 0.4 0.5 Pr obability Density Mean Figure 5: A verage Completion Length Comparison. Left: C.D .F . of the aver age completion length for base model, orange curve , and RL ne-tuned one, green curve , with β KL = 1 . e − 3 . We considered a maximum completion length of 6K. Right: Reduction in completion length from the RL ne-tuned model. W e use the same models in the left plot. The RL ne tuned achieved aver age reduction length of 2 . 38 ± 0 . 07 . T able 8: Performance Results on MA TH500 . Results for differ ent budget values. For budgets ∈ { 1 K , 2 K , 4 K , 6 K } , we manually force the model to generate the nal answer once the limit is hit. In the last column, we do not force the model to generate the answer . W e let it generate tokens untill the budget hit. Model Accur acy ↑ ( % ) Budget = 1K Budget = 2K Budget = 4K Budget = 6K A verage (Budget = 32K) SFT Baseline (r=128) 34 57 73 83 95 BF RL β KL =1 . e − 3 62 78 85 90 92 BF RL β KL =1 . e − 4 72 80 84 85 90 5 .4 Qualitative Analysis of Budget-Forced Co T T o examine the mechanics of our budget forcing objective at the trajectory level, we qualitatively compare the rea- soning traces of the unconstrained baseline against our budget-for ced model across four distinct mathematical Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 14 Input Prompt Problem: What is the value of (26 2 − 24 2 − 10) 2 − 10 2 ? Baseline Response (Standard Co T) - 3118 tokens Oka y , so I need to nd the value [ ... ] Maybe if I simplify the inner part rst [ ... ] Using difference of squares , 26 2 − 24 2 = 100 . Then subtract 10 to get 90. Now we have 90 2 − 10 2 [ ... ] Applying difference of squares again: (90 − 10)(90 + 10) = 80 × 100 = 8000 . Let me check my steps again [ ... ] Starting over [ ... ] First part 100 [ ... ] Then subtract 10 [ ... ] Alternatively , if I do it the other way [ ... ] So both methods give me 8000 . Let me try expanding (26 2 − 24 2 − 10) 2 rst [ ... ] 26 squared is 676 [ ... ] 24 squared is 576 [ ... ] Difference is 100 [ ... ] Correct. Alternatively , maybe f actoring the entire expression [ ... ] [(26 2 − 24 2 − 10)] 2 − 10 2 [ ... ] This seems more complicated. Wait, just to be thorough, maybe I can compute each step numerically [ ... ] All methods conrm it’ s 8000 . Budget Forced Co T - 810 tokens First, we simplify the expression 26 2 − 24 2 using the difference of squares formula: 26 2 − 24 2 = (26 − 24)(26 + 24) = 2 × 50 = 100 . Next, substitute this result back into the original expression: (100 − 10) 2 − 10 2 = 90 2 − 10 2 . Apply the difference of squares formula again: 90 2 − 10 2 = (90 − 10)(90 + 10) = 80 × 100 = 8000 . Therefore , the nal answer is 8000 . Figure 6: Qualitative comparison on algebraic simplication. Middle: The Baseline trace correctly identies the difference of squares strategy immediately but engages in excessive self- verication, re-calculating the result via expansion, direct computation, and alternative factorizations. Bottom: The Budget Forced trace recognizes the nested difference of squares structure and ex ecutes the solution linearly without redundant checking. domains: number theory , algebraic simplication, pattern recognition, and modular arithmetic (Figures 6 – 7 and Figures 9 – 10 in Appendix C ). A consistent pattern emerges from this analysis: the unconstrained baseline fre- quently suffers from severe epistemic hesitation. While it ty pically identies the correct logical strategy early in the generation process, it expends thousands of tokens on redundant self- verication, testing alternative (and often less efcient) methods, and hypothesizing trivial errors. F or instance, in Figure 6 and Figure 7 , the baseline arrives at the correct ans wer almost immediately but f alls into extensive validation loops, re-calculating the result using three to four different approaches. In contrast, our budget-forced polic y learns to condently trust its initial, correct logical derivations. It success- fully prunes these redundant validation loops and verbose sy ntactic parsing, drastically reducing the trace length while strictly preserving the essential reasoning backbone and human readability . F urthermore , although we omit special delimiter tokens ( e.g., ) in the visualizations for brevity , we obser ve that the budget-forced model robustly generates them, maintaining the required Co T and nal-answer formatting constraints. 6 P arallel T est - Time Scaling and Reasoning During autoregressiv e generation with LLMs, one of the dominant runtime bottlenecks is repeatedly loading model weights of the individual layer s to produce the next token. As a result, there is often potential to increase the total amount of computation performed per user query without incurring signicant overhead in runtime. A straightfor - ward w ay to do this is to generate multiple independent samples in parallel: instead of producing one Co T trajectory and one nal answer , this approach allocates additional compute to produce several trajectories concurrently . Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 15 Input Prompt Problem: What is the remainder when 1 + 2 + 3 + 4 + · · · + 9 + 10 is divided by 9? Baseline Response (Standard Co T) - 2822 tokens Oka y , so I need to nd the remainder [ ... ] Maybe I should just add up all the numbers [ ... ] Sum formula: 10 × 11/2 = 55 . T o nd 55 divided by 9 [ ... ] Remainder is 1. Wait, but let me conrm this another way to be sure [ ... ] Let me add the numbers step by step [ ... ] 1 mod 9 is 1 [ ... ] 55 mod 9 is 1. Correct. Alternatively , maybe using digit sum rule [ ... ] 10 mod 9 is 1. Alternatively , grouping pairs [ ... ] Therefore , all these methods lead me to the same answer [ ... ] I don’t see any mistakes here. The remainder is 1. Budget Forced - 559 tokens First, calculate the sum of integers from 1 to 10 using the formula n ( n +1) 2 : Sum = 10(11) 2 = 55 . Next, nd the remainder when 55 is divided by 9: 55 = 9 × 6 + 1 . Therefore , the remainder is 1. Figure 7: Qualitative comparison on modular arithmetic. Middle: The Baseline trace correctly computes the sum and remainder immediately but engages in extensive, redundant v erication using four different methods (step- by -step addition, digit sum rule, pairing, and re-calculation). Bottom: The Budget Forced trace performs the direct calculation and returns the result without hesitation. Par allel generation is not only attractive from a systems perspective, but it also improv es accuracy . Acr oss bench- marks and models, generating multiple candidate solutions and then aggregating them has repeatedly shown consistent performance gains. A common and simple aggregation strategy is majority voting over individual an- swer s, where the most frequently produced answer is selected as the nal answer . In practice, majority voting is widely used as a reliable tool for achieving an additional boost in nal model performance from the same underlying base model. Initial work s [ 60 ; 61 ] indicated that the same LLM can be made to generate multiple diverse and independent re- sponses towar ds nding a better solution. Scaling compute by sampling multiple independent responses prior to aggregation has shown [ 62 ; 63 ; 64 ] to be an effective paradigm enabling smaller models to outperform larger mod- els for the same inference compute budget. In these cases, the key is to score the solutions (e .g., by frequency , or using an external verier) and predict the highest-scoring solution . Parallel reasoning design relies on selecting a sampling scheme, design and employ rewar d models, and nally aggregation scheme to generate the nal answer . Diversity among responses [ 62 ] is crucial to increasing the probability of sampling the correct response. Increasing diversity is ty pically done by auto-regressiv e generation at a higher sampling temperature [ 65 ]. While responses are ty pically sampled independently , recent works explore inter -dependent sampling [ 66 ; 6 7 ; 68 ; 69 ] and guided- search [ 70 ; 71 ; 72 ] tow ards generating a better response candidate pool. In this work, w e focus on independent sampling schemes. Reward models (also referr ed to as veriers) estimate one or more scalar -valued scores for a given response . There is a long line of work on improving rew ard models, with more accurate rewar d models [ 73 ; 74 ], more granular score assignment ( e.g., per -step scores with process rewar d models) [ 75 ; 76 ; 77 ], scaling verication with more compute [ 78 ], and extending rewar d models to other reasoning domains [ 79 ; 80 ] beyond mathematical reasoning. In this work, we use the rewar d models for scoring the outcome . Finally , aggregation is essential to drafting a specic response from a pool of multiple response-score pairs. Existing work s have primarily looked at this from a rank-and-select lens: to rank the responses either by frequency of occurrence (‘self -consistency’ or ‘ma- jority voting’) [ 61 ], or using external reward models to score responses [ 60 ; 76 ]. Note that these strategies lead to a zero-sum situation: the top-ranking solution is selected and the rest discarded. Consequently , a more recent line Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 16 of work [ 81 ; 82 ; 83 ; 84 ] investigates synthesizing (rather than selecting ) a nal response based on the candidate responses. Scaling compute at test-time generally implies additional latency and compute overhead and hence making it especially challenging to r ealize on resource-constr ained edge devices . Some benets of parallel TT S on edge devices were discussed in [ 85 ] with a separate verier . This separation prevents efcient reuse of intermediate computations, e.g., through the KV-cache . Consequently , scaling parallel compute efciently has gained some at- tention in the research community . T owards efciently reasoning, several works have investigated generating solu- tions by performing fewer FLOP s [ 63 ] and reducing the memory footprint [ 86 ; 87 ] required during generation. Clos- est to our verier design presented in this paper is GenRM [ 88 ], where the authors investigate netuning the base generative model to additionally perform prompt-based verication. Such a joint generation-v erication paradigm is appealing for edge devices, since generation and verication can be performed with minimal movement of pa- rameters between DRAM and ash memor y . 6 .1 Efcient V erier Design for Edge Compute Par allel reasoning produces a set of N independent Co T traces and corresponding nal answer s. The central ques- tion then becomes: given these N candidates, how do we reliably choose the correct one? Majority voting provides a strong baseline, but it is not alway s sufcient, especially when the answer space is large or ambiguous. This mo- tivates introducing an explicit selection mechanism that can score candidates and prefer those most likely to be correct, while still beneting from the diversity created by independent sampling. A natural approach is to add a verier model that evaluates each candidate solution. However , deploying a sepa- rate verier on edge devices at the same scale as the generator can be prohibitive in storage and memor y footprint, and often becomes especially costly in latency . T o avoid this, we aim to reuse the generator as effectively as possi- ble. Concr etely , we keep the same base model and add a lightweight verier head: a separate linear layer applied to the nal token embedding, followed by a sigmoid activ ation to yield a scalar “ correctness ” score. Figure 1 b illus- trates this approach. This design keeps additional parameters minimal and, crucially , allows KV-cache reuse for all generated responses, since the verier head can operate on representations already computed during generation. In addition to the linear head, we append a short verication prompt after each generated response, asking the model whether the proposed solution is correct. Empirically , this extra query has been benecial compared to relying on a linear head alone without an explicit verication prompt. Oper ationally , this strateg y only requires a small additional prell step for the verication prompt for each individual generation, while preserving KV-cache reuse from the original generation. The verier therefore adds only modest overhead per candidate while improving the reliability of the selection process. T o obtain the best results, we combine majority voting with verier scoring into a weighted majority vote. Instead of counting each candidate answ er equally , we weight each candidate ’ s vote by its verier score . Intuitively , can- didates that the verier deems more likely to be correct contribute mor e to the nal decision, while still retaining the r obustness benets of aggregation across multiple samples. This hybrid approach preserves the simplicity and stability of voting while incorporating a learned notion of solution quality , which is particularly helpful when the candidate set contains a mix of supercially plausible but incorrect solutions alongside correct ones. 6 .2 T raining and Evaluation Details V erication is treated as a binary classication problem where a candidate response of the generator is labeled as correct or incorrect with r espect to the ground- truth answer . T o generate training data for the verier , we use 97 . 5% of the 7 . 5k questions from the MA TH training set [ 38 ]. The r emaining 2. 5% of questions ar e reserved for validation of the verier . T o construct a diverse training set and a validation set for fast evaluation, we generate 16 Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 17 candidate responses for each training question, while we limit ourselves to 4 generated responses in the validation set. The verier head uses a sigmoid activation to produce a probability-like score, and we train it with binary cross- entropy loss. This setup aligns the verier objective directly with the downstream selection problem: distinguishing correct from incorrect candidate solutions pr oduced by the generator under the same sampling procedure used at inference time. T able 9: Accuracy of our lightweight verier weighted majority voting compared to majority voting without verier and greedy decoding on MA TH500. The mean and standard deviation are computed from 20 random dra ws from 16 independent 4bit-w eight-quantized Qwen-2. 5-7B-Instruct responses. Par allel Responses 1 2 4 6 8 Greedy (baseline ) 71. 0 - - - - Majority V ote 69 .9 ± 1.3 70. 0 ± 1.3 75 .1 ± 1. 0 76 . 6 ± 1. 0 77 . 5 ± 0 .8 Weighted MV (ours) 69 .9 ± 1.3 72.7 ± 1. 0 76 .1 ± 0 .9 77 . 5 ± 0 . 8 78.2 ± 0. 7 6 .3 Results We evaluate the proposed lightweight verier on MA TH500 using a 4-bit-w eight-quantized Qwen-2. 5-7B-Instruct model, comparing greedy decoding, standard majority voting, and our weighted majority voting. T able 9 summa- rizes accuracy as a function of the number of parallel responses, sampled with temperature set to 0 .7 . Even with very limited parallelism, parallel test -time scaling yields immediate benets. With just two parallel re- sponses, weighted majority voting improves accuracy to 72.7% , outperforming both the greedy baseline (71. 0% ) and standard majority voting (70 . 0%). This highlights a key advantage of incorporating verication: when two sam- pled responses disagree , majority voting cannot break ties, whereas the verier -weighted scheme can consistently select the more reliable candidate. As the degree of parallelism increases, both majority voting and weighted majority voting exhibit steady gains, con- rming prior observations that parallel sampling improves the probability of generating a correct solution. Ho wever , weighted majority voting consistently outperforms unweighted majority voting across all parallelism levels, and at eight parallel responses, weighted majority voting improves upon the baselines by 10%. Importantly , the variance across random draw s is also slightly reduced compared to majority voting, suggesting that verier weighting pro- vides a more stable aggregation mechanism. Despite its simplicity , the verier delivers strong benets. Architecturally , it amounts to generating only a minimal amount of overhead, effectively one extra token per stream. Because the verier reuses the generator’ s KV-cache , it does not require reprocessing the original prompt or response, av oiding the dominant memory and latency costs typically associated with separate verier models . This makes the performance gains particularly compelling in edge settings, where memory bandwidth and storage are tightly constrained. 7 Quantization In the following section, we discuss the details of our quantization methodology . We begin by providing a brief overview of neural network quantization and a summar y of recent methods for quantizing LLMs in Section 7 .1 . We outline our strateg y for quantizing base LLM in Section 7 .2 demonstrated on Qwen2. 5-7B-Instruct, and later equip it with reasoning capabilities in Section 7 . 3 . Lastly , we provide the specics of quantized model export and deployment on device in Section 7 .4 . Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 18 7 .1 Background and Related Work Quantization. Neural network quantization is one of the most powerful way s to reduce model footprint, data transfer and compute requirements [ 89 ; 90 ]. By quantizing a model, high bit- width oating point weights and ac- tivations can be represented using low-bit numbers. Next to reducing model size , the use of low-bit x ed-point representations , such as INT8 , can also signicantly reduce the latency and energy consumption [ 91 ]. We use the following denition of the quantization-dequantization function: b x := q ( x ; s , z , b ) = s ·  clip j x s m + z ; − 2 b − 1 , 2 b − 1 − 1  | {z } =: x Z − z  , (7) where x denotes the quantizer input (i.e ., network weight or activation tensor), s the high precision ( FP32 / FP16 / BF16 ) quantization scale, z the integer zer o offset, and b the bitwidth. ⌊·⌉ denotes the r ound-to-nearest-integer operator . x Z is a b -bit integer quantized representation of the input x . Quantization parameters s , z can be shared across the components of x (typic ally per -channel or block-wise ). T his quantization scheme is called uniform afne or asymmetric quantization [ 92 ; 89 ; 93 ] and is one of the most commonly used quantization schemes because it allows for efcient implementation of xed-point arithmetic. In the case of symmetric quantization, we restrict the quantization grid to be symmetric around z = 0 . Quantization methods can generally be categorized into post-training quantization (PT Q) and quantization-aware training (Q A T) families. PTQ algorithms convert pretrained high-precision networks directly into xed-point models without the need for the original training pipeline [ 94 ; 95 ; 96 ; 97 ; 98 ; 99 ; 100 ; 101 ; 102 ]. These approaches are fast, easy to use, and typically rely only on a small calibration dataset. In contrast, QA T methods [ 103 ; 104 ; 105 ; 106 ] simulate quantization during training to nd more optimal solutions, but generally require longer training, more memory , labeled data, and careful hy perparameter tuning. LLM Quantization. The excessive training cost and memory usage of traditional QA T methods renders them less practical for quantizing modern LLMs, although some work s such as LLM-QA T [ 107 ] and BitDistiller [ 108 ] explore QA T with knowledge distillation. Notably , [ 109 ; 110 ] are the only studies we are aw are of that successfully scale QA T to billions of tokens. Several papers explored the combination of QA T and parameter -efcient ne-tuning (PEFT), including [ 111 ; 112 ; 113 ; 114 ; 115 ; 116 ]. Most of these approaches offer a substantial memory reduction com- pared to traditional QA T , but generally are not focused on inference efciency . For instance, QLoRA [ 111 ] quantizes the pretrained weights to 4 bit using (a non-uniform) NF4 format but dequantizes them in the forward pass back to BF16 . Post -training quantization of LLMs is a challenging task due to presence of strong numerical outliers in weights and activations [ 117 ; 118 ; 119 ; 120 ; 121 ]. The core challenge is that quantizing outliers onto a xed-point grid forces a range-precision trade-off: incr easing the dynamic range captures outliers but sacrices precision near zero, while retaining precision requires clipping them – both of which strongly degrade model performance. Existing LLM PT Q methods can be broadly categorized into weights-only quantization and weight-activ ation quan- tization. Weights-only quantization focuses on converting weights to low-bit values. GPT Q [ 122 ] employ s second- order information to iteratively round grouped weights and correct the quantization error in the remaining groups. SpQR [ 123 ], AWQ [ 124 ] and OW Q [ 125 ] emphasize the importance of so-called “ salient” weights that correspond to high-magnitude activations. Other recent W-only methods include [ 126 ; 127 ; 128 ; 129 ]. W eight-activation quan- tization compresses both weights and activ ations. SmoothQuant [ 130 ], LLM.int8() / GPT3.int8() [ 119 ] and Outlier Suppression [ 131 ] achieve W8A 8 quantization by managing activation outliers. LLM.int8() uses mixed- precision decomposition, while the other two employ channel- wise scaling. Some of the other recent W&A PTQ methods are [ 132 ; 133 ; 134 ; 135 ; 136 ; 137 ; 138 ]. LLM Quantization using FPT s. A promising direction in LLM quantization is the use of rotations and Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 19 other function-preserving transformations (FPT s). Nagel et al. [ 100 ] rst explored FPT s for CNN quantization, showing that ReLU and per -channel scaling commute, enabling cross-lay er rescaling of weights. In the LLM setting, Xiao et al. [ 130 ] propose migrating problematic outliers from the activations into the weights through online per -channel scaling applied before linear la yers. Follo w-up work e xtends this idea by incorporating shifts into the scaling [ 134 ], scaling vectors for queries and key s [ 139 ], channel-mixing transforms [ 129 ], randomized Hadamard transforms to reduce outliers [ 140 ; 141 ], other online rotations [ 142 ], combinations of scaling and rotations [ 143 ], and Kronecker -structured matrix transforms [ 144 ]. Recently , FPT Quant [ 145 ] introduced three novel, lightweight, and expressive FPT s to facilitate quantization of transformers . By leveraging the equivariances and independencies inherent to modern transformers, these FPT s are designed to maintain the model’ s function while shaping the intermediate activ ation distributions to be more quantization friendly . As a result, FPTQuant enables static INT4 quantization with virtually no overhead and no custom kernels, is very fast and performs on par or ex ceeds most prior work. 7 .2 Quantizing Base Language Model Quantization setup . T o run the model efciently on our hardware , we quantize the weights of all linear layer s including the nal LM head using INT4 per-channel uniform afne quantization (eq. 7 ). T o further improve efciency and reduce latency , we use INT8 KV-cache , INT8 input embeddings and INT16 for all remaining activations (all per- tensor). For brevity , we will refer to this conguration as ‘W 4A16KV 8’ . We use sy mmetric quantization for weights, KV -cache and embeddings, and asymmetric quantization for activations, which is a common setting. T ransformations. T o maximize the accuracy of the quantized model, we apply the subset of fully-mergeable transformations from FPTQuant [ 145 ] (Figure 8 ): · a pair of pre-RoPE transforms ( T k , ¯ T k ) , where T k is applied to ke ys and ¯ T k can be interpreted as an inverse of T k , applied to the queries; · ( T u , T − 1 u ) a per-channel scaler merged into up and dow n projection weights; · multi-head v alue transforms ( T v , ¯ T v ) , which consist of invertible matrices per head merged into value and output weights; · and a r otation matrix ( T r , T − 1 r ) for rotating the residuals (applied at the beginning and the very end of each transformer block and is shared). This set of transformations T := { T k , T u , T v , T r } will help shape the intermediate activation distributions to be more quantization friendly , while keeping the unquantized model outputs intact. All of the abov e FPT s are fully mergeable , so that we can run the model without any extra inference overhead on our hardware . T raining and evaluation details. Follo wing literature [ 145 ; 146 ], we train the model on DCLM-Edu [ 147 ], a cleaner ltered ver sion of DCLM [ 148 ] obtained by applying an educational quality classier [ 149 ]. We initialize quantiza- tion parameter s by minimizing the L p ( p = 2 ) norm between quantized and unquantized tensors , and then train transformation T and quantization parameters { s , z } end-to-end, closely following the pipeline of FPT Quant. W e simulate quantization using FastForward [ 19 ]. For brevity , we denote to the aforementioned quantization pipeline as FPT Quant ◦ . T o assess the pr edictive performance of the quantized base language model, we follow the pr evious work [ 122 ; 130 ; 139 ; 145 ; 144 ], and report WikiT ext-2 test perple xity (assuming a sequence length of 4096 ). We also report an av erage zero-shot accuracy of a set of common sense reasoning (CSR) tasks, that includes PIQA [ 150 ], Wino- Grande [ 151 ], HellaSwag [ 152 ], ARC-e and ARC-c [ 153 ], and LAMBAD A [ 154 ]. Finally , we report 5-shot accur acy on MMLU [ 155 ]. For CSR and MMLU ev aluation, we use the LM Harness frame work [ 156 ]. Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 20 Figure 8: Function-Preserving T ransformations . We use 4 transform types from FPT Quant: scale-and-rotate transform T k merged into quer y and key , a per-channel scaler T u merged into up and dow n projection, and T v that consists of invertible matrices per head merged into value and output weights, and a rotation matrix T r for rotating residuals (shar ed across lay ers). After training of the tr ansforms is complete, the transformation parameters fr om each ‘merge group ’ are merged into the original model weights W . T able 10: Quantized W 4A16KV8 base model ( Qwen2. 5-7B-Instruct) results . We report Wikitext perplexity , average 0-shot CSR, and 5-shot MMLU accuracies. ‘ L p ’ denotes the use of L p range initialization, T = using the set of mergeable transforms, ‘train’ = end-to-end training of transformation and quantization parameters. Method Bitwidth L p T tr ain WikiT ext-2 ( ↓ ) CSR ( ↑ ) MMLU ( ↑ ) Full-precision BF16 - - - 6 .85 72. 90 7 4.28 Min-max quantization W 4A16KV 8 - - - 102.4 51. 71 62. 35 W 4A16KV8 ✓ - - 9 .18 65 . 83 67 .59 W 4A16KV8 - ✓ - 8 .48 67 . 85 69 .06 W 4A16KV8 ✓ ✓ - 7 .53 70 .68 72.26 FPT Quant ◦ (our s) W 4A16KV8 ✓ ✓ ✓ 7 .26 7 2. 94 72. 81 Results. W e summarize our results for quantized base model in T able 10 . As we can see, the simplest PT Q pipeline with min-max range estimation experiences an unacceptable accuracy /perplexity drop . Both strong nu- merical outliers in the activations (ev en with 16-bits!) but mainly the catastrophic loss of precision in the quantized 4-bit weights lead to such poor performance. Employing the set of function-preserving mergeable transformations T already signicantly improves the distribu- tion of weights and activ ations, leading to much better accuracy , even with min-max range setting. Further , using a better range initialization together with end-to-end learning both progressively recov er a greater portion of the full-precision model performance. In the end, we match full-precision accuracy on CSR, and have about 0. 4 per- plexity drop on WikiT ext-2 and just under 1. 5% accur acy drop on MMLU, wher e the latter is know n to be quite a challenging benchmark. Over all, FPTQuant ◦ -quantized base model demonstrates strong performance, given that the entire process took less than 24 hours on a single Nvidia H100 80GB GPU. 7 .3 Quantization- A ware Modular Reasoning Quantizing the base model will inevitably affect the underlying activation distributions. T o achieve the best perfor- mance, it is crucial to account for these changes when applying subsequent ne-tuning, including our Reasoning LoRA protocol described in Section 3 . Our approach follow s the general paradigm of QLoRA [ 111 ] and related techniques [ 112 ; 113 ], in which LoRA Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 21 T able 11: Quantized W4 A16KV8 reasoning model results ( Qwen2. 5-7B-Instruct base) . We report AIME24 /25 , MA TH500 , GPQA, and AMC23 accuracies (higher is better). N = number of O T data examples used for training reasoning module. Bitwidth FPTQuant ◦ QAMR N AIME24 AIME25 MA TH500 GPQ A AMC23 A vg BF16 - - 50k 21.8 20 .3 82. 6 38 .6 65 .2 45 .70 W 4A16KV8 - - 50k 0. 0 0 .0 0 .0 0. 0 0 . 0 0 . 00 W 4A16KV8 - ✓ 50k 17 . 3 6. 3 75 .6 33 .0 64. 0 39.25 W 4A16KV8 ✓ ✓ 50k 23. 3 15 .0 79 . 6 33 . 7 57 . 0 41. 72 BF16 - - 1.2M 53 .3 33 . 0 94. 0 39 . 9 82.5 60 . 54 W 4A16KV8 ✓ ✓ 1.2M 46. 6 36. 6 89 .6 3 7 .8 80. 0 58 .12 adapters are trained on top of a frozen, quantized base model. T o further improve memory and runtime efciency , we quantize the trained LoRA adapter w eights to INT8 and use INT16 activations during infer ence. As with the base model, we use symmetric per-channel quantization for weights and asymmetric per -tensor quantization for activations. We denote the aforementioned technique as Quantization-Aw are Modular Reasoning ( QAMR). T raining and evaluation details. We follow the training and ev aluation protocols described in Section 3 . F or train- ing, we use the OT 3 [ 33 ] dataset. T o assess the reasoning capabilities of our quantized reasoning model, we use a comprehensive set of benchmarks, including AIME 24/25 [ 36 ], MA TH500 [ 38 ], GPQ A Diamond [ 39 ], and AMC23 [ 37 ]. We conduct an ablation study to assess the contributions of both the improved base model quantiza- tion pipeline and the proposed QAMR approach using a random subset of 50k training examples, while reserving the full training dataset for the nal set of results. Results. W e observe in T able 11 that a naïvely quantized base model combined with a full-pr ecision reasoning module (i. e., without QARM) is essentially non-functional. Qualitatively such model outputs seemingly random to- kens, without any structure or relevance to the task s at hand. In contrast, applying QARM – even with relatively short training – recovers a substantial portion of the performance on tasks such as MA TH500 and GPQ A. Notably , quantization-awar e modular reasoning is crucial for learning any- thing at all . Further , using FP T Quant ◦ offers a stronger starting point for training the r easoning module and consistently im- prov es performance across all benchmark s except AMC23 , for which longer training is required. By shaping the underlying activation distributions to be more quantization-friendly , using FPT Quant ◦ -quantized base model leads to signicantly fewer training instabilities and enables faster learning compared to a base model quantized with a standard min-max range setting. Finally , when combined with extended training, our approach achieves performance within roughly 2% of an equiv- alently trained full-precision reasoning model on av erage , while being signicantly more compact, and inference- efcient. 7 .4 V erier Quantization and On-Device Deployment As the nal stage of our quantization pipeline, we address the quantization of the v erier , which is essential for deploying the proposed verier on resource-constrained hardwar e. T o minimize degradation from reduced numer- ical precision, we train the verier directly on embeddings produced by the 4-bit w eight-quantized Qwen-2. 5-7B- Instruct model obtained in section 7 .2 . This choice reduces distribution shift between training and inference, ensur- ing that the verier learns to operate on similar representations it will encounter at deploy ment time. After training Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 22 the verier head under this setting, we further quantize both activations and verier head weights to 8 bit represen- tations using FastForw ard [ 19 ]. Once all components including the base model, reasoning adapters , and verier are quantized, we pr epare the models for on-device deployment. The rst step after quantization is the model transformation to assure compat- ibility at pytorch representation level with the format supported by GENIE SDK [ 20 ]. These pertains to aspects of autoregressiv e parallel and sequential generation for prell and decoding, as well as handling the attention opera- tions and masking, and position embeddings. Next, we establish compatibility with GENIE at the ONNX representation lev el. W e use Qualcomm FastForwar d [ 19 ] for this stage and implement transformations at linear layer s and multi-head attention, as well as model parti- tioning. W e use Py tor ch with FastForw ard to get the ONNX graph and the associated quantization encodings. W e export to Deep Learning Container format, and Quantize any remaining non-quantized nodes missed by FastFor - ward ( e.g. biases). We compile for the deployment target (e .g. aarch64-android) and upload to the device using adb (Android Device Bridge). 8 Discussions and Challenges Deploying capable reasoning models on resour ce-constrained edge devices requires navigating a complex trade- off between task performance, latency , memory footprint, and power consumption. In this work, we proposed a practical end-to-end framework to ov ercome these limitations. By decoupling reasoning from the base weights using modular LoRA adapters, we demonstrate that parameter-efcient ne-tuning can achieve competitive rea- soning accuracy relative to computationally expensive full-parameter distillation methods like DeepSeek-R1-Distill. T o optimize day -to-day user interactions, our lightweight dynamic Switcher routes standard conversations to the highly efcient base model, reserving the reasoning adapters strictly for complex queries wher e multi-step logic is required. T o further combat latency , our budget -forced RL alignment explicitly penalizes generation verbosity , yielding a 2.4 × reduction in average reasoning tokens without sacricing task accuracy . At inference time, we ex- ploit the memory-bound nature of autoregr essive decoding by introducing parallel test-time scaling, coupled with a lightweight verier that pro vides up to a 10% accuracy boost on complex reasoning benchmarks. Finally , we show that 4-bit weight quantization via FPT Quant and Quantization-A ware Modular Reasoning preserves this ro- bust performance, achieving within 2% of the full-precision reasoning model’ s accuracy while delivering massive memory savings. Below , we summarize the key insights derived from each component of our pipeline and outline the remaining challenges that pave the way for future research. LoRA for modular reasoning. Our experiments revealed that parameter -efcient ne-tuning via LoRA is highly effective at eliciting reasoning capabilities in small (3B and 7B) base models, often rivaling the performance of com- putationally expensive full-parameter distillation. A key insight is that the success of LoRA is heavily dependent on adapter capacity and base model scale . While a LoRA rank of 128 allo wed the 7B model to nearly match dense ne-tuning baselines on challenging benchmarks, the 3B model exhibited a wider performance gap, indicating that smaller backbones are more sensitive to adapter capacity limits. Furthermore, while reasoning specialization im- prov es performance on complex tasks (e.g., LiveCodeBench), it introduces a trade-off , occasionally degrading zero- shot performance on simpler coding tasks that require dir ect answer s. Managing this specialization-forgetting trade-off remains an ongoing challenge. Dynamic LoRA routing via the switcher . We demonstrated that the computational overhead of reasoning can be drastically reduced by recognizing that not all queries requir e complex multi-step logic. The lightweight Switcher module successfully acts as an on-demand router , preserving the base model’ s speed for standard queries while activating reasoning LoRA adapters only when necessary . A major deployment insight was the necessity of masked LoRA training during the prell stage. This strategy ensures that the KV -cache generated by the base model can be seamlessly reused by the reasoning adapters, completely eliminating the severe latency Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 23 penalty of re-encoding prompt tokens when switching modes. While the current switcher relies on a super vised classier head to evaluate query complexity , a promising direc- tion for future work is to learn this routing policy via reinforcement learning. This would act synergistically with our budget-for cing objectiv es: while budget- forced RL explicitly shortens the reasoning traces when LoRA is activ e, an RL-driven router optimized for both accuracy and length would learn to bypass the adapters entirely whenever possible. Because the frozen base model natively produces direct answer s without verbose chain-of-thought, suc- cessfully routing a query to the non-LoR A mode automatically yields a drastically shorter response, organically reserving the reasoning adapters strictly for complex queries where the base model would fail. Future work could extend the switcher beyond binary r outing and turn it into a gener al mechanism for dynamic LoRA selection. Instead of choosing only between the base model and a single reasoning adapter , the system could route each query to a bank of task-specic adapters [ 157 ; 158 ], for example specialized for mathematics, coding, or other domains, allowing the same backbone to support richer capabilities while preserving modular de- ployment. An especially promising direction is to include adapters tailored for latent reasoning, since recent work suggests that latent-reasoning LoRA adapters can retain strong reasoning performance while substantially reduc- ing the token overhead of explicit Co T generation [ 159 ; 160 ; 161 ; 162 ; 163 ]. In this setting, the switcher would not only decide whether reasoning is needed, but also which form of reasoning is most efcient for a given query , mak- ing dynamic adapter routing a natural path toward more capable and more compute-efcient on-device systems . Budget forcing. We presented our RL recipe to netune LLMs to generate shorter completions. By leveraging a multiplicative penalty (eq. 3 ), we successfully aligned LLMs to reduce the number of generated tokens to answers given questions. Our empirical ev aluations demonstrate an a verage completion length reduction of 2 . 4 × -and up to 8 × maximum compression-with minimal degradation in task accuracy . Crucially , our formulation structur ally mitigates the rewar d hacking we observed in early experiments, ensuring that efciency gains stem from genuine rationale compression rather than formatting exploits. While our “ soft-barrier” approach offers a robust mechanism for compute-aware inference, it also opens several exciting avenues for future research. W e outline these forward- looking directions, which naturally address the current boundaries of our methodolog y . Our e xperiments pr ovide a snapshot of budget for cing on state-of-the-art reasoning models. How ever , the r ela- tionship between base model scale (e.g., parameter count) and the capacity for rationale compression remains an open question. Investigating whether larger , more capable models exhibit greater “epistemic hesitation” , and thus offer a larger margin for ITC reduction, will be critical for formulating generalized scaling laws for budget forcing. A fundamental limitation of current budget forcing approaches, including ours, is the assumption of uniform token cost where every generated token contributes equally to the budget consumption regardless of its utility . How ever , a token representing a crucial logical leap carries signicantly higher semantic value than a token used for syntactic glue or hedging ( e.g., “Let me think about this... ”). A promising future direction lies in developing semantic-awar e budget priors that weight penalties dynamically based on information density or local entropy as explored in [ 164 ]. By shifting the optimization objective fr om pure length minimization to reasoning density maximization , we can encourage models to prioritize high-utility tokens while disproportionately penalizing low -entropy ller , effectively decoupling computational cost from reasoning depth. Par allel test-time scaling and reasoning. We ha ve demonstrated that ev en a light verier design can substan- tially improve the effectiveness of parallel test-time scaling for r easoning on edge de vices. By combining the ro- bustness of aggregation with a learned notion of correctness at negligible additional cost, weighted majority voting offers a practical and efcient approach for deploying parallel reasoning on resource-constrained devices. The cur- rent design can be extended to score the steps with a process reward model. Another interesting extension is the parallel reasoning schemes with interdependent generation as in [ 66 ; 68 ; 165 ]. Quantization Our experiments on Qwen2. 5-7B-Instruct highlight the importance of starting from a strong quan- tized base model. Le veraging function-preserving transformations, improved range initialization, and joint ne- Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 24 tuning of transformation and quantization parameters offers a straightforward and lightweight path to quantizing modern LLMs while preserving robust predictive performance. When extending such models with reasoning ca- pabilities, we further show that it is essential to account for distribution shifts induced by quantization. Inspired by prior PEFT literature , we address this challenge by proposing the Quantization-A ware Modular Reasoning ( QAMR) approach, which mitigates the quantization noise and distribution shifts by training reasoning adapters directly on the quantized base model. Finally , our r esults demonstrate that a relativ ely compact 4-bit w eight-quantized quantized 7B model can achieve reasoning performance comparable to that of substantially larger models. Reasoning remains a token-gener ation-intensive task, making it fundamentally memory-bound rather than compute-bound. A s a result, further impro vements in efcienc y and performance will likely depend on reducing the memory footprint of the model weights. A promising direction for future work is to push quantization below 4 bits by leveraging state-of-the-art compression techniques such as Quip# [ 166 ], or exploring 2-3-bit QA T methods, such as ParetoQ [ 109 ]. 9 Conclusion In this work, we presented an end-to-end framework that makes state-of-the-art LLM reasoning practical on resource-constr ained edge devices. We demonstrated that parameter-efcient LoRA adaptation, governed by a dynamic routing switcher , unlock s powerful reasoning capabilities without compromising the speed of everyda y interactions. By introducing budget-forced reinforcement learning, we successfully curbed model verbosity to t strict on-device token limits. T o further maximize hardware utilization, w e leveraged parallel test-time scaling and a lightweight latent verier to boost accuracy during the memory-bound generation phase . Throughout this pipeline, hardware-a war eness remains the unif ying principle: from maximizing KV -cache reuse to intertwining quantization directly into the training of the adapters, switcher , and verier . Ultimately , this co-designed approach bridges the gap between cloud-based reasoning and the strict memory , latency , and power budgets of mobile hardwar e, providing a practical blueprint for on-device AI. Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 25 References [1] AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card , 1(1):4, 2024 . (Cited on page 1 ) [2] Aar on Jaech, Adam Kalai, Adam L erer , Adam Richardson, Ahmed El-Kishky , Aiden Low , Alec Helyar , Alek- sander Madry , Alex Beutel, Alex Carney , et al. Openai o1 system card. arXiv preprint , 2024 . (Cited on page 1 , 3 ) [ 3 ] Mohammed Abouzaid, Andre w J Blumberg, Martin Hairer , Joe Kileel, T amara G K olda, P aul D Nelson, Daniel Spielman, Nikhil Srivastav a, Rachel Ward, Shmuel Weinber ger , and Lauren Williams. First proof . arXiv [cs.AI] , 5 February 2026 . (Cited on page 1 ) [4] Y ang Cao, Y ubin Chen, Xuyang Guo, Zhao Song, Song Y ue, Jiahao Zhang, and Jiale Zhao. Evaluating frontier LLMs on PhD-level mathematical reasoning: A benchmark on a textbook in theoretical computer science about randomized algorithms. arXiv [cs.AI] , 16 December 2025 . (Cited on page 1 ) [ 5 ] T ony Feng, T rieu H T rinh, Garrett Bingham, Daw sen Hwang, Y uri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano , Sang-Hy un Kim, Federico Pasqualotto , Sergei Gukov , Jonathan N Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi T a y , Y aguang Li, Chenkai Kuang, Y uan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nay akanti, Xiaomeng Y ang, Heng- T ze Cheng, Demis Hassabis, Kora y K avukcuoglu, Quoc V Le , and T hang Luong. T owards autonomous mathematics research. arXiv [cs.L G] , 12 February 2026. ( Cited on page 1 ) [ 6 ] Open AI. First proof? , 13 February 2026. ( Cited on page 1 ) [7] Da vid P Woodruff , Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, Mohammadhossein Bateni, Sim- ina Branzei, Michael P Brenner , Lin Chen, Ying Feng, Lance Fortno w , Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T Hajiaghayi, Mahdi JafariR aviz, Adel Jav anmard, Karthik C S., Ken-Ichi Ka waraba yashi, Ravi Ku- mar , Silvio Lattanzi, Euiwoong Lee, Yi Li, Ioannis Panageas , Dimitris Paparas , Benjamin Przybocki, Bernardo Subercaseaux, Ola Svensson, Sha yan T aherijam, Xuan Wu, Eylon Y ogev , Morteza Zadimoghaddam, Sam- son Zhou, Y ossi Matias, James Manyika, and V ahab Mirrokni. Accelerating scientic research with gemini: Case studies and common techniques. arXiv [cs. CL] , 16 February 2026. ( Cited on page 1 ) [ 8 ] Shunyu Y ao, Jeffrey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Y uan Cao. ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations , 29 September 2022. ( Cited on page 1 ) [ 9 ] Hanzhang Zhou, Xu Zhang, Panrong T ong, Jianan Zhang, Liang yu Chen, Quy u Kong, Chenglin Cai, Chen Liu, Y ue Wang, Jingren Zhou, and Steven Hoi. MAI-UI technical report: Real-world centric foundation GUI agents. arXiv [cs . CV] , 26 December 2025 . (Cited on page 1 ) [10 ] Haoming Wang, Haoy ang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Long xiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Y e, Y ujia Qin, Yuw en Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Y u, Haobin Chen, Hongyi Guo, Jing Su, Jing jia Huang, Kai Shen, Kaiyu Shi, Lin Y an, Peiyao Zhao, Pengfei Liu, Qinghao Y e, Renjie Zheng, Shulin Xin, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Y oubin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanf an Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Y aohui Wang, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Y ang, Jie T ang, Li Li, Qihua Han, T aoran Lu, Wo yu Lin, Xiaokang T ong, Xinyao Li, Yichi Zhang, Y u Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Y ang, Hang yu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Y uan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Y ang, Zhiyuan Zeng, Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 26 Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, and Guang Shi. UI- T ARS-2 technical report: Advancing GUI agent with multi-turn reinforcement learning. arXiv [cs.AI] , 5 September 2025 . (Cited on page 1 ) [11] V euns- T eam, Gao Changlong, Gu Zhangxuan, Liu Y ulin, Qiu Xiny u, Shen Shuheng, Wen Y ue, Xia Tiany u, Xu Zheny u, Zeng Zhengwen, Zhou Beitong, Zhou Xingran, Chen Weizhi, Dai Sunhao, Dou Jingya, Gong Yichen, Guo Y uan, Guo Zhenlin, Li Feng, Li Qian, Lin Jinzhen, Zhou Y uqi, Zhu Linchao, Chen Liang, Guo Zheny u, Meng Changhua, and Wang Weiqiang. UI-venus-1. 5 technical report. arXiv [cs. CV] , 9 February 2026 . (Cited on page 1 ) [12] Measuring thinking efciency in reasoning models: The missing benchmark. https://nousresearch. com/measuring- thinking- efficiency- in- reasoning- models- the- missing- benchmark/ , 14 August 2025 . Accessed: 2026-2-19 . (Cited on page 1 ) [13 ] K eivan Alizadeh, Iman Mirzadeh, Dmitry Belenko , S Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mo- hammad Rastegari, and Mehrdad Farajtabar . LLM in a ash: Efcient large language model inference with limited memory . In ACL , 2024 . (Cited on page 2 ) [14] Jie Xiao, Qianyi Huang, Xu Chen, and Chen Tian. Under standing large language models in your pockets: Performance study on COT S mobile devices. arXiv [cs .L G] , 9 February 2026 . (Cited on page 2 ) [15 ] Edw ard J Hu, Phillip Wallis, Ze yuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu Wang, Weizhu Chen, et al. Lora: Low -rank adaptation of large language models. In International Conference on Learning Representations , 2022. (Cited on page 2 , 4 ) [16 ] DeepSeek -AI, Da ya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, and et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinfor cement learning, 2025 . URL https://arxiv.org/abs/2501.12948 . (Cited on page 2 , 4 ) [1 7] Niklas Muennighoff , Zitong Y ang, W eijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luk e Zettle- moyer , Percy Liang, Emmanuel Candès, and T atsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393 , 2025 . doi: 10.48550/arXiv.2501.19393. URL 2501.19393 . (Cited on page 2 , 10 , 11 ) [18 ] Juny an Li, Chuang Gan, et al. Steering llm thinking with budget guidance. arXiv preprint arXiv:2506 .13752 , 2025 . NVIDIA & UMass Amherst. (Cited on page 2 ) [19 ] F astfor w ard: Neural network quantization for research and prototyping. https://github.com/ Qualcomm- AI- research/fastforward . (Cited on page 3 , 20 , 23 ) [20 ] Qualcomm gen ai inference extensions (genie). https://www.qualcomm.com/developer/ software/gen- ai- inference- extensions . (Cited on page 3 , 23 ) [21] T akeshi Kojima, Shixiang Shane Gu, Machel Reid, Y utaka Matsuo , and Y usuke Iw asaw a. L arge language models are zero-shot reasoners. Advances in neur al information processing systems , 35:22199–22213 , 2022. (Cited on page 3 ) [22] Maxwell Nye , Anders Johan Andreassen, Guy Gur -Ari, Henr yk Michalewski, Jacob Austin, David Bieber , David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Sho w your work: Scratchpads for intermediate computation with language models, 2022. URL https:// openreview.net/forum?id=iedYJm92o0a . (Cited on page 3 ) [23 ] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter , Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language mod- els. In S. Koyejo , S . Mohamed, A. Agar w al, D. Belgrave , K. Cho, and A. Oh, editors, Advances Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 27 in Neural Information Processing Systems , volume 35 , pages 24824–2483 7 . Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 9d5609613524ecf4f15af0f7b31abca4- Paper- Conference.pdf . (Cited on page 3 , 11 ) [24] Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, and Willie Neiswanger . Tina: Tiny reasoning models via lora. arXiv preprint , 2025 . (Cited on page 3 ) [25 ] Haor an Xu, Baolin Peng, Hany Aw adalla, Dongdong Chen, Y en-Chun Chen, Mei Gao, Y oung Jin Kim, Yun- sheng Li, Liliang Ren, Y elong Shen, et al. Phi-4-mini-reasoning: Exploring the limits of small reasoning lan- guage models in math. arXiv preprint arXiv:2504 .21233 , 2025 . (Cited on page 3 ) [26 ] Lingjie Jiang, Xun Wu, Shaohan Huang, Qing xiu Dong, Zewen Chi, Li Dong, Xing xing Zhang, T engchao L v , Lei Cui, and Furu Wei. Think only when you need with large hybrid-reasoning models. arXiv preprint arXiv:2505 .14631 , 2025. ( Cited on page 3 ) [27] John Schulman and Thinking Machines Lab . Lora without regret. T hinking Machines Lab: Connectionism , 2025 . doi: 10.64434/tml.20250929. https:/ /thinkingmachines.ai/blog/lora/ . (Cited on page 4 ) [28 ] Y ang Chen, Zhuolin Y ang, Zihan Liu, Chanky u Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning. arXiv preprint arXiv:2505 .16400 , 2025. ( Cited on page 4 ) [29 ] Quy -Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What w orks and what doesn’t. arXiv preprint arXiv:2503 .16219 , 2025. ( Cited on page 4 ) [ 30 ] Mohammad Ali Alomrani, Ying xue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Y aochen Hu, Rohan Deepak Ajwani, Antonios V alkanas, Raika Karimi, et al. Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms. arXiv preprint arXiv:2507 . 02076 , 2025 . (Cited on page 4 ) [ 31] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language mod- els. arXiv preprint arXiv:2402. 03300 , 2024. (Cited on page 4 , 12 ) [ 32] Qw en T eam. Qwq-32b: Embracing the po wer of reinforcement learning. https://qwen.ai/blog?id= qwq- 32b , 2025 . (Cited on page 4 ) [ 33 ] Etash Guha, R yan Marten, Sedrick Keh, Negin Raoof , Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, T rung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models. arXiv preprint arXiv:2506 . 041 78 , 2025 . (Cited on page 4 , 5 , 7 , 22 ) [ 34] Qwen, An Y ang, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengy uan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Y ang, Jianhong T u, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Ke xin Y ang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T ang, Tingy u Xia, Xingzhang Ren, Xuancheng Ren, Y ang Fan, Y ang Su, Yichang Zhang, Yu Wan, Y uqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2. 5 technical report, 2025 . URL . (Cited on page 5 ) [ 35 ] Hugging Face. Open r1: A fully open reproduction of deepseek-r1, Januar y 2025 . URL https://github. com/huggingface/open- r1 . (Cited on page 5 ) [ 36 ] Art of problem solving. aime problems and solutions. 2024. URL https://artofproblemsolving. com/wiki/index.php/AIME_Problems_and_Solutions . (Cited on page 5 , 22 , 39 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 28 [ 37] Art of problem solving. amc problems and solutions. 2023 . URL https://artofproblemsolving.com/ wiki/index.php/AMC_12_Problems_and_Solutions . (Cited on page 5 , 22 , 39 ) [ 38 ] Dan Hendr y cks, Collin Burns, Saurav Kada vath, Akul Arora, Steven Basart, Eric T ang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv pr eprint arXiv:2103 .038 7 4 , 2021. (Cited on page 5 , 10 , 17 , 22 , 39 ) [ 39 ] Da vid Rein, Betty Li Hou, Asa Cooper Stickland, Jackson P etty , Richard Y uanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-lev el google-proof q&a benchmark. In First Conference on Language Modeling , 2024. ( Cited on page 5 , 22 , 39 ) [40 ] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Y an, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, K oushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free ev aluation of large language models for code. arXiv [cs. SE] , March 2024. ( Cited on page 5 , 39 ) [41] Mark Chen, Jerry T worek, Heewoo Jun, Qiming Y uan, Henrique Ponde de Oliveira Pinto , Jared Kaplan, Harri Edwar ds, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ra y , Raul Puri, Gretchen Krueger , Michael Petr ov , Heidy Khlaaf , Girish Sastry , Pamela Mishkin, Brooke Chan, Scott Gray , Nick Ryder , Mikhail Pavlo v , Alethea P ower , Lukasz Kaiser , Mohammad Ba varian, Clemens Winter , Philippe Tillet, Felipe P etroski Such, Dav e Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-V oss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas T ezak, Jie T ang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N Carr , Jan Leik e, Josh Achiam, V edant Misra, Evan Morikaw a, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer , Peter Welinder , Bob McGrew , Dario Amodei, Sam McCandlish, Ilya Sutskever , and Wojciech Zaremba. Evaluating large language models trained on code. arXiv [cs .L G] , July 2021. (Cited on page 5 , 6 , 39 ) [42] Jacob Austin, Augustus Odena, Maxwell Nye , Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael T erry , Quoc Le, and Charles Sutton. Progr am sy nthesis with large language models. arXiv [cs .PL] , August 2021. (Cited on page 5 , 39 ) [43 ] Jia wei Liu, Chunqiu Steven Xia, Y uyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. arXiv [ cs.SE] , May 2023 . ( Cited on page 5 , 39 ) [44] Nathan Habib, Clémentine Fourrier , Hynek Kydlíček, Thomas Wolf , and Lewis T unstall. Lighteval: A lightweight frame work for llm evaluation, 2023 . URL https://github.com/huggingface/ lighteval . (Cited on page 6 ) [45 ] Woosuk Kw on, Zhuohan Li, Siy uan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Y u, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efcient memory management for large language model serving with pagedat- tention. In Proceedings of the ACM SIGOPS 29th Sy mposium on Operating Systems Principles , 2023. ( Cited on page 6 ) [46 ] Maggie Huan, Y uetai Li, T uney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha P oovendr an, Graham Neubig, and Xiang Yue . Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507 . 00432 , 2025 . (Cited on page 7 ) [47] Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qing fu Zhang, et al. Reinforcement ne-tuning naturally mitigates forgetting in continual post-training. arXiv preprint arXiv:2507 . 05386 , 2025. ( Cited on page 7 ) [48 ] Prana v Rajpurkar , Robin Jia, and Perc y Liang. Know what you don’t know: Unansw erable questions for squad, 2018 . URL . (Cited on page 10 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 29 [49 ] Mor Geva, Daniel Khashabi, Elad Segal, T ushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies, 2021. URL https://arxiv. org/abs/2101.02235 . (Cited on page 10 ) [ 50 ] Junyu Zhang, Yif an Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, and Huan Zhang. When reasoning meets its laws . In NeurIPS 2025 Work shop on Efcient Reasoning , 2025. URL https:// openreview.net/forum?id=lWjcbodr4M . (Cited on page 11 ) [ 51] Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by w riting less. arXiv preprint arXiv:2502.18600 , 2025 . URL . (Cited on page 11 ) [ 52] Matthe w Renze and Erhan Guven. The benets of a concise chain of thought on problem-solving in large lan- guage models. arXiv preprint arXiv:2401. 05618 , 2024. URL pdf . (Cited on page 11 ) [ 53 ] Jiaqi Wang, K evin Qinghong Lin, James Cheng, and Mik e Zheng Shou. Think or not? selective reasoning via reinforcement learning for vision-language models. arXiv preprint arXiv:2505 .16854 , 2025. URL https: //arxiv.org/pdf/2505.16854 . (Cited on page 11 ) [ 54] Cheng yu Huang, Zheng xin Zhang, and Claire Cardie. Hapo: T raining language models to reason concisely via history-a ware policy optimization. arXiv preprint arXiv:2505 .11225 , 2025 . (Cited on page 11 ) [ 55 ] Pr anjal Aggar w al and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503. 04697 , 2025 . URL . (Cited on page 11 ) [ 56 ] Shih- Y ang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hong xu Yin, Yu-Chiang Frank Wang, K wang- Ting Cheng, Y ejin Choi, et al. Dler: Doing length penalty right-incentivizing more intelligence per token via reinforcement learning. arXiv preprint arXiv:2510 .15110 , 2025 . (Cited on page 11 ) [ 57] Michael Luo, Sijun T an, Justin W ong, Xiaoxiang Shi, William Y T ang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, et al. Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl. Notion Blog , 2025 . (Cited on page 12 ) [ 58 ] L eandro von Werra, Y ounes Belkada, Lewis T unstall, Edwar d Beeching, T ristan Thrush, Nathan Lambert, Shengyi Huang, K ashif Rasul, and Quentin Gallouédec. TRL: T ransformers Reinforcement Learning, 2020. URL https://github.com/huggingface/trl . (Cited on page 13 ) [ 59 ] Hunter Lightman, Vineet Kosar aju, Yur a Burda, Harri Edwards, Bowen Baker , T eddy Lee, Jan Leike , John Schulman, Ilya Sutskever , and K arl Cobbe. Let’ s verify step by step. arXiv preprint , 2023. doi: 10.48550/arXiv.2305.20050. URL . (Cited on page 13 ) [ 60 ] K arl Cobbe, Vineet K osaraju, Mohammad Bav arian, Mark Chen, Heewoo Jun, Lukasz K aiser , Matthias Plap- pert, Jerry T worek, Jacob Hilton, Reiichiro Nakano, et al. T raining veriers to solve math word problems. arXiv preprint arXiv:2110 .14168 , 2021. ( Cited on page 16 ) [ 61] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery , and Denny Zhou. Self-consistency improv es chain of thought reasoning in language models. arXiv preprint arXiv:2203 .11171 , 2022. (Cited on page 16 ) [ 62] Br adley Brown, Jordan Jur av sky , Ry an Ehrlich, Ronald Clark, Quoc V Le , Christopher Ré, and Azalia Mirho- seini. Large language monkey s: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407 .2178 7 , 2024. (Cited on page 16 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 30 [ 63 ] Y angzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Y ang. Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem-solving. In The Thirteenth International Conference on Learning Representations , 2025 . ( Cited on page 16 , 17 ) [ 64] Charlie Snell, Jaehoon Lee, K elvin Xu, and Avir al Kumar . Scaling llm test-time compute optimally can be more effective than scaling model parameters. In The Thirteenth International Conference on Learning Rep- resentations , 2025 . ( Cited on page 16 ) [ 65 ] Matthe w Renze. The effect of sampling temperature on problem solving in large language models. In Find- ings of the association for computational linguistics: EMNLP 2024 , pages 7346–7356 , 2024. (Cited on page 16 ) [ 66 ] Chan- Jan Hsu, Davide Buffelli, Jamie McGow an, Feng- Ting Liao, Yi-Chang Chen, Sattar V akili, and Da-shan Shiu. Group think: Multiple concurrent reasoning agents collaborating at token level granularity . arXiv preprint arXiv:2505 .11107 , 2025 . ( Cited on page 16 , 24 ) [ 67] Jiayi Pan, Xiuy u Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Y ala, T rev or Darrell, Kurt K eutzer , and Alane Suhr . Learning adaptive parallel reasoning with language models. arXiv preprint , 2025. (Cited on page 16 ) [ 68 ] Gleb Rodionov , Roman Garipov , Alina Shutov a, George Y akushev , Erik Schultheis, V age Egiazarian, Anton Sinitsin, Denis Kuznedelev , and Dan Alistarh. Hogwild! inference: P arallel llm generation via concurrent attention. arXiv preprint arXiv:2504. 06261 , 2025 . (Cited on page 16 , 24 ) [ 69 ] T ong Zheng, Hongming Zhang, Wenhao Y u, Xiaoyang Wang, Xiny u Y ang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, et al. Parallel-r1: T ow ards parallel thinking via reinforcement learning. arXiv preprint arXiv:2509 . 07980 , 2025. ( Cited on page 16 ) [70 ] Shunyu Y ao, Dian Y u, Jeffre y Zhao, Izhak Shafran, T om Grifths, Yuan Cao, and Karthik Narasimhan. T ree of thoughts: Deliber ate problem solving with large language models. Advances in neural information process- ing systems , 36:11809–11822, 2023. ( Cited on page 16 ) [71] Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Y ang, and Yu Wang. Skeleton-of-thought: Prompt- ing llms for efcient parallel generation. arXiv preprint arXiv:2307 .15337 , 2023 . (Cited on page 16 ) [72] Shuangtao Li, Shuaihao Dong, Kexin Luan, Xinhan Di, and Chaofan Ding. Enhancing reasoning through pro- cess supervision with monte carlo tree search. arXiv preprint , 2025 . (Cited on page 16 ) [73 ] Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Y an, Jujie He, Chaojie Wang, Shuicheng Y an, Y ang Liu, and Y ahui Zhou. Skywork-rew ard: Bag of tricks for re ward modeling in llms. arXiv preprint arXiv:2410 .18451 , 2024. ( Cited on page 16 ) [7 4] Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J . Zhang, Makesh Nar- simhan Sreedhar , and Oleksii Kuchaiev . Helpsteer2: Open-source dataset for tr aining top-performing re- ward models, 2024. ( Cited on page 16 ) [75 ] Hunter Lightman, Vineet K osaraju, Y uri Burda, Harrison Edwards , Bowen Baker , T eddy Lee, Jan Leike, John Schulman, Ilya Sutskever , and Karl Cobbe. Let’ s verify step by step. In The T welfth International Conference on Learning Representations , 2023 . ( Cited on page 16 ) [76 ] P eiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhif ang Sui. Math-shepherd: V erif y and reinforce llms step-by -step without human annotations. arXiv preprint arXiv:2312. 08935 , 2023 . ( Cited on page 16 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 31 [77] Zhenru Zhang, Chujie Zheng, Y angzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. T he lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501. 07301 , 2025 . ( Cited on page 16 ) [78 ] Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Y ang Liu, and Y u Wu. Inference-time scaling for generalist rew ard mo deling. arXiv preprint arXiv:2504. 02495 , 2025 . (Cited on page 16 ) [79 ] Hyungjoo Chae , Sunghw an Kim, Junhee Cho , Seungone Kim, Seungjun Moon, G yeom Hwangbo, Dongha Lim, Minjin Kim, Y eonjun Hwang, Minju Gwak, et al. Web-shepherd: Adv ancing prms for reinforcing web agents. arXiv preprint arXiv:2505 .15277 , 2025 . (Cited on page 16 ) [ 80 ] Zhenf ang Chen, Delin Chen, Rui Sun, W enjun Liu, and Chuang Gan. Scaling autonomous agents via auto- matic rew ard modeling and planning. arXiv preprint , 2025. ( Cited on page 16 ) [ 81] Ammar Khairi, Daniel D’ souza, Marzieh Fadaee, and Julia Kreutzer . Making, not taking, the best of n. arXiv preprint arXiv:2510 . 00931 , 2025 . (Cited on page 17 ) [ 82] Y afu Li, Zhilin Wang, Tingchen Fu, Ganqu Cui, Sen Y ang, and Yu Cheng. From drafts to answers: Unlocking llm potential via aggregation ne-tuning. arXiv preprint , 2025 . (Cited on page 17 ) [ 83 ] Jianing Qi, Xi Y e, Hao T ang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for llm reasoning. arXiv preprint arXiv:2506 . 09014 , 2025. ( Cited on page 1 7 ) [ 84] Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, and Ilia Kulikov . The ma- jority is not alway s right: Rl training for solution aggregation. arXiv preprint arXiv:2509 .068 70 , 2025 . (Cited on page 17 ) [ 85 ] Zixu Hao, Jiany u Wei, T uowei Wang, Minxing Huang, Huiqiang Jiang, Shiqi Jiang, Ting Cao, and Ju Ren. Scaling llm test-time compute with mobile npu on smartphones. In EuroSys 2026 . ACM, November 2025 . URL https://www.microsoft.com/en- us/research/publication/ scaling- llm- test- time- compute- with- mobile- npu- on- smartphones/ . (Cited on page 17 ) [ 86 ] Hanshi Sun, Momin Haider , Ruiqi Zhang, Huitao Y ang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, and Andrea Zanette . F ast best-of-n decoding via speculative rejection. Advances in Neural Information Processing Systems , 37:32630–32652, 2024 . (Cited on page 17 ) [ 87] Coleman Hooper , Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswar an, Nicholas Lee, Michael W Mahoney , Sophia Shao, Kurt Keutzer , and Amir Gholami. Ets: Efcient tree search for inference- time scaling. arXiv preprint , 2025 . (Cited on page 17 ) [ 88 ] Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar , and Rishabh Agarwal. Generative veriers: Rew ard modeling as next-token prediction. In The Thirteenth International Conference on Learning Representations , 2025 . ( Cited on page 17 ) [ 89 ] R aghuraman Krishnamoorthi. Quantizing deep convolutional network s for efcient inference: A whitepaper . arXiv preprint arXiv:1806 . 08342 , 2018 . (Cited on page 19 ) [ 90 ] Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Y elysei Bondarenko , Mart Van Baalen, and Tijmen Blankev oort. A white paper on neural network quantization. arXiv preprint arXiv:2106. 08295 , 2021. (Cited on page 19 ) [ 91] M. Horowitz. 1.1 computing’ s energy problem (and what we can do about it). In 2014 IEEE International Solid- State Circuits Conference Digest of T echnical Paper s (ISSCC ) , pages 10–14, 2014. doi: 10.1109/ISSCC. 2014.6757323. (Cited on page 19 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 32 [ 92] Ita y Hubara, Matthieu Courbariaux, Daniel Soudry , Ran El- Y aniv , and Y oshua Bengio . Quantized neural net- works: T raining neural networks with low precision weights and activations. The Journal of Machine Learning Research , 18(1):6869–6898 , 2017 . (Cited on page 19 ) [ 93 ] Shuchang Zhou, Y uxin Wu, Zekun Ni, Xiny u Zhou, He Wen, and Y uheng Zou. Doref a-net: T raining low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606 . 06160 , 2016 . ( Cited on page 19 ) [ 94] Ron Banner , Y ury Nahshan, Elad Hoffer , and Daniel Soudry . P ost-training 4-bit quantization of convolution networks for rapid-deployment. arXiv preprint arXiv:1810 .05 723 , 2018. ( Cited on page 19 ) [ 95 ] Y aohui Cai, Zhewei Y ao, Zhen Dong, Amir Gholami, Michael W Mahoney , and Kurt Keutzer . Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13169–13178 , 2020 . (Cited on page 19 ) [ 96 ] Y oni Choukroun, Eli Krav chik, Fan Y ang, and Pav el Kisilev . Low -bit quantization of neural networks for efcient inference . In ICCV Workshops , pages 3009–3018 , 2019 . (Cited on page 19 ) [ 97] Itay Hubara, Yury Nahshan, Y air Hanani, Ron Banner , and Daniel Soudry . Improving post training neural quan- tization: L ay er -wise calibration and integer progr amming. arXiv preprint arXiv:2006 .10518 , 2020 . (Cited on page 19 ) [ 98 ] Eldad Meller , Alexander Finkelstein, Uri Almog, and Mark Grobman. Same, same but different: Recover - ing neural network quantization error through weight factorization. In International Conference on Machine Learning , pages 4486–4495 . PMLR, 2019. ( Cited on page 19 ) [ 99 ] Ritchie Zhao, Y uwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improving neural network quantization without retr aining using outlier channel splitting. In International conference on machine learning , pages 7543–7552. PMLR, 2019 . ( Cited on page 19 ) [100 ] Markus Nagel, Mart van Baalen, Tijmen Blankev oort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/C VF International Conference on Computer Vision (ICCV) , October 2019 . (Cited on page 19 , 20 ) [101] Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, and Tijmen Blanke voort. Up or Down? Adaptive Rounding for Post- T raining Quantization, April 2020 . URL 10568v2 . (Cited on page 19 ) [102] Y uhang Li, Ruihao Gong, Xu T an, Y ang Y ang, Peng Hu, Qi Zhang, Feng w ei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102. 05426 , 2021. (Cited on page 19 ) [103 ] Suy og Gupta, Ankur Agraw al, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International conference on machine learning , pages 173 7–1 7 46 . PMLR, 2015 . (Cited on page 19 ) [104] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew T ang, Andrew Howar d, Hartwig Adam, and Dmitry Kalenichenk o. Quantization and training of neural networks for efcient integer -arithmetic-only inference . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2704– 2713 , 2018 . (Cited on page 19 ) [105 ] Ste ven K. Esser , Jeffrey L. McKinstry , Deepika Bablani, Rathinakumar Appuswamy , and Dharmendra S. Modha. Learned step size quantization. In International Conference on Learning Representations (ICLR) , 2020 . ( Cited on page 19 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 33 [106 ] Markus Nagel, Marios Fournarakis, Y elysei Bondarenko, and Tijmen Blankev oort. Ov ercoming oscillations in quantization-aw are training. In International Conference on Machine Learning , pages 16318–16330. PMLR, 2022. (Cited on page 19 ) [107] Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Y ashar Mehdad, Y angyang Shi, Raghuraman Krishnamoorthi, and Vik as Chandra. Llm-qat: Data-free quantization aw are training for large language models. arXiv preprint arXiv:2305 .1 7888 , 2023 . (Cited on page 19 ) [108 ] Da you Du, Yijia Zhang, Shijie Cao , Jiaqi Guo, Ting Cao, Xiaow en Chu, and Ning yi Xu. Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation. arXiv preprint , 2024. ( Cited on page 19 ) [109 ] Z echun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiaw ei Zhao, Scott Roy , Lisa Jin, Y unyang Xiong, Y angyang Shi, Lin Xiao , Y uandong Tian, Bilge Sor an, Raghuraman Krishnamoorthi, Tijmen Blankev oort, and Vik as Chandra. P aretoq: Scaling law s in extremely lo w-bit llm quantization, 2025 . URL https://arxiv.org/abs/2502.02631 . (Cited on page 19 , 25 ) [110 ] Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Y unshui Li, Jin Ma, Jie Huang, Xun Zhou, et al. Scaling law for quantization-a ware training. arXiv preprint , 2025 . (Cited on page 19 ) [111] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer . Qlora: Efcient netuning of quantized llms. Advances in Neural Information Processing Systems , 36 , 2024. (Cited on page 19 , 21 ) [112] Y uhui Xu, Lingxi Xie , Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, and Qi Tian. Qa-lora: Quantization-aw are low-rank adaptation of large language models. arXiv preprint arXiv:2309 .14717 , 2023 . ( Cited on page 19 , 21 ) [113 ] Y ixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis , Weizhu Chen, and T uo Zhao. Loftq: Lora-ne- tuning-aw are quantization for large language models. arXiv preprint arXiv:2310 . 08659 , 2023 . ( Cited on page 19 , 21 ) [114] Han Guo, Philip Greengard, Eric P Xing, and Y oon Kim. Lq-lora: Low -rank plus quantized matrix decomposi- tion for efcient language model netuning. arXiv preprint , 2023 . (Cited on page 19 ) [115 ] Jeonghoon Kim, Jung Hyun Lee , Sungdong Kim, Joonsuk Park, Kang Min Y oo, Se Jung Kwon, and Dongsoo Lee. Memory-efcient ne-tuning of compressed large language models via sub-4-bit integer quantization. Advances in Neural Information Processing Systems , 36 , 2024 . (Cited on page 19 ) [116 ] Y elysei Bondarenko , Riccardo Del Chiaro, and Markus Nagel. Low -rank quantization-aw are training for llms. arXiv preprint arXiv:2406 . 06385 , 2024. ( Cited on page 19 ) [11 7] Y elysei Bondarenko , Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming the challenges of efcient transformer quantization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 7947–7969 , Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp- main.627. URL https: //aclanthology.org/2021.emnlp- main.627 . (Cited on page 19 ) [118 ] Olga Kov aleva, Saurabh Kulshreshtha, Anna Rogers, and Anna Rumshisky . Bert busters: Outlier dimensions that disrupt transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages 3392–3405 , 2021. (Cited on page 19 ) [119 ] T im Dettmers, Mike Lewis, Y ounes Belkada, and Luke Zettlemoyer . Gpt3 . int8 (): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems , 2022. (Cited on page 19 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 34 [120 ] Y elysei Bondarenko, Markus Nagel, and Tijmen Blanke voort. Quantizable T ransformers: Removing Outliers by Helping Attention Heads Do Nothing. Advances in Neural Information Processing Systems , 2023 . URL https://arxiv.org/abs/2306.12929v2 . (Cited on page 19 ) [121] Mingjie Sun, Xinlei Chen, J Zico Kolter , and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.1 7762 , 2024. ( Cited on page 19 ) [122] Elias Frantar , Saleh Ashkboos, T orsten Hoeer , and Dan Alistarh. Gptq: Accurate post-tr aining quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.1 7323 , 2022. ( Cited on page 19 , 20 ) [123 ] T im Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev , Elias Frantar , Saleh Ashkboos, Alexander Borzunov , T orsten Hoeer , and Dan Alistarh. Spqr: A sparse-quantized representation for near - lossless llm weight compression. arXiv preprint arXiv:2306 .030 78 , 2023 . (Cited on page 19 ) [124] Ji Lin, Jiaming T ang, Haotian T ang, Shang Y ang, Xingy u Dang, and Song Han. A wq: Activation-a war e weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306. 00978 , 2023 . (Cited on page 19 ) [125 ] Changhun Lee, Jungy u Jin, T aesu Kim, Hyungjun Kim, and Eunhyeok Park. Owq: Outlier-a ware weight quan- tization for efcient ne-tuning and inference of large language models. In Proceedings of the AAAI Confer - ence on Articial Intelligence , volume 38 , pages 13355–13364, 2024. ( Cited on page 19 ) [126 ] Y ongkweon Jeon, Chungman Lee, Kyungphil Park, and Ho-young Kim. A frustratingly easy post-training quantization scheme for llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing , pages 14446–14461, 2023 . (Cited on page 19 ) [127] Jung Hyun L ee, Jeonghoon Kim, Se Jung K won, and Dongsoo Lee . Flexround: Learnable r ounding based on element-wise division for post-training quantization. In International Conference on Machine Learning , pages 18913–18939 . PMLR, 2023. ( Cited on page 19 ) [128 ] Y an Luo, Y angcheng Gao, Zhao Zhang, Jicong Fan, Haijun Zhang, and Mingliang Xu. Long-range zero-shot generative deep network quantization. Neural Networks , 166:683–691, 2023. ( Cited on page 19 ) [129 ] Jerry Chee, Y aohui Cai, V olodymy r Kuleshov , and Christopher M De Sa. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems , 36, 2024. (Cited on page 19 , 20 ) [130 ] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and Efcient Post - T raining Quantization for L arge L anguage Models, March 2024 . URL http://arxiv. org/abs/2211.10438 . arXiv:2211.10438 [cs]. ( Cited on page 19 , 20 ) [131] Xiuying Wei, Y unchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. arXiv preprint arXiv:2209 .13325 , 2022. (Cited on page 19 ) [132] Jangwhan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, and Jung wook Choi. En- hancing computation efciency in large language models through weight and activation quantization. arXiv preprint arXiv:2311. 05161 , 2023. ( Cited on page 19 ) [133 ] Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. Qllm: Accurate and efcient low -bitwidth quantization for large language models. arXiv preprint arXiv:2310 . 08041 , 2023 . (Cited on page 19 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 35 [134] Xiuying Wei, Yunchen Zhang, Y uhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier Suppression+: Accurate quantization of large language models b y equivalent and optimal shifting and scaling, October 2023 . URL . arXiv:2304. 09145 [ cs]. (Cited on page 19 , 20 ) [135 ] Zhihang Y uan, Lin Niu, Jiawei Liu, Wenyu Liu, Xing gang W ang, Yuzhang Shang, Guang yu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe W u. Rptq: Reorder -based post-training quantization for large language models. arXiv preprint arXiv:2304 . 01089 , 2023. ( Cited on page 19 ) [136 ] Hanlin T ang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, and Zhanhui K ang. Easyquant: An efcient data- free quantization algorithm for llms. arXiv preprint arXiv:2403 . 02775 , 2024. ( Cited on page 19 ) [137] Zhewei Y ao, Reza Y azdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Y uxiong He. Zeroquant: Efcient and affordable post -training quantization for large-scale transform- ers. In S. K oyejo , S. Mohamed, A. Agarw al, D . Belgrave , K. Cho, and A. Oh, editors, Advances in Neural Information Processing Sy stems , volume 35 , pages 2 7168–27183 . Curran Associates , Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ adf7fa39d65e2983d724ff7da57f00ac- Paper- Conference.pdf . (Cited on page 19 ) [138 ] Y ujun Lin, Haotian T ang, Shang Y ang, Zhekai Zhang, Guang xuan Xiao, Chuang Gan, and Song Han. Qserve: W 4a8kv4 quantization and system co-design for efcient llm serving. arXiv preprint arXiv:2405. 04532 , 2024 . (Cited on page 19 ) [139 ] W enqi Shao, Mengzhao Chen, Zhaoy ang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao , Y u Qiao, and Ping Luo. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models, March 2024 . URL . arXiv:2308 .13137 [cs ]. (Cited on page 20 ) [140 ] Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, T orsten Hoeer , and James Hensman. QuaRot: Outlier -Free 4-Bit Inference in Rotated LLMs, March 2024 . URL https://arxiv.org/abs/2404.00456v1 . (Cited on page 20 ) [141] Zechun Liu, Changsheng Zhao, Igor Fedoro v , Bilge Soran, Dhruv Choudhary , Raghuraman Krishnamoorthi, Vikas Chandr a, Yuandong Tian, and Tijmen Blanke voort. SpinQuant: LLM quantization with learned rotations, May 2024. URL . (Cited on page 20 ) [142] Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Y ingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. DuQuant: Distributing Outlier s via Dual T ransformation Makes Stronger Quantized LLMs. In Ad- vances in Neural Information Processing Systems . arXiv , November 2024. doi: 10.48550/arXiv.2406.01721. URL . arXiv:2406. 01721. ( Cited on page 20 ) [143 ] Xing Hu, Yuan Cheng, Dawei Y ang, Zhixuan Chen, Zukang Xu, Jiang yong Yu, XUCHEN, Zhihang Y uan, Zhe jiang, and Sifan Zhou. OS TQuant: Rening Large Language Model Quantization with Orthogonal and Scal- ing T ransformations for Better Distribution Fitting. In The Thirteenth International Conference on Learning Representations , 2025 . URL https://openreview.net/forum?id=rAcgDBdKnP . (Cited on page 20 ) [144] Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, JiaxinHu, Xianzhi Yu, Lu Hou, Chun Y uan, Xin Jiang, Wulong Liu, and Jun Y ao. Flatquant: Flatness matters for LLM quantization. In Forty- second International Conference on Machine Learning , 2025 . URL https://openreview.net/forum? id=uTz2Utym5n . (Cited on page 20 ) [145 ] Boris van Breugel, Y elysei Bondarenko , Paul Whatmough, and Markus Nagel. Fptquant: Function-pr eserving transforms for llm quantization. arXiv preprint arXiv:2506 . 04985 , 2025. ( Cited on page 20 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 36 [146 ] Jung Hyun Lee , Seung jae Shin, Vinnam Kim, Jaeseong Y ou, and An Chen. Unifying block -wise ptq and distillation-based qat for progressiv e quantization to ward 2-bit instruction-tuned llms . arXiv preprint arXiv:2506 .09104 , 2025 . (Cited on page 20 ) [147] Loubna Ben Allal, Anton Lozhkov , Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo , Le wis T unstall, Andrés Maraoti, Hynek Kydlíček, Agustín Piqueres Lajarín, V aibhav Sriv astav , et al. Smollm2: When smol goes big–data-centric training of a small language model. arXiv preprint arXiv:2502. 0273 7 , 2025. ( Cited on page 20 ) [148 ] Jeffr ey Li, Ale x F ang, Georgios Smy rnis, Maor Ivgi, Matt Jor dan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. Adv ances in Neural Information Processing Systems , 37:14200–14282, 2024. (Cited on page 20 ) [149 ] Guilherme P enedo, Hynek Kydlíček, Anton Lozhkov , Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf , et al. The neweb datasets: Decanting the web for the nest text data at scale. Advances in Neural Information Processing Systems , 37:30811–30849 , 2024. ( Cited on page 20 ) [150 ] Y onatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Y ejin Choi. PIQA: Reasoning about Physical Commonsense in Natural Language. Proceedings of the AAAI Conference on Articial Intelligence , 34(05 ): 7 432–7 439 , April 2020 . IS SN 237 4-3468 . doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai. org/index.php/AAAI/article/view/6239 . Number: 05 . ( Cited on page 20 ) [151] K eisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Y ejin Choi. WinoGrande: an adversarial winograd schema challenge at scale. Commun. A CM , 64( 9 ):99–106 , August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URL https://dl.acm.org/doi/10.1145/3474381 . (Cited on page 20 ) [152] Row an Zellers , Ari Holtzman, Y onatan Bisk, Ali Farhadi, and Y ejin Choi. HellaSw ag: Can a Machine Really Finish Y our Sentence? , May 2019 . URL . arXiv:1905 . 07830 [cs]. ( Cited on page 20 ) [153 ] P eter Clark, Isaac Cowhey , Oren Etzioni, T ushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind T afjord. Think you have Solved Question Answering? T ry ARC , the AI2 Reasoning Challenge, March 2018 . URL . arXiv:1803. 05457 [cs ]. (Cited on page 20 ) [154] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle , Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Com- putational Linguistics (V olume 1: Long Papers) , pages 1525–1534, 2016 . ( Cited on page 20 ) [155 ] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language under standing. arXiv preprint arXiv:2009 . 03300 , 2020. (Cited on page 20 ) [156 ] Lintang Sutawika, Hailey Schoelkopf , Leo Gao, Baber Abbasi, Stella Biderman, Jonathan T ow , Charles Lover - ing, Jason Phang, Anish Thite , Thomas Wang, et al. Eleutherai/lm-ev aluation-harness: v0 . 4. 9 . Zenodo , 2025 . ( Cited on page 20 ) [157] Shaohan Huang and Furu Wei. Mixture of lora experts. In ICLR 2024 , April 2024. ( Cited on page 24 ) [158 ] W enfeng Feng, Chuzhan Hao, Yue wei Zhang, Y u Han, and Hao Wang. Mixture-of-lor as: An efcient multi- task tuning method for large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024) , pages 113 71– 11380 , 2024. (Cited on page 24 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 37 [159 ] Shibo Hao, Sainbay ar Sukhbaatar , DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Y uandong Tian. T raining large language models to reason in a continuous latent space. arXiv preprint arXiv:2412. 06769 , 2024 . (Cited on page 24 ) [160 ] Zhenyi Shen, Hanqi Y an, Linhai Zhang, Zhanghao Hu, Y ali Du, and Y ulan He . Codi: Compressing chain-of- thought into continuous space via self-distillation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 677–693 , 2025 . (Cited on page 24 ) [161] Haoyi Wu, Zhihao T eng, and K ewei T u. P arallel continuous chain-of-thought with jacobi iteration. In Proceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 914–926 , 2025. ( Cited on page 24 ) [162] Anna Kuzina, Maciej Pióro, and Babak Ehteshami Bejnordi. K ava: Latent reasoning via compressed KV- cache distillation. In The F ourteenth International Conference on Learning Representations , 2026 . URL https://openreview.net/forum?id=ePrhcLbtGv . (Cited on page 24 ) [163 ] Xiaoqiang Wang, Suy uchen Wang, Yun Zhu, and Bang Liu. System-1. 5 reasoning: T rav ersal in language and latent spaces with dy namic shortcuts. arXiv preprint , 2025 . (Cited on page 24 ) [164] Fabio Valerio Massoli, Andr ey Kuzmin, and Arash Behboodi. Reasoning as compression: Unif ying budget forcing via the conditional information bottleneck. arXiv [ cs.LG] , 9 March 2026 . (Cited on page 24 ) [165 ] Gabriele Cesa, Thomas Hehn, Aleix T orres-Camps, Àlex Batlle Casellas, Jordi Ros-Giralt, Arash Behboodi, and T ribhuvanesh Orek ondy . LaneroPE: Positional encoding for collaborative parallel reasoning and gener- ation. In W orkshop on Multi-Agent Learning and Its Opportunities in the Era of Generative AI , 2026 . URL https://openreview.net/forum?id=6WAuvwZjmw . (Cited on page 24 ) [166 ] Albert T seng, Jerry Chee, Qingyao Sun, V olodymy r Kuleshov , and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks. Proceedings of machine learning research , 235:48630 , 2024. (Cited on page 25 ) Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 38 A Benchmark Description T o comprehensively assess the reasoning capabilities of our ne-tuned models, we lever age a diverse suite of benchmarks spanning the mathematics, science, and coding domains. · AIME 24/25 [ 36 ] consists of 30 highly challenging mathematics competition problems from the 2024 and 2025 American Invitational Mathematics Examination. The questions cover a range of topics including alge- bra, geometr y , and number theory , and are aimed for high-school students. The problems require multi-step reasoning, and the answer s are integers between 0 and 999. · MA TH500 [ 38 ] is a benchmark consisting of 500 mathematical questions spanning different topics includ- ing algebra, geometry , number theory , precalculus , and probability . The problems r equire multi-step solu- tions, and answer s may include La T eX -formatted expressions. · GPQ A Diamond [ 39 ] consists of 198 science PhD-level questions from physics , chemistr y , and biolog y . All questions are presented in a multiple-choice format. · AMC23 [ 37 ] contains 40 problems from the 2023 American Mathematics Competition with integer answer s. · Liv eCodeBench [ 40 ] is a continuously updated (hence ”live ”) coding benchmark. In every release , a certain number of coding problems is sourced from competitive programming platforms ( e.g. CodeForces , Leet- Code), and each problem is used to build four coding ” scenarios ”: Code Generation, Code Repair , T est Output Prediction, and Code Execution. In this work we use the v2 release, comprising 511 problems, and conne ourselves to the Code Generation scenario. · HumanEv al and HumanEval+ [ 41 ; 43 ] are the original and improved versions of the HumanEval benchmark, comprising 164 problems in which a model, based on a Py thon function’ s signature and docstring, must generate its body . The resulting function is then veried with sever al unit tests. HumanEval+ [ 43 ] improves the original benchmark by increasing the number unit tests for verication by 80 times. · MBPP and MBPP+ [ 42 ; 43 ] are the original and improved versions of the Most Basic Python Programs benchmark, consisting of 1000 basic Python programming tasks sourced from human coders. Each task involves w riting a simple Python function based on natural language requirements and three unit tests it must pass. The MBPP+ enhancement selects a subset of 378 task and increases the number of unit tests by 35 times. B LoRA ablation study This appendix provides the complete , detailed results of the parameter -efcient ne-tuning (PEFT) ablation study introduced in Section Section 3 .4 .1 , in T ables 12 and 13 . The study explores the impact of varying learning rates, batch sizes, and LoRA adapter ranks on the reasoning capabilities of both the 3B and 7B model backbones. The models were trained on a 50 , 000-entr y subset of the Open Thoughts3 (O T3 ) dataset for one epoch and evaluated across core mathematical and scientic benchmarks (AIME24, AIME25, MA TH500, GPQA, and AMC23 ). Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 39 T able 12: Ablation results for Qwen2. 5-3B-Instruct. LR denotes learning rate and BS stands for batch size. In each learning rate subgroup , the best performance is marked in bold. LR BS Rank AIME24 AIME25 MA TH500 GPQA AMC23 A vg 0 .0001 32 32 0 . 05 0 . 01 0 .55 0 .25 0 .24 0 .220 0 .0001 32 64 0 . 03 0 .02 0. 58 0 .29 0 .28 0 .240 0 .0001 32 128 0. 08 0 . 03 0 .56 0.28 0. 30 0 .250 0 .0001 32 256 0. 03 0 . 04 0. 57 0.28 0 .28 0.240 0 .0001 64 32 0 . 05 0 . 01 0 . 54 0.2 7 0.28 0 .230 0 .0001 64 64 0 . 03 0. 03 0 .5 7 0 .29 0 .29 0 .242 0 .0001 64 128 0 .02 0 . 03 0 .58 0.2 7 0. 30 0.240 0 .0001 64 256 0 .04 0. 06 0 .59 0. 30 0. 34 0.266 0 .0001 128 32 0 . 04 0 .01 0 . 61 0 .26 0 .29 0 .242 0 .0001 128 64 0 .05 0. 00 0 . 59 0 .29 0 .27 0 .240 0 .0001 128 128 0 . 03 0 . 01 0 .56 0 .27 0 .25 0 .224 0 .0001 128 256 0 .03 0. 05 0 . 58 0.2 7 0 .25 0 .236 0 .0002 32 32 0 .08 0 . 01 0 . 58 0 .26 0 . 33 0 .252 0 .0002 32 64 0. 05 0 . 03 0 .59 0 .30 0 . 32 0 .258 0 .0002 32 128 0 . 06 0 .04 0 . 57 0 .25 0.29 0 .242 0 .0002 32 256 0. 07 0. 04 0 .60 0 .28 0. 34 0 .266 0 .0002 64 32 0 .03 0 . 02 0. 56 0.2 7 0. 30 0 .236 0 .0002 64 64 0 .0 7 0 . 03 0 .58 0.26 0 .28 0 .244 0 .0002 64 128 0 . 07 0. 06 0. 55 0 .29 0.25 0 .244 0 .0002 64 256 0 . 06 0. 04 0. 58 0 .27 0. 30 0 .250 0 .0002 128 32 0 .00 0 .03 0 . 59 0 .27 0 .28 0 .234 0 .0002 128 64 0 .00 0 . 02 0 .55 0 .28 0.28 0 .226 0 .0002 128 128 0 . 05 0. 04 0. 55 0 .27 0 . 30 0.242 0 .0002 128 256 0 . 07 0 . 04 0 .59 0 .29 0. 32 0 .262 0 .0005 32 32 0 .05 0. 02 0. 55 0. 31 0 .29 0 .244 0 .0005 32 64 0 .05 0 . 03 0 . 53 0 .28 0.2 7 0 .232 0 .0005 32 128 0 . 06 0 .04 0 . 54 0 .26 0 .27 0 .234 0 .0005 32 256 0 .04 0 . 05 0. 53 0 .28 0 .24 0 .228 0 .0005 64 32 0 . 06 0 .04 0 .56 0 .26 0 .29 0 .242 0 .0005 64 64 0 . 05 0 .05 0 .5 7 0 .28 0 . 30 0.250 0 .0005 64 128 0 . 06 0. 04 0 .5 7 0 .29 0 .27 0 .246 0 .0005 64 256 0 .03 0. 05 0 . 54 0 .28 0 .24 0 .228 0 .0005 128 32 0. 06 0 . 03 0. 57 0 .28 0 .26 0 .240 0 .0005 128 64 0 . 03 0 . 00 0. 57 0.28 0 .27 0 .230 0 .0005 128 128 0. 06 0 . 05 0. 56 0.2 7 0. 35 0.258 0 .0005 128 256 0. 06 0 . 06 0. 58 0 .29 0 .25 0.248 Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 40 T able 13: Ablation results for Qwen2. 5-7B-Instruct. LR denotes learning rate and BS stands for batch size. In each learning rate subgroup , the best performance is marked in bold. LR BS Rank AIME24 AIME25 MA TH500 GPQ A AMC23 Av g 0 .0001 32 32 0 .18 0 .12 0 .79 0 . 36 0. 50 0. 390 0 .0001 32 64 0 .13 0 .16 0 .78 0. 36 0 . 51 0. 388 0 .0001 32 128 0 .17 0.11 0. 78 0 .3 7 0. 55 0 .396 0 .0001 32 256 0.14 0.14 0 . 79 0 .35 0 .54 0 .392 0 .0001 64 32 0 .14 0 .12 0 .77 0 .35 0. 54 0 .384 0 .0001 64 64 0 .16 0.1 7 0 .76 0 .38 0 .56 0 .406 0 .0001 64 128 0 .15 0 .15 0 .78 0. 39 0. 55 0 . 404 0 .0001 64 256 0.15 0 .15 0 .76 0 . 38 0 .53 0 . 394 0 .0001 128 32 0 .12 0 .12 0 .77 0 .3 7 0 . 55 0. 386 0 .0001 128 64 0 .18 0 .13 0 .78 0 . 35 0 .51 0. 390 0 .0001 128 128 0.19 0.11 0. 78 0 .36 0 .52 0. 392 0 .0001 128 256 0.18 0 .11 0 .76 0 .3 7 0. 53 0. 390 0 .0002 32 32 0 .12 0 .14 0 .77 0. 36 0 .54 0 . 386 0 .0002 32 64 0 .14 0 .16 0 .79 0 . 34 0. 51 0 .388 0 .0002 32 128 0 .18 0.18 0. 78 0. 35 0. 54 0. 406 0 .0002 32 256 0 .18 0 .17 0. 77 0. 35 0. 56 0 .406 0 .0002 64 32 0 .21 0.11 0. 79 0 . 36 0 . 52 0. 398 0 .0002 64 64 0 .16 0 .16 0. 77 0 .3 7 0 . 58 0. 408 0 .0002 64 128 0.20 0.14 0. 80 0. 38 0. 52 0 . 408 0 .0002 64 256 0.14 0 .17 0 .76 0 . 33 0 . 52 0 .384 0 .0002 128 32 0 .15 0.11 0. 77 0. 35 0. 57 0. 390 0 .0002 128 64 0 .18 0 .12 0. 79 0 . 33 0. 55 0 .394 0 .0002 128 128 0 .18 0 .15 0 .77 0 . 35 0 . 57 0. 404 0 .0002 128 256 0 .20 0 .17 0. 79 0 .38 0 . 54 0 .416 0 .0005 32 32 0 .17 0.15 0 .76 0. 38 0 .53 0 .398 0 .0005 32 64 0.11 0 .09 0 .6 7 0. 40 0 .50 0 .354 0 .0005 32 128 0 .16 0 .11 0. 76 0 . 37 0. 47 0. 37 4 0 .0005 32 256 0 . 00 0 .00 0. 00 0 .24 0 .00 0 .048 0 .0005 64 32 0 .17 0 .14 0. 76 0 . 38 0 . 54 0 .398 0 .0005 64 64 0 .15 0 .14 0. 77 0 . 34 0 .50 0. 380 0 .0005 64 128 0.22 0.12 0. 79 0. 37 0 .59 0 . 418 0 .0005 64 256 0 .00 0. 00 0 . 01 0.2 7 0 . 02 0. 060 0 .0005 128 32 0 .19 0.13 0. 79 0 . 34 0. 55 0 . 400 0 .0005 128 64 0 .17 0 .12 0 .77 0 .38 0 .55 0. 398 0 .0005 128 128 0.23 0 .1 7 0. 78 0 . 37 0 . 53 0. 416 0 .0005 128 256 0.21 0 .17 0 .76 0. 36 0 . 49 0 . 398 Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 41 C Budget Forcing Details T able 14 reports the list of hy perpar ameters we used in the budget forcing RL training. T able 14: GRPO T raining Hyperparameters. All experiments share these settings unless otherwise noted. Hyperparameter V alue Optimizer AdamW Learning Rate 2 × 10 − 5 LR Scheduler Cosine Warmup Ratio 0 .05 Batch Size (Global) 256 Generations per Prompt ( G ) 8 T emperature 0 .8 Max Completion Length 6144 Max Gradient Norm 1. 0 KL Penalty Coefcient ( β KL ) { 10 − 3 , 10 − 4 } T raining Steps 200 We report in Figures 9 and 10 additional qualitative comparisons between the reasoning traces of the uncon- strained baseline against our budget-forced model. Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 42 Input Prompt Problem: When the positive integers with exactly three positive divisors are listed in ascending order , what is the fth number listed? Baseline Response (Standard Co T) - 2465 tokens Oka y , so I need to nd the fth positive integer [ ... ] Let’s start by recalling [ ... ] Numbers with exactly three divisors are squares of primes. Let me conrm this [ ... ] 4 has 1,2,4 [ ... ] 9. So the fth number should be 121. Wait a second, hold on. Let me make sure I didn’t skip any primes. Let me list primes in order: 1st prime: 2, 2nd prime: 3 [ ... ] But hold on, let me check if there ’s any other number [ ... ] Suppose someone considers 16 [ ... ] 16 is 2 4 , ve divisors. What about 12? Six divisors. Maybe I made a mistake here. Let me verify once again. Let me list the numbers [ ... ] Alternatively , maybe a cube of a prime? 8 is 2 3 , four divisors [ ... ] Maybe a product of two primes? 6 is 2 × 3 , four divisors. Therefore , I think I can be condent [ ... ] Wait, but hold on [ ... ] The fth prime is 11, so 11 squared is 121. Budget Forced Co T - 581 tokens First, recall that if a number has ex actly three divisors, it must be a square of a prime number . T o nd the fth such number , we list the squares of the rst ve prime numbers. The primes are 2, 3 , 5, 7 , 11. Squaring them gives 4, 9 , 25, 49 , and 11 2 = 121 . Therefore , the nal answer is 121. Figure 9: Qualitative comparison on number theor y reasoning. T ext highlighted in red denotes redundant veri- cation and verbal parsing, while bold text identies essential reasoning steps. We use “ [ ... ] ” as a placeholder for brevity . T op: Prompt. Middle: The Baseline trace correctly identies the property ( p 2 ) early on but falls into exten- sive , redundant verication loops checking composite numbers and re-listing primes (highlighted in red). Bottom: The Budget Forced trace directly applies the prime-square property and computes the result without unnecessary hesitation or syntactic noise. Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 43 Input Prompt Problem: What is the 2003rd term of the sequence of odd numbers 1, 3, 5, 7 , ... ? Baseline Response (Standard Co T) - 3434 tokens Oka y , so I need to nd the 2003rd term [ ... ] Let me think. First, I remember that the sequence is [ ... ] The nth term is 2 n − 1 . Wait, let me check that. If n=1, then 1 [ ... ] n=2 gives 3 [ ... ] So that formula checks out. Alternatively , maybe I can think of the sequence as starting at 1 and adding 2 [ ... ] arithmetic sequence formula 1 + ( n − 1)2 [ ... ] expands to 2 n − 1 . Maybe someone might think the nth term is n 2 [ ... ] or 2 n [ ... ] let me check term 5 [ ... ] . Correct. Therefore, no matter how I approach it [ ... ] Wait, but just to be thorough, let me compute 2 × 2003 rst [ ... ] Substituting n = 2003 gives 2(2003) − 1 = 4005 . Budget Forced Co T - 829 tokens First, observe that the sequence of odd numbers is given by the formula: a n = 2 n − 1 . T o nd the 2003rd term, we substitute n = 2003 into the formula: a 2003 = 2(2003) − 1 = 4006 − 1 = 4005 . Therefore , the nal answer is 4005 . Figure 10: Qualitative comparison on pattern recognition. Middle: The Baseline tr ace correctly identies the formula 2 n − 1 initially but spends nearly 1000 tokens validating it against alternative arithmetic formulas ( a + ( n − 1) d ) and hy pothetical user errors (confusing term number with value, testing n 2 , etc .). Bottom: The Budget Forced trace directly retrieves the formula and computes the specic term requested. Qualcomm AI Research is an initiative of Qualcomm T echnologies, Inc. 44

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment