NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation

Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly Low-Rank Adaptation (LoRA), have become essential for adapting Large Language Models (LLMs) to downstream tasks. While the recent FlyLoRA framework successfully leverages bio-inspired sp…

Authors: Yuxin Yang, Haoran Zhang, Mingxuan Li

Neur oLoRA: Context-A war e Neur omodulation f or Parameter -Efficient Multi-T ask Adaptation Y uxin Y ang 1 Haoran Zhang 1 Mingxuan Li 2 Jiachen Xu 1 Ruoxi Shen 2 Zhenyu W ang 1 Tianhao Liu 2 Siqi Chen 1 W eilin Huang 2 1 Shanghai Uni versity 2 Fudan Uni versity Abstract Parameter -Ef ficient Fine-T uning (PEFT) tech- niques, particularly Low-Rank Adaptation (LoRA), hav e become essential for adapting Large Language Models (LLMs) to down- stream tasks. While the recent FlyLoRA frame- work successfully le verages bio-inspired sparse random projections to mitigate parameter in- terference, it relies on a static, magnitude- based routing mechanism that is agnostic to input context. In this paper , we propose Neu- roLoRA , a nov el Mixture-of-Experts (MoE) based LoRA framew ork inspired by biological neuromodulation—the dynamic regulation of neuronal excitability based on context. Neu- roLoRA retains the computational ef ficiency of frozen random projections while introducing a lightweight, learnable neuromodulation gate that contextually rescales the projection space prior to e xpert selection. W e further propose a Contrasti ve Orthogonality Loss to explicitly enforce separation between expert subspaces, enhancing both task decoupling and continual learning capacity . Extensiv e experiments on MMLU, GSM8K, and ScienceQA demonstrate that NeuroLoRA consistently outperforms Fly- LoRA and other strong baselines across single- task adaptation, multi-task model merging, and sequential continual learning scenarios, while maintaining comparable parameter efficienc y . 1 Introduction The paradigm of pre-training followed by task- specific fine-tuning has become the dominant ap- proach for deploying Large Language Models (LLMs) ( Bro wn et al. , 2020 ; Achiam et al. , 2023 ; Grattafiori et al. , 2024 ). Ho wever , Full Fine-T uning (FFT) of models with billions of parameters in- curs prohibitiv e computational and storage costs, moti vating the dev elopment of Parameter-Ef ficient Fine-T uning (PEFT) methods ( Houlsby et al. , 2019 ; Lester et al. , 2021 ; Li and Liang , 2021 ). Among these, Lo w-Rank Adaptation (LoRA) ( Hu et al. , 2022 ) has emerged as the de facto standard, freez- ing the pre-trained weights and injecting trainable lo w-rank decomposition matrices into the model’ s linear layers. Despite its ef ficiency , standard LoRA assumes a single global low-rank update shared across all inputs. This design leads to intra-task interfer - ence —where div erse input patterns within a single task compete for the same limited rank capacity— and inter-task interfer ence when merging adapters trained on different tasks ( Y u et al. , 2020 ; Y a- dav et al. , 2023 ; Zou et al. , 2025b ). T o address these limitations, Mixture-of-Experts (MoE) ar- chitectures hav e been integrated into the LoRA frame work ( Dou et al. , 2024 ; W u et al. , 2024 ; Li et al. , 2024 ), enabling conditional activ ation of task- specific parameter subsets. A recent adv ancement, FlyLoRA ( Zou et al. , 2025b ), draws inspiration from the fruit fly olfac- tory circuit ( Dasgupta et al. , 2017 ) to replace train- able routers with a frozen sparse random projection matrix. By selecting experts based on the magni- tude of the projected input (a “W inner-T ak e-All” mechanism analogous to K enyon Cell acti vation), FlyLoRA achie ves strong task decoupling and en- ables training-free model merging. Ho wever , FlyLoRA ’ s reliance on a fixed random projection represents a purely structur al imitation of biological neural circuits, ov erlooking a critical functional component: neuromodulation ( Marder , 2012 ). In biological brains, neural responses are dynamically modulated by neurotransmitters (e.g., dopamine, acetylcholine) that alter neuronal sensi- ti vity based on internal states and external context ( Doya , 2002 ). FlyLoRA ’ s routing is reactive to the current token’ s magnitude profile but agnostic to the broader semantic context. This static routing can be suboptimal for tasks requiring contextual reasoning, where the appropriate expert specializa- tion depends on the meaning of preceding tokens rather than the current token’ s magnitude alone. 1 T o bridge this gap, we propose NeuroLoRA (Neuromodulated Lo w-Rank Adaptation). Neu- roLoRA preserves the computational ef ficiency and merging compatibility of FlyLoRA ’ s frozen sparse projection while augmenting it with a Context- A ware Neur omodulation Gate . This lightweight gating component dynamically rescales the axes of the random projection space based on input conte xt, ef fectively modulating the routing decision bound- ary without altering the projection matrix itself. Furthermore, we introduce a Contrastive Orthogo- nality Loss ( L orth ) that explicitly enforces separa- tion between expert subspaces, pro viding stronger guarantees than the probabilistic near -orthogonality of random projections alone. This design addi- tionally confers natural advantages for continual learning, as the enforced subspace separation mit- igates catastrophic forgetting ( Kirkpatrick et al. , 2017 ) when the model is sequentially adapted to ne w tasks. Our contributions are summarized as follo ws: • W e identify the limitation of static magnitude- based routing in FlyLoRA and propose a bio- logically moti vated solution grounded in the principle of neuromodulation. • W e introduce NeuroLoRA, which incorpo- rates a lightweight context-aw are g ating mech- anism to dynamically modulate the frozen pro- jection space, enabling conte xt-sensiti ve e x- pert acti vation while preserving training-free merging compatibility . • W e propose a Contrasti ve Orthogonality Loss that explicitly penalizes e xpert subspace ov er- lap, enhancing task decoupling for both multi- task merging and continual learning. • Extensi ve experiments on three benchmarks (MMLU, ScienceQA, GSM8K) demonstrate that NeuroLoRA outperforms FlyLoRA and other strong baselines in single-task, multi- task merging, and continual learning settings. 2 Related W ork 2.1 Parameter -Efficient Fine-T uning PEFT aims to adapt pre-trained LLMs to do wn- stream tasks while updating only a small fraction of parameters. Early approaches include Adapter layers ( Houlsby et al. , 2019 ), which insert bottle- neck modules between T ransformer ( V aswani et al. , 2017 ) layers, and Prefix-T uning ( Li and Liang , 2021 ), which prepends learnable continuous tokens to the input. Prompt T uning ( Lester et al. , 2021 ) further simplifies this by learning soft prompts in the input embedding space. Hu et al. ( 2022 ) introduced LoRA, which in- jects lo w-rank matrices B ∈ R d out × r and A ∈ R r × d in into frozen linear layers, where r ≪ min( d in , d out ) . Subsequent work has focused on improving rank allocation and training dynamics. AdaLoRA ( Zhang et al. , 2023 ) dynamically prunes singular v alues via importance scoring to allocate rank budget across layers. QLoRA ( Dettmers et al. , 2024 ) combines 4-bit quantization of the base model with LoRA adapters, substantially re- ducing memory footprint. DoRA ( Liu et al. , 2024 ) decomposes the weight update into magnitude and directional components to improv e learning stabil- ity . While ef fectiv e for single-task adaptation, these methods are susceptible to parameter interference in multi-task and continual learning settings ( Zou et al. , 2025b ; W ang et al. , 2023 ). 2.2 Mixture-of-Experts f or LoRA Mixture-of-Experts (MoE) architectures condition- ally acti vate subsets of parameters based on input characteristics ( Shazeer et al. , 2017 ; Fedus et al. , 2022 ; Lepikhin et al. , 2021 ), enabling increased model capacity without proportional computational cost. Integrating MoE with LoRA has emerged as a principled approach to multi-task adaptation. Lo- RAMoE ( Dou et al. , 2024 ) employs a localized balancing constraint to prev ent expert collapse dur - ing instruction tuning. MoLE ( W u et al. , 2024 ) and MixLoRA ( Li et al. , 2024 ) utilize top- k gating networks to route tokens to dif ferent LoRA expert modules. Zadouri et al. ( 2023 ) in vestigated the parameter ef ficiency limits of MoE-LoRA config- urations. AdaMix ( W ang et al. , 2022 ) randomly mixes adapter experts during training for improv ed generalization. Ho wever , these methods rely on explicit, train- able router netw orks (typically linear layers fol- lo wed by softmax), which introduce additional learnable parameters and—more critically—make training-free model merging infeasible since the routing logic is learned and task-specific. Neu- roLoRA dif fers fundamentally by retaining a frozen projection for routing while introducing dynamic modulation. 2 2.3 Bio-Inspired Repr esentations and Model Merging The fruit fly olfactory circuit has inspired ef ficient locality-sensiti ve hashing algorithms ( Dasgupta et al. , 2017 ). The core principle—projecting inputs into a high-dimensional sparse space to separate ov erlapping patterns—was applied to LoRA by Fly- LoRA ( Zou et al. , 2025b ), which treats the down- projection matrix A as a frozen sparse random pro- jection and uses magnitude-based Winner -T ake-All acti vation for expert selection. This design inher- ently supports Model Mer ging techniques such as T ask Arithmetic ( Ilharco et al. , 2023 ) and TIES- Merging ( Y ada v et al. , 2023 ), due to the approx- imate orthogonality of high-dimensional random vectors. Beyond task decoupling and merging, recent work has re vealed that the structural properties of the fly olfactory circuit—in particular , its sparse, high-dimensional, and approximately or- thogonal random projections—naturally mitigate the stability-plasticity dilemma central to continual learning ( Zou et al. , 2025a ). Fly-CL ( Zou et al. , 2026 ) further extended this insight into a practi- cal continual representation learning framework, demonstrating that fly-inspired decorrelation mech- anisms can significantly reduce inter -task interfer- ence while maintaining training ef ficiency . Our work identifies a critical gap in FlyLoRA: while it imitates the structur e (fixed connecti vity) of the fly olfactory circuit, it omits the functional dynamics of biological neural systems. By intro- ducing a neuromodulation gate inspired by the role of modulatory neurotransmitters in biological brains ( Marder , 2012 ; Doya , 2002 ), NeuroLoRA combines the structural stability of random projec- tions with contextual adaptability . 2.4 Continual Learning with PEFT Continual learning (CL) seeks to sequentially learn new tasks without catastrophically for get- ting previously acquired kno wledge ( Kirkpatrick et al. , 2017 ; Li and Hoiem , 2017 ; McCloske y and Cohen , 1989 ). Classical CL strategies in- clude regularization-based methods such as Elastic W eight Consolidation (EWC) ( Kirkpatrick et al. , 2017 ) and Learning without For getting (LwF) ( Li and Hoiem , 2017 ), replay-based methods ( Re- buf fi et al. , 2017 ; Chaudhry et al. , 2019 ), and architecture-based methods that allocate dedicated parameters per task ( Rusu et al. , 2016 ). The intersection of PEFT and continual learning has attracted increasing attention. Razdaibiedina et al. ( 2023 ) proposed Progressiv e Prompts, which sequentially append ne w soft prompts while freez- ing previous ones. O-LoRA ( W ang et al. , 2023 ) constrains each ne w task’ s LoRA subspace to be or - thogonal to pre viously learned subspaces, directly addressing inter-task interference. Biderman et al. ( 2024 ) empirically demonstrated that LoRA pro- vides a de gree of implicit re gularization ag ainst forgetting due to its lo w-rank constraint. InfLoRA ( Liang and Li , 2024 ) further forms LoRA param- eters within the subspace spanned by prior task representations. From a bio-inspired perspectiv e, Zou et al. ( 2025a ) provided theoretical and empirical evi- dence that the sparse, expansi ve coding of the fly olfactory circuit inherently alle viates catastrophic forgetting by projecting task representations into nearly orthogonal subspaces. Zou et al. ( 2026 ) built upon this insight to de velop Fly-CL, a fly-inspired continual learning frame work that achiev es ef fi- cient decorrelation across sequential tasks with reduced training overhead. NeuroLoRA ’ s design extends this line of work: the frozen projection matrix A provides the same structural stability , the context-a ware modulation gate further adapts ex- pert selection to task-specific patterns, and the Con- trasti ve Orthogonality Loss explicitly enforces non- ov erlapping expert subspaces, reducing destructiv e interference between sequentially learned tasks. 3 Methodology 3.1 Preliminaries: Revisiting FlyLoRA Standard LoRA ( Hu et al. , 2022 ) approximates the weight update ∆ W ∈ R d out × d in for a frozen pre-trained weight matrix W 0 with two low-rank matrices: B ∈ R d out × r and A ∈ R r × d in , where rank r ≪ min( d in , d out ) . The modified forward pass is: y = W 0 x + α r B Ax (1) where α is a scaling h yperparameter . FlyLoRA ( Zou et al. , 2025b ) reinterprets this decomposition as a Rank-Wise Mixture of Experts, drawing on the sparse, e xpansiv e coding principle of the fruit fly olfactory system ( Dasgupta et al. , 2017 ). It introduces two k ey modifications: 1. Frozen Sparse Projection. The down- projection matrix A is initialized as a sparse 3 random matrix with sparsity ratio ρ and re- mains frozen throughout training. Each el- ement A ij is drawn from { 0 , +1 , − 1 } with probabilities { 1 − ρ, ρ/ 2 , ρ/ 2 } . This mimics the fixed, random connections from Projec- tion Neurons (PNs) to Ken yon Cells (KCs) in the fly olfactory circuit. 2. Implicit Magnitude-Based Routing. F or an input x ∈ R d in , FlyLoRA computes the pro- jection h = Ax ∈ R r . The top- k indices in the absolute v alues | h | determine the set of ac- ti vated e xperts I activ e , mimicking the W inner- T ake-All mechanism observed in KCs. The output is then computed as: ∆ y = X i ∈I active B : ,i · h i (2) The critical limitation of this design lies in its context-a gnostic routing. The expert selection for a token x depends solely on its intrinsic vector representation and the fixed projection A . For in- stance, the tok en “bank” should ideally acti vate dif- ferent e xperts in the contexts of “ri ver bank” versus “in vestment bank, ” but a static, magnitude-based router cannot distinguish between these cases since the token embeddings at the LoRA injection point carry limited contextual dif ferentiation. 3.2 NeuroLoRA: Dynamic Modulation of Static Projections W e draw inspiration from neuromodulation in bi- ological neural systems ( Marder , 2012 ). Neuro- modulators such as dopamine and acetylcholine do not directly transmit primary sensory informa- tion; instead, they alter the e xcitability and gain of target neurons, effecti vely changing ho w those neu- rons respond to the same input depending on the org anism’ s internal state or external conte xt ( Doya , 2002 ). NeuroLoRA simulates this mechanism by in- troducing a lightweight, conte xt-aware gate that dynamically modulates the static projection space of matrix A . Context-A ware Neuromodulation Gate. For each input token x ∈ R d in , we generate a mod- ulation vector m x ∈ R r via a bottleneck gating network E ϕ : m x = σ ( W 2 · GELU ( W 1 x )) ⊙ γ + β (3) where W 1 ∈ R d h × d in and W 2 ∈ R r × d h are learnable projections with bottleneck dimension d h ≪ d in , σ ( · ) denotes the sigmoid function, and ⊙ is the Hadamard product. The learnable param- eters γ ∈ R r and β ∈ R r are initialized to 1 and 0 respecti vely , ensuring that m x ≈ 1 at initializa- tion. This guarantees that NeuroLoRA reduces to FlyLoRA at the start of training, providing a stable initialization. Modulated Expert Selection. The e xpert selec- tion is performed on the dynamically re-weighted projection. Instead of computing h = Ax directly , we apply the modulation: h ′ = ( Ax ) ⊙ m x (4) This operation can be understood as dynamically rescaling the axes of the r -dimensional random projection space. Dimensions deemed rele vant by the context (via m x ) are amplified, increasing their probability of being selected by the top- k opera- tion, while contextually irrelev ant dimensions are suppressed. The acti ve expert set is then determined by: I activ e = T opK  | h ′ | , k  (5) and the final lo w-rank update is: ∆ y = X i ∈I active B : ,i · h ′ i (6) Crucially , the projection matrix A remains frozen and sparse throughout training. Only the gating parameters { W 1 , W 2 , γ , β } and the up- projection matrix B are updated, preserving both parameter ef ficiency and merging compatibility . 3.3 Contrastive Orthogonality Loss FlyLoRA relies on the Johnson–Lindenstrauss lemma ( Johnson and Lindenstrauss , 1984 ), which guarantees that random projections approximately preserve pairwise distances. In high dimensions, random vectors are nearly orthogonal with high probability . Howe ver , this is a probabilistic guaran- tee that weak ens for practical, finite ranks (e.g., r = 32 ), where non-negligible correlations be- tween expert columns of B can arise during train- ing. T o strengthen subspace separation from proba- bilistic to explicitly enforced, we introduce a Con- trasti ve Orthogonality Loss L orth . For a giv en in- put x with activ e expert set I activ e and inactiv e set 4 I inactiv e = { 1 , . . . , r } \ I activ e , we penalize the co- sine similarity between acti ve and inacti ve e xpert columns: L orth = 1 |I activ e | · |I inactiv e | (7) X i ∈I active X j ∈I inactive B ⊤ : ,i B : ,j ∥ B : ,i ∥ 2 ∥ B : ,j ∥ 2 ! 2 This loss encourages the columns of B corre- sponding to dif ferent expert specializations to oc- cupy distinct, non-interfering subspaces. This prop- erty is beneficial for both (1) multi-task model merging, as it reduces destructiv e interference when combining adapters, and (2) continual learn- ing, as it encourages new tasks to utilize subspaces that are orthogonal to those of prior tasks. The total training objecti ve is: L total = L task + λ L orth (8) where λ is a hyperparameter controlling the strength of orthogonality regularization. 4 Experiments 4.1 Experimental Setup Base Model. All e xperiments are conducted on Llama-3-8B ( Grattafiori et al. , 2024 ). LoRA mod- ules are applied to the query , key , value, and output projection matrices in all attention layers. Datasets. Following the FlyLoRA ev aluation pro- tocol ( Zou et al. , 2025b ), we ev aluate on three benchmarks spanning div erse reasoning capabil- ities: • MMLU ( Hendrycks et al. , 2021 ): A compre- hensi ve benchmark cov ering 57 subjects for e v aluating general knowledge and language understanding. • ScienceQA ( Lu et al. , 2022 ): A multimodal science question answering benchmark requir- ing scientific reasoning (we use the te xt-only subset). • GSM8K ( Cobbe et al. , 2021 ): A dataset of grade-school math word problems requiring multi-step arithmetic reasoning. Baselines. W e compare NeuroLoRA against the follo wing methods: • LoRA ( Hu et al. , 2022 ): Standard low-rank adaptation with rank r = 32 . Method Params (%) MMLU SciQA GSM8K A vg. Full FT 100 66.8 96.2 62.4 75.1 LoRA 0.83 64.2 94.0 56.2 71.5 AdaLoRA 0.81 64.5 93.6 56.8 71.6 DoRA 0.84 64.8 94.3 57.4 72.2 MoLE 0.35 63.9 93.1 55.8 70.9 FlyLoRA 0.13 65.1 94.1 58.7 72.6 NeuroLoRA 0.14 66.3 95.5 61.2 74.3 T able 1: Single-task performance comparison on Llama- 3-8B. Params (%) denotes the percentage of trainable parameters relativ e to the full model. NeuroLoRA achiev es the best performance across all three bench- marks while maintaining comparable parameter effi- ciency to FlyLoRA. • AdaLoRA ( Zhang et al. , 2023 ): Adaptive rank allocation via singular v alue pruning. • DoRA ( Liu et al. , 2024 ): W eight-decomposed lo w-rank adaptation. • MoLE ( W u et al. , 2024 ): MoE-LoRA with learned top- k gating. • FlyLoRA ( Zou et al. , 2025b ): Bio-inspired MoE-LoRA with frozen sparse projection and magnitude-based routing. Implementation Details. For FlyLoRA and Neu- roLoRA, we set the total rank r = 32 , acti ve rank k = 8 , and sparsity ratio ρ = 0 . 25 . The bottleneck dimension of the neuromodulation gate is d h = 64 . All methods are trained using AdamW ( Loshchilo v and Hutter , 2019 ) with β 1 = 0 . 9 , β 2 = 0 . 95 , and weight decay 0 . 01 . W e use a cosine learning rate schedule with an initial learning rate of 2 × 10 − 4 for B parameters and 5 × 10 − 4 for the modulation gate parameters, with a linear w armup ov er the first 3% of training steps. The batch size is 16 with gradient accumulation over 4 steps (ef fective batch size 64). All models are trained for 3 epochs. The orthogonality loss weight is λ = 0 . 1 , selected via grid search over { 0 . 01 , 0 . 05 , 0 . 1 , 0 . 2 } on a held- out validation set. Experiments are conducted on 4 NVIDIA A100 (80GB) GPUs using DeepSpeed ZeR O Stage 2 ( Rajbhandari et al. , 2020 ). W e report the mean ov er 3 random seeds. 4.2 Single-T ask Results T able 1 summarizes the single-task adaptation per- formance. NeuroLoRA achieves the highest accuracy across all three benchmarks, outperforming Fly- LoRA by +1.2 on MMLU, +1.4 on ScienceQA, 5 Method A vg. (Indiv .) T ask Arith. TIES LoRA 71.5 58.2 (-18.6%) 60.5 (-15.4%) DoRA 72.2 59.8 (-17.2%) 61.7 (-14.5%) FlyLoRA 72.6 68.9 (-5.1%) 69.6 (-4.1%) NeuroLoRA 74.3 71.8 (-3.4%) 72.3 (-2.7%) T able 2: Multi-task model mer ging results (average ac- curacy across three tasks). Percentages indicate relative degradation from individual task performance. Neu- roLoRA exhibits the smallest performance drop under both merging strate gies. and +2.5 on GSM8K. The lar gest improvement is observed on GSM8K, which requires multi-step mathematical reasoning. W e attribute this to Neu- roLoRA ’ s ability to dynamically modulate expert selection based on the ev olving problem state: in multi-step reasoning, the appropriate computation at each step depends hea vily on the results of pre- ceding steps, a dependency that FlyLoRA ’ s static router cannot capture. Notably , NeuroLoRA ap- proaches Full Fine-T uning performance on MMLU (66.3 vs. 66.8) while using only 0.14% of the train- able parameters. 4.3 Multi-T ask Model Merging A key advantage of FlyLoRA ’ s frozen projection design is compatibility with training-free model merging. W e ev aluate whether NeuroLoRA pre- serves this property by independently training adapters on each of the three tasks and merging them using T ask Arithmetic ( Ilharco et al. , 2023 ) and TIES-Merging ( Y adav et al. , 2023 ). As shown in T able 2 , NeuroLoRA exhibits the least de gradation under both merging strate gies. W ith T ask Arithmetic, NeuroLoRA loses only 3.4% relati ve performance compared to 5.1% for Fly- LoRA and 18.6% for standard LoRA. This im- prov ement is attributed to the Contrasti ve Orthog- onality Loss, which explicitly enforces subspace separation during training, pro viding stronger inter-task decoupling than the probabilistic near- orthogonality of random projections alone. The merging of the modulation gate parameters is han- dled by simple av eraging, as the gate operates as a multiplicati ve modifier on the shared frozen projec- tion space. 4.4 Continual Learning The subspace separation properties of NeuroLoRA suggest a natural adv antage for continual learning scenarios, where models must sequentially adapt to ne w tasks without forgetting pre viously learned Method MMLU SciQA GSM8K A vg. BWT LoRA-Seq 51.8 84.2 55.4 63.8 − 11.3 EWC+LoRA 56.1 87.5 54.8 66.1 − 7.5 O-LoRA 58.2 89.4 55.6 67.7 − 5.8 FlyLoRA-Seq 59.7 90.1 57.6 69.1 − 4.3 NeuroLoRA-Seq 62.1 92.0 60.4 71.5 − 2 . 6 T able 3: Continual learning results (MMLU → Sci- enceQA → GSM8K). BWT measures backward trans- fer; v alues closer to zero indicate less for getting. Neu- roLoRA achie ves the best trade-of f between plasticity (learning new tasks) and stability (retaining old tasks). kno wledge. W e ev aluate this in a sequential fine- tuning setting. Setup. W e sequentially fine-tune Llama-3-8B on the three datasets in the order MMLU → Sci- enceQA → GSM8K, where each task is trained for 3 epochs before proceeding to the next. After train- ing on all three tasks, we e valuate performance on all tasks. W e additionally report Bac kward T rans- fer (BWT) ( Lopez-Paz and Ranzato , 2017 ), which quantifies the av erage performance change on pre- viously learned tasks after learning new ones (ne g- ati ve values indicate for getting): BWT = 1 T − 1 T − 1 X i =1 ( R T ,i − R i,i ) (9) where R j,i denotes the accurac y on task i after training on task j , and T is the total number of tasks. W e compare against sequential v ariants of the baselines, as well as established continual learning strategies: • LoRA-Seq : Sequential LoRA without any anti-forgetting mechanism. • EWC+LoRA : LoRA with Elastic W eight Consolidation ( Kirkpatrick et al. , 2017 ) ap- plied to the LoRA parameters. • O-LoRA ( W ang et al. , 2023 ): LoRA with or- thogonal subspace constraints between tasks. • FlyLoRA-Seq : Sequential FlyLoRA without explicit for getting mitigation. Results. T able 3 presents the continual learning results. NeuroLoRA-Seq achiev es the highest av er- age accuracy (71.5) and the least for getting (BWT = − 2.6), outperforming all baselines. Compared to FlyLoRA-Seq (BWT = − 4.3), the improv ement 6 V ariant GSM8K ∆ NeuroLoRA (full) 61.2 — w/o L orth ( λ = 0 ) 60.1 − 1.1 w/o Modulation Gate 58.7 − 2.5 w/ Static Gate ( m x = const) 59.4 − 1.8 w/ T rainable A 57.8 − 3.4 d h = 32 60.6 − 0.6 d h = 128 61.1 − 0.1 T able 4: Ablation study on GSM8K. Remo ving the mod- ulation gate (equi valent to FlyLoRA) incurs the lar gest degradation. Making A trainable degrades performance, confirming the importance of the frozen projection for generalization. is substantial and consistent across all three met- rics. W e attribute this to two complementary mech- anisms: (1) the Contrasti ve Orthogonality Loss ( L orth ) activ ely pushes expert subspaces apart dur- ing each task’ s training, creating “free” capacity for subsequent tasks; and (2) the conte xt-aw are mod- ulation gate naturally routes inputs from dif ferent tasks to dif ferent re gions of the projection space, providing an implicit form of task-specific parame- ter allocation without explicit task identity . These findings are consistent with the theoretical analysis of Zou et al. ( 2025a ), who sho wed that the fly olfac- tory circuit’ s sparse random projections inherently mitigate the stability-plasticity dilemma, and with the practical continual learning gains demonstrated by Fly-CL ( Zou et al. , 2026 ). NeuroLoRA extends these bio-inspired adv antages by adding dynamic modulation atop the static structural properties. Notably , NeuroLoRA-Seq achiev es GSM8K ac- curacy of 60.4 in the continual setting, which is 98.7% of its single-task performance (61.2), sug- gesting minimal interference from prior task train- ing on mathematical reasoning. 4.5 Ablation Study W e conduct ablation experiments on GSM8K to verify the contrib ution of each component. Se veral observ ations emerge from T able 4 : (1) Removing the modulation gate (reverting to Fly- LoRA) causes the largest single-component drop ( − 2.5), confirming that context-a ware routing is the primary source of improv ement. (2) Replacing the context-dependent g ate with a static learnable vector ( m x = const ) recov ers only part of the gain ( − 1.8), demonstrating that the input-dependent na- ture of the modulation is essential. (3) Making A trainable degrades performance by − 3.4, consistent with FlyLoRA ’ s finding that frozen projections pro- vide beneficial regularization. (4) The bottleneck dimension d h sho ws diminishing returns beyond 64, and d h = 64 provides a fav orable trade-of f between expressi veness and parameter ov erhead. 4.6 Sensitivity Analysis W e analyze the sensitivity of NeuroLoRA to two ke y hyperparameters: the acti ve rank k and the orthogonality loss weight λ . Effect of Active Rank k . W e v ary k ∈ { 4 , 8 , 12 , 16 } with fixed r = 32 on GSM8K. Per- formance peaks at k = 8 (61.2), with k = 4 yield- ing 59.8 (insufficient capacity) and k = 16 yielding 60.3 (reduced specialization due to too man y acti ve experts). This aligns with FlyLoRA ’ s finding that moderate sparsity ( k /r = 0 . 25 ) provides the best balance. Effect of λ . W ith λ = 0 . 01 , orthogonality regu- larization is too weak to meaningfully separate sub- spaces (60.4). Performance improves with λ = 0 . 1 (61.2) and begins to de grade at λ = 0 . 2 (60.7), as excessi ve orthogonality pressure constrains the expressi veness of B . 5 Discussion Static vs. Dynamic Routing. FlyLoRA ’ s design philosophy holds that “an expert is a ware of its o wn capacity” through magnitude-based selection. While this is effecti ve for pattern-matching tasks (e.g., kno wledge recall in MMLU), it is insuf ficient for tasks requiring compositional reasoning (e.g., multi-step problem solving in GSM8K). The per- formance gap between FlyLoRA and NeuroLoRA on GSM8K (+2.5) versus MMLU (+1.2) supports this interpretation. NeuroLoRA ’ s modulation gate provides a mechanism analogous to biological neu- romodulation, where the same neural circuit can produce qualitatively dif ferent responses depend- ing on modulatory context ( Marder , 2012 ). Computational Overhead. The neuromodula- tion gate introduces d in × d h + d h × r + 2 r additional parameters per LoRA layer . W ith d in = 4096 , d h = 64 , and r = 32 , this amounts to 264,256 parameters per layer—a ne gligible overhead of ap- proximately 0.003% of the base model. The pro- jection A remains frozen and sparse, meaning Neu- roLoRA maintains the training speed advantages of FlyLoRA ov er dense methods. 7 Implications f or Continual Learning . The con- tinual learning results in Section 4.4 suggest that the combination of frozen routing structure and enforced subspace orthogonality provides an ef- fecti ve inducti ve bias for mitigating catastrophic forgetting. Unlike methods that require e xplicit task boundaries or replay buf fers ( Kirkpatrick et al. , 2017 ; Reb uffi et al. , 2017 ), NeuroLoRA ’ s approach is fully online and does not store any data from pre vious tasks. The context-a ware gate implicitly partitions the expert space based on input character - istics rather than task labels, making it applicable to task-agnostic continual learning scenarios. 6 Conclusion W e presented NeuroLoRA, an advancement of the MoE-LoRA frame work that incorporates biologi- cal insights from neuromodulation. By introduc- ing a lightweight context-a ware gating mechanism atop the sparse random projections pioneered by FlyLoRA, NeuroLoRA resolves the limitations of static, magnitude-based routing. The addition of a Contrasti ve Orthogonality Loss further strengthens expert subspace separation, benefiting both multi- task merging and continual learning. Extensiv e experiments demonstrate that NeuroLoRA consis- tently outperforms FlyLoRA and other strong base- lines across single-task adaptation (+1.7 av erage improv ement), model merging ( − 3.4% vs. − 5.1% degradation), and continual learning (BWT of − 2.6 vs. − 4.3). These results suggest that simulating not only the structur e (connectivity) but also the dy- namics (modulation) of biological neural systems is a promising direction for parameter-ef ficient adap- tation of large language models. Limitations Our ev aluation is limited to a single base model (Llama-3-8B) and three benchmarks. While the benchmarks span knowledge, scientific reason- ing, and mathematical reasoning, ev aluation on a broader set of tasks and model scales would strengthen the generality of our conclusions. The continual learning experiments consider a fixed task order; future work should inv estigate sensi- ti vity to task ordering and longer task sequences. Additionally , the current neuromodulation gate op- erates on indi vidual tokens; incorporating cross- token (sequence-le vel) context via attention pool- ing may yield further improv ements. References Josh Achiam, Stev en Adler , Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. GPT-4 techni- cal report. arXiv preprint . Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Ha vens, V italiy Chiley , Jonathan Frankle, and 1 others. 2024. LoRA learns less and forgets less. arXiv preprint . T om Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry , Amanda Askell, and 1 others. 2020. Language models are fe w- shot learners. In Advances in Neural Information Pr ocessing Systems , volume 33, pages 1877–1901. Arslan Chaudhry , Marcus Rohrbach, Mohamed El- hoseiny , Thalaiyasingam Ajanthan, Pa wan Kumar Dokania, Philip HS T orr , and Marc’Aurelio Ranzato. 2019. On tiny episodic memories in continual learn- ing. In arXiv preprint . Karl Cobbe, V ineet K osaraju, Mohammad Bav arian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry T worek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 . Sanjoy Dasgupta, Charles F Stevens, and Saket Bhatt. 2017. A neural algorithm for a fundamental comput- ing problem. Science , 358(6364):793–796. T im Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer . 2024. QLoRA: Efficient finetun- ing of quantized language models. Advances in Neu- ral Information Pr ocessing Systems , 36. Shihan Dou, Enyu Zhou, Y an Liu, Songyang Gao, Jun W ei, Huajian Shen, Y uhao Xiong, Junjie Shan, Ermo Shan, Xuwu Huang, and 1 others. 2024. LoRAMoE: Alleviating world kno wledge forgetting in lar ge lan- guage models via MoE-style plugin. arXiv pr eprint arXiv:2312.09979 . Kenji Doya. 2002. Metalearning and neuromodulation. Neural Networks , 15(4–6):495–506. W illiam Fedus, Barret Zoph, and Noam Shazeer . 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity . Journal of Machine Learning Researc h , 23(120):1–39. Aaron Grattafiori, Abhimanyu Dube y , Abhinav Jauhri, Abhinav Pandey , Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint . Dan Hendrycks, Collin Burns, Ste ven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 8 2021. Measuring massive multitask language under - standing. In International Confer ence on Learning Repr esentations . Neil Houlsby , Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylv ain Gelly . 2019. Parameter -efficient transfer learning for NLP. In In- ternational Confer ence on Machine Learning , pages 2790–2799. PMLR. Edward J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizhu Chen. 2022. LoRA: Low-rank adap- tation of large language models. arXiv pr eprint arXiv:2106.09685 . Gabriel Ilharco, Marco T ulio Ribeiro, Mitchell W orts- man, Suchin Gururangan, Ludwig Schmidt, Han- naneh Hajishirzi, and Ali Farhadi. 2023. Editing models with task arithmetic. In International Confer- ence on Learning Repr esentations . W illiam B Johnson and Joram Lindenstrauss. 1984. Ex- tensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics , 26:189–206. James Kirkpatrick, Razvan P ascanu, Neil Rabinowitz, Joel V eness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag- nieszka Grabska-Barwinska, and 1 others. 2017. Overcoming catastrophic for getting in neural net- works. Pr oceedings of the National Academy of Sciences , 114(13):3521–3526. Dmitry Lepikhin, HyoukJoong Lee, Y uanzhong Xu, Dehao Chen, Orhan Firat, Y anping Huang, Maxim Krikun, Noam Shazeer , and Zhifeng Chen. 2021. GShard: Scaling giant models with conditional com- putation and automatic sharding. In International Confer ence on Learning Representations . Brian Lester , Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-ef ficient prompt tuning. In Pr oceedings of the 2021 Conference on Empirical Methods in Natural Language Pr ocessing , pages 3045–3059. Dengchun Li, Y ingzi Mei, Zhengmao Zhuang, Mingyu Zhang, Deng Cai, Lei Li, and Minlie Huang. 2024. MixLoRA: Enhancing large language models fine- tuning with LoRA-based mixture of experts. arXiv pr eprint arXiv:2404.15159 . Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv pr eprint arXiv:2101.00190 . Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE T ransactions on P attern Analysis and Machine Intelligence , 40(12):2935–2947. Y an-Shuo Liang and W u-Jun Li. 2024. InfLoRA: Interference-free low-rank adaptation for continual learning. arXiv preprint . Shih-Y ang Liu, Chien-Y i W ang, Hongxu Y in, P avlo Molchanov , Y u-Chiang Frank W ang, Kwang-Ting Cheng, and Min-Hung Chen. 2024. DoRA: W eight- decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353 . David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Pr ocessing Systems , volume 30. Ilya Loshchilov and Frank Hutter . 2019. Decoupled weight decay regularization. In International Confer- ence on Learning Repr esentations . Pan Lu, Sw aroop Mishra, T ony Xia, Liang Qiu, Kai-W ei Chang, Song-Chun Zhu, Oyvind T afjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multi- modal reasoning via thought chains for science ques- tion answering. In Advances in Neural Information Pr ocessing Systems , volume 35, pages 2507–2521. Eve Marder . 2012. Neuromodulation of neuronal cir- cuits: Back to the future. Neuron , 76(1):1–11. Michael McCloskey and Neal J Cohen. 1989. Catas- trophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation , 24:109–165. Samyam Rajbhandari, Jeff Rasley , Olatunji Rber, and Y uxiong He. 2020. ZeR O: Memory optimizations tow ard training trillion parameter models. In Pr o- ceedings of the International Confer ence for High P erformance Computing, Networking, Storage and Analysis , pages 1–16. Anastasia Razdaibiedina, Y uning Mao, Rui Hou, Ma- dian Khabsa, Mike Lewis, and Amjad Almahairi. 2023. Progressiv e prompts: Continual learning for language models. arXiv preprint . Sylvestre-Alvise Reb uffi, Alexander K olesnikov , Georg Sperl, and Christoph H Lampert. 2017. iCaRL: In- cremental classifier and representation learning. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 2001–2010. Andrei A Rusu, Neil C Rabinowitz, Guillaume Des- jardins, Hubert Soyer , James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Progressi ve neural networks. In arXiv pr eprint arXiv:1606.04671 . Noam Shazeer , Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-e xperts layer . In Inter- national Confer ence on Learning Representations . Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser , and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pr o- cessing Systems , volume 30. 9 Y aqing W ang, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan A wadallah, and Jianfeng Gao. 2022. AdaMix: Mixture-of-adaptations for parameter-ef ficient model tuning. In Pr oceedings of the 2022 Conference on Empirical Methods in Natural Language Pr ocessing , pages 5744–5760. Y iming W ang, Y ong Y u, Lingqiao Zeng, and Zuowei Li. 2023. O-LoRA: Orthogonal low-rank adap- tation of large language models. arXiv pr eprint arXiv:2406.01434 . Xun W u, Shaohan Hu, Y ing Shi, Bowen Liu, Xin Geng, Furu Jiao, Jiang Bian, and Furu W ei. 2024. Mixture of LoRA experts. arXiv pr eprint arXiv:2404.13628 . Prateek Y adav , Derek T am, Leshem Choshen, Colin Raffel, and Mohit Bansal. 2023. TIES-merging: Re- solving interference when merging models. In Ad- vances in Neural Information Processing Systems , volume 36. T ianhe Y u, Saurabh Kumar , Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems , 33:5824– 5836. T ed Zadouri, Thomas Hartvigsen, Gabriel Ilharco, Ali Farhadi, and Sara Hooker . 2023. Pushing mixture of experts to the limit: Extremely parameter effi- cient MoE for instruction tuning. arXiv pr eprint arXiv:2309.05444 . Qingru Zhang, Minshuo Chen, Ale xander Bukharin, Nikos Karampatziakis, Pengcheng He, Y u Cheng, W eizhu Chen, and T uo Zhao. 2023. AdaLoRA: Adaptiv e budget allocation for parameter-efficient fine-tuning. arXiv preprint . Heming Zou, Y unliang Zang, and Xiangyang Ji. 2025a. Structural features of the fly olfactory circuit mitigate the stability-plasticity dilemma in continual learning. arXiv pr eprint arXiv:2502.01427 . Heming Zou, Y unliang Zang, W utong Xu, and Xi- angyang Ji. 2026. Fly-CL: A fly-inspired framew ork for enhancing ef ficient decorrelation and reduced training time in pre-trained model-based continual representation learning. In The F ourteenth Interna- tional Confer ence on Learning Representations . Heming Zou, Y unliang Zang, W utong Xu, Y ao Zhu, and Xiangyang Ji. 2025b. FlyLoRA: Boosting task decoupling and parameter ef ficiency via implicit rank- wise mixture-of-experts. In The Thirty-ninth Annual Confer ence on Neural Information Pr ocessing Sys- tems . A Derivation of the Orthogonality Loss Bound W e provide a brief analysis of the expected value of L orth under random initialization. Let the columns of B ∈ R d out × r be initialized i.i.d. from N (0 , 1 d out I ) . For two independent columns B : ,i and B : ,j , the squared cosine similarity c 2 ij =  B ⊤ : ,i B : ,j ∥ B : ,i ∥∥ B : ,j ∥  2 has expected value E [ c 2 ij ] = 1 d out for large d out . W ith d out = 4096 , this yields E [ L orth ] ≈ 2 . 4 × 10 − 4 at initialization. During training, without L orth , gradient updates can in- crease inter -expert correlations to O (10 − 2 ) ; the orthogonality loss pre vents this by maintaining c 2 ij near its initialization-le vel bound. B Full Hyperparameter Configuration Hyperparameter V alue Base model Llama-3-8B LoRA target modules W q , W k , W v , W o T otal rank r 32 Activ e rank k 8 Sparsity ρ 0.25 Gate bottleneck d h 64 Optimizer AdamW β 1 , β 2 0.9, 0.95 W eight decay 0.01 Learning rate ( B ) 2 × 10 − 4 Learning rate (gate) 5 × 10 − 4 LR schedule Cosine with linear warmup W armup ratio 3% Batch size 16 ( × 4 gradient accumulation) T raining epochs 3 λ ( L orth ) 0.1 α (LoRA scaling) 16 Precision BF16 mixed precision GPUs 4 × NVIDIA A100 (80GB) Framew ork DeepSpeed ZeR O Stage 2 T able 5: Complete hyperparameter configuration. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment