HypeLoRA: Hyper-Network-Generated LoRA Adapters for Calibrated Language Model Fine-Tuning
Modern Transformer-based models frequently suffer from miscalibration, producing overconfident predictions that do not reflect true empirical frequencies. This work investigates the calibration dynamics of LoRA: Low-Rank Adaptation and a novel hyper-…
Authors: Bartosz Trojan, Filip Gębala
Hyp eLoRA: Hyp er-Net w ork-Generated LoRA A dapters for Calibrated Language Mo del Fine-T uning Bartosz T ro jan 1 , 2 [0009 − 0005 − 2649 − 3194] and Filip Gębala 1 , 3 [0009 − 0002 − 3020 − 4365] 1 Upp er-Secondary Sc ho ols of Comm unications in Cracow 2 bartosztrojanofficial@gmail.com 3 fgebalaofficial@gmail.com Abstract. Mo dern T ransformer-based mo dels frequently suffer from mis- calibration, pro ducing o verconfiden t predictions that do not reflect true empirical frequencies. This w ork inv estigates the calibration dynamics of LoRA: Lo w-Rank Adaptation and a nov el h yp er-net work-based adap- tation framew ork as parameter-efficient alternativ es to full fine-tuning for RoBER T a. Ev aluating across the GLUE b enchmark, w e demonstrate that LoRA-based adaptation consistently achiev es calibration parity with (and in specific tasks exceeds) full fine-tuning, while maintaining signif- ican tly higher parameter efficiency . W e further explore a dynamic ap- proac h where a shared h yp er-net work generates LoRA factors (A and B matrices) to induce structural coupling across la yers. This approac h pro- duced results similar to standard LoRA fine-tuning, ev en achieving better MCC on CoLA dataset. Our study also reveal a critical trade-off: con- straining the adaptation space (e.g., freezing matrices A) acts as a p ow- erful regularizer that enhances Expected Calibration Error (ECE), but necessitates a carefully balanced sacrifice in downstream task accuracy . T o supp ort future researc h, we provide a unified and repro ducible imple- men tation of contemporary calibration metrics, including ECE, MCE, and ACE. Our findings clarify the relationship b etw een parameter effi- ciency and probabilistic reliability , p ositioning structured lo w-rank up- dates as a viable foundation for uncertain ty-a ware T ransformer arc hitec- tures. https://gith ub.com/btro jan- official/Hyp eLoRA Keyw ords: Fine-tuning · LoRA · Hyp er-netw ork · RoBER T a · Calibra- tion · ECE 1 In tro duction T ransformer-based arc hitectures hav e become the dominant paradigm across a wide range of mac hine learning domains, including natural language pro cess- ing [4], computer vision [5], and sp eec h recognition [1]. While these mo dels ac hieve state-of-the-art predictive accuracy , it is no w w ell established that their probabilistic outputs are often p oorly calibrated [3]. F ormally , a mo del is con- sidered calibrated if the predicted confidence of a given class matc hes the true 2 Bartosz T ro jan and Filip Gębala empirical frequency of that class among all samples assigned that confidence score — that is, among all inputs where the mo del outputs probability p , ex- actly a fraction p should truly b elong to the predicted class. In this w ork, we explore parameter-efficien t alternatives for calibrating trans- former based enco ders. W e ev aluate fine-tuned and augmented on LoRA RoBER T a mo del on the GLUE benchmark using a comprehensiv e set of calibration met- rics, including Exp ected Calibration Error (ECE), Adaptiv e Calibration Error (A CE), and Maximum Calibration Error (MCE). In addition, w e inv estigate the use of h yp er-net works as a mechanism for inducing dynamic calibration b eha vior within frozen transformer arc hitectures. A dditionally , we study a framework in which a h yp er-net w ork generates low- rank calibration signals. This setup aims to induce structural coupling across la yers, forcing the mo del to learn a global calibration strategy rather than inde- p enden t adjustments. W e an ticipated that this shared generation pro cess would act as a regularizer, temp ering ov erly sharp probability distributions. While this approac h ultimately fails to yield consisten t impro vemen ts, it provides v aluable insigh t into the limitations of h yp er-net work-driv en calibration for transformers. Bey ond empirical ev aluation, w e also consolidate and implemen t a uni fied set of recen t calibration metrics, pro viding clear and reproducible reference imple- men tations. This addresses the curren t fragmentation in calibration ev aluation practices and facilitates more systematic comparison across metho ds. Our con tributions can b e summarized as follo ws: – W e provide an ev aluation of standard and LoRA fine-tuning for calibration of RoBER T a models on the GLUE b enchmark using m ultiple complementary calibration metrics i.e. ECE, MCE, A CE. – W e in vestigate the use of hyper-netw orks and LoRA as mechanisms for sys- tematic model calibration and for analyzing the underlying causes of mo del failure. – W e implement mo dern calibration metrics in a single ev aluation framework. 2 Related W ork Pre-training. Large-scale self-sup ervised pre-training on generic corp ora yields represen tations that transfer across downstream tasks, as demonstrated by masked language mo deling and autoregressive ob jectives [4,16]. Domain- and task-adaptive pre-training on unlab eled data can further improv e p erformance [7], yet the high computational cost of pre-training large backbones motiv ates parameter-efficien t adaptation metho ds. Fine-tuning large language mo dels. F ull-parameter fine-tuning is the most direct adaptation approach, but is computationally and memory demanding for large models [17]. When a single bac kb one suc h as RoBER T a [16] must b e tai- lored to many tasks, storing full task-sp ecific parameters and optimizer states b ecomes infeasible, motiv ating more resource-efficient alternativ es. P arameter-efficient fine-tuning. PEFT metho ds update only a small param- eter subset per task while achieving p erformance comparable to full fine-tuning. Hyp eLoRA for Calibrated Language Mo del Fine-T uning 3 A dapters [9] insert trainable b ottleneck mo dules in to eac h transformer block. Prefix-tuning [13] prep ends learnable tokens to the input, optimizing a con tin u- ous prompt without mo difying mo del w eights. LoRA [10] injects trainable low- rank matrices into attention lay ers without any activ ation function, substan tially reducing trainable parameter coun t and memory cost. Mo del Calibration [25]. A calibrated mo del’s confidence estimates align with empirical likelihoo ds: formally , p erfect calibration requires P ( Y = y | Q = q ) = q ov er the joint distribution P ( Q, Y ) . W e ev aluate calibration primarily via exp ected calibration error (ECE) [6], alongside Classwise ECE, MCE, A CE, Thresholded A CE, and Brier Score [19,12,20,2]. P ost-ho c metho ds suc h as tem- p erature scaling [18], Platt scaling [22], and isotonic regression [26] are simple but globally applied and brittle under distribution shift. T raining-time approaches lik e lab el smoothing [15] and confidence-a ware regularization [14] are more ex- pressiv e but require costly retraining. Hyp er-net w orks. Hyp er-netw orks parameterize weigh ts as functions of exter- nal inputs, enabling dynamic generation of adaptation parameters [8]. In PEFT, conditioning hyper-netw orks on task iden tit y to generate adapter or lo w-rank up- date parameters improv es parameter sharing and multi-task p erformance with- out p er-task parameter duplication [21]. Generating full weigh t tensors remains c hallenging due to scalability and stability concerns [11], so practical approaches constrain generation to compact structures suc h as bottleneck adapters or lo w- rank factors. 3 Prop osed Approac h W e prop ose a parameter-efficient calibration mec hanism that injects lo w-rank up dates into a frozen RoBER T a enco der [16] via a compact hyper-netw ork (Fig- ure 1). The hyper-netw ork conditions on a learned embedding asso ciated with eac h target weigh t matrix, producing co ordinated adaptation signals across all transformer la yers while k eeping all backbone parameters frozen. 3.1 Problem Definition Let f θ ( x ) denote a pretrained transformer enco der with frozen parameters θ , pro ducing a probabilit y distribution o ver C classes for input x : p θ ( y | x ) = softmax( f θ ( x )) . (1) Our goal is to improv e predictiv e c alibration without mo difying pretrained w eights. Instead of p ost-hoc logit adjustmen ts (e.g., temperature scaling), w e in tro duce structured lo w-rank p erturbations inside the transformer la yers. 3.2 La yer-Conditioned Lo w-Rank Up dates with Hyp er-Net w ork F or a weigh t matrix W ∈ R d × d of pretrained transformer, we apply a LoRA-st yle [10] lo w-rank p erturbation: W ′ = W + α AB , (2) 4 Bartosz T ro jan and Filip Gębala Query / V alue Matrix ❄ B ⚡ A ⚡ / ❄ ... Query / V alue Matrix ❄ B ⚡ A ⚡ / ❄ Hypernetwork 🔥 (MLP / T ransformer) layer 1 layer 2 ... ... Fig. 1. A hyper-netw ork generates the w eights for the Query and V alue matrices in all atten tion blocks of the RoBER T a model, while the original pretrained weigh ts remain frozen. The figure illustrates the approac h in which the hyper-netw ork produces b oth the A and B matrices. In this work, w e also present a v ariant where only the B matrices are generated by the hyper-netw ork, and A matrices are fixed with randomly initialized v alues, which isn’t shown on this figure. where A ∈ R d × r and B ∈ R r × d are lo w-rank factors of rank r < d , and α is a fixed scaling co efficien t. W e apply this up date to the Query and V alue pro jection matrices in eac h attention block. Unlik e standard LoRA, whic h trains all of A and B matrices indep endently , w e generate these factors via a shared h yp er-net work H ϕ , conditioned on la yer em b edding. Each target weigh t matrix — sp ecifically the Query and V alue pro jections in each lay er — is asso ciated with a dedicated learned embedding e ∈ R d h . Concretely , if both A s and B s are generated for the Query and V alue matrices in la yer ℓ , there are four embeddings p er lay er: e Q A [ ℓ ] , e Q B [ ℓ ] , e V A [ ℓ ] , e V B [ ℓ ] . Hyp er-net w ork H ϕ maps eac h such em b edding e to target lo w-rank factor: A Q ℓ = H ϕ ( e Q A [ ℓ ]) B Q ℓ = H ϕ ( e Q B [ ℓ ]) A V ℓ = H ϕ ( e V A [ ℓ ]) B V ℓ = H ϕ ( e V B [ ℓ ]) (3) Hyp eLoRA for Calibrated Language Mo del Fine-T uning 5 The resulting v ectors are reshap ed in to matrices of appropriate dimensions and applied to their resp ectiv e weigh t matrices. This ties all lay er adaptations through a single generator, enforcing structural coherence and drastically reducing pa- rameter coun t relativ e to p er-la y er LoRA. The arc hitecture of H ϕ is either a ligh tw eight MLP or a small T ransformer encoder op erating ov er all em b eddings join tly; implementation details are giv en in Section 4.3. V arian ts. W e consider tw o op erating mo des dep ending on whether A ma- trices are generated or fixed: – F ull generation : all A and B matrices are produced by H ϕ , meaning that there are four em b eddings p er la y er in this scenario. – Fixed-A : A matrices are initialized once from a Kaiming uniform distribu- tion (similarly to [10]) and held fixed throughout training; only B matrices are generated by H ϕ , meaning that there are only t wo embeddings p er lay er. The in tuition is that fixed random A matrices act as structured noise injec- tion into the frozen base mo del, encouraging the learned B matrices to be more robust. 3.3 T raining and Inference During training, H ϕ generates A s and B s for all target weigh t matrices. They are applied transien tly in the forward pass — the stored pretrained w eigh ts are nev er mo dified. Gradients flow through the low-rank pro jections bac k to H ϕ , whose parameters ϕ are up dated while θ -original pretrained mo del weigh ts-remains frozen. W e optimize the standard cross-en tropy loss. A t inference, the h yp er-net work again pro duces the low-rank factors on the fly , or they can b e precomputed once to reduce run time o v erhead. This de- sign main tains a strict separation b et w een the frozen backbone and the learned adaptation, ensuring a minimal memory fo otprin t and no risk of corrupting the pretrained represen tations. 4 Exp erimen tal Setup 4.1 Datasets W e ev aluate all of the exp eriments on the General Language Understanding Ev aluation (GLUE) b enc hmark, a collection of sen tence- and sen tence-pair lan- guage understanding tasks designed to pro vide a standardized comparison of mo del p erformance across diverse NLP capabilities [24]. In this work, we use only GLUE tasks form ulated as classification problems. – CoLA (Corpus of Linguistic Acceptabilit y). A single-sentence accept- abilit y task where the model predicts whether a s en tence is linguistically acceptable (binary). CoLA has 8.5k training examples and 1k test samples. – SST-2 (Stanford Sentimen t T reebank). A single-sentence sen timent classification task where the mo del predicts p ositive vs. negativ e sen timent (binary). SST-2 has 67k training examples and 1.8k test examples. 6 Bartosz T ro jan and Filip Gębala – QNLI (Question-answ ering Natural Language Inference). A sen tence- pair task (QA/NLI) where the mo del predicts whether a context sentence con tains the answ er to a question (binary). QNLI has 105k training examples and 5.4k test examples. – MRPC (Microsoft Research Paraphrase Corpus). A sen tence-pair paraphrase task where the mo del predicts whether t wo sentences are para- phrases (binary). MRPC has 3.7k training examples and 1.7k test examples. – R TE (Recognizing T extual Entailmen t). A sen tence-pair inference task (NLI) where the mo del predicts whether a h yp othesis is en tailed b y a premise (binary). R TE has 2.5k training and 3k test examples. – MNLI (Multi-Genre Natural Language Inference). A sentence-pair natural language inference task where the mo del predicts the relationship b et w een a premise and a h yp othesis (en tailmen t, contradiction, or neutral; three-w ay classification). MNLI has 393k training examples and 20k test examples. 4.2 T raining and Ev aluation Setup Unless stated otherwise, we follo w the exp erimen tal proto col from the original LoRA work [10] for training and ev aluation. The t wo main differences are (i) the in tro duction of a h yper-netw ork to generate the lo w-rank update parameters and (ii) the sp ecific injection strategy used to apply these generated updates within the frozen RoBER T a enco der [16]. F or our approach, w e use different p eak learning rates (e.g., 1e − 5 for the MLP and 4e − 4 for the T ransformer), as w e observed that low er learning rates lead to more stable and effective training when using the MLP-based h yp er-net work. 4.3 Hyp er-net w ork Architectures W e consider t wo hyper-netw ork architectures for generating the LoRA factors: a m ultilay er perceptron (MLP) and a T ransformer enco der [23]. In both cases, eac h transformer lay er identifier ℓ is represen ted b y a learnable embedding of dimension 128. MLP . The MLP consists of four fully connecte d la yers. The input dimension is 128, all hidden lay ers hav e width 2048, and the GELU as its activ ation func- tion. The output lay er pro jects to the flattened LoRA parameter space. In all exp erimen ts, the hidden size of RoBER T a is 768 and the LoRA rank is r = 8 , so the output dimension corresp onds to 768 × 8 p er generated factor. All MLP w eights are initialized from a normal distribution. In the configu- ration where A matrices are fixed, it is initialized once using Kaiming uniform initialization and k ept frozen, while only B is generated by the h yp er-net work. T ransformer. In the T ransformer-based v arian t, all lay er embeddings (di- mension 128) are first pro jected with a linear lay er to a hidden dimension of 256 and then pro cessed jointly b y a T ransformer enco der with 2 lay ers, 16 at- ten tion heads, and mo del dimension 256. The outputs corresp onding to each Hyp eLoRA for Calibrated Language Mo del Fine-T uning 7 la yer are passed through a final linear pro jection to pro duce the flattened LoRA parameters, either for both A and B matrices or only for B s w hen A s are fixed. All learnable parameters are initialized from a normal distribution, except for the fixed A matrices, whic h use Kaiming uniform initialization. 4.4 Metrics Exact calibration measurement with finite samples is not p ossible due to contin- uous predicted confidence v alues. In practice, calibration error is appro ximated b y partitioning N predictions into M bins { b 1 , . . . , b M } based on predicted prob- abilities and comparing a v erage confidence with empirical accuracy within eac h bin. M = 10 for all of our exp eriments. The most widely used metric is Exp ected Calibration Error (ECE) [19], defined as the w eighted a verage of p er-bin confidence–accuracy gaps: ECE = M X m =1 | b m | N | acc( b m ) − conf ( b m ) | , (4) where | b m | is the num b er of samples in bin b m , acc( b m ) is the fraction of correct predictions, and conf ( b m ) is the mean predicted probabilit y within that bin. The remaining metrics follo w the same bin-level discrepancy idea with sp e- cific mo difications. Maximum Calibration Error (MCE) [19] replaces the w eighted av erage with the worst-case bin maxim um. Classwise ECE (CECE) [12] computes the ECE-style sum separately for each class and a verages across all K classes. A daptiv e Calibration Error (ACE) [20] replaces fixed-width bins with equal-population bins, also av eraging per class. Thresholded A CE (T A CE) [20] further restricts A CE to predictions whose confidence exceeds a threshold ϵ , reducing the influence of near-zero probabilities. Finally , the Brier Score (BS) [2] is a proper scoring rule computing the mean squared error be- t ween predicted probabilities and one-hot targets across all classes, pro viding a com bined measure of calibration and sharpness. 5 Results Here we present the results of our exp erimen ts, including calibration analysis for standard Fine-T uning (FT), Low-Rank Adaptation (LoRA), and our prop osed h yp er-net work-based v ariants. 5.1 Calibration of the Existing Metho ds T able 1 presents task performance and calibration metrics for standard Fine- T uning and LoRA fine-tuning across selected GLUE b enc hmarks. LoRA consis- ten tly matches or impro ves predictiv e p erformance relative to full Fine-T uning, with noticeable gains on CoLA, MRPC, QNLI, and R TE, while main taining comparable results on MNLI. 8 Bartosz T ro jan and Filip Gębala T able 1. P erformance and calibration metrics across GLUE b enc hmarks for RoBER T a large . FT denotes standard fine-tuning [16]; LoRA denotes Low-Rank Adapta- tion [10]. F ollo wing [24], we rep ort Matthews Correlation Co efficient (MCC) for CoLA, F1 for MRPC, and Accuracy for other tasks. Minor score deviations from [10] are attributable to different random seeds; additionally , MRPC reports F1 (as in [16]) rather than accuracy , and R TE (FT) is initialized from MNLI-pretrained w eights for fair comparison [10]. Metric CoLA SST-2 QNLI MRPC R TE MNLI FT S core ↑ 61.68 ± 0 . 73 94.50 ± 0 . 41 92.64 ± 0 . 22 90.37 ± 0 . 41 85.32 ± 1 . 37 87.17 ECE ↓ 0.136 ± 0 . 008 0.035 ± 0 . 017 0.072 ± 0 . 002 0.111 ± 0 . 023 0.104 ± 0 . 045 0.074 CECE ↓ 0.138 ± 0 . 007 0.036 ± 0 . 016 0.072 ± 0 . 002 0.114 ± 0 . 021 0.113 ± 0 . 038 0.050 MCE ↓ 0.339 ± 0 . 074 0.277 ± 0 . 242 0.539 ± 0 . 111 0.308 ± 0 . 163 0.309 ± 0 . 159 0.202 ACE ↓ 0.134 ± 0 . 009 0.033 ± 0 . 016 0.071 ± 0 . 002 0.110 ± 0 . 021 0.103 ± 0 . 039 0.073 T ACE 0 . 01 ↓ 0.469 ± 0 . 006 0.461 ± 0 . 013 0.482 ± 0 . 007 0.441 ± 0 . 020 0.413 ± 0 . 058 0.492 Brier Score ↓ 0.290 ± 0 . 006 0.096 ± 0 . 006 0.145 ± 0 . 004 0.245 ± 0 . 013 0.266 ± 0 . 009 0.207 LoRA S core ↑ 63.94 ± 0 . 21 94.99 ± 0 . 18 93.07 ± 0 . 05 93.18 ± 0 . 40 87.61 ± 0 . 21 87.19 ECE ↓ 0.120 ± 0 . 025 0.046 ± 0 . 001 0.036 ± 0 . 011 0.088 ± 0 . 012 0.124 ± 0 . 007 0.043 CECE ↓ 0.123 ± 0 . 024 0.046 ± 0 . 001 0.037 ± 0 . 011 0.088 ± 0 . 011 0.125 ± 0 . 006 0.029 MCE ↓ 0.282 ± 0 . 038 0.340 ± 0 . 111 0.123 ± 0 . 052 0.424 ± 0 . 097 0.502 ± 0 . 156 0.257 ACE ↓ 0.114 ± 0 . 024 0.040 ± 0 . 003 0.035 ± 0 . 010 0.084 ± 0 . 012 0.114 ± 0 . 002 0.043 T ACE 0 . 01 ↓ 0.443 ± 0 . 026 0.477 ± 0 . 007 0.452 ± 0 . 012 0.447 ± 0 . 005 0.451 ± 0 . 024 0.432 Brier Score ↓ 0.271 ± 0 . 020 0.096 ± 0 . 002 0.114 ± 0 . 004 0.180 ± 0 . 017 0.244 ± 0 . 009 0.193 F rom a calibration persp ectiv e, ECE remains relatively stable across seeds for b oth metho ds, typically exhibiting low v ariance. CECE closely follows ECE, re- flecting the predominantly binary structure of the ev aluated tasks. As expected, MCE shows substan tially higher v ariabilit y due to its sensitivity to w orst-case confidence bins. ACE aligns closely with ECE, indicating an approximately uni- form distribution of samples across confidence bins. A cross tasks, LoRA demonstrates impro v ed calibration on QNLI and MNLI, where ECE and Brier Score are consisten tly lo wer than under full Fine-T uning. Ho wev er, this b eha vior is not uniform. On SST-2 and R TE, Fine-T uning achiev es lo wer ECE v alues, suggesting b etter calibration despite sligh tly weak er predic- tiv e performance. The elev ated v alues of T A CE 0 . 01 (T A CE with threshold equal to 0 . 01 ) across b oth metho ds indicate that predictions are concentrated near ex- treme probabilities, leading to high-confidence outputs and increased calibration p enalties when misclassifications occur. Ov erall, LoRA do es not uniformly improv e calibration ov er full Fine-T uning. Its effect is task-dependent, and impro v ements in predictiv e p erformance do not systematically translate in to b etter uncertain ty estimation. 5.2 Our Approac h - Calibration with Hyp er-Net work T able 2 rep orts the performance and calibration of the prop osed h yp er-net work v arian ts on CoLA and SST-2, compared to the LoRA baseline. W e ev aluate b oth MLP-based and T ransformer-based hyper-netw orks, with either full generation of adapter matrices ( A gen ) or with matrices A fixed ( A fix ). On CoLA, the T ransformer-based hyper-netw ork with fully generated ma- trices (T ransformer A gen ) ac hieves av erage highest p erformance, but also shows Hyp eLoRA for Calibrated Language Mo del Fine-T uning 9 T able 2. W e compare four hypernetw ork configurations combining tw o architectures (MLP and T ransformer) with t wo weigh t generation strategies ( A gen and A fix ), each av- eraged o ver 3 seeds. T ransformer A gen outp erforms the LoRA baseline on CoLA, though LoRA remains stronger on SST-2, suggesting task-dep enden t behavior. Interestingly , T ransformer A fix ac hieves the b est calibration across all runs despite not leading in Matthews Correlation Coefficient (MCC) or Accuracy . COLA SST-2 MCC ↑ ECE ↓ A cc ↑ ECE ↓ MLP A gen 63 . 54 ± 0 . 31 0 . 116 ± 0 . 019 94 . 84 ± 0 . 30 0 . 037 ± 0 . 007 MLP A fix 48 . 25 ± 13 . 01 0 . 130 ± 0 . 015 92 . 89 ± 1 . 41 0 . 047 ± 0 . 011 T ransformer A gen 64 . 42 ± 1 . 75 0 . 119 ± 0 . 009 94 . 78 ± 0 . 08 0 . 040 ± 0 . 003 T ransformer A fix 60 . 69 ± 0 . 35 0 . 100 ± 0 . 010 94 . 56 ± 0 . 08 0 . 028 ± 0 . 004 LoRA 63 . 94 ± 0 . 21 0 . 120 ± 0 . 025 94 . 99 ± 0 . 18 0 . 046 ± 0 . 001 high v ariability across different seeds. It surpasses LoRA fine-tuning. How ev er, its calibration remains similar to LoRA, indicating no inherent calibration ad- v an tage from full matrix generation. When A matrices are fixed (T ransformer A fix ), MCC v alue decreases, but ECE improv es, representing the b est calibra- tion among all ev aluated configurations. W e h yp othesize that this calibration impro vemen t is caused b y a reduction in effective degrees of freedom - fixing A matrices constrains the adaptation space, acting as an implicit regularizer that prev ents the model from b ecoming o verconfiden t. The MLP-based v arian ts follow the same qualitativ e pattern but exhibit re- duced stability . While MLP A gen main tains comp etitive p erformance, fixing A matrices leads to a substan tial drop in performance without impro ving calibra- tion relativ e to the T ransformer-based fixed configuration. On SST-2, LoRA achiev es the highest predictiv e accuracy . Both hyper-netw ork v arian ts with generated matrices remain comp etitive. Again, freezing matrices A results in a small but consistent decrease in accuracy . How ev er, the T ransformer- based fixed configuration yields the low est ECE, significantly outp erforming b oth LoRA and the fully generated v arian t. Con trary to our initial hypothesis, fully generated h yp er-net work adapters do not systematically improv e calibration ov er LoRA, despite o ccasionally im- pro ving task performance (notably on CoLA). Instead, calibration impro v ements emerge primarily when matrices A are fixed. This constraint introduces a struc- tured transformation of the adapter input space, limiting o verconfiden t predic- tions and acting as a form of implicit regularization. The improv ement in calibra- tion is accompanied by a clear trade-off in predictiv e performance, particularly pronounced on CoLA. A dditionally , as show on Fig. 2, w e observe that extended training consis- ten tly leads to worsening calibration across all configurations, even when task metrics contin ue to impro ve. This b ehavior suggests progressiv e o v erfitting to the training ob jectiv e, resulting in sharp er predictive distributions and degraded uncertain ty estimation. 10 Bartosz T ro jan and Filip Gębala 0 20 40 60 80 Epoch 20 30 40 50 60 70 matthews cor r elation CoL A MLP A_gen MLP A_fix ed T A_gen T A_fix L oR A 0 10 20 30 40 50 60 Epoch 88 90 92 94 96 accuracy SST -2 MLP A_gen MLP A_fix ed T A_gen T A_fix L oR A 0 20 40 60 80 Epoch 0.05 0.10 0.15 0.20 ece CoL A MLP A_gen MLP A_fix ed T A_gen T A_fix L oR A 0 10 20 30 40 50 60 Epoch 0.02 0.03 0.04 0.05 0.06 ece SST -2 MLP A_gen MLP A_fix ed T A_gen T A_fix L oR A Fig. 2. Ev aluation results on CoLA and SST-2 benchmarks, rep orted as Matthews Correlation Co efficien t and accuracy (top row) alongside Exp ected Calibration Error (b ottom ro w), av eraged across 3 indep enden t random seeds. A gen means b oth matrices A and B are generated, and A fix means matrices A are fixed. LoRA [10] is included as a baseline. Fixing matrix A improv es mo del calibration, alb eit at the cost of task p erformance across b oth datasets. 6 Conclusion In this w ork, w e inv estigated parameter-efficien t adaptation mec hanisms as a no vel approach to calibration for transformer-based language mo dels. W e an- alyzed standard Fine-T uning and LoRA fine-tuning [10] applied to RoBER T a [16], and in tro duced a h yp er-net w ork-based v ariant in whic h low-rank up date matrices are generated conditionally on transformer la yer iden tity . Our ev aluation across m ultiple GLUE b enc hmarks [24] yields three main findings. First, LoRA pro vides calibration comparable to full Fine-T uning while retaining parameter efficiency , though calibration impro vemen ts remain task- dep enden t. Second, hyper-netw ork-generated low-rank factors yield calibration broadly similar to standard LoRA, suggesting structural cross-la yer coupling alone is insufficient for systematic confidence correction, but also our proposed T ransformer-based h yp er-net work LoRA v ariant (T ransformer A gen ) sho wed promis- ing results, by outp erforming standard LoRA on the CoLA benchmark. Third, freezing all of the A matrices ( A fix ) mo destly improv es calibration at the cost of task performance, revealing a tension b etw een represen tation stability and predictiv e sharpness. W e additionally pro vide a unified, reproducible implemen- tation of six calibration metrics (ECE, CECE, MCE, ACE, T A CE, Brier Score), Hyp eLoRA for Calibrated Language Mo del Fine-T uning 11 con tributing to ward more systematic ev aluation standards in transformer cali- bration researc h. F urther W ork F uture work should inv estigate the mechanism behind the A fix calibration improv ement — whether it stems from reduced flexibility , ad- ditional noise, or mo dified optimization dynamics. The CoLA adv an tage of the T ransformer h yp er-net work motiv ates further study of this architecture. Extend- ing ev aluation to multi-class and out-of-distribution b enchmarks would clarify whether observ ed improv ements generalize b eyond binary GLUE tasks. A ckno wledgmen ts. The authors are grateful to Dr. Kamil Książek, Dr. T omasz Kuśmierczyk, and Prof. Jacek T ab or of the Jagiellonian Universit y for their inv aluable guidance and for providing access to computational resources. Disclosure of Interests. The authors hav e no competing interests to declare that are relev ant to the con tent of this article. References 1. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: w av2v ec 2.0: A framework for self-sup ervised learning of sp eec h representations. Adv ances in neural information pro cessing systems 33 , 12449–12460 (2020) 2. Brier, G.W.: V erification of forecasts expressed in terms of probability . Mon thly W eather Review 78 , 1–3 (1950), h ttps://api.semanticsc holar.org/CorpusID: 122906757 3. Desai, S., Durrett, G.: Calibration of pre-trained transformers. arXiv preprint arXiv:2003.07892 (2020) 4. Devlin, J., Chang, M.W., Lee, K., T outano v a, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language tec hnologies, volume 1 (long and short pap ers). pp. 4171–4186 (2019) 5. Doso vitskiy , A.: An image is worth 16x16 w ords: T ransformers for image recogni- tion at scale. arXiv preprin t arXiv:2010.11929 (2020) 6. Guo, C., Pleiss, G., Sun, Y., W einberger, K.Q.: On calibration of mo dern neu- ral net w orks. In: Pro ceedings of the 34th In ternational Conference on Machine Learning - V olume 70. p. 1321–1330. ICML’17, JMLR.org (2017) 7. Gururangan, S., Maraso vić, A., Sw ay amdipta, S., Lo, K., Beltagy , I., Do wney , D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks. In: Jurafsky , D., Chai, J., Schluter, N., T etreault, J. (eds.) Pro ceedings of the 58th Ann ual Meeting of the Asso ciation for Computational Linguistics. pp. 8342–8360. Asso ciation for Computational Linguistics, Online (Jul 2020). https://doi.org/10. 18653/v1/2020.acl- main.740, https://aclan thology .org/2020.acl- main.740/ 8. Ha, D., Dai, A., Le, Q.V.: Hyp ernetw orks. arXiv preprint arXiv:1609.09106 (2016) 9. He, R., Liu, L., Y e, H., T an, Q., Ding, B., Cheng, L., Low, J., Bing, L., Si, L.: On the effectiv eness of adapter-based tuning for pretrained language model adaptation. In: Pro ceedings of the 59th ann ual meeting of the asso ciation for computational lin- guistics and the 11th international join t conference on natural language pro cessing (v olume 1: long papers). pp. 2208–2222 (2021) 12 Bartosz T ro jan and Filip Gębala 10. Hu, E.J., Shen, Y., W allis, P ., Allen-Zhu, Z., Li, Y., W ang, S., W ang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR. Op enReview.net (2022), http://dblp.uni- trier.de/db/conf/iclr/iclr2022.h tml#HuSW AL WW C22 11. Jia, X., De Brabandere, B., T uytelaars, T., Go ol, L.V.: Dynamic filter netw orks. A dv ances in neural information processing systems 29 (2016) 12. Kull, M., Perello Nieto, M., Kängsepp, M., Silv a Filho, T., Song, H., Flach, P .: Be- y ond temperature scaling: Obtaining well-calibrated multi-class probabilities with diric hlet calibration. Adv ances in neural information pro cessing systems 32 (2019) 13. Li, X.L., Liang, P .: Prefix-tuning: Optimizing contin uous prompts for generation. arXiv preprint arXiv:2101.00190 (2021) 14. Liang, G., Zhang, Y., Jacobs, N.: Neural netw ork calibration for medical imag- ing classification using dca regularization. In: International conference on machine learning, workshop on uncertain ty and robustness in deep learning (2020) 15. Liu, B., Ben A yed, I., Galdran, A., Dolz, J.: The devil is in the margin: Margin- based lab el smoothing for netw ork calibration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 80–88 (2022) 16. Liu, Y., Ott, M., Go y al, N., Du, J., Joshi, M., Chen, D., Levy , O., Lewis, M., Zettlemo yer, L., Sto y anov, V.: Rob erta: A robustly optimized bert pretraining approac h. In: In ternational Conference on Learning Representations (ICLR) (2020) 17. Lv, K., Y ang, Y., Liu, T., Guo, Q., Qiu, X.: F ull parameter fine-tuning for large language mo dels with limited resources. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Asso ciation for Computa- tional Linguistics (V olume 1: Long Papers). pp. 8187–8198. Asso ciation for Com- putational Linguistics, Bangkok, Thailand (Aug 2024). https://doi.org/10.18653/ v1/2024.acl- long.445, https://aclan thology .org/2024.acl- long.445/ 18. Mozafari, A.S., Gomes, H.S., Leão, W., Janny , S., Gagné, C.: Attended temp erature scaling: a practical approach for calibrating deep neural net works. arXiv preprint arXiv:1810.11586 (2018) 19. Naeini, M.P ., Co oper, G., Hauskrech t, M.: Obtaining w ell calibrated probabili- ties using bay esian binning. In: Proceedings of the AAAI conference on artificial in telligence. v ol. 29 (2015) 20. Nixon, J., Dusenberry , M.W., Zhang, L., Jerfel, G., T ran, D.: Measuring calibration in deep learning. In: CVPR workshops. vol. 2 (2019) 21. Ortiz-Bara jas, J.G., Gómez-Adorno, H., Solorio, T.: Hyp erloader: Integrating h yp ernet work-based lora and adapter lay ers into multi-task transformers for se- quence lab elling (2024) 22. Platt, J., et al.: Probabilistic outputs for supp ort vector machines and comparisons to regularized likelihoo d metho ds. A dv ances in large margin classifiers 10 (3), 61–74 (1999) 23. V aswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., P olosukhin, I.: Atten tion is all y ou need. A dv ances in neural information pro- cessing systems 30 (2017) 24. W ang, A., Singh, A., Michael, J., Hill, F., Levy , O., Bo wman, S.: Glue: A multi- task b enc hmark and analysis platform for n atural language understanding. In: Pro- ceedings of the 2018 EMNLP workshop Blac kb o xNLP: Analyzing and interpreting neural netw orks for NLP . pp. 353–355 (2018) 25. W ang, C.: Calibration in deep learning: A surv ey of the state-of-the-art. arXiv preprin t arXiv:2308.01222 (2023) 26. Zadrozn y , B., Elk an, C.: T ransforming classifier scores into accurate m ulticlass probabilit y estimates. In: Pro ceedings of the eighth A CM SIGKDD in ternational conference on Kno wledge disco very and data mining. pp. 694–699 (2002)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment