In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks

In-Con text Sym b olic Regression for Robustness-Impro v ed K olmogoro v-Arnold Net w orks F rancesco So vrano 1 [0000 − 0002 − 6285 − 1041] , Lidia Losa vio 1 [0000 − 0002 − 8834 − 7388] , Giulia Vilone 2 [0000 − 0002 − 4401 − 5664] , and Marc Langheinric h 1 [0009 − 0009 − 0800 − 7251] 1 Univ ersity of Italian-Sp eaking Switzerland (USI) {francesco.sovrano,lidia.anna.maria.losavio,marc.langheinrich}@usi.ch 2 Analog Devices International giulia.vilone@analog.com Abstract. Sym b olic regression aims to replace black-box predictors with concise analytical expressions that can b e inspected and v alidated in sci- en tiﬁc machine learning. Kolmogoro v–Arnold Netw orks (KANs) are well suited to this goal b ecause each connection betw een adjacen t units (an “edge”) is parametrised by a learnable univ ariate function that can, in principle, be replaced by a sym bolic op erator. In practice, ho wev er, sym- b olic extraction is a bottleneck: the standard KAN-to-symbol approach ﬁts operators to each learned edge function in isolation, making the dis- crete choice sensitiv e to initialisation and non-conv ex parameter ﬁtting, and ignoring how lo cal substitutions interact through the full netw ork. W e study in-c ontext symb olic r e gr ession for op erator extraction in KANs, and present tw o complementary instantiations. Greedy in-context Sym- b olic Regression (GSR) performs greedy , in-context selection by choosing edge replacemen ts according to end-to-end loss improv ement after brief ﬁne-tuning. Gated Matching Pursuit (GMP) amortises this in-context selection b y training a diﬀerentiable gated op erator lay er that places an op erator library b ehind sparse gates on each edge; after con vergence, gates are discretised (optionally follow ed by a short in-context greedy re- ﬁnemen t pass). W e quantify robustness via one-factor-at-a-time (OF A T) h yp er-parameter sweeps and assess b oth predictive error and qualitativ e consistency of recov ered form ulas. A cross sev eral experiments, greedy in-con text symbolic regression achiev es up to 99.8% reduction in median OF A T test MSE. Co de & Data : https://gith ub.com/F rancesco-Sovrano/In-Con text-Symbolic- Regression-KAN Keyw ords: Sym bolic Regression · Explainable AI · Kolmogoro v–Arnold Net works · Matching Pursuit · Hyp er-P arameter Robustness · Scientiﬁc Mac hine Learning 2 F. Sovrano et al. 1 In tro duction Explainable AI (XAI) is increasingly exp ected to supp ort scientiﬁc w orkﬂows: unco vering functional relationships, prop osing compact mechanisms, and pro- ducing artefacts that can b e insp ected and v alidated [8,24]. In this setting, r o- bustness is essential. If a metho d pro duces diﬀerent analytical expressions for the same dataset under small changes to random seed, hyper-parameters, archi- tecture width, or representation resolution, then its “explanation” b ecomes hard to repro duce and diﬃcult to trust [3,1,9]. Sym b olic regression is attractive as an intrinsic al ly interpr etable mo delling paradigm: it returns an explicit sym b olic formula that ﬁts the data, rather than a blac k-b o x predictor accompanied b y p ost-hoc explanations [23,19,24]. This has driv en renewed interest in symbolic regression systems, including mo dern approac hes that combine com binatorial search ov er expressions with con tinuous optimisation, simpliﬁcation, and constan t-ﬁtting heuristics [13,25,30,7,14]. K olmogorov–Arnold Netw orks (KANs) [17] oﬀer a promising bridge b et ween neural learning and sym b olic form ulas, grounded in the Kolmogoro v–Arnold sup erposition view [12,4]. Unlik e Multila yer P erceptrons (MLPs), KANs place learnable univariate functions on edges and compute no de outputs b y summa- tion. This design (and extensions for scien tiﬁc disco very [16]) makes it natural to visualise the learned edge functions, replace them with symbolic op erators, and comp ose them into a ﬁnal expression. In practice, ho wev er, symbolic extraction remains a computational and method- ological b ottleneck: man y learned numeric edge functions m ust b e conv erted into discrete choices from an operator library . A common strategy for KANs, which w e call AutoSym , pro cesses each edge indep enden tly . It sample s the learned uni- v ariate function from its numerical parametrisation (often a spline) and selects the library op erator that b est ﬁts the sampled curve, optionally with a simplicity p enalt y [16]. This pip eline is fragile for tw o reasons: 1. Instability fr om isolate d curve ﬁtting: When candidate op erators include free parameters, e.g., “ a · sin( bx + c ) + d ”, per-edge ﬁtting is non-con v ex and sensitiv e to initialisation and lo cal minima. Moreov er, expressive op erator families can ﬁt many shapes, so multiple candidates often achiev e similar curv e-ﬁtting scores, making the choice ambiguous [1,9]. As a result, small c hanges in KAN initialisation, spline grid, or training can produce diﬀeren t “b est” op erators for the same edge. 2. Err or pr op agation fr om ignoring c ontext: Ev en when t wo operators ﬁt the lo cal spline equally well, their glob al eﬀect inside the full KAN can diﬀer. Because AutoSym commits to edge-wise decisions in isolation, it cannot accoun t for interactions b et ween edges. Errors in early symbolic c hoices ma y distort do wnstream computations, forcing later edges to compensate and p oten tially prev enting recov ery of the ground-truth comp osition. Fig. 1 summarises why isolated p er-edge extraction is unstable and can propagate errors through the net work. In-Con text Sym b olic Regression for Robustness-Improv ed KANs 3 Fig. 1. Problem ov erview: isolated p er-edge KAN-to-symbol ﬁtting (AutoSym) is un- stable and ignores end-to-end con text. A more stable alternative is to ev aluate candidates in c ontext . F or a giv en edge, we temp orarily replace its numeric function with a candidate operator, brieﬂy ﬁne-tune the ful l netw ork, and score the candidate by the resulting end- to-end loss. W e then revert to the pre-trial state and repeat this pro cess for the remaining candidates, ﬁnally committing to the operator that yields the largest reduction in global loss after short ﬁne-tuning. Rep eating this pro cedure while prioritising the edge that most improv es the ob jective resembles Matching Pursuit (MP), a classic greedy metho d for sparse approximation that selects atoms from a dictionary to reduce the residual as quickly as p ossible [21,22,29]. Applied to KAN sym b olic regression, this yields a Greedy in-con text Symbolic Regression (GSR) pro cedure that is substantially more stable than isolated p er- edge ﬁtting b ecause it selects op erators b y their end-to-end loss impr ovement . The drawbac k of greedy in-context selection is computational cost, and it still treats symbolic structure as a p ost-training decision. T o address this, we prop ose Gated Matching Pursuit (GMP). The key ide a is to replace the numeric edge parametrisation (e.g., a spline basis) with a gate d op er ator me chanism that places the entire op erator library b ehind a diﬀeren tiable gate. During training, the mo del learns gate weigh ts p er edge, eﬀectively p erforming op erator selection as p art of optimisation . After conv ergence, we discretise the gates to obtain a sin- gle op erator p er edge and optionally apply a short greedy reﬁnement pass. Our approac h is inspired by function-combination mechanisms in KAN-like architec- tures [28] and by mixture-of-exp erts gating [10,26]. Con tinuous relaxations for discrete selection (e.g., Gumbel–Softmax) provide practical to ols to encourage near-discrete gates during training [11,20,18]. W e ev aluate GSR and GMP through the lens of hyp er-p ar ameter r obustness : w e sw eep random seeds and arc hitectural c hoices (notably netw ork width, regu- 4 F. Sovrano et al. larisation strength, and pruning sc hedule) and study ho w predictive p erformance and recov ered sym b olic structure v ary across runs. Exp erimen ts are conducted on SRBench [14], a standard b enc hmark suite for sym b olic regression, using tasks from its F eynman collection. W e ﬁnd th at AutoSym is highly sensitive under these sweeps, often pro ducing diﬀerent op erators and ﬁnal expressions for minor p erturbations. In con trast, in-c ontext symbolic regression methods (GSR and GMP) generally exhibit higher hyper-parameter robustness in our experiments, leading to more cons isten t operator reco very and more stable ﬁnal form ulas; GSR is t ypically the strongest p erformer, while GMP remains comp etitiv e. Ov erall, this pap er mak es the following contributions: – W e identify failure mo des of isolated, p er-edge spline-to-symbol ﬁtting for KAN sym b olic regression and connect them to XAI robustness concerns [8,24,3]. – W e formalise a Matching-Pursuit-inspired greedy in-context sym b olic regres- sion pro cedure for KANs that improv es robustness by selecting op erators via end-to-end loss after short ﬁne-tuning, thereby reducing error propagation [21,29]. – W e introduce a gated op erator lay er that p erforms amortise d in-c ontext op- erator selection during training, reducing candidate-ev aluation cycles; this impro ves eﬃciency while retaining m uch of the robustness b eneﬁt of in- con text selection [28,26]. – W e empirically demonstrate that in-c ontext symbolic regression (GSR, GMP) impro ves hyper-parameter robustness o ver isolated p er-edge ﬁtting, and of- ten yields low er error with qualitativ ely more consisten t recov ered formulas. – W e release a w orking replication pac k age with the co de, sweep deﬁnitions, and plotting utilities needed to repro duce the exp erimen ts, tables, and ﬁgures in this pap er: h ttps://github.com/F rancesco-So vrano/In-Context-Sym bolic- Regression-KAN. 2 Related W ork This work connects four lines of research: robustness in XAI, sym b olic regression, KANs and KAN-to-symbol extraction, and greedy/gated selection mechanisms. R obustness in explainable mo del ling. A central distinction in XAI is b etw een p ost-ho c explanations and intrinsic al ly interpr etable mo dels [8,24]. Post-hoc meth- o ds such as LIME and SHAP pro vide feature-attribution explanations for arbi- trary predictors [23,19], but their outputs can b e sensitiv e to sampling, p erturba- tion schemes, and other implementation c hoices [3,1,9]. Prior work therefore em- phasises r obustness as a practical requirement: explanations should not change substan tially under small p erturbations of training conditions, random seeds, or mo del sp eciﬁcation when predictive b ehaviour is similar [27,3]. Sym b olic regres- sion pro duces a mo del that is itself an explanation. This shifts robustness from a diagnostic prop ert y to a core requirement: instabilit y corresp onds to recov ering diﬀeren t candidate “laws” from the same data, which undermines repro ducibilit y in scien tiﬁc use [14]. In-Con text Sym b olic Regression for Robustness-Improv ed KANs 5 Symb olic r e gr ession. Classical symbolic regression traces back to genetic pro- gramming and ev olutionary search [13], with early systems demonstrating re- co very of compact physical relations from data [25]. More recent metho ds com- bine discrete search with contin uous optimisation and incorp orate simpliﬁcation and constan t-ﬁtting heuristics [30,7]. Related directions include sparse mo del disco very with predeﬁned feature libraries (e.g., SINDy) [6]. A cross these ap- proac hes, p erformance depends on the op erator set, noise level, and ev aluation proto col, motiv ating standardised b enc hmarks suc h as SRBenc h [14]. In this pa- p er we follo w this practice and ev aluate on SRBenc h tasks, with a fo cus on the h yp er-parameter robustness of structure reco very . KANs and symb olic extr action. KANs were introduced as an alternative to MLPs in which each edge carries a learnable univ ariate function (commonly parame- terised by splines) and no des aggregate inputs by summation [17]. F ollow-up w ork extended KANs tow ard scien tiﬁc discov ery , including multiplicativ e v ari- an ts and tooling for sym b olic con version [16]. Related arc hitectures explore al- ternativ e edge parametrisations and basis functions [2,28]. F astKAN replaces the B-spline basis with Gaussian radial basis functions and shows that spline-based KAN lay ers can b e appro ximated by radial basis function netw orks [15]; we in- clude F astKAN as a baseline. A common use case is to train a n umeric KAN and then conv ert it to a closed-form expression by ﬁtting each learned edge function to an op erator library and composing the resulting op erators [16]. As discussed in our in tro duction, this p er-edge extraction is lo cal and can b e sensitive to non- con vex parameter ﬁtting and to in teractions betw een edges, motiv ating metho ds that ev aluate op erators in context and/or integrate selection into training. Gr e e dy pursuit for in-c ontext sele ction. MP is a greedy pro cedure for sparse appro ximation that iteratively selects dictionary atoms to reduce the residual [21]. V arian ts such as Orthogonal Matc hing Pursuit provide improv ed recov ery in certain regimes [22,29]. W e adapt the same “select-and-reﬁne” principle to KAN sym b olic extraction: candidates are scored by end-to-end loss after brief ﬁne-tuning, and selections are committed iteratively . This in-context selection directly targets the global ob jectiv e, rather than relying on isolated curve ﬁtting. Gating me chanisms and c ontinuous r elaxations. Mixture-of-exp erts mo dels use gating netw orks to select or weigh t exp ert components [10], with sparsely gated v arian ts enabling eﬃcient scaling by activ ating only a subset of exp erts p er input [26]. Discrete selection can b e appro ximated with con tinuous relaxations suc h as Gumbel–Softmax [11,20], and sparsity can b e encouraged via regularisation (e.g., ℓ 0 -st yle p enalties) [18]. W e use gating at the lev el of KAN edges: each edge main tains a gated mixture o v er a sym b olic op erator library during training, whic h is later discretised to obtain a single op erator p er edge. 6 F. Sovrano et al. 3 Bac kground This section in tro duces the notation and comp onents used in the prop osed meth- o ds: the KAN la yer, the MultKAN extension, the standard p er-edge sym b olic extraction baseline, and MP, whic h serves as the greedy template underlying GSR and the reﬁnemen t stage of GMP. KAN layers. The Kolmogoro v–Arnold representation theorem states that con- tin uous m ultiv ariate functions on compact domains can b e expressed using com- p ositions of univ ariate functions and addition [12,4]. KANs ins tan tiate this idea b y parameterising each edge with a learnable univ ariate function [17]. Consider a la yer mapping x ∈ R d in to y ∈ R d out . A KAN la yer computes y j = d in X i =1 ϕ j,i ( x i ) , (1) where ϕ j,i : R → R is the learnable 1D function associated with edge ( i → j ) . In the original formulation, each ϕ j,i is represented by a spline basis [17,5]. Stacking la yers yields comp ositions of these edge functions across depth. MultKAN. Many scientiﬁc expressions inv olve explicit pro ducts. MultKAN ex- tends KANs by adding m ultiplication mo dules (or multiplication no des) so that m ultiplicative interactions are represented directly [16]. Our metho ds apply to b oth KAN and MultKAN; unless otherwise stated, we use the additive KAN notation in Eq.1. Per-e dge symb olic extr action b aseline. Let ϕ j,i denote a trained edge function represen ted numerically (e.g., by spline co eﬃcien ts). The standard KAN-to- sym b ol baseline approximates ϕ j,i b y selecting an op erator family from a library: ϕ j,i ( x ) ≈ g k ( x ; θ ) , where g k is the k -th candidate op erator and θ are its contin uous parameters (e.g., scale/shift/frequency). Examples of op er ator families. A t ypical library con tains a set of univ ariate primitiv es with aﬃne reparametrisations. F or example: g sin ( x ; θ ) = a sin( bx + c ) + d, g exp ( x ; θ ) = a exp( bx + c ) + d, g log ( x ; θ ) = a log( bx + c ) + d ( with bx + c > 0) , where θ collects the corresponding coeﬃcients (i.e., θ = ( a, b, c, d ) ). Here, g k denotes the symbolic form (e.g., “sine”), while θ captures the ﬁtted con tinuous parameters for that form. In-Con text Sym b olic Regression for Robustness-Improv ed KANs 7 Baseline ﬁtting pr o c e dur e. Giv en samples { ( x t , ϕ j,i ( x t )) } T t =1 from the numeric edge function, the baseline ﬁts θ for each candidate family b y minimising a lo cal regression loss, e.g. min θ 1 T T X t =1 ( ϕ j,i ( x t ) − g k ( x t ; θ )) 2 , and then selects the candidate k with the best local score, i.e., low est Mean Square Error (MSE) or highest R 2 , optionally with a simplicity p enalt y [16]. In the exp erimen ts rep orted here w e set that simplicit y w eight to zero. The rea- son is tec hnical: on the b ounded domains induced b y the training data, several op erator families can ﬁt the same sampled edge almost equally w ell after aﬃne reparametrisation, so a hand-assigned complexit y score can dominate near-ties for reasons unrelated to end-to-end ﬁdelity . Because suc h scores dep end on the c hosen library and are not in v arian t to algebraically equiv alent represen tations, they can systematically steer AutoSym tow ard a c heap er-but-wrong family . W e therefore disable this heuristic to isolate the eﬀect of lo cal-v ersus-in-con text ev al- uation. Baseline Limitations. Per-edge ﬁtting is often non-conv ex in θ , and several op- erator families can ﬁt the same sampled curv e comparably w ell under aﬃne reparametrisations, e.g., a · sin( bx + c ) + d . As a result, the selected op erator can dep end on initialisation and optimisation details. Moreov er, a go o d lo c al ﬁt do es not guarantee that replacing ϕ j,i b y g k ( · ; θ ) preserv es end-to-end p erformance once the edge is inserted bac k into the full netw ork. Matching Pursuit. MP is a greedy metho d for building a sparse approxima- tion from a ﬁxed library (dictionary) of candidate components [21]. Starting from an initial approximation, it repeatedly (i) selects the single candidate that yields the largest improv emen t according to a criterion, and (ii) up dates the ap- pro ximation b efore pro ceeding to the next selection. Orthogonal v arian ts such as OMP modify the up date step to re-estimate co eﬃcien ts after each selection [22,29]. In this paper, MP is used as an algorithmic template rather than a signal-pro cessing to ol. The “dictionary” is the sym b olic operator library , and candidates are instantiated b y assigning an op erator family to a sp eciﬁc edge (with its parameters ﬁtted). The impro v ement criterion is not residual norm, but end-to-end loss impr ovement of the whole KAN. This is wh y GSR ev aluates a candidate b y inserting it into the net work, ﬁtting its contin uous parameters (and, if needed, allowing a short end-to-end re-ﬁt of aﬀected parameters), and measuring the resulting loss. The same select–up date structure also motiv ates the optional greedy reﬁnemen t pass applied after gate discretisation in GMP. 4 Prop osed Metho ds W e aim to turn a trained n umeric KAN whose edges are spline functions in to an interpretable symb olic KAN whose edges come from a small op erator library 8 F. Sovrano et al. Fig. 2. M ethod o verview: GSR selects operators by end-to-end loss impro vemen t; GMP amortises in-context selection via sparse op erator gates during training, then discretises (optionally reﬁned by a short greedy pass) to reduce candidate-trial cost. (e.g., sin , polynomials, exp ). W e prop ose tw o complementary con version strate- gies. GSR is a p ost-hoc pro cedure that conv erts one edge at a time by trying c andidate op er ators, brieﬂy ﬁne-tuning the whole network, and c ommitting the op er ator that yields the b est end-to-end loss . GMP accelerates this searc h b y learning soft op er ator gates during training, pruning each edge to a small top- k candidate set, discretising, and optionally running a short greedy reﬁnement restricted to those candidates. Fig. 2 summarises the workﬂo w. 4.1 Problem formulation W e start from a trained numeric KAN (or MultKAN) mo del f num whose edge functions ϕ num j,i : R → R are represented n umerically (in practice, splines; cf. Eq. 1). W e seek a functionally similar but more in terpretable mo del f sym ob- tained b y replacing a subset of these spline edges with closed-form op erators. Let L = { g k } K k =1 b e a library of univ ariate op erator forms (e.g., sin , exp , log , p olynomials; see Section 3). When an edge e = ( i → j ) is conv erted, w e represen t it as ϕ sym j,i ( x ) = α e g k e ( β e x + γ e ) + δ e , (2) where k e ∈ { 1 , . . . , K } selects the op erator form and ( α e , β e , γ e , δ e ) are learnable aﬃne parameters that absorb scale/shift. W e call an edge numeric if it is still spline-based, and symb olic once it takes the form in Eq. 2. Our ob jective is to obtain f sym that preserv es predictiv e p erformance on held-out data (e.g., J ( f sym ; D v al ) ≤ J ( f num ; D v al ) + ε ), while fa vouring stable op erator choices across seeds/h yp er-parameters and an eﬃcient con version budget. When a domain-sp eciﬁc complexit y prior is well deﬁned it In-Con text Sym b olic Regression for Robustness-Improv ed KANs 9 can b e added on top of this ob jective, but w e do not enforce suc h a prior in the exp erimen ts of this pap er. R unning example. If the ground-truth is y = sin( x 1 )+ x 2 2 , a trained n umeric KAN t ypically learns spline edges that approximate sin( · ) and ( · ) 2 on the data range. Our metho ds replace those spline edges with the corresp onding op erator forms from L , while brieﬂy re-optimising the full netw ork to account for interactions b et w een edges. 4.2 Greedy in-context Sym b olic Regression Edge r anking (existing KAN heuristic). At eac h iteration, we prioritise whic h numeric edge to con vert using an imp ortance score s e (as commonly used for pruning in KANs). Gr e e dily , we select the single remaining numeric edge with the highest score, e ⋆ = arg max e numeric s e , so early con versions target the most inﬂuen tial edges. In our implementation, s e is recomputed from the current net- w ork state b efore eac h selection (optionally amortised by up dating only after committing a batc h of edges). In each iteration, GSR selects the most imp ortan t remaining numeric edge e ⋆ (according to the imp ortance score s e ), then ev aluates candidate op erators in- con text. Concretely , for each g ∈ L (or a pruned subset) we temp orarily replace e ⋆ with g (including aﬃne parameters), ﬁne-tune the ful l network for τ steps, measure the resulting end-to-end loss on a small v alidation split, and restore the original parameters. W e ﬁnally commit the op erator that yields the lo west loss and pro ceed to the next edge. This proce dure, formally represented as Algorithm 1, mitigates error propa- gation b ecause eac h symbolic c hoice is ev aluated in the context of all previously committed symbolic choices and the remaining numeric e dges. It also mitigates initialisation sensitivity: rather than selecting the operator that b est ﬁts a spline in isolation (where lo cal minima dominate), we select the op erator that yields the b est global ob jectiv e after brief adaptation. Complexity of naive GSR. Let K = |L| b e the library size and τ b e the ﬁne- tuning budget p er trial. Naiv ely , each sym b olic selection costs K trial runs, eac h with τ optimisation steps, yielding O ( K τ ) steps p er con verted edge. F or M edges, this is O ( M K τ ) optimisation steps, often unacceptable in practice. 4.3 Gated op erator lay ers for amortised in-context selection T o reduce the num b er of costly in-context trial runs required by GSR, we in- tegrate op erator selection directly into training b y introducing gate d op er ator layers . This yields GMP, an amortise d v ariant of in-con text selection: rather than explicitly trying ev ery op erator for ev ery edge, eac h edge maintains a dif- feren tiable mixture ov er the op erator library , and training learns b oth (i) whic h op erator(s) to prefer and (ii) the corresp onding contin uous parameters. After training, w e discretise the gates to obtain a single symbolic op erator p er edge 10 F. Sovrano et al. Algorithm 1 Greedy in-context Symbolic Regression for KANs Require: T rained n umeric KAN f , dataset D , operator library L , ﬁne-tuning steps τ , max symbolic edges M Ensure: Symbolic KAN mo del f sym 1: Compute initial edge imp ortance scores { s e } for all edges e 2: Initialize set of conv erted edges S ← ∅ 3: for m = 1 to M do 4: Recompute/update edge importance scores { s e } (optional) 5: Select next edge e ⋆ ← arg max e / ∈S s e 6: Initialize best loss J ⋆ ← + ∞ and b est operator g ⋆ ← None 7: for each candidate op erator g ∈ L (or a pruned subset) do 8: Snapshot mo del parameters (including all edges) 9: Replace edge e ⋆ with op erator g (initialise/retain aﬃne params) 10: Fine-tune full mo del for τ steps on D 11: Ev aluate end-to-end loss J ← J ( f ; D v al ) 12: Restore snapshot 13: if J < J ⋆ then 14: J ⋆ ← J , g ⋆ ← g 15: end if 16: end for 17: Commit: replace edge e ⋆ with g ⋆ permanently 18: Optional: brief ﬁne-tuning after committing 19: Update S ← S ∪ { e ⋆ } 20: end for 21: return f as f sym and optionally apply a short greedy reﬁnement pass restricted to the retained candidates. Gate d e dge p ar ametrisation. F or eac h edge ( i → j ) , instead of a spline ϕ j,i w e use a soft selection o ver the op erator library L = { g k } K k =1 : ϕ j,i ( x ) = K X k =1 π ( j,i ) k  α ( j,i ) k g k ( β ( j,i ) k x + γ ( j,i ) k ) + δ ( j,i ) k  , (3) where π ( j,i ) ∈ ∆ K − 1 is a probability vector o ver op erators and ( α, β , γ , δ ) are p er-operator aﬃne parameters that absorb scale/shift, matc hing the sym b olic edge form in Eq. 2. W e parameterise the gate via logits ℓ ( j,i ) ∈ R K and a softmax: π ( j,i ) k = exp( ℓ ( j,i ) k ) P K r =1 exp( ℓ ( j,i ) r ) . (4) In tuitively , each edge carries a mixture of candidate symbolic operators, and optimisation increases the w eight of operators that reduce end-to-end loss in c ontext (b ecause the mixture is trained as part of the full netw ork). Stabilising the gate with varianc e c ompr ession. Diﬀerent op erators can pro duce outputs with very diﬀerent scales and heavy-tailed resp onses, which can desta- bilise optimisation and make the logits ℓ ( j,i ) o verly sensitive to outliers. T o mit- igate this, we compress each op erator output z with a sc ale d asinh transform b efore mixing: ˜ z = s asinh  z s  , (5) In-Con text Sym b olic Regression for Robustness-Improv ed KANs 11 Algorithm 2 Gated Matching Pursuit Require: Dataset D , op erator library L , gated KAN architecture, training steps T , pruning sched- ule, top- k value k , reﬁnement steps τ Ensure: Symbolic KAN f sym 1: Initialize gated KAN: each edge is a mixture o ver L (Eq. 3) 2: for t = 1 to T do 3: Update network parameters by minimizing J ( · ; D ) + λ ent R ent + λ ℓ 1 R ℓ 1 4: if pruning step then 5: F or eac h edge, keep top- k operators by gate probabilit y; mask out the rest 6: end if 7: end for 8: Discretize: for each edge, choose g ⋆ = arg max g ∈L π ( g ) ; replace mixture with g ⋆ 9: Optional reﬁnement: run GSR (Alg. 1) but restrict each edge’s candidate set to its retained top- k 10: return symbolic mo del f sym where the scale parameter log s is learned join tly with ( ℓ, α, β , γ , δ ) (optionally p er edge/operator). This transform is appro ximately linear for | z | ≪ s and grows only logarithmically for | z | ≫ s , damping extreme v alues while preserving small v ariations. Enc our aging sp arsity and enabling discr etisation. T o obtain a ﬁnal sym b olic mo del, each edge should select (appro ximately) a single op erator. W e encourage near-discrete gates with t wo complementary heuristics: (i) entrop y regularisa- tion, R ent = P ( j,i ) H ( π ( j,i ) ) , to fa vour p eaky distributions; and (ii) perio dic top- k pruning, whic h k eeps only the k highest-probability op erators p er edge and masks out the rest. T r aining, pruning, and discr etisation. GMP trains the gated KAN end-to-end b y minimising J ( · ; D ) + λ ent R ent + λ ℓ 1 R ℓ 1 , where R ℓ 1 is an ℓ 1 p enalt y on the gate parameters. During training, we p erio d- ically prune each edge to its top- k operators according to π ( j,i ) . A hand-crafted simplicit y term could b e added here in principle, but we keep it disabled in all rep orted exp erimen ts for the same identiﬁabilit y reasons discussed for AutoSym ab o v e. After training, we discretise each edge b y taking arg max k π ( j,i ) k , replacing the mixture in Eq. 3 with that single op erator and retaining its learned aﬃne parameters. Optional gr e e dy r eﬁnement (r estricte d GSR). Although gating already p erforms in-con text selection during training, discretisation is a hard decision and ma y o ccasionally pick b et ween near-ties. T o v alidate and potentially correct these discrete c hoices, we optionally run a short greedy reﬁnement pass using GSR (Algorithm 1) but restricting eac h edge’s candidate set to its retained top- k op erators (as determined by pruning). This preserves the computational b eneﬁts of GMP while adding a targeted in-con text chec k at the end. 12 F. Sovrano et al. Why GMP should impr ove r obustness over isolate d p er-e dge ﬁtting. GMP a voids isolated spline-to-op erator curve ﬁtting and instead learns op erator preferences jointly with the full netw ork ob jective, reducing sensitivity to lo cal minima and p er-edge initialisation. Sparsity ob jectives and top- k pruning further reduce am- biguit y among similarly ﬁtting op erators and help prev ent oscillation. Finally , the optional restricted GSR reﬁnement ev aluates discrete op erator choices ex- plicitly in context, correcting o ccasional discretisation errors at low additional cost. 5 Exp erimen ts W e ev aluate our methods with a fo cus on r obustness : whether small, routine c hanges in training and conv ersion settings lead to large changes in predictive p erformance and in the recov ered symbolic mo del. 5.1 T asks and data proto col W e use regression datasets from the SRBench F eynman benchmark [14], a stan- dard b enc hmark suite for symbolic regression. W e sp eciﬁcally c hose the F eynman with-units suite b ecause it pro vides controlled scientiﬁc symbolic-regression tasks with kno wn closed-form targets, making it suitable for studying symbolic recov- ery in KANs. Within that suite, we ev aluate 10 targets (I.10.7, I.12.1, I.13.4, I I.34.29a, I.9.18, I.12.4, I.34.1, II.6.15a, II.6.15b, and I I.21.32), selected with- out further p er-problem tuning so as to keep the OF A T study computationally tractable while reducing c herry-picking. Eac h dataset con tains m ultiple input v ariables and a single scalar target. F or eac h dataset, we construct a training and test split by sampling up to 2000 training p oin ts and 1000 test p oin ts from the av ailable rows (when a dataset con tains fewer ro ws, we use the maximum av ailable under these caps). Unless otherwise stated, the split is obtained b y a seeded random p ermutation of ro ws. Predictiv e p erformance is measured b y test MSE. 5.2 Compared pip elines All metho ds use the same univ ariate operator library L with K = 25 op erator forms: constants and identit y; polynomial pow ers x 2 – x 5 ; inv erse p ow ers 1 /x – 1 /x 3 ; √ · and 1 / √ · ; log and exp ; sin , cos , tan , tanh ; | · | and sgn ; arctan ; arcsin ; arccos ; arctanh ; and a Gaussian primitive exp( − x 2 ) . Eac h symbolic edge addi- tionally includes a learnable aﬃne reparametrisation as in Eq. 2. Using a relatively large library mak es op erator selection substantially more c hallenging; in particular, it increases the combinatorial am biguity faced b y p ost- ho c selection and mak es GMP harder to optimise, since learning sparse and conﬁden t gates b ecomes more complex as K grows. W e compare ﬁv e pip elines: In-Con text Sym b olic Regression for Robustness-Improv ed KANs 13 1. AutoSym (baseline). T rain a n umeric MultKAN [16] whose edges are spline functions, then replace each remaining active edge independently b y ﬁtting candidate op erators to its learned one-dimensional curv e and selecting the best lo cal ﬁt. In our experiments we disable an y explicit complexit y bias in this lo cal selection to isolate the eﬀect of in-context ev aluation (simplicity w eight = 0 ). 2. F astKAN + AutoSym. Same post-ho c p er-edge extraction as AutoSym, but the n umeric edge parametrisation uses radial basis functions instead of splines [15]. 3. GSR (cf. Section 4.2). T rain the same n umeric mo del as AutoSym, then p erform greedy in-context sym b olic regression: iteratively choose one still- n umeric edge using the mo del’s edge-imp ortance scores, try candidate op- erators on that edge, brieﬂy reﬁt the full mo del, and commit the op erator that yields the lo west loss after this short reﬁt. 4. F astKAN + GSR (cf. Section 4.2). Same greedy in-con text conv ersion as GSR, starting from a radial-basis n umeric parametrisation [15]. 5. GMP (cf. Section 4.3). T rain a mo del whose edges are diﬀerentiable gated mixtures ov er the op erator library , with gate sparsity encouraged b y entrop y regularisation and p eriodic top- k pruning. During pruning rounds, eac h edge is restricted to a small shortlist of op erators according to its gate weigh ts. After training, we discretise eac h edge by selecting the op erator with the largest gate probabilit y and retaining its learned aﬃne parameters. W e then optionally apply a short restricted greedy reﬁnemen t pass, using GSR only o ver each edge’s retained top- k candid ates. This yields an eﬃciency-orien ted in-con text baseline that reduces candidate-trial cost while preserving m uch of the robustness b eneﬁt of in-context selection. 5.3 T raining sc hedule, sensitivit y sw eep, and rep orted metrics Mo del family and tr aining sche dule. All metho ds share the same base arc hitec- ture: a single hidden MultKAN la yer follow ed by a scalar output. The hidden la yer contains m additive units and two m ultiplication units, i.e. width [ m, 2] , where m is v aried in the sweep. F or spline-based mo dels, inputs are mapp ed onto a ﬁxed grid resolution of 20 knots/centres (dep ending on the n umeric parametri- sation) and we use cubic B-splines (degree 3 ). The grid range is set p er dataset to the minimum and maximum observed in the training inputs (global min/max across all input dimensions). Eac h run follo ws the same m ulti-stage schedule: an initial ﬁt without regu- larisation, follow ed by several prune-and-reﬁt cycles with regularisation enabled, then a ﬁnal non-regularised ﬁt. Symbolic extraction (AutoSym or greedy con- v ersion) is applied after this ﬁnal ﬁt, and we p erform a short ﬁnal p olishing ﬁt afterw ards. T raining uses the A dam optimizer with learning rate 10 − 2 . Eac h ﬁt stage uses a ﬁxed budget of 200 optimisation steps. Greedy candidate ev alua- tions use a budget of 100 steps p er candidate. During prune-and-reﬁt cycles, pruning uses a no de threshold of 0.1; edge-threshold pruning is disabled (edge threshold = 0 . 0 ). The regularised reﬁt uses the same optimiser and step budget 14 F. Sovrano et al. as the non-regularised stages, diﬀering only b y the regularisation co eﬃcient λ (b elo w). F or GMP, gate sparsit y is encouraged with an entrop y penalty w eigh t of 10 − 3 and an ℓ 1 gate p enalt y weigh t of 10 − 2 . Gate pruning uses an initial cap of 10 op erators p er edge and decreases to the ﬁnal shortlist size (top- k ) of 5 across pruning cycles. In all rep orted GMP exp erimen ts, w e enabled the op- tional restricted-GSR reﬁnement after gate discretisation; this reﬁnement used the same short candidate-ev aluation budget as GSR and w as restricted to each edge’s retained top- k shortlist. One-factor-at-a-time sensitivity swe ep. W e operationalise robustness as low sen- sitivit y of predictive p erformance and reco vered formulas to routine exp erimen tal p erturbations. T o measure this sens itivit y , w e use a one-factor-at-a-time sweep around a reference conﬁguration. F or eac h dataset we v ary exactly one factor at a time while holding the others ﬁxed: – Hidden width: num b er of additive units m ∈ { 5 , 10 , 20 , 50 , 100 } (multipli- cation units ﬁxed to 2 ). – Regularisation strength λ : { 10 − 4 , 10 − 3 , 10 − 2 , 10 − 1 } during prune-and- reﬁt cycles. – Num b er of pruning cycles: { 1 , 3 , 5 } . – Random seed: { 1 , 2 , 3 } , controlling initialisation and the random train-test split. The r efer enc e c onﬁgur ation is ( m, λ, #cycles , seed ) = (5 , 10 − 2 , 3 , 1) . This is the default anc hor setting used in three places: it is the unp erturbed p oin t around whic h each OF A T sweep is constructed, it is the conﬁguration rep eated across the factor-sp eciﬁc sweeps, and it is the ﬁxed setting used for the seed-only rep eats rep orted in T able 1. Coun ting rep eated app earances of that anchor yields 15 runs p er dataset (12 unique conﬁgurations), and w e run all ﬁve pip elines for eac h run. Metrics. F or each metho d and dataset we rep ort: – Predictiv e accuracy: test MSE. – Sensitivit y-based robustness pro xy: the distribution of test MSE ov er the sw eep, summarised by the median (and dispersion via quartiles). Low er, tigh ter distributions indicate greater robustness to the p erturb ed factor. Statistic al c omp arison. F or each dataset, we select the metho d with the low est median test MSE as the reference. W e then compare the reference to eac h other metho d using a one-sided Mann–Whitney U test with alternative h yp othesis MSE(ref ) < MSE(other). W e correct for m ultiple comparisons (reference vs. eac h comp etitor) using Holm correction. T o complement signiﬁcance testing, we rep ort eﬀect size using Cliﬀ ’s ∆ with 95% b o otstrap conﬁdence interv als. In-Con text Sym b olic Regression for Robustness-Improv ed KANs 15 6 Results W e rep ort tw o complementary views of sensitivity , which together supp ort our robustness claims. First, T able 1 giv es a se e d-sensitivity snapshot at the ﬁxed reference conﬁguration, summarised as mean ± std o ver rep eated seeds. Second, Figure 3 visualises the OF A T hyp er-p ar ameter sensitivity distributions obtained b y v arying width, λ , and the num b er of pruning cycles around that same ref- erence conﬁguration; T able 2 summarises their median performance relative to the AutoSym baseline, and T able 3 reports the corresp onding distribution-level statistical comparisons. Figure 3 is therefore not an av erage ov er the seed rep eats in T able 1, and T able 1 is not an a verage of the p oin ts shown in Figure 3. The rankings can diﬀer b ecause the tw o summaries answer diﬀerent questions. 6.1 Seed sensitivity at the reference conﬁguration T able 1 reports test MSE as mean ± std ov er a v ailable seeds for the reference conﬁguration only; this is a seed-sensitivit y snapshot rather than an OF A T sum- mary . En tries marked with † use fewer than three successful seeds, and some GMP runs are una v ailable under the default settings. Across the 10 datasets, the low est mean test MSE at this ﬁxed setting is achiev ed by F astKAN+GSR on I.10.7, I.12.1, I.12.4, and I.34.1; by GSR on I.13.4, I.9.18, and I I.6.15b; by F astKAN+AutoSym on II.34.29a and II.21.32; and b y GMP on II.6.15a, al- though that last result is based on fewer successful seeds and should be inter- preted cautiously . Thes e ﬁxed-setting means are not directly com parable to the OF A T medians in Figure 3: the latter aggregate all v alid conﬁgurations in the sw eep, whereas T able 1 av erages only rep eated seeds of one selected setting. The missing or limited GMP en tries are consistent with a schedule-mismatc h failure mo de under the shared training proto col: gated op erator la yers can sep- arate more slowly than the greedy p ost-hoc v ariants, so pruning may remov e viable op erator paths b efore the gates ha ve stabilised. W e therefore treat these cases as procedural failures under the curren t schedule rather than as deﬁni- tiv e evidence that GMP cannot ﬁt the task; a longer pre-pruning phase or a milder pruning schedule ma y recov er some of them. Overall, the in-con text greedy pip elines (GSR and F astKAN+GSR) are the strongest and most con- sisten tly comp etitiv e at the reference setting, while several p ost-ho c pip elines exhibit substantial sto chastic v ariabilit y on selected targets. Imp ortan tly , the standard deviations reveal substan tial sto c hastic v ariability for some pip elines and datasets (e.g., AutoSym on I.13.4), motiv ating the broader OF A T robustness analysis b elo w. 6.2 Hyp er-parameter sensitivit y under OF A T sw eeps W e analyse sensitivity to hyper-parameters by aggregating eac h metho d’s test MSE v alues across the OF A T sw eep dimensions of hidden width, λ , and the n umber of pruning cycles (w e exclude the explicit “seed” factor here, since seed 16 F. Sovrano et al. T able 1. Seed sensitivity at the reference conﬁguration (width [5 , 2] , λ = 10 − 2 , and three pruning cycles). W e report test MSE as mean ± std o ver av ailable seeds for this ﬁxed setting only . This table quan tiﬁes sensitivity to sto c hasticity at one chosen con- ﬁguration; low er means and smaller standard deviations indicate greater robustness, but it is not an av erage of the OF A T sw eep in Figure 3. Entries mark ed with † use few er than three successful seeds; for single-seed cases, only the observed v alue is re- p orted. Entries marked N/A indicate that no successful run was obtained under the shared training/pruning schedule. Low er is b etter. The low est mean is in b old ; the second-lo west mean is underlined. F eynman Dataset AutoSym F astKAN+AutoSym GSR F astKAN+GSR GMP I.13.4 1.31e3 ± 2.21e3 1.29e1 ± 1.77e1 7.35e-1 ± 3.94e-1 1.42e0 ± 6.61e-1 N/A I.10.7 6.14e-2 ± 1.38e-2 8.49e-1 ± 8.64e-1 2.60e-3 ± 2.50e-3 1.90e-3 ± 3.00e-3 1.14e-2 ± 4.20e-3 I.12.1 1.07e2 ± 1.43e2 4.86e-1 ± 4.79e-1 7.18e-2 ± 2.25e-2 1.30e-2 ± 1.31e-2 1.25e0 ± 9.99e-1 II.34.29a 7.18e-4 ± 1.23e-4 8.60e-5 ± 1.07e-4 6.43e-4 ± 3.50e-4 5.99e-4 ± 3.77e-4 1.18e-2 ± 1.38e-2 I.9.18 1.40e-1 ± 2.15e-1 6.05e-3 ± 9.40e-3 2.89e-4 ± 1.53e-4 1.57e-3 ± 2.01e-3 N/A I.12.4 4.64e-4 ± 3.70e-4 1.12e-4 ± 1.86e-4 1.24e-4 ± 2.50e-5 8.20e-5 ± 6.60e-5 1.73e-4 ± 3.60e-5 I.34.1 1.27e-1 ± 1.03e-1 3.39e-1 ± 4.75e-1 2.02e-2 ± 3.16e-2 8.77e-3 ± 1.20e-2 3.53e-2 ± 2.39e-2 II.6.15a 1.28e-1 ± 1.12e-1 4.51e0 ± 7.81e0 3.13e-3 ± 2.33e-3 2.96e-3 ± 1.19e-3 2.22e-3 † II.6.15b 7.67e-4 ± 2.12e-4 3.86e-4 ± 4.29e-4 3.25e-4 ± 2.28e-4 5.00e-4 ± 3.53e-4 8.33e-4 ± 9.80e-5 II.21.32 1.15e-3 ± 1.50e-3 † 1.00e-5 ± 4.00e-6 † 7.80e-5 ± 5.20e-5 8.80e-5 ± 4.00e-5 1.02e-3 † sensitivit y is already reported in T able 1). Figure 3 shows the resulting hyp er- p ar ameter sensitivity distributions as violin plots on a log-scale y-axis (necessary due to the wide dynamic range of MSE). Lo wer medians and tighter distributions indicate that a metho d is less sensitiv e to routine hyper-parameter c hoices and therefore more robust under this op erationalisation. Some metho ds, esp ecially GMP on selected datasets, produced no v alid OF A T observ ations under the shared pruning schedule; these cases are sho wn as explicit missing-v alue markers rather than b eing silently dropp ed. W e rep ort distribution-lev el comparisons of OF A T robustness using one- sided Mann–Whitney U tests with Holm correction (details in Section 5). In T able 3, p Holm indicates signiﬁcance after correction, Cliﬀ ’s δ quantiﬁes the eﬀect size (more negativ e fav ours the reference), and the 95% b ootstrap CI for “ median(other) − median(ref ) ” rep orts the median gap (p ositiv e fav ours the reference). T able 3 summarises these results and highlights whic h comparisons remain signiﬁcant after Holm correction. Across datasets, the statistical conclu- sions are consistent with the distributions in Fig. 3. F or in terpretability , note that among the 10 datasets the largest median impro vemen t o ver the AutoSym baseline o ccurs on F eynman I.12.1: F astKAN+GSR reduces the median OF A T test MSE from 9.49 to 0.0212, i.e., a 100 · (1 − 0 . 0212 / 9 . 49) ≈ 99 . 8% reduction (T able 2). A cross the 10 datasets, an in-context v arian t attains the low est median OF A T test MSE on sev en targets: F astKAN+GSR on I.10.7, I.12.1, and I.34.1, and GSR on I.13.4, II.21.32, I.12.4, and I.9.18. F astKAN+AutoSym attains the lo w est OF A T median on the remaining three targets: I I.34.29a, I I.6.15a, and I I.6.15b. These OF A T median rankings do not alwa ys match the ﬁxed-setting mean rank- ings in T able 1, b ecause the OF A T analysis aggregates all v alid settings in the In-Con text Sym b olic Regression for Robustness-Improv ed KANs 17 1 0 4 1 0 2 1 0 0 1 0 2 1 0 4 T est MSE (log) feynman I.10.7 1 0 2 1 0 0 1 0 2 1 0 4 1 0 6 feynman I.12.1 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 T est MSE (log) feynman I.12.4 1 0 0 1 0 1 1 0 2 1 0 3 X feynman I.13.4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 T est MSE (log) feynman I.34.1 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 1 0 2 X feynman I.9.18 1 0 4 1 0 3 1 0 2 T est MSE (log) X feynman II.21.32 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 feynman II.34.29a AutoSym F astK AN+AutoSym GSR F astK AN+GSR GMP 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 T est MSE (log) X feynman II.6.15a AutoSym F astK AN+AutoSym GSR F astK AN+GSR GMP 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 feynman II.6.15b Fig. 3. OF A T hyper-parameter sensitivity distributions. Violins summarise test MSE across all v alid one-factor-at-a-time runs obtained b y v arying hidden width, λ , and the n umber of pruning cycles around the reference conﬁguration; dots denote individual observ ations. This ﬁgure aggregates hyper-parameter p erturbations only and do es not a verage o ver the seed-only rep eats from T able 1. Red × markers indicate that a method pro duced no v alid OF A T runs for that dataset and is therefore absent from the violin aggregation. Low er, tigh ter distributions indicate lo wer sensitivit y and hence greater robustness. 18 F. Sovrano et al. T able 2. Median OF A T test MSE of the best pipeline vs. AutoSym baseline, and p ercen t reduction 100 · (1 − med(b est) / med(AutoSym)) . F eynman Dataset med (best) med (AutoSym) Reduction (%) I.10.7 1.34e-3 7.36e-2 98.2 I.12.1 2.12e-2 9.49e0 99.8 I.13.4 6.43e-1 6.94e1 99.1 II.34.29a 3.05e-5 2.17e-4 85.9 II.21.32 3.25e-5 1.13e-3 97.1 II.6.15a 2.23e-3 2.58e-3 13.6 II.6.15b 6.70e-5 2.22e-4 69.8 I.12.4 1.41e-4 3.81e-4 63.0 I.34.1 2.09e-3 1.88e-2 88.9 I.9.18 2.48e-4 9.92e-2 99.8 sw eep whereas T able 1 summarises only rep eated seeds of one selected conﬁgu- ration. The contrast is clearest on I I.21.32: F astKAN+AutoSym has the low est mean at the reference conﬁguration in T able 1, but GSR has the low est me- dian across the full OF A T sw eep in Figure 3 and T able 2; T able 3 further shows that the GSR vs F astKAN+AutoSym comparison is not Holm-signiﬁcan t on this dataset. In the clearest cases fa vouring in-con text selection (namely I.10.7, I.12.1, I.13.4, I.9.18, and I.34.1), the b est greedy v ariant signiﬁcantly outp er- forms AutoSym after Holm correction, often with large eﬀect sizes. At the same time, the wins of F astKAN+AutoSym on I I.34.29a, I I.6.15a, and I I.6.15b indi- cate that the underlying numeric parametrisation can in some cases make local p ost-hoc extraction suﬃciently robust. Other datasets are less clear-cut. F or ex- ample, on I.12.4, GSR ac hieves the b est OF A T median, but after correction only the comparison with GMP remains signiﬁcan t. On II.6.15a, none of the pairwise diﬀerences remains signiﬁcant after correction despite a small median adv an tage for F astKAN+AutoSym. 7 Discussion This section interprets the robustness patterns observ ed in the seed-sensitivit y snapshot (T able 1) and in the OF A T sensitivit y distributions (Fig. 3), as well as the distribution-level comparisons (T able 3). W e fo cus on how these results relate to robustness under routine exp erimen tal c hoices. Interpr eting OF A T violins as sensitivity signals. The OF A T violin plots in Fig. 3 pro vide a compact summary of h yp er-parameter sensitivity . Since each violin represen ts the distribution of test MSE obtained by v arying one factor at a time (width, λ , and the num b er of pruning cycles) around the reference conﬁguration, the vertical extent of a violin (i.e., the spread of MSE v alues on the log scale) can b e read as an empirical pro xy for robustness to routine h yp er-parameter c hoices: taller violins indicate larger performance v ariabilit y , while ﬂatter violins indicate greater robustness. This interpretation is consistent with our robustness In-Con text Sym b olic Regression for Robustness-Improv ed KANs 19 T able 3. One-sided Mann–Whitney U tests comparing the b est pip eline (low est median OF A T test MSE) to the others. p -v alues are Holm-corrected p er d ataset; eﬀect size is Cliﬀ ’s δ (negative fav ours the b est pip eline). Bo otstrap 95% CIs are rep orted for the median diﬀerence (other − b est). Only metho ds with v alid OF A T distributions are included; omitted rows corresp ond to unav ailable distributions (notably selected GMP cases). Signiﬁcance markers: * for p Holm < 0 . 05 , ** for p Holm < 0 . 01 , and *** for p Holm < 0 . 001 . Comparison (b est → other) med (best) med (other) p Holm Cliﬀ ’s δ CI F eynman I.10.7 (b est: F astKAN+GSR) F astKAN+GSR → AutoSym (baseline) 1.34e-3 7.36e-2 1.89e-4 *** -0.944 [4.51e-2, 1.55e0] F astKAN+GSR → F astKAN+AutoSym 1.34e-3 4.52e-2 3.59e-3 ** -0.736 [1.35e-3, 1.84e0] F astKAN+GSR → GMP 1.34e-3 1.06e-2 6.01e-3 ** -0.667 [-2.83e-4, 1.43e-2] F astKAN+GSR → GSR 1.34e-3 2.38e-3 6.27e-2 -0.375 [-3.39e-3, 5.05e-3] F eynman I.12.1 (b est: F astKAN+GSR) F astKAN+GSR → AutoSym (baseline) 2.12e-2 9.49e0 2.06e-4 *** -0.983 [4.14e0, 2.72e2] F astKAN+GSR → F astKAN+AutoSym 2.12e-2 1.03e0 4.42e-4 *** -0.917 [9.89e-2, 5.20e0] F astKAN+GSR → GMP 2.12e-2 1.23e-1 6.21e-4 *** -0.868 [9.25e-2, 1.98e0] F astKAN+GSR → GSR 2.12e-2 3.31e-2 1.18e-1 -0.306 [-4.62e-3, 6.06e-2] F eynman I.13.4 (b est: GSR) GSR → AutoSym (baseline) 6.43e-1 6.94e1 5.32e-5 *** -1.000 [4.57e1, 1.06e2] GSR → F astKAN+AutoSym 6.43e-1 9.59e0 2.27e-3 ** -0.758 [1.10e0, 3.28e1] GSR → F astKAN+GSR 6.43e-1 1.02e0 2.60e-2 * -0.485 [-1.22e-1, 6.52e-1] F eynman II.34.29a (best: F astKAN+AutoSym) F astKAN+AutoSym → AutoSym (baseline) 3.05e-5 2.17e-4 1.07e-2 * -0.569 [2.94e-5, 5.48e-4] F astKAN+AutoSym → F astKAN+GSR 3.05e-5 4.87e-4 1.07e-2 * -0.653 [2.28e-4, 8.29e-4] F astKAN+AutoSym → GMP 3.05e-5 5.72e-3 4.23e-3 ** -0.783 [1.07e-3, 2.50e-2] F astKAN+AutoSym → GSR 3.05e-5 6.72e-4 1.07e-2 * -0.653 [3.03e-4, 7.67e-4] F eynman II.21.32 (best: GSR) GSR → AutoSym (baseline) 3.25e-5 1.13e-3 4.06e-2 * -0.604 [-4.10e-5, 2.06e-3] GSR → F astKAN+AutoSym 3.25e-5 4.80e-5 5.51e-1 0.0238 [-8.70e-5, 6.50e-5] GSR → F astKAN+GSR 3.25e-5 6.60e-5 2.84e-1 -0.264 [-7.30e-5, 3.48e-4] F eynman II.6.15a (best: F astKAN+AutoSym) F astKAN+AutoSym → AutoSym (baseline) 2.23e-3 2.58e-3 1.00e0 0.0833 [-6.77e0, 1.28e-1] F astKAN+AutoSym → F astKAN+GSR 2.23e-3 2.39e-3 1.00e0 0.0139 [-6.77e0, 2.88e-3] F astKAN+AutoSym → GSR 2.23e-3 2.37e-3 1.00e0 0.167 [-6.77e0, 1.32e-3] F eynman II.6.15b (best: F astKAN+AutoSym) F astKAN+AutoSym → AutoSym (baseline) 6.70e-5 2.22e-4 5.24e-3 ** -0.708 [1.30e-5, 8.67e-4] F astKAN+AutoSym → F astKAN+GSR 6.70e-5 1.68e-4 4.60e-2 * -0.431 [-5.00e-6, 2.46e-4] F astKAN+AutoSym → GMP 6.70e-5 7.76e-4 2.73e-4 *** -0.917 [2.13e-4, 8.67e-4] F astKAN+AutoSym → GSR 6.70e-5 2.16e-4 4.60e-2 * -0.486 [1.45e-5, 2.25e-4] F eynman I.12.4 (b est: GSR) GSR → AutoSym (baseline) 1.41e-4 3.81e-4 6.23e-1 -0.125 [-1.05e-4, 5.59e-4] GSR → F astKAN+AutoSym 1.41e-4 3.27e-4 5.07e-1 -0.250 [-1.35e-4, 5.59e-4] GSR → F astKAN+GSR 1.41e-4 1.47e-4 6.23e-1 0.000 [-1.36e-4, 5.62e-5] GSR → GMP 1.41e-4 2.54e-4 2.96e-3 ** -0.788 [1.59e-5, 5.59e-4] F eynman I.34.1 (b est: F astKAN+GSR) F astKAN+GSR → AutoSym (baseline) 2.09e-3 1.88e-2 2.12e-3 ** -0.783 [9.68e-3, 1.61e0] F astKAN+GSR → F astKAN+AutoSym 2.09e-3 8.57e-2 1.89e-4 *** -0.944 [6.28e-2, 9.99e-1] F astKAN+GSR → GMP 2.09e-3 5.96e-2 9.31e-4 *** -0.848 [2.49e-2, 6.28e-2] F astKAN+GSR → GSR 2.09e-3 2.72e-3 6.42e-1 0.0833 [-3.26e-3, 1.91e-3] F eynman I.9.18 (b est: GSR) GSR → AutoSym (baseline) 2.48e-4 9.92e-2 5.24e-5 *** -1.000 [1.49e-2, 3.89e-1] GSR → F astKAN+AutoSym 2.48e-4 1.05e-3 2.90e-3 ** -0.722 [3.90e-4, 1.48e-2] GSR → F astKAN+GSR 2.48e-4 3.37e-4 1.85e-1 -0.222 [-1.30e-4, 2.89e-4] 20 F. Sovrano et al. ob jective, b ecause it reﬂects how strongly the pip eline’s predictive b ehaviour c hanges under small, standard tuning decisions. Implic ations of in-c ontext sele ction and numeric p ar ametrisation. T ak en to- gether, the multi-seed snapshot at the reference setting (T able 1) and the OF A T sensitivit y distributions (Fig. 3), supp orted b y distribution-level tests (T able 3), motiv ate three empirical observ ations: (i) In the datasets where an in-context v ariant ac hieves the lo west median OF A T MSE (GSR or F astKAN+GSR), the corresp onding OF A T violins are con- cen trated at low er error lev els and typically exhibit reduced disp ersion relative to p ost-hoc edge-wise extraction. (ii) Switc hing the n umeric edge parametrisa- tion from splines to radial basis functions can improv e robustness for certain targets (e.g., I I.34.29a), but it does not uniformly eliminate the v ariability in- duced by lo cal p er-edge ﬁtting across all datasets. (iii) W hen in-con text v arian ts outp erform post-ho c baselines with Holm-signiﬁcant diﬀerences, the accompa- n ying eﬀect sizes are typically large (Cliﬀ ’s δ close to − 1 ), indicating that the p erformance adv antage is not limited to a small median shift but reﬂects a broad separation b et ween the OF A T distributions; the b o otstrap interv als for “ median(other) − median(b est) ” are predominantly p ositiv e in these cases, sup- p orting that the median gaps are practically meaningful across the sweep. Why in-c ontext evaluation impr oves r obustness. A central failure mo de of iso- lated p er-edge extraction is that a lo cally go od one-dimensional ﬁt can be glob- ally wrong once comp osed through additions and m ultiplication units. By se- lecting op erators based on the end-to-end loss after brief reﬁtting, GSR (and the discretisation stage of GMP) directly tests whether a sym bolic substitu- tion remains compatible with the rest of the net work. This “in-con text” chec k tends to reject op erators that match an edge curv e in isolation but in tro duce brittle interactions do wnstream, which is consistent with the tighter OF A T sen- sitivit y distributions observed for the greedy pip elines on most targets, with an in-con text v arian t ac hieving the best median OF A T MSE on sev en of the ten datasets. Why the b est pip eline c an b e tar get-dep endent. F astKAN+AutoSym is b est by median on three datasets: I I.34.29a, I I.6.15a, and I I.6.15b. This suggests that for some targets the radial-basis parametrisation yields edge functions that are easier to discretize reliably with lo cal post-ho c matching. By contrast, GSR or F astKAN+GSR is b est on the remaining seven datasets. End-to-end in-context ev aluation is therefore generally more robust, but not univ ersally dominan t. More broadly , these results suggest that robustness is shaped jointly by the nu- meric inductive bias (ho w edges are parametrized and trained) and the symb olic selection rule used for discretization. Eﬃciency–r obustness tr ade-oﬀ. While GSR can be the most accurate and robust, it incurs a higher conv ersion cost b ecause it ev aluates many candidate op era- tors in context. GMP reduces this cost by learning sparse gates and restricting In-Con text Sym b olic Regression for Robustness-Improv ed KANs 21 in-con text ev aluation to a p er-edge top- k shortlist, which directly reduces the n umber of candidate ﬁne-tuning lo ops. The mixed results across targets indi- cate that gate learning can b e an eﬀective accelerator, but that part of GMP’s mixed robustness may b e pro cedural rather than in trinsic: the gated op erator la yer can conv erge more slo wly than the greedy baselines, so applying the same prune-and-reﬁt cadence can prune viable op erator paths b efore gate separation stabilises. This oﬀers a plausible explanation for the missing or inv alid GMP runs in T able 1 and Fig. 3, and suggests that longer pre-pruning training or gentler pruning sc hedules could improv e the eﬃciency–robustness trade-oﬀ. 8 Limitations Our study is a ﬁrst step tow ard r obustness-awar e symbolic extraction for KANs. Belo w w e outline key limitations and, for each, the concrete steps we to ok to mitigate its impact in the presen t work, together with what remains op en. Benchmark sc op e and e c olo gic al validity. W e ev aluate on ten targets from SR- Benc h’s F eynman suite under a ﬁxed training proto col. This controlled setting do es not span the full range of regimes relev an t to scientiﬁc dis co v ery (e.g., heavy lab el noise, co v ariate shift, sparse observ ations, or high-dimensional inputs), so the results should b e interpreted as evidence ab out robustness within this en ve- lop e rather than as a general guarantee. Mitigation. W e ﬁxed the ev aluation p o ol to the SRBench F eynman with-units suite, which oﬀers controlled scien tiﬁc tar- gets with known formulas, and then used a limited subset without per-problem retuning to reduce c herry-picking under our compute budget. Limite d r obustness factors and inter action eﬀe cts. Our robustness analysis per- forms one-factor-at-a-time sw eeps ov er width, regularisation strength, and prun- ing rounds, plus a limited set of random seeds. This isolates individual sensi- tivities but do es not fully characterise higher-order interactions (e.g., particular width–pruning combinations that jointly trigger collapse). A larger factorial de- sign or targeted interaction prob es (e.g., conditional sw eeps around failure re- gions) w ould provide stronger guaran tees at higher computational cost. Mitiga- tion. W e (i) sweep the most consequen tial knobs for KANs extraction (capacity , sparsiﬁcation, and pruning), (ii) replicate runs across multiple seeds to reduce the risk of anecdotal conclusions, and (iii) rep ort v ariability across conﬁgurations rather than only b est-case p oin ts. Op er ator-libr ary design and identiﬁability. Operator reco very is constrained by the library and b y represen tational equiv alences: distinct op erators (or aﬃne reparametrisations) can b e indistinguishable on a b ounded domain, and math- ematically equiv alen t expressions can app ear in diﬀerent syn tactic forms after discretisation and simpliﬁcation. Consequently , “correct recov ery” is not alw ays uniquely deﬁned without an explicit equiv alence proto col. Mitigation. W e keep the op erator set ﬁxed across pip elines, ev aluate candidates on the same input 22 F. Sovrano et al. domain, and treat expression recov ery primarily as a robustness/interpretabilit y signal rather than as a strict exact-match ob jective. W e also apply consistent p ost-hoc simpliﬁcation so that sup erﬁcial syntactic diﬀerences are reduced when rep orting recov ered formulas. What is optimise d vs. what is explaine d. Our primary quantitativ e metric is test MSE. Predictiv e accuracy is necessary but not suﬃcient for high-qualit y explanations: users often care ab out structural correctness, robustness of the extracted form under p erturbations, and faithfulness in a causal/functional sense b ey ond numeric ﬁt. Mitigation. W e complement MSE with qualitative insp ection of reco vered expressions and, crucially , with a r obustness-oriente d ev aluation: we stress the pip elines under controlled h yp er-parameter and pruning p erturbations to assess whether the extracted forms p ersist or degrade. This directly targets one dimension of explanation reliabilit y (robustness to plausible training v ariations), ev en when a formal structural metric is unav ailable. The study is therefore b est read as comparing pip eline robustness under a shared practical protocol, not as isolating the eﬀect of in-context selection under a strictly matched compute budget. 9 Conclusion and F uture W ork W e studied symbolic extraction for KANs from a XAI p erspective, emphasising r obustness under routine exp erimen tal choices and op erationalising it through sensitivit y to controlled p erturbations. W e in tro duced tw o in-con text pipelines that select symbolic operators based on their end-to-end eﬀect on the netw ork after brief ﬁne-tuning: Greedy in-con text Sym b olic Regression, which ev aluates candidates explicitly during conv ersion, and Gated Matc hing Pursuit, which amortises m uch of this selection b y learning sparse operator gates and then discretising from a small top- k shortlist. A cross ten F eynman targets, the results indicate that in-context selection can substan tially increase hyper-parameter robustness relativ e to isolated p er- edge curve matching. On seven of the ten targets, an in-context v arian t attains the b est OF A T me dian test MSE, often with tigh ter sensitivit y distributions, suggesting impro ved robustness to width, regularisation, and pruning sc hedules. W e also ﬁnd that changing the numeric parametrisation (splines vs. radial basis functions) can shift whic h pipeline is most robust, with F astKAN+AutoSym p erforming b est on II.34.29a, II.6.15a, and I I.6.15b under the OF A T median criterion. This highlights that robustness is a property of the ful l pip eline rather than of sym b olic extraction alone. A few directions for future w ork could improv e robustness and practical use- fulness. One is broader ev aluation, both on additional SRBenc h suites and on real scientiﬁc datasets, esp ecially in noisy , low-sample, and out-of-distribution regimes where robustness matters most. Another is to go b ey ond test error b y incorp orating measures of structural agreemen t, suc h as operator-set o v erlap, edit distance ov er expression trees, or equiv alence-class scoring, and by examin- ing interaction eﬀects through more systematic sweep designs. It w ould also b e In-Con text Sym b olic Regression for Robustness-Improv ed KANs 23 w orth studying tigh ter relaxations for op erator selection, including L 0 /Concrete gates, hierarc hical op erator libraries, and adaptive op erator generation, to reduce am biguity without substantially increasing compute. A ckno wledgmen ts. This work was funded b y the Swiss Innov ation Agency (Innosu- isse) under grant agreemen t 119.321 INT-ICT. References 1. A debay o, J., Gilmer, J., Muelly , M., Go odfellow, I., Hardt, M., Kim, B.: Sanit y c hecks for saliency maps. In: Adv ances in Neural Information Pro cessing Systems. v ol. 31, pp. 9505–9515 (2018) 2. Aghaei, A.A.: rKAN: Rational k olmogorov–arnold net works. arXiv preprin t arXiv:2406.14495 (2024), 3. Alv arez-Melis, D., Jaakkola, T.S.: On the robustness of in terpretability methods. arXiv preprint arXiv:1806.08049 (2018), 4. Arnold, V.I.: On the functions of three v ariables. Doklady Ak ademii Nauk SSSR 114 (4), 679–681 (1957) 5. de Bo or, C.: A Practical Guide to Splines, Applied Mathematical Sciences, vol. 27. Springer-V erlag, New Y ork, NY (1978). h tt ps :/ / do i. or g/ 10 .1 0 0 7 /9 78- 1 - 4 61 2- 6333- 3 6. Brun ton, S.L., Proctor, J.L., Kutz, J.N.: Disco v ering go verning equations from data b y sparse identiﬁcation of nonlinear dynamical systems. Pro ceedings of the National A cademy of Sciences 113 (15), 3932–3937 (2016). ht tp s:/ /d oi. or g/1 0 .1073/pnas.1517384113 , 7. Cranmer, M.: In terpretable machine learning for science with PySR and Symboli- cRegression.jl. arXiv preprint arXiv:2305.01582 (2023). https://doi.org/10.4 8 5 50/arXiv.2305.01582 , 8. Doshi-V elez, F., Kim, B.: T ow ards a rigorous science of interpretable machine learn- ing. arXiv preprint arXiv:1702.08608 (2017), 9. Ghorbani, A., Abid, A., Zou, J.: Interpretation of neural net works is fragile. In: Pro ceedings of the AAAI Conference on Artiﬁcial Intelligence. v ol. 33, pp. 3681– 3688 (2019). https://doi.org/10.1609/aaai.v33i01.33013681 10. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptiv e mixtures of lo cal exp erts. Neural Computation 3 (1), 79–87 (1991). ht t ps: //d o i.o rg/ 1 0.1 162/ ne co.1991.3.1.79 11. Jang, E., Gu, S., P o ole, B.: Categorical reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144 (2017), 12. K olmogorov, A.N.: On the representation of contin uous functions of man y v ariables b y sup erposition of contin uous functions of one v ariable and addition. Doklady Ak ademii Nauk SSSR 114 (5), 953–956 (1957) 13. K oza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press (1992) 14. La Cav a, W., Orzecho wski, P ., Burlacu, B., de F rança, F.O., Virgolin, M., Jin, Y., Kommenda, M., Mo ore, J.H.: Con temp orary symbolic regression methods and their relative p erformance. In: Pro ceedings of the Neural Information Pro cessing Systems T rack on Datasets and Benchmarks (2021), 15. Li, Z.: Kolmogoro v-arnold netw orks are radial basis function netw orks (2024). ht tps :// doi. org/ 10.4 855 0/ar Xiv. 240 5.06 721 , h ttps :// a rxi v.or g/ab s/2 405. 0 6721 24 F. Sovrano et al. 16. Li u, Z., Ma, P ., W ang, Y., Matusik, W., T egmark, M.: KAN 2.0: Kolmogoro v– arnold net w orks meet science. arXiv preprint arXiv:2408.10205 (2024), ht t p s : 17. Li u, Z., W ang, Y., V aidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T.Y., T egmark, M.: KAN: Kolmogoro v–arnold net works. In: arXiv preprint arXiv:2404.19756 (2024), 18. Loui zos, C., W elling, M., Kingma, D.P .: Learning sparse neural netw orks through l 0 regularization. In: International Conference on Learning Representations (2018), 19. Lu ndberg, S.M., Lee, S.I.: A uniﬁed approach to in terpreting mo del predictions. In: Adv ances in Neural Information Processing Systems. vol. 30, pp. 4765–4774 (2017) 20. M addison, C.J., Mnih, A., T eh, Y.W.: The concrete distribution: A con tin uous relaxation of discrete random v ariables. arXiv preprint arXiv:1611.00712 (2017), 21. M allat, S.G., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE T ransactions on Signal Processing 41 (12), 3397–3415 (1993). h t tps :// doi .or g/ 10.1109/78.258082 22. Pati, Y.C., Rezaiifar, R., Krishnaprasad, P .S.: Orthogonal matching pursuit: Re- cursiv e function approximation with applications to wa v elet decomposition. In: Pro ceedings of the 27th Asilomar Conference on Signals, Systems and Computers. pp. 40–44 (1993). https://doi.org/10.1109/ACSSC.1993.342465 23. Rib eiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?”: Explaining the predictions of any classiﬁer. In: Pro ceedings of the 22nd ACM SIGKDD Interna- tional Conference on Kno wledge Discov ery and Data Mining. pp. 1135–1144 (2016). https://doi.org/10.1145/2939672.2939778 24. Ru din, C.: Stop explaining blac k box mac hine learning mo dels for high stakes decisions and use interpretable mo dels instead. Nature Mac hine In telligence 1 (5), 206–215 (2019). https://doi.org/10.1038/s42256- 019- 0048- x 25. S c hmidt, M., Lipson, H.: Distilling free-form natural la ws from exp erimen tal data. Science 324 (5923), 81–85 (2009). https://doi.org/10.1126/science.1165893 26. S hazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hin ton, G., Dean, J.: Outrageously large neural netw orks: The sparsely-gated mixture-of-exp erts lay er. arXiv preprint arXiv:1701.06538 (2017). http s:// doi.o rg/1 0 .485 50/ar Xiv. 170 1.06538 , 27. S o vrano, F., Vilone, G., Lognoul, M., Longo, L.: Legal xai: a systematic review and in terdisciplinary mapping of xai and eu law, to wards a research agenda for legally resp onsible ai. T ow ards a Research Agenda for Legally Resp onsible AI (July 22, 2025) (2025) 28. T a, H.T., Thai, D.Q., Rahman, A.B.S., Sidorov, G., Gelbukh, A.: F C-KAN: F unction combinations in kolmogoro v–arnold net works. Information Sciences 736 , 123103 (2026). h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . i n s . 2 0 2 6 . 1 2 3 1 0 3 , h t t p s : //doi.org/10.1016/j.ins.2026.123103 29. T ropp, J.A., Gilbert, A.C.: Signal reco v ery from random measuremen ts via or- thogonal matc hing pursuit. IEEE T ransactions on Information Theory 53 (12), 4655–4666 (2007). https://doi.org/10.1109/TIT.2007.909108 30. Udr escu, S.M., T egmark, M.: AI F eynman 2.0: Pareto-optimal symbolic regression exploiting graph mo dularit y . Science Adv ances 6 (16), eaay2631 (2020). h t t p s : //doi.org/10.1126/sciadv.aay2631

In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment