Sparse Additive Model Pruning for Order-Based Causal Structure Learning

Sparse Additiv e Model Pruning f or Order -Based Causal Structur e Learning K entaro Kanamori, Hirofumi Suzuki, T akuya T akagi Artiﬁcial Intelligence Laboratory , Fujitsu Limited { k.kanamori, suzuki-hirofumi, takagi.takuya } @fujitsu.com Abstract Causal structure learning, also known as causal discov ery , aims to estimate causal relationships between variables as a form of a causal directed acyclic graph (D AG) from ob- servational data. One of the major frameworks is the order- based approach that ﬁrst estimates a topological order of the underlying D A G and then prunes spurious edges from the fully-connected DA G induced by the estimated topological order . Previous studies often focus on the former ordering step because it can dramatically reduce the search space of D AGs. In practice, the latter pruning step is equally crucial for ensuring both computational efﬁcienc y and estimation ac- curacy . Most e xisting methods employ a pruning technique based on generalized additive models and hypothesis testing, commonly known as CAM-pruning. Howe ver , this approach can be a computational bottleneck as it requires repeatedly ﬁtting additive models for all variables. Furthermore, it may harm estimation quality due to multiple testing. T o address these issues, we introduce a new pruning method based on sparse additi ve models, which enables direct pruning of re- dundant edges without relying on hypothesis testing. W e pro- pose an efﬁcient algorithm for learning sparse additive mod- els by combining the randomized tree embedding technique with group-wise sparse regression. Experimental results on both synthetic and real datasets demonstrated that our method is signiﬁcantly faster than existing pruning methods while maintaining comparable or superior accuracy . 1 Introduction In sev eral scientiﬁc ﬁelds, such as biology and economics, it is often important to identify causal relationships between variables in observational data. Causal structure learning , also known as causal disco very , aims to estimate them as a form of a causal dir ected acyclic graph (DA G) (Pearl 2009). A causal D A G enables us to understand the relationships be- tween variables and to predict the effect of interventions on the variables, which are crucial for decision-making in v ari- ous applications (Peters, Janzing, and Sch ¨ olkopf 2017). One of the promising algorithmic framew orks for causal structure learning is the order -based appr oach (T eyssier and K oller 2005). Under some conditions on the data-generating process, previous studies ha ve shown that the underly- ing causal D AG is identiﬁable from purely observational Copyright © 2026, Association for the Advancement of Artiﬁcial Intelligence (www .aaai.org). All rights reserv ed. data (Peters et al. 2014). Howe ver , we need to search for a causal graph while ensuring it is acyclic, which incurs super-e xponential computational costs in the number of vari- ables (Chickering 1996). T o alleviate this issue, the order- based approach ﬁrst estimates a topological order of the un- derlying causal D AG and then prunes spurious edges from the fully-connected DA G induced by the estimated topologi- cal order , as illustrated in Figure 1. By estimating a topolog- ical order in advance, we can obtain a causal D A G without explicitly imposing the acyclicity constraint, which reduces the search space dramatically (Rolland et al. 2022). While existing studies often focus on the former order- ing step, in practice, the latter pruning step is equally crucial for ensuring both computational ef ﬁciency and estimation accuracy . Most existing methods employ a pruning algo- rithm based on generalized additi ve models (GAMs) (Hastie and T ibshirani 1986), kno wn as CAM-pruning (B ¨ uhlmann, Peters, and Ernest 2014). It ﬁrst ﬁts a GAM for regress- ing each variable on its candidate parents, and then iden- tiﬁes redundant parents by hypothesis testing for the ﬁtted GAM (Marra and W ood 2011). Ho wever , the computational cost of ﬁtting GAMs is generally expensi ve, especially for high-dimensional cases. Since CAM-pruning needs to re- peatedly ﬁt GAMs for all variables, it often becomes the bottleneck of the entire algorithm (Rolland et al. 2022; Mon- tagna et al. 2023). Furthermore, CAM-pruning also requires repeating hypothesis testing, which can degrade the estima- tion accuracy due to multiple testing (Huang et al. 2018). The goal of this paper is to propose an alternati ve pruning method that addresses the aforementioned limitations of e x- isting pruning methods. It enables us to accelerate the exist- ing order -based causal structure learning algorithms without compromising their estimation quality . Our Contributions In this paper, we propose a ne w prun- ing method for order-based causal structure learning. Our key idea is to learn a spar se additive model (Ravikumar et al. 2009) that re gresses each variable on its candidate parents, which enables us to directly prune redundant candidate par- ents without requiring hypothesis testing. T o accelerate the process of ﬁtting sparse additive models, we propose a ne w framew ork, named Sparse Additive Randomized TRee En- semble (SARTRE) , by combining the randomized tree em- bedding and group-wise sparse regression techniques. Our X 1 X 2 X 3 X 4 Observationaldata Fully-connectedDAG X 2 X 3 X 1 X 4 PrunedDAG topologicalorder: X 2 , X 3 , X 1 , X 4 prunededges: ( X 2 , X 3 ), ( X 3 , X 4 ) Step1 Step2 X 2 X 3 X 1 X 4 Figure 1: An ov erview of the order -based causal structure learning algorithm. Gi ven an observ ational dataset, it ﬁrst estimates a topological order of the underlying causal DA G. Then, it prunes spurious edges from the fully-connected DA G induced by the estimated topological order . This paper focuses on the latter step and aims to propose an efﬁcient and accurate pruning method. contributions are summarized as follo ws: • W e introduce a new efﬁcient frame work for learning a sparse additive model, named SAR TRE. W e consider a special case of the additi ve model, where the shape func- tion for each variable is expressed as a linear combination of weighted indicator functions over a set of interv als. W e propose to generate a set of intervals by randomized tr ee embedding (Moosmann, Triggs, and Jurie 2006), and show that we can efﬁciently learn the sparse weight vec- tor via gr oup lasso re gr ession (Y uan and Lin 2006). • W e propose an ef ﬁcient pruning method for order -based causal structure learning by leveraging our SAR TRE framew ork. Giv en an estimated topological order , our method can ef ﬁciently prune redundant edges from the fully-connected D A G induced by the estimated topo- logical order without requiring hypothesis testing. Our method can be combined with any causal ordering algo- rithm, such as SCORE (Rolland et al. 2022). • By numerical experiments on synthetic and real datasets, we demonstrated that our method achie ved a signiﬁcant speedup compared to e xisting pruning methods, includ- ing CAM-pruning (B ¨ uhlmann, Peters, and Ernest 2014). Furthermore, our method achiev ed comparable or supe- rior accuracy to existing methods, suggesting that it can be a promising alternativ e to current pruning methods. 2 Preliminaries 2.1 Causal Structure Learning Causal structure learning , also kno wn as causal discovery , aims to estimate a causal gr aph that expresses the causal re- lationships between variables from observational data. For a set of variables [ d ] : = { 1 , . . . , d } , we consider an addi- tive noise model (ANM) with a directed ac yclic graph (DA G) G (Peters et al. 2014). More precisely , we assume that a ran- dom variable X = ( X 1 , . . . , X d ) ∈ R d is generated by the following structur al equation model for each i ∈ [ d ] : X i = f i ( X pa( i ) ) + ε i , where X pa( i ) is a vector of v ariables that are parents of X i in G , f i is a deterministic link function, and ε i ∼ N (0 , σ 2 i ) is an independent Gaussian noise variable. As with the pre vious studies (Peters et al. 2014; Rolland et al. 2022), we assume that the link functions f i are non- linear and twice continuously dif ferentiable in e very com- ponent. Under mild conditions, the ANM deﬁned above is known to be identiﬁable ; that is, the underlying causal DA G G can be uniquely recovered from observ ational data gen- erated according to the joint distribution of X (Peters et al. 2014). Moti vated by this fact, se veral algorithms for esti- mating the causal DA G G from observ ational data ha ve been proposed so f ar (B ¨ uhlmann, Peters, and Ernest 2014; Mon- tagna et al. 2023; Xu et al. 2024). 2.2 Order -Based Causal Structure Learning One of the major frame works for estimating a causal D A G G is the or der-based appr oach (T eyssier and K oller 2005). Giv en an observ ational sample S = { x 1 , . . . , x n } ⊆ R d , it divides the task of estimating G into two steps: (1) esti- mating a topological order of G , and (2) pruning redundant edges from the fully-connected D AG induced by the esti- mated topological order . Figure 1 presents an overvie w of the order-based approach. Pre vious studies often focus on the former step and proposed sev eral methods for estimating a topological order from a gi ven observational sample S . For the latter pruning step, most of the e xisting methods emplo y a nonlinear variable selection algorithm based on general- ized additive models (GAMs) (Hastie and T ibshirani 1986). Ordering Step Given a sample S , the ordering step aims to estimate a topological order π of the underlying D A G G . A topological order π is expressed as a permutation ov er [ d ] such that, for any i, j ∈ [ d ] , π ( i ) < π ( j ) holds if X j is a descendant of X i in G . Existing methods often estimate a topological order in a bottom-up greedy manner that itera- tiv ely identiﬁes a leaf variable, i.e., a variable that is not a parent of any other remaining v ariables (Peters et al. 2014). One of the state-of-the-art methods is the SCORE algo- rithm (Rolland et al. 2022), which identiﬁes a leaf variable by leveraging the score matching technique. F or the under- lying joint distribution p ( x ) of X , let s ( x ) : = ∇ log p ( x ) be the logarithmic gradient of p ( x ) , which is known as the scor e function . Then, (Rolland et al. 2022) proved that X i is a leaf v ariable if and only if the variance of ∂ s i ( X ) ∂ x i is zero. Based on this fact, SCORE identiﬁes a leaf v ariable X l by l = arg min i ∈ [ d ] V ar X  ∂ s i ( X ) ∂ x i  . Note that V ar X h ∂ s i ( X ) ∂ x i i can be estimated by the second- order Stein gradient estimator (Li and Turner 2018) with a giv en sample S . After identifying a leaf variable X l , SCORE … … … 0 1 0 … 0 1 0 0 … 1 0 0 1 … 0 X 2 X 3 X 1 Randomizedtreeembedding Group-wisesparseregression  f 4 ( X  pa(4) ) ϕ 2 ( X 2 ) ϕ 3 ( X 3 ) ϕ 1 ( X 1 ) β ⊤ 4,2 ϕ 2 ( X 2 ) β ⊤ 4,3 ϕ 3 ( X 3 ) β ⊤ 4,1 ϕ 1 ( X 1 ) ∑ (a) Model Overvie w − 3 − 2 − 1 0 1 2 3 X 1 − 1 0 1 g 4 , 1 ( X 1 ) − 3 − 2 − 1 0 1 2 3 X 2 − 1 0 1 g 4 , 2 ( X 2 ) − 3 − 2 − 1 0 1 2 3 X 3 − 1 0 1 g 4 , 3 ( X 3 ) (b) Shape Functions Figure 2: An example of our SAR TRE framew ork. W e consider the same example as in Figure 1, where the estimated topo- logical order ˆ π = (2 , 3 , 1 , 4) is giv en and we consider to identify the parents of the variable X 4 from its candidate parents ˆ pa(4) = { 2 , 3 , 1 } . (a) Our method ﬁrst generates binary representation vectors ( ϕ 2 ( X 2 ) , ϕ 3 ( X 3 ) , ϕ 1 ( X 1 )) by randomized tree embedding. Then, it learns an additi ve model ˆ f 4 ( X ˆ pa(4) ) = g 4 , 2 ( X 2 ) + g 4 , 3 ( X 3 ) + g 4 , 1 ( X 1 ) , where each shape function is deﬁned by g 4 ,j ( X j ) = β ⊤ 4 ,j ϕ j ( X j ) . By optimizing the coefﬁcient vectors ( β 4 , 2 , β 4 , 3 , β 4 , 1 ) through group lasso re gression, we can obtain a sparse additive model ˆ f 4 . (b) For each shape function g 4 ,j , if β 4 ,j = 0 holds, we have g 4 ,j ( X j ) = 0 for any X j , which enables us to prune the corresponding candidate parent X j . In this example, β 4 , 3 = 0 holds and thus we can prune X 3 . remov es it from the set of v ariables and repeats this proce- dure for the remaining variables. When all variables are re- mov ed, we can obtain an estimated topological order ˆ π . Pruning Step Giv en an estimated topological order ˆ π and S , the pruning step aims to remove spurious edges and identify the true parents of each variable X i in the under - lying D A G G . Once a topological order ˆ π is determined, we can construct the fully-connected D A G by adding a di- rected edge from X j to X i for each j, i ∈ [ d ] such that ˆ π ( j ) < ˆ π ( i ) holds. Let ˆ pa( i ) : = { j ∈ [ d ] | ˆ π ( j ) < ˆ π ( i ) } be the set of candidate par ents of X i with respect to ˆ π . Then, the task of the pruning step is to select the true par- ents of each X i from ˆ pa( i ) , which can be regarded as a variable selection problem. Most e xisting methods employ a variable selection method based on GAMs, known as CAM- pruning (B ¨ uhlmann, Peters, and Ernest 2014). The idea be- hind CAM-pruning is to assume that the link function f i can be expressed as an additiv e model. It ﬁrst ﬁts an additi ve model that regresses X i on its candidate parents X ˆ pa( i ) : ˆ f i ( X ˆ pa( i ) ) = X j ∈ ˆ pa( i ) g i,j ( X j ) , where g i,j is a nonlinear shape function for each j ∈ ˆ pa( i ) . Then, CAM-pruning selects a subset of ˆ pa( i ) by hypothesis testing whether g i,j is signiﬁcantly different from zero for each j ∈ ˆ pa( i ) . That is, it remov es each candidate parent X j if the null hypothesis g i,j ( X j ) = 0 is accepted. While CAM-pruning is known as a powerful method, it faces two critical challenges. First, while it needs to ﬁt an additiv e model for each variable X i , repeating this proce- dure for all variables can be computationally expensi ve, es- pecially in high-dimensional cases (Montagna et al. 2023). Second, it requires hypothesis testing for all pairs of each variable and its candidate parent, which can harm the cor - rectness of pruning due to multiple testing (Huang et al. 2018). T o address these issues, the aim of this paper is to propose an alternati ve pruning method that can be applied to high-dimensional datasets while av oiding multiple testing. 3 Algorithm In this section, we propose an efﬁcient and accurate prun- ing method for order-based causal structure learning. W e assume that we are given a topological order ˆ π over v ari- ables [ d ] estimated from an observational sample S = { x 1 , . . . , x n } ⊆ R d by some algorithm. Note that our method can be combined with any existing ordering algo- rithm, including CAM (B ¨ uhlmann, Peters, and Ernest 2014), SCORE (Rolland et al. 2022), and CaPS (Xu et al. 2024). Giv en an estimated topological order ˆ π , we aim to identify the true parents pa( i ) for each variable X i by removing spu- rious variables from its candidate parents ˆ pa( i ) gi ven by ˆ π . Our main idea is to learn a sparse additive model (Raviku- mar et al. 2009) that regresses each v ariable X i on a small subset of its candidate parents X ˆ pa( i ) . By learning a sparse additiv e model, we can directly prune redundant candi- date parents without relying on hypothesis testing. How- ev er, ﬁtting a sparse additive model for each v ariable X i can be computationally expensi ve, as is the case with CAM- pruning. Since we need to repeat this procedure for all vari- ables, the computational cost can be prohibitive in practice. T o address this issue, we propose a new frame work for ef- ﬁciently learning sparse additive models by combining the randomized tr ee embedding (Moosmann, T riggs, and Jurie 2006) and gr oup lasso re gr ession (Y uan and Lin 2006). 3.1 Sparse Additive Randomized T ree Ensemble First, we introduce Sparse Additive Randomized TRee En- semble (SARTRE) , a new frame work for learning sparse ad- ditiv e models. W e consider a special case of the additi ve model, where each shape function is expressed as a lin- ear combination of weighted indicator functions ov er a set of intervals. More precisely , we consider an additi ve model ˆ f i ( X ˆ pa( i ) ) = P j ∈ ˆ pa( i ) g i,j ( X j ) whose shape functions g i,j are deﬁned as follows: g i,j ( X j ) = X l j k =1 β i,j,k · ϕ j,k ( X j ) , X j ≤ 0.7 X j ≤ 0.4 X j ≤ 0.8 r j ,1 r j ,2 r j ,3 r j ,4 X j ≤ 0.9 X j ≤ 0.6 r j ,5 r j ,6 r j ,7 Interv als R j r j, 1 =( −∞ , 0 . 4] r j, 2 =( 0 . 4 , 0 . 7] r j, 3 =( 0 . 7 , 0 . 8] r j, 4 =( 0 . 8 , ∞ ] r j, 5 =( −∞ , 0 . 6] r j, 6 =( 0 . 6 , 0 . 9] r j, 7 =( 0 . 9 , ∞ ] Figure 3: An example of the tree embedding technique. Giv en an ensemble of decision trees that takes X j as input, each leaf k of a tree corresponds to an interval r j,k of X j . Thus, we can obtain a set of interv als R j by collecting inter- vals corresponding to the lea ves in a giv en tree ensemble. where l j is the total number of intervals for X j , ϕ j,k ( X j ) = I [ X j ∈ r j,k ] is an indicator function with respect to the k -th interval r j,k ⊂ R , and β i,j,k is a coef ﬁcient for r j,k . Each in- terval r j,k is expressed as r j,k = ( a j,k , b j,k ] with lower and upper bounds a j,k , b j,k ∈ R . For notational con venience, we denote by ϕ j ( X j ) : = ( ϕ j, 1 ( X j ) , . . . , ϕ j,l j ( X j )) ∈ { 0 , 1 } l j , which can be regarded as an embedding vector of X j using the set of intervals R j : = { r j, 1 , . . . , r j,l j } . Then, our shape function can be expressed by g i,j ( X j ) = β ⊤ i,j ϕ j ( X j ) , where β i,j : = ( β i,j, 1 , . . . , β i,j,l j ) ∈ R l j is a coef ﬁcient vector for the pair of a variable X i and its candidate parent X j . T o learn our additi ve model ˆ f i , we need to generate a set of intervals R j and optimize the coefﬁcient v ector β i,j for each j ∈ ˆ pa( i ) . By deﬁnition, it is easy to see that g i,j ( X j ) = 0 if β i,j,k = 0 holds for all k ∈ [ l j ] . It im- plies that we can obtain a sparse additiv e model by forcing β i,j = 0 for as many candidate parents j ∈ ˆ pa( i ) as possi- ble. Based on this observation, our SAR TRE consists of the following two steps: (1) randomized tr ee embedding , which efﬁciently generates a set of interv als R j by an unsupervised manner; (2) gr oup-wise sparse re gr ession , which optimizes each coef ﬁcient vector β i,j by group lasso regression. Fig- ure 2 illustrates an ov ervie w of our SAR TRE framew ork. Randomized T r ee Embedding T o generate a set of inter - vals R j for each j ∈ ˆ pa( i ) , we employ the embedding tech- nique based on tree ensemble models (Moosmann, T riggs, and Jurie 2006). A tree ensemble model consists of deci- sion trees that partition the input space and assign a constant value to each partition. Giv en a tree ensemble h j : R → R that takes one single variable X j as input, each leaf k of a decision tree in the ensemble corresponds to an interv al r j,k . It suggests that we can obtain R j by collecting inter- vals corresponding to the lea ves in a gi ven tree ensemble h j . Such a technique is known as tr ee embedding (Moosmann, T riggs, and Jurie 2006), which enables us to easily e xpress nonlinear relationships (Friedman and Popescu 2008; Feng and Zhou 2018). Figure 3 illustrates a running example of how to extract a set of interv als R j from a tree ensemble h j . There exist sev eral possible ways to obtain a tree en- semble h j . One straightforward approach is to train a tree ensemble h j that regresses X i on X j by some supervised learning algorithm, such as random forests (Breiman 2001) or gradient boosting (Friedman 2000). T o accelerate our Algorithm 1: SAR TRE-pruning 1: for j = 1 , . . . , d do ▷ Randomized T ree Embedding 2: Fit a randomized tree ensemble h j ; 3: Extract R j = { r j, 1 , . . . , r j,l j } from h j ; 4: end for 5: Construct a fully-connected DA G ˆ G induced by ˆ π . 6: for i = 1 , . . . , d do ▷ Gr oup-wise Sparse Regr ession 7: ˆ β i ← arg min β i ∈ R p i L ( β i ) + λ · P j ∈ ˆ pa( i ) ∥ β i,j ∥ 2 ; 8: for j ∈ ˆ pa( i ) do 9: if ˆ β i,j, 1 = · · · = ˆ β i,j,l j = 0 then 10: Remov e the edge ( X j , X i ) from ˆ G ; 11: end if 12: end for 13: end for 14: return ˆ G generating process, we employ an ensemble of completely randomized trees (Geurts, Ernst, and W ehenkel 2006). Be- cause these trees are constructed by randomly selecting a split point for each node without an y target v ariable, we can generate R j more ef ﬁciently than the supervised approach. Note that the interv als R j that are generated in such an un- supervised and randomized manner can not always be opti- mal in terms of the regression performance. Howe ver , since we optimize the corresponding coefﬁcient vectors β i,j in the next step, the y may not harm the ov erall performance if we can set the total number of intervals to be suf ﬁciently large (Geurts, Ernst, and W ehenkel 2006). Group-W ise Sparse Regr ession Giv en the generated in- tervals R j for each j ∈ ˆ pa( i ) , we optimize the correspond- ing coefﬁcient v ectors β i,j by the group-wise sparse regres- sion technique (Y uan and Lin 2006; Massias, Gramfort, and Salmon 2018). Let p i = P j ∈ ˆ pa( i ) l j be the total number of intervals for all candidate parents of X i . W e denote the con- catenated embedding and coef ﬁcient vectors by Φ i ( x ) : = ( ϕ j ( x j )) j ∈ ˆ pa( i ) ∈ { 0 , 1 } p i and β i : = ( β i,j ) j ∈ ˆ pa( i ) ∈ R p i , respectiv ely . Then, our additiv e model can be expressed as ˆ f i ( x ˆ pa( i ) ) = β ⊤ i Φ i ( x ) . It implies that our additiv e model can be regarded as a linear model o ver embedding Φ i with a coef ﬁcient vector β i if the interv als R j are ﬁxed. In addi- tion, our coef ﬁcient vector β i has a group structure, where each group corresponds to one candidate parent of X i , and we aim to encourage group-wise sparsity in β i for obtaining a sparse additiv e model. By lev eraging the facts mentioned abov e, we can ﬁt a co- efﬁcient vector β i by group lasso r e gression (Y uan and Lin 2006). Our problem can be formulated as follows: min β i ∈ R p i L ( β i ) + λ · X j ∈ ˆ pa( i ) ∥ β i,j ∥ 2 , where L ( β i ) : = P n m =1  x m,i − β ⊤ i Φ i ( x m )  2 and λ > 0 is a regularization parameter . While the ﬁrst term of the ob- jectiv e function is the standard squared loss, the second term is the group lasso penalty that encourages group-wise spar- sity in β i . W e can solve the above optimization problem by existing efﬁcient algorithms, such as block-coordinate de- scent (Friedman, Hastie, and Tibshirani 2010) or dual ex- trapolation (Massias, Gramfort, and Salmon 2018). 10 20 30 40 50 Number of v ariables 0 25 50 SHD SHD - ER1 SCORE DAS SAR TRE 10 20 30 40 50 Number of v ariables 50 100 150 SHD SHD - ER4 10 20 30 40 50 Number of v ariables 0 25 50 SHD SHD - SF1 10 20 30 40 50 Number of v ariables 0 100 SHD SHD - SF4 10 20 30 40 50 Number of v ariables 0 500 1000 SID SID - ER1 10 20 30 40 50 Number of v ariables 0 1000 SID SID - ER4 10 20 30 40 50 Number of v ariables 0 1000 SID SID - SF1 10 20 30 40 50 Number of v ariables 0 1000 2000 SID SID - SF4 10 20 30 40 50 Number of v ariables 0 100 200 Time Time - ER1 10 20 30 40 50 Number of v ariables 0 100 200 Time Time - ER4 10 20 30 40 50 Number of v ariables 0 100 200 Time Time - SF1 10 20 30 40 50 Number of v ariables 0 100 200 Time Time - SF4 Figure 4: Experimental results of baseline comparison on the synthetic datasets. F or all the metrics, smaller values are better . The shaded areas indicate the standard de viations over 10 trials. W e varied the number of variables d from 10 to 50 . Our SAR TRE was signiﬁcantly faster than the baselines while maintaining comparable or superior SHD and SID. 3.2 Overall Pruning Algorithm W e no w propose a pruning algorithm for order -based causal structure learning, based on our SAR TRE frame work. Al- gorithm 1 presents our algorithm, named SARTRE-pruning . Giv en an estimated topological order ˆ π and a sample S , it ﬁrst generates a set of intervals R j for each variable X j by randomized tree embedding. Then, for each variable X i , it trains a sparse additi ve model ˆ f i ( X ˆ pa( i ) ) by optimizing the coefﬁcient v ector ˆ β i through group lasso regression. For a candidate parent j ∈ ˆ pa( i ) , if all the coef ﬁcients ˆ β i,j,k are zero, we conclude that X j is not a parent of X i and prune the edge ( X j , X i ) from the fully-connected D AG ˆ G induced by ˆ π . Finally , our algorithm returns the pruned D AG ˆ G . Our SAR TRE-pruning has sev eral adv antages compared to the existing pruning methods (B ¨ uhlmann, Peters, and Ernest 2014; Montagna et al. 2023). In terms of computa- tional efﬁciency , our algorithm generates a set of intervals R j for each variable in an unsupervised manner (lines 1–4) before the pruning procedure. Since the generated intervals R j are independent of any target variable, we can use R j for all the variables whose candidate parents include X j . Thus, all we need in our pruning procedure is to optimize the co- efﬁcient vector ˆ β i for each X i (lines 6–13), which is more efﬁcient than ﬁtting an additi ve model from scratch. Further- more, our algorithm obtains a sparse additi ve model for each variable X i that encourages as many shape functions g i,j as possible to be zero (Ravikumar et al. 2009). It enables us to directly identify spurious edges without multiple testing. 3.3 Theoretical Analysis Finally , we discuss the representation ability of our addi- tiv e model compared to the standard one used in CAM- pruning. As a shape function, existing implementations of CAM-pruning often employ a smoothing spline (Hastie, Tib- shirani, and Friedman 2009), which is a smooth piece-wise polynomial function. On the other hand, our shape function g i,j does not include polynomial terms and only consists of a linear combination of weighted indicator functions over a set of intervals. Contrary to such a simple structure, we sho w that our shape function has the potential to e xpress any con- tinuous function arbitrarily well in Proposition 1. Proposition 1. F or a variable X j , we assume X j ∈ [ a j , b j ] for some a j < b j . Then, for any continuous function g ∗ : [ a j , b j ] → R and ε > 0 , ther e exist our shape function g i,j such that max x j ∈ [ a j ,b j ] | g ∗ ( x j ) − g i,j ( x j ) | < ε holds. Pr oof (sketch). By deﬁnition, our shape function g i,j can be expressed as a piece-wise constant function , which is a uni- versal approximator for any continuous function (Cybenko 1989). This fact implies the existence of g i,j that can approx- imate a giv en continuous function g ∗ arbitrarily well. Proposition 1 suggests that our additi ve model potentially has a rich representation ability , while it is restricted to a simple shape function compared to CAM-pruning. Unfortu- nately , it is not tri vial to determine the sufﬁcient number of intervals l j and learn the appropriate coefﬁcients β i,j . While we leav e them for future w ork, in the next section, we empir- ically demonstrate that our method works well in practice. 1000 2000 3000 4000 5000 Number of samples 0 10 SHD SHD - ER1 SCORE DAS SAR TRE 1000 2000 3000 4000 5000 Number of samples 40 60 SHD SHD - ER4 1000 2000 3000 4000 5000 Number of samples 0 50 100 SID SID - ER1 1000 2000 3000 4000 5000 Number of samples 100 200 300 SID SID - ER4 1000 2000 3000 4000 5000 Number of samples 0 50 100 Time Time - ER1 1000 2000 3000 4000 5000 Number of samples 0 50 100 Time Time - ER4 Figure 5: Experimental results of baseline comparison by varying the number of samples n from 1000 to 5000 on ER1 and ER4 datasets. Our SAR TRE was faster than the base- lines without signiﬁcantly degrading SHD and SID. 4 Experiments T o in vestigate the efﬁcac y of our framework, we conducted numerical experiments on synthetic and real datasets. All the code was implemented in Python 3.10. All the experi- ments were conducted on macOS Sequoia with Apple M2 Ultra CPU and 128 GB memory . Due to page limitations, the complete results are shown in Appendix. 4.1 Experimental Setup Datasets W e examine the performance of our method on synthetic datasets generated from a nonlinear ANM with Gaussian noise. F ollowing (Rolland et al. 2022), we gener- ated link functions f i by sampling Gaussian processes with a unit bandwidth RBF kernel. W e considered two types of causal DA Gs: Er d ˝ os–R ´ enyi (ER) and scale-fr ee (SF) mod- els. For a ﬁxed number of variables d , we varied the spar- sity of the sampled graph by setting the average number of edges to be d or 4 d . W e also used real and semi-real datasets: Sachs (Sachs et al. 2005) and fMRI (Smith et al. 2011). Baselines W e compare our method ( SARTRE ) with two existing baselines. One baseline is SCORE (Rolland et al. 2022), which estimates a topological order by the score matching and then prunes spurious edges by CAM- pruning. Another baseline is D AS (Montagna et al. 2023), which is a fast variant of SCORE that ﬁlters redundant candidate parents before applying CAM-pruning. Some other methods, such as CAM (B ¨ uhlmann, Peters, and Ernest 2014), NO TEARS (Zheng et al. 2018, 2020), and GraND AG (Lachapelle et al. 2020), are not included in our comparison since they were outperformed by SCORE in (Rolland et al. 2022). W e implemented our SAR TRE by combining the ordering step of SCORE and Algorithm 1. Note that all methods use the same ordering step, and their 2 6 2 7 2 8 2 9 Number of variables 10 1 10 2 SHD SHD - ER1 DAS SAR TRE 2 6 2 7 2 8 2 9 Number of variables 10 3 SHD SHD - ER4 2 6 2 7 2 8 2 9 Number of variables 10 2 10 3 SID SID - ER1 2 6 2 7 2 8 2 9 Number of variables 10 4 10 5 SID SID - ER4 2 6 2 7 2 8 2 9 Number of variables 10 2 10 3 Time Time - ER1 2 6 2 7 2 8 2 9 Number of variables 10 2 10 3 Time Time - ER4 Figure 6: Experimental results of the high-dimensional cases on ER1 and ER4 datasets. W e varied the number of variables by d ∈ { 64 , 128 , 256 , 512 } . W e observed that our SAR TRE was faster and attained better SHD and SID than D AS. differences are only in the pruning step. For SAR TRE, we set λ = 0 . 1 because it performed the best in our sensitivity analyses, which are shown in Appendix. W e also set the to- tal number of trees and the maximum leaf size of each tree in the ensemble h j to be 5 and 8 , respecti vely , which means that the total number of intervals l j is at most 40 . W e re- peated each e xperiment 10 times and report the a verage val- ues of (i) structural Hamming distance (SHD), (ii) structural interventional distance (SID) (Peters and B ¨ uhlmann 2015), and (iii) running time [s] for each method. 4.2 Experimental Results Baseline Comparison First, we ev aluate the performance of each method by v arying the number of v ariables d . Fig- ure 4 shows the average SHD, SID, and running time of each method on the synthetic datasets with n = 2000 . W e can see that our SAR TRE was signiﬁcantly faster than the baselines, as the number of variables d approached 50 . Furthermore, while SAR TRE maintained comparable SHD and SID to the baselines on ER4 and SF4, it achiev ed the best SHD and SID on ER1 and SF1. It indicates that SAR TRE was more accu- rate than the baselines when the underlying causal graph was sparse, while it remained competitiv e on dense graphs. In summary , we conﬁrmed that our SARTRE achieved signiﬁ- cant speedup compar ed to the baselines while maintaining comparable or superior accur acy . W e also compare each method by varying the number of samples n . Figure 5 presents the results on ER1 and ER4 datasets with d = 20 . W e observed that the running time of D AS approached that of SCORE as the number of samples n increased, while SAR TRE remained faster than both base- lines. In terms of accurac y , SCORE attained better SHD and SID than SAR TRE on ER4. One possible reason is that hy- pothesis testing of CAM-pruning remains accurate for large Sachs fMRI Method SHD SID T ime SHD SID T ime SCORE 43.2 102.4 14.7 19.6 71.8 11.4 D AS 27.6 70.3 6.42 11.6 58.6 4.75 SAR TRE 22.7 58.0 3.62 12.9 60.0 3.18 T able 1: Experimental results on real datasets. W e sampled each dataset by bootstrap sampling, and repeated this proce- dure for 10 trials. W e can see that our SAR TRE was faster than the baselines while maintaining comparable accuracy . n relatively to d and dense D A Gs. Ne vertheless, SAR TRE outperformed SCORE on ER1 and D AS on both ER1 and ER4. These results indicate that our SARTRE often per- formed well with r espect to the number of samples as well . High-Dimensional Case Next, we e xamine our method in high-dimensional cases. W e varied d by { 64 , 128 , 256 , 512 } and ﬁx ed n = 2000 . Due to the high dimensionality , we omitted the ordering step of each method and used the ground truth topological order as input. Figure 6 sho ws the results on ER1 and ER4 datasets for D AS and SAR TRE. W e observed that SAR TRE was faster than D AS while maintain- ing better SHD and SID in all cases. From these results, we conﬁrmed that our SARTRE performed better than the e xist- ing scalable algorithm even in the high-dimensional cases . Real Datasets Finally , we ev aluate our method on real and semi-real datasets. W e sampled a subset of each dataset by bootstrap sampling with n = 2000 . T able 1 shows the av er- age SHD, SID, and running time of each method on Sachs and fMRI datasets ov er 10 trials. W e observed that SAR TRE was faster than the baselines without signiﬁcantly de grading SHD and SID. These results suggest that our SARTRE per- formed well on r eal datasets, as well as synthetic datasets . 5 Related W ork Causal structur e learning , also referred to as causal discov- ery , has been a fundamental task in the ﬁeld of statistics, ma- chine learning, and artiﬁcial intelligence for decades (Pearl 2009; Peters, Janzing, and Sch ¨ olkopf 2017). Existing stud- ies can be categorized into three main approaches and mix- tures of them: constraint-based (Spirtes and Glymour 1991), scor e-based (Chickering 2003), and functional model-based methods (Shimizu et al. 2006). This paper focuses on func- tional model-based methods, and in particular , considers the additive noise model (ANM) (Peters et al. 2014), which is a popular and widely studied model in the literature. T o estimate a causal D A G from observational data, sev- eral algorithms hav e been proposed so far (Shimizu et al. 2011; Ghoshal and Honorio 2018; Zheng et al. 2018; Cai et al. 2018; Lachapelle et al. 2020; Zheng et al. 2020). In this paper , we focus on the or der-based appr oach (T eyssier and K oller 2005), which ﬁrst estimates a topological order of the underlying causal D A G and then prunes spurious edges from the fully-connected DA G induced by the estimated or- der . Existing studies often focus on the former step and pro- posed sev eral methods for estimating a topological order , such as CAM (B ¨ uhlmann, Peters, and Ernest 2014), RE- SIT (Peters et al. 2014), SCORE (Rolland et al. 2022), and so on (Xu et al. 2024; W ang et al. 2021; Sanchez et al. 2023). For the latter step, most of these methods employ CAM- pruning (B ¨ uhlmann, Peters, and Ernest 2014) that identi- ﬁes spurious edges by nonlinear feature selection based on GAMs. One exception is DAS proposed by (Montagna et al. 2023), which accelerates CAM-pruning by removing redun- dant candidate parents in advance by le veraging the score matching (Rolland et al. 2022). In contrast, this paper pro- poses an alternative pruning method that aims to address the limitations of CAM-pruning in terms of efﬁcienc y and ac- curacy . W e demonstrated that our method can signiﬁcantly accelerate existing order-based causal structure learning al- gorithms without compromising accuracy , enabling us to ap- ply them to more high-dimensional settings. Our SAR TRE can be regarded as a new nonlinear vari- able selection method, as well as a ne w learning algorithm for sparse additive models (Ravikumar et al. 2009; Haris, Simon, and Shojaie 2022). It consists of the randomized tr ee embedding (Moosmann, T riggs, and Jurie 2006; Geurts, Ernst, and W ehenkel 2006; Feng and Zhou 2018) and gr oup lasso r e gression (Y uan and Lin 2006; Massias, Gramfort, and Salmon 2018). While frame works for learning GAMs with tree-based shape functions exist (Lou et al. 2013), the y are not designed to encourage sparsity . Since variable selec- tion plays a crucial role in many applications, our SAR TRE has the potential to be applied to v arious tasks beyond causal structure learning (Marra and W ood 2011). 6 Conclusion This paper proposed a new pruning method that enhances the efﬁcienc y and accuracy of nonlinear order-based causal structure learning. T o address the limitations of the existing pruning method based on additive models with hypothesis testing, we introduced a new framework, named SAR TRE, for learning a sparse additive model by combining the ran- domized tree embedding and group-wise sparse regression techniques. Our method can efﬁciently learn a sparse addi- tiv e model for each variable and its candidate parents, which enables us to directly prune redundant edges without hy- pothesis testing. Experimental results demonstrated that our method was signiﬁcantly faster than the existing methods while maintaining comparable or superior accuracy . Limitations and Future W ork One limitation of our method is that we need to determine some hyperparameters in adv ance. In our experiments, we used the same v alues for λ and l j across all settings and conﬁrmed that our method stably performed well in all situations. Ho we ver , it would be beneﬁcial to dev elop a method that can automatically tune these values based on the data. Another limitation is the lack of theoretical guarantees for our pruning quality . While we showed that our SAR TRE has a rich representation ability in Proposition 1, guaranteeing the correctness of pruning re- mains an open problem. Finally , we need to examine our method in more general and practical settings. F or example, in vestigating the performance of our method in the presence of latent confounders is important for future work. Ethical Statement Our proposed method, named SAR TRE, is a ne w pruning algorithm for nonlinear causal structure learning from ob- servational data. W e believ e that our method can be applied to enhance the understanding of complex systems across various ﬁelds, including biology , economics, and the social sciences. It enables researchers to discover causal relation- ships in data that may not be readily observ able, leading to more informed decision-making. W e ackno wledge that the use of causal structure learning methods can have signiﬁ- cant ethical implications, particularly in sensitiv e areas such as healthcare or criminal justice. T o pre vent potential mis- use, we need to ensure that our method is used responsibly and ethically . Overall, we belie ve that our method can have a positi ve impact on society by enhancing the understanding of complex systems and facilitating better decision-making, provided it is used adequately . Acknowledgments W e wish to thank Y uta Fujishige, Shun Y anashima, and Ryosuke Ozeki for making a number of v aluable sugges- tions. W e also thank the anon ymous re vie wers for their in- sightful comments. References Breiman, L. 2001. Random Forests. Machine Learning , 45(1): 5–32. B ¨ uhlmann, P .; Peters, J.; and Ernest, J. 2014. CAM: Causal additiv e models, high-dimensional order search and penal- ized regression. Annals of Statistics , 42(6): 2526–2556. Cai, R.; Qiao, J.; Zhang, Z.; and Hao, Z. 2018. SELF: Struc- tural Equational Likelihood Framew ork for Causal Discov- ery . In Pr oceedings of the 32nd AAAI Confer ence on Artiﬁ- cial Intelligence , 1787–1794. Chickering, D. M. 1996. Learning Bayesian Networks is NP-Complete. In Learning fr om Data: Artiﬁcial Intelligence and Statistics V , 121–130. Chickering, D. M. 2003. Optimal structure identiﬁcation with greedy search. Journal of Machine Learning Resear ch , 3: 507–554. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of contr ol, signals and sys- tems , 2(4): 303–314. Feng, J.; and Zhou, Z.-H. 2018. AutoEncoder by Forest. In Pr oceedings of the 32nd AAAI Confer ence on Artiﬁcial Intelligence , 2967–2973. Friedman, J.; Hastie, T .; and Tibshirani, R. 2010. A note on the group lasso and a sparse group lasso. arXiv , Friedman, J. H. 2000. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics , 29: 1189– 1232. Friedman, J. H.; and Popescu, B. E. 2008. Predictiv e learn- ing via rule ensembles. The Annals of Applied Statistics , 2(3): 916–954. Geurts, P .; Ernst, D.; and W ehenkel, L. 2006. Extremely randomized trees. Machine Learning , 63(1): 3–42. Ghoshal, A.; and Honorio, J. 2018. Learning linear struc- tural equation models in polynomial time and sample com- plexity . In Pr oceedings of the 21st International Confer ence on Artiﬁcial Intelligence and Statistics , 1466–1475. Haris, A.; Simon, N.; and Shojaie, A. 2022. Generalized Sparse Additiv e Models. J ournal of Machine Learning Re- sear ch , 23(70): 1–56. Hastie, T .; and Tibshirani, R. 1986. Generalized Additiv e Models. Statistical Science , 1(3): 297–310. Hastie, T .; T ibshirani, R.; and Friedman, J. H. 2009. The Elements of Statistical Learning: Data Mining, Infer ence, and Pr ediction, 2nd Edition . Springer Series in Statistics. Springer . Huang, B.; Zhang, K.; Lin, Y .; Sch ¨ olkopf, B.; and Glymour , C. 2018. Generalized Score Functions for Causal Discov- ery . In Pr oceedings of the 24th ACM SIGKDD International Confer ence on Knowledge Discovery & Data Mining , 1551– 1560. Lachapelle, S.; Brouillard, P .; Deleu, T .; and Lacoste-Julien, S. 2020. Gradient-Based Neural D AG Learning. In Pr o- ceedings of the 8th International Conference on Learning Repr esentations . Li, Y .; and T urner, R. E. 2018. Gradient Estimators for Im- plicit Models. In Pr oceedings of the 6th International Con- fer ence on Learning Representations . Lou, Y .; Caruana, R.; Gehrke, J.; and Hooker , G. 2013. Ac- curate Intelligible Models with Pairwise Interactions. In Pr oceedings of the 19th ACM SIGKDD International Con- fer ence on Knowledge Discovery and Data Mining , 623– 631. Marra, G.; and W ood, S. N. 2011. Practical v ariable selec- tion for generalized additive models. Computational Statis- tics & Data Analysis , 55(7): 2372–2387. Massias, M.; Gramfort, A.; and Salmon, J. 2018. Celer: a Fast Solver for the Lasso with Dual Extrapolation. In Pr o- ceedings of the 35th International Confer ence on Machine Learning , 3315–3324. Montagna, F .; Noceti, N.; Rosasco, L.; Zhang, K.; and Lo- catello, F . 2023. Scalable Causal Discovery with Score Matching. In Proceedings of the 2nd Confer ence on Causal Learning and Reasoning , 752–771. Moosmann, F .; T riggs, B.; and Jurie, F . 2006. F ast Dis- criminativ e V isual Codebooks using Randomized Clustering Forests. In Pr oceedings of the 19th Annual Conference on Neural Information Pr ocessing Systems , 985–992. Pearl, J. 2009. Causality: Models, Reasoning and Infer ence . Cambridge Univ ersity Press, 2nd edition. Peters, J.; and B ¨ uhlmann, P . 2015. Structural Intervention Distance for Ev aluating Causal Graphs. Neural Computa- tion , 27(3): 771–799. Peters, J.; Janzing, D.; and Sch ¨ olkopf, B. 2017. Elements of causal infer ence: foundations and learning algorithms . The MIT press. Peters, J.; Mooij, J. M.; Janzing, D.; and Sch ¨ olkopf, B. 2014. Causal Discovery with Continuous Additive Noise Models. Journal of Machine Learning Resear ch , 15(58): 2009–2053. Ravikumar , P .; Lafferty , J.; Liu, H.; and W asserman, L. 2009. Sparse Additi ve Models. Journal of the Royal Statisti- cal Society Series B: Statistical Methodolo gy , 71(5): 1009– 1030. Rolland, P .; Cevher , V .; Kleindessner , M.; Russell, C.; Janz- ing, D.; Sch ¨ olkopf, B.; and Locatello, F . 2022. Score Match- ing Enables Causal Discovery of Nonlinear Additiv e Noise Models. In Pr oceedings of the 39th International Confer- ence on Machine Learning , 18741–18753. Sachs, K.; Perez, O.; Pe’er , D.; Lauffenb urger , D. A.; and Nolan, G. P . 2005. Causal protein-signaling networks deriv ed from multiparameter single-cell data. Science , 308(5721): 523–529. Sanchez, P .; Liu, X.; O’Neil, A. Q.; and Tsaftaris, S. A. 2023. Diffusion Models for Causal Discovery via T opo- logical Ordering. In Pr oceedings of the 11th International Confer ence on Learning Representations . Shimizu, S.; Hoyer , P . O.; Hyv ¨ arinen, A.; and Kerminen, A. 2006. A Linear Non-Gaussian Acyclic Model for Causal Discov ery. J ournal of Machine Learning Researc h , 7: 2003– 2030. Shimizu, S.; Inazumi, T .; Sogaw a, Y .; Hyv ¨ arinen, A.; Kawa- hara, Y .; W ashio, T .; Hoyer , P . O.; and Bollen, K. 2011. Di- rectLiNGAM: A Direct Method for Learning a Linear Non- Gaussian Structural Equation Model. Journal of Machine Learning Resear ch , 12: 1225–1248. Smith, S. M.; Miller , K. L.; Salimi-Khorshidi, G.; W ebster , M.; Beckmann, C. F .; Nichols, T . E.; Ramsey , J. D.; and W oolrich, M. W . 2011. Network modelling methods for FMRI. Neur oImage , 54(2): 875–891. Spirtes, P .; and Glymour, C. 1991. An algorithm for fast recov ery of sparse causal graphs. Social Science Computer Revie w , 9: 67–72. T eyssier , M.; and K oller , D. 2005. Ordering-based search: a simple and effecti ve algorithm for learning Bayesian net- works. In Proceedings of the 21st Confer ence on Uncer- tainty in Artiﬁcial Intelligence , 584–590. W ang, X.; Du, Y .; Zhu, S.; Ke, L.; Chen, Z.; Hao, J.; and W ang, J. 2021. Ordering-Based Causal Discovery with Re- inforcement Learning. In Pr oceedings of the 30th Inter - national Joint Confer ence on Artiﬁcial Intelligence , 3566– 3573. Xu, Z.; Li, Y .; Liu, C.; and Gui, N. 2024. Ordering-based causal discovery for linear and nonlinear relations. In Pr o- ceedings of the 38th Annual Confer ence on Neural Informa- tion Pr ocessing Systems , 4315–4340. Y uan, M.; and Lin, Y . 2006. Model selection and estima- tion in regression with grouped variables. J ournal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(1): 49–67. Zheng, X.; Arag am, B.; Ra vikumar, P .; and Xing, E. P . 2018. D AGs with NO TEARS: continuous optimization for struc- ture learning. In Pr oceedings of the 32nd Annual Confer ence on Neural Information Pr ocessing Systems , 9492–9503. Zheng, X.; Dan, C.; Aragam, B.; Ravikumar , P .; and Xing, E. 2020. Learning Sparse Nonparametric D AGs. In Pro- ceedings of the 23r d International Conference on Artiﬁcial Intelligence and Statistics , 3414–3425. A Omitted Proof Pr oof of Pr opsition 1. First, we show that our shape func- tion g i,j is a piece-wise constant function. For a set of in- tervals R j = { r j, 1 , . . . , r j,l j } with r j,k = ( a j,k , b j,k ] for k ∈ [ l j ] , let Γ = { γ j, 1 , . . . , γ j,q j } be the sorted set of all the boundaries a j,k , b j,k in R j , where q j is the total number of unique boundaries. Then, we can express g i,j as follows: g i,j ( X j ) = X q j − 1 p =1 α p · I [ X j ∈ ( γ p , γ p +1 ]] , where α p is the sum of the coefﬁcients β i,j,k such that r j,k ∩ ( γ p , γ p +1 ]  = ∅ . The abov e result shows that g i,j is regarded as a piece-wise constant function. Next, we show that a piece-wise constant function is a univ ersal approximator of any continuous function over a bounded interval [ a j , b j ] ⊂ R . For any continuous function g ∗ : [ a j , b j ] → R and ε > 0 , there exists δ > 0 such that | g ∗ ( x ) − g ∗ ( x ′ ) | < ε for any x, x ′ ∈ [ a j , b j ] with | x − x ′ | < δ . Here, we set Γ = { γ j, 1 , . . . , γ j,q j } such that γ p +1 − γ p < δ for all p ∈ [ q j − 1] , γ 1 = a , and γ q j = b hold. W e also set α p = g ∗ ( γ p ) for all p ∈ [ q j ] . Then, for an y x j ∈ [ a j , b j ] , there e xists p ∈ [ q j − 1] such that x j ∈ ( γ p , γ p +1 ] and | x j − γ p | ≤ γ p +1 − γ p < δ . By combining the above two results, we ha ve | g ∗ ( x j ) − g ( x j ) | = | g ∗ ( x j ) − α p | = | g ∗ ( x j ) − g ∗ ( γ p ) | < ε . Since it holds for an y x j ∈ [ a j , b j ] , we ha ve max x j ∈ [ a j ,b j ] | g ∗ ( x j ) − g ( x j ) | < ε , which concludes the proof. B Complete Experimental Results B.1 Baseline Comparison Figures 7 and 8 sho w the complete experimental results of the baseline comparison on the synthetic datasets. B.2 High-Dimensional Cases Figure 9 sho ws the complete experimental results of the high-dimensional cases. B.3 Mixture of Linear and Nonlinear Link Functions T o further inv estigate the performance of our method, we conducted experiments on the cases where the link functions are a mixture of linear and nonlinear functions. W e gener- ated each link function f i as either a linear function or a nonlinear function with probability 0 . 5 . Instead of SCORE, we employed CaPS (Xu et al. 2024) as the ordering method since it is a variant of SCORE that can handle both linear and nonlinear link functions. Figure 10 sho ws the experimental results. W e observed a similar trend to the pre vious experiments shown in the main paper . These results suggest that our SAR TRE performed similarly ev en when the link functions are a mixture of linear and nonlinear functions. B.4 Sensitivity Analysis W e conducted sensiti vity analyses on the regularization pa- rameter λ by varying it from 0 . 1 to 0 . 3 with a step size of 0 . 05 . In addition to the av erage SHD and SID, we also mea- sured the av erage v alues of F1 score, precision, and recall. Figures 11 and 12 show the experimental results. W e can see that λ = 0 . 1 attained the best SHD, SID, F1 score, and recall in most cases. It suggests that there exists a parame- ter setting for λ that can stably achie ve better performance across different situations. Ho we ver , we observed that the precision of λ = 0 . 1 was often lower than the others, which is likely due to the fact that a lar ger λ leads to a sparser graph. W e also observed that the performance of λ = 0 . 1 was often worse than the others when n = 1000 , which is likely due to ov erﬁtting. W e lea ve the further in vestigation of the sensiti vity of λ and ho w to automatically tune λ based on a giv en dataset to future work. C Additional Comments on Existing Assets All the code used in our experiments was implemented in Python 3.10 with scikit-learn 1.5.2 1 , pyGAM 0.9.1 2 , and celer 0.7.3 3 . Scikit-learn 1.5.2 and celer 0.7.3 are publicly av ailable under the BSD-3-Clause license. pyGAM 0.9.1 is publicly av ailable under the Apache License 2.0. All the real datasets used in our experiments are publicly av ailable and do not contain an y identiﬁable information or offensi ve con- tent. As they are accompanied by appropriate citations in the main body , see the corresponding references for more details. All the experiments were conducted on macOS Se- quoia with Apple M2 Ultra CPU and 128 GB memory . 1 https://github .com/scikit- learn/scikit- learn 2 https://github .com/dswah/pyGAM 3 https://github .com/mathurinm/celer 10 20 30 40 50 Num b er of v ariables 0 25 50 SHD SHD - ER1 SCORE DAS SAR TRE 10 20 30 40 50 Num b er of v ariables 50 100 150 SHD SHD - ER4 10 20 30 40 50 Num b er of v ariables 0 25 50 SHD SHD - SF1 10 20 30 40 50 Num b er of v ariables 0 100 SHD SHD - SF4 10 20 30 40 50 Num b er of v ariables 0 500 1000 SID SID - ER1 10 20 30 40 50 Num b er of v ariables 0 1000 SID SID - ER4 10 20 30 40 50 Num b er of v ariables 0 1000 SID SID - SF1 10 20 30 40 50 Num b er of v ariables 0 1000 2000 SID SID - SF4 10 20 30 40 50 Num b er of v ariables 0 100 200 Time Time - ER1 10 20 30 40 50 Num b er of v ariables 0 100 200 Time Time - ER4 10 20 30 40 50 Num b er of v ariables 0 100 200 Time Time - SF1 10 20 30 40 50 Num b er of v ariables 0 100 200 Time Time - SF4 Figure 7: Experimental results of baseline comparison on the synthetic datasets. The shaded areas indicate the standard devia- tions ov er 10 trials. W e v aried the number of variables d from 10 to 50 and ﬁxed the number of samples n to 2000 . 1000 2000 3000 4000 5000 Num b er of samples 0 10 SHD SHD - ER1 SCORE DAS SAR TRE 1000 2000 3000 4000 5000 Num b er of samples 40 60 SHD SHD - ER4 1000 2000 3000 4000 5000 Num b er of samples 0 10 SHD SHD - SF1 1000 2000 3000 4000 5000 Num b er of samples 20 40 SHD SHD - SF4 1000 2000 3000 4000 5000 Num b er of samples 0 50 100 SID SID - ER1 1000 2000 3000 4000 5000 Num b er of samples 100 200 300 SID SID - ER4 1000 2000 3000 4000 5000 Num b er of samples 0 100 SID SID - SF1 1000 2000 3000 4000 5000 Num b er of samples 200 300 SID SID - SF4 1000 2000 3000 4000 5000 Num b er of samples 0 50 100 Time Time - ER1 1000 2000 3000 4000 5000 Num b er of samples 0 50 100 Time Time - ER4 1000 2000 3000 4000 5000 Num b er of samples 0 50 100 Time Time - SF1 1000 2000 3000 4000 5000 Num b er of samples 0 50 100 Time Time - SF4 Figure 8: Experimental results of baseline comparison on the synthetic datasets. The shaded areas indicate the standard devia- tions ov er 10 trials. W e v aried the number of samples n from 1000 to 5000 and ﬁxed the number of v ariables d to 20 . 2 6 2 7 2 8 2 9 Num b er of v ariables 10 1 10 2 SHD SHD - ER1 DAS SAR TRE 2 6 2 7 2 8 2 9 Num b er of v ariables 10 3 SHD SHD - ER4 2 6 2 7 2 8 2 9 Num b er of v ariables 10 1 10 2 SHD SHD - SF1 2 6 2 7 2 8 2 9 Num b er of v ariables 10 2 10 3 SHD SHD - SF4 2 6 2 7 2 8 2 9 Num b er of v ariables 10 2 10 3 SID SID - ER1 2 6 2 7 2 8 2 9 Num b er of v ariables 10 4 10 5 SID SID - ER4 2 6 2 7 2 8 2 9 Num b er of v ariables 10 3 10 4 SID SID - SF1 2 6 2 7 2 8 2 9 Num b er of v ariables 10 4 10 5 SID SID - SF4 2 6 2 7 2 8 2 9 Num b er of v ariables 10 2 10 3 Time Time - ER1 2 6 2 7 2 8 2 9 Num b er of v ariables 10 2 10 3 Time Time - ER4 2 6 2 7 2 8 2 9 Num b er of v ariables 10 2 10 3 Time Time - SF1 2 6 2 7 2 8 2 9 Num b er of v ariables 10 2 10 3 Time Time - SF4 Figure 9: Experimental results of the high-dimensional cases on ER1 and ER4 datasets. The shaded areas indicate the standard deviations o ver 10 trials. W e varied the number of variables d ∈ { 64 , 128 , 256 , 512 } and ﬁxed the number of samples n to 2000 . 10 20 30 40 50 Number of v ariables 0 20 40 SHD SHD - ER1 CaPS DAS SAR TRE 10 20 30 40 50 Number of v ariables 100 200 SHD SHD - ER4 10 20 30 40 50 Number of v ariables 0 20 40 SHD SHD - SF1 10 20 30 40 50 Number of v ariables 0 100 SHD SHD - SF4 10 20 30 40 50 Number of v ariables 0 500 1000 SID SID - ER1 10 20 30 40 50 Number of v ariables 0 1000 2000 SID SID - ER4 10 20 30 40 50 Number of v ariables 0 500 1000 SID SID - SF1 10 20 30 40 50 Number of v ariables 0 1000 2000 SID SID - SF4 10 20 30 40 50 Number of v ariables 0 100 200 Time Time - ER1 10 20 30 40 50 Number of v ariables 0 100 200 Time Time - ER4 10 20 30 40 50 Number of v ariables 0 100 200 Time Time - SF1 10 20 30 40 50 Number of v ariables 0 100 200 Time Time - SF4 (a) V ariables 1000 2000 3000 4000 5000 Number of samples 5 10 15 SHD SHD - ER1 CaPS DAS SAR TRE 1000 2000 3000 4000 5000 Number of samples 40 60 SHD SHD - ER4 1000 2000 3000 4000 5000 Number of samples 0 10 20 SHD SHD - SF1 1000 2000 3000 4000 5000 Number of samples 20 40 SHD SHD - SF4 1000 2000 3000 4000 5000 Number of samples 0 100 SID SID - ER1 1000 2000 3000 4000 5000 Number of samples 200 300 SID SID - ER4 1000 2000 3000 4000 5000 Number of samples 0 100 SID SID - SF1 1000 2000 3000 4000 5000 Number of samples 200 300 SID SID - SF4 1000 2000 3000 4000 5000 Number of samples 0 100 Time Time - ER1 1000 2000 3000 4000 5000 Number of samples 0 100 Time Time - ER4 1000 2000 3000 4000 5000 Number of samples 0 100 Time Time - SF1 1000 2000 3000 4000 5000 Number of samples 0 100 Time Time - SF4 (b) Samples Figure 10: Experimental results of the mix ed cases. W e generated each link function f i as either a linear function or a nonlinear function with probability 0 . 5 . The shaded areas indicate the standard deviations ov er 10 trials. While we varied the number of variables d from 10 to 50 and ﬁx ed the number of samples n to 2000 in (a), we v aried the number of samples n from 1000 to 5000 and ﬁx ed the number of variables d to 20 in (b). W e observed that our SAR TRE performed well e ven when the link functions are a mixture of linear and nonlinear functions. 10 20 30 40 50 Num b er of v ariables 0 10 20 30 SHD SHD - ER1 α = 0 . 1 α = 0 . 15 α = 0 . 2 α = 0 . 25 α = 0 . 3 10 20 30 40 50 Num b er of v ariables 50 100 150 SHD SHD - ER4 10 20 30 40 50 Num b er of v ariables 0 10 20 SHD SHD - SF1 10 20 30 40 50 Num b er of v ariables 0 50 100 150 SHD SHD - SF4 10 20 30 40 50 Num b er of v ariables 0 50 100 150 200 SID SID - ER1 10 20 30 40 50 Num b er of v ariables 0 500 1000 1500 2000 SID SID - ER4 10 20 30 40 50 Num b er of v ariables 0 200 400 600 SID SID - SF1 10 20 30 40 50 Num b er of v ariables 0 1000 2000 SID SID - SF4 10 20 30 40 50 Num b er of v ariables 0 . 4 0 . 6 0 . 8 1 . 0 F1 F1 - ER1 10 20 30 40 50 Num b er of v ariables 0 . 2 0 . 4 0 . 6 F1 F1 - ER4 10 20 30 40 50 Num b er of v ariables 0 . 7 0 . 8 0 . 9 1 . 0 F1 F1 - SF1 10 20 30 40 50 Num b er of v ariables 0 . 2 0 . 4 0 . 6 0 . 8 F1 F1 - SF4 10 20 30 40 50 Num b er of v ariables 0 . 85 0 . 90 0 . 95 1 . 00 Precision Precision - ER1 10 20 30 40 50 Num b er of v ariables 0 . 85 0 . 90 0 . 95 1 . 00 1 . 05 Precision Precision - ER4 10 20 30 40 50 Num b er of v ariables 0 . 7 0 . 8 0 . 9 1 . 0 Precision Precision - SF1 10 20 30 40 50 Num b er of v ariables 0 . 85 0 . 90 0 . 95 1 . 00 1 . 05 Precision Precision - SF4 10 20 30 40 50 Num b er of v ariables 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Recall Recall - ER1 10 20 30 40 50 Num b er of v ariables 0 . 2 0 . 4 Recall Recall - ER4 10 20 30 40 50 Num b er of v ariables 0 . 6 0 . 8 1 . 0 Recall Recall - SF1 10 20 30 40 50 Num b er of v ariables 0 . 2 0 . 4 0 . 6 0 . 8 Recall Recall - SF4 Figure 11: Experimental results of the sensitivity analysis on the regularization parameter λ . The shaded areas indicate the standard de viations o ver 10 trials. W e v aried the number of v ariables d from 10 to 50 and ﬁxed the number of samples n to 2000 . 1000 2000 3000 4000 5000 Num b er of samples 5 10 15 SHD SHD - ER1 α = 0 . 1 α = 0 . 15 α = 0 . 2 α = 0 . 25 α = 0 . 3 1000 2000 3000 4000 5000 Num b er of samples 50 60 70 SHD SHD - ER4 1000 2000 3000 4000 5000 Num b er of samples 0 5 10 SHD SHD - SF1 1000 2000 3000 4000 5000 Num b er of samples 20 40 60 SHD SHD - SF4 1000 2000 3000 4000 5000 Num b er of samples 20 40 60 80 SID SID - ER1 1000 2000 3000 4000 5000 Num b er of samples 200 250 300 SID SID - ER4 1000 2000 3000 4000 5000 Num b er of samples 0 50 100 150 SID SID - SF1 1000 2000 3000 4000 5000 Num b er of samples 200 250 300 350 SID SID - SF4 1000 2000 3000 4000 5000 Num b er of samples 0 . 4 0 . 6 0 . 8 F1 F1 - ER1 1000 2000 3000 4000 5000 Num b er of samples 0 . 2 0 . 4 0 . 6 F1 F1 - ER4 1000 2000 3000 4000 5000 Num b er of samples 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 F1 F1 - SF1 1000 2000 3000 4000 5000 Num b er of samples 0 . 2 0 . 4 0 . 6 0 . 8 F1 F1 - SF4 1000 2000 3000 4000 5000 Num b er of samples 0 . 8 0 . 9 1 . 0 Precision Precision - ER1 1000 2000 3000 4000 5000 Num b er of samples 0 . 7 0 . 8 0 . 9 1 . 0 Precision Precision - ER4 1000 2000 3000 4000 5000 Num b er of samples 0 . 7 0 . 8 0 . 9 1 . 0 Precision Precision - SF1 1000 2000 3000 4000 5000 Num b er of samples 0 . 8 0 . 9 1 . 0 Precision Precision - SF4 1000 2000 3000 4000 5000 Num b er of samples 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Recall Recall - ER1 1000 2000 3000 4000 5000 Num b er of samples 0 . 1 0 . 2 0 . 3 0 . 4 Recall Recall - ER4 1000 2000 3000 4000 5000 Num b er of samples 0 . 4 0 . 6 0 . 8 1 . 0 Recall Recall - SF1 1000 2000 3000 4000 5000 Num b er of samples 0 . 2 0 . 4 0 . 6 0 . 8 Recall Recall - SF4 Figure 12: Experimental results of the sensitivity analysis on the regularization parameter λ . The shaded areas indicate the standard deviations over 10 trials. W e varied the number of samples n from 1000 to 5000 and ﬁxed the number of variables d to 20 .

Sparse Additive Model Pruning for Order-Based Causal Structure Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment