Data-driven configuration tuning of glmnet for balancing accuracy and computational efficiency

Data-driv en conﬁguration tuning of glmnet to balance accuracy and computation time Sh uhei Muro ya 1 ∗ and Kei Hirose 2 † 1 Join t Graduate Sc ho ol of Mathematics for Innov ation, Kyush u Univ ersity , F ukuok a, Japan 2 Institute of Mathematics for Industry , Kyushu Universit y , F ukuok a, Japan Abstract glmnet is a widely adopted R pac k age for lasso estimation due to its computational eﬃciency . Despite its p opularit y , glmnet sometimes yields solutions that are substan- tially diﬀeren t from the true ones b ecause of the inappropriate default conﬁguration of the algorithm. The accuracy of the obtained solutions can b e impro v ed b y appropriately tuning the conﬁguration. How ever, impro ving accuracy typically increases computa- tional time, resulting in a trade-oﬀ b et w een accuracy and computational eﬃciency . Therefore, it is essen tial to establish a systematic approac h to determine appropriate conﬁguration. T o address this need, w e prop ose a uniﬁed data-driv en framework speciﬁ- cally designed to optimize the conﬁguration by balancing the trade-oﬀ betw een accuracy and computational eﬃciency . W e generate large-scale simulated datasets and apply glmnet under v arious conﬁgurations to obtain accuracy and computation time. Based on these results, w e construct neural net w orks that predict accuracy and computation time from data c haracteristics and conﬁguration. Given a new dataset, our framew ork uses the neural netw orks to explore the conﬁguration space and deriv e a Pareto fron t that represen ts the trade-oﬀ b et ween accuracy and computational cost. This front al- lo ws us to automatically identify the conﬁguration that maximize accuracy under a user-sp eciﬁed time constraint. The proposed metho d is implemen ted in the R pac k age glmnetconf , av ailable at https://github.com/Shuhei- Muroya/glmnetconf.git . Keywor ds : lasso, glmnet , h yp erparameter optimization, computational eﬃciency 1 In tro duction The lasso (least absolute shrink age and selection op erator; Tibshirani, 1996) is a p opular metho d for regression that uses an ℓ 1 p enalt y to obtain sparse regression co eﬃcien ts. The lasso can be applied to high-dimensional data, where the n umber of predictors exceeds the n umber of observ ations, and it provides interpretable results. Due to these features, the lasso is widely applied across v arious ﬁelds suc h as signal pro cessing (Cand` es and W akin, 2008), genomics (Bøvelstad et al., 2007) and astronom y (Lu and Li, 2015). ∗ Email: muro ya.sh uhei.697@s.kyushu-u.ac.jp † Email: hirose@imi.kyushu-u.ac.jp 1 −8 −6 −4 −2 0 2 4 −0.08 −0.04 0.00 0.04 log ( λ ) (a) glmnet (default) −8 −6 −4 −2 0 2 4 −0.08 −0.04 0.00 0.04 log ( λ ) (b) glmnet (man ual) −8 −6 −4 −2 0 2 4 −0.08 −0.04 0.00 0.04 log ( λ ) (c) LARS Figure 1: The solution path for the same dataset by each pack age. The exp erimen tal setting is identical to that in Section 4, with N = 1500 , p = 800 , ρ = 0 . 5. F or clarit y , we displa y the solution path for only the ﬁrst 10 co eﬃcien ts to av oid visual congestion. The glmnet (default) denotes the estimator by glmnet using the default conﬁguration, whereas glmnet (man ual) denotes the estimator by glmnet whose conﬁguration is man ually optimized b y the authors. LARS denotes the estimator b y LARS algorithm, which pro vides the exact solution path and thus serves as a reference (ground truth). By comparison, the results of glmnet (manual) are seen to b e close to that of LARS. Let N b e the n umber of observ ations and p b e the num b er of predictors. Let X ∈ R N × p b e the design matrix with rows x i ∈ R p for i = 1 , . . . , N , and let y ∈ R N b e the resp onse vector. W e assume that the explanatory v ariables are standardized and the resp onse v ector is centered. Under these assumptions, the lasso estimates the coeﬃcient vector β ∈ R p b y solving minimize β 1 2 N ∥ y − X β ∥ 2 2 + λ ∥ β ∥ 1 , (1) where λ > 0 is a regularization parameter, and ∥ · ∥ 2 and ∥ · ∥ 1 denote the ℓ 2 - and ℓ 1 -norms, resp ectiv ely . Since the lasso solution do es not generally hav e a closed-form expression b ecause of the non-diﬀeren tiability of the ℓ 1 norm, v arious algorithms ha ve been proposed to solv e this problem (F u, 1998; Osb orne et al., 2000; Efron et al., 2004; Daubechies et al., 2004; Beck and T eb oulle, 2009; F riedman et al., 2010; Boyd et al., 2011). In particular, the co ordinate descen t algorithm (F u, 1998; F riedman et al., 2010) and the Least Angle Regression (LARS) algorithm (Efron et al., 2004) ha ve been widely used. The co ordinate descen t algorithm pro vides a fast approximate solution by iterativ ely up dating each co eﬃcien t. In contrast, LARS yields the exact entire solution path for the lasso problem at a higher computational cost. The coordinate descen t algorithm and the LARS algorithm are implemented in the R pac k age glmnet and lars , resp ectiv ely . The glmnet pac k age is esp ecially widely used due to its computation eﬃciency . In fact, it was downloaded ov er 1.3 million times in 2024, whic h was more than ten times that of lars , according to the CRAN do wnload logs pro vided b y cranlogs (Cs´ ardi, 2019). Ho wev er, our n umerical experiments rev eal that the glmnet solution path can diﬀer signiﬁcan tly from the exact solution path for correlated high-dimensional data. These discrepancies app ear to b e caused b y the default settings in glmnet . In particular, the conv ergence threshold and the sp eciﬁcation of the λ sequence play critical roles. Hereafter, we refer to these settings as the conﬁguration of glmnet . T o illustrate how the conﬁguration aﬀects the results, Figure 1 compares the solution path of the ﬁrst 10 co eﬃcien ts from three estimators for a given dataset: glmnet (default), glmnet (manual) and LARS. The glmnet (default) and glmnet (manual) denote the estimators obtained using the default and man ually tuned conﬁguration, resp ectiv ely . The lab el 2 LARS corresp onds to the exact solution path computed by the lars pack age. As shown in the ﬁgure, the solution path of glmnet (default) is substan tially diﬀeren t from that of LARS, whereas the path of glmnet (man ual) is muc h closer to that of LARS. This result implies that appropriate tuning of the conﬁguration is crucial for obtaining an accurate solution path. In practice, users often rely on the default conﬁguration without knowing its critical impact. This is partly b ecause glmnet returns results without warning, even when the default conﬁguration is inappropriate for the given dataset. F urthermore, manual tuning is rarely p erformed as it requires exp ert knowledge of the underlying algorithm. Crucially , improving accuracy t ypically increases computational time, resulting in a trade-oﬀ b et ween accuracy and computational eﬃciency . While an appropriate conﬁguration should ideally b e determined for each dataset to balance this trade-oﬀ, a systematic approac h for such tuning has not yet b een established. Therefore, w e prop ose a data-driven framew ork that automatically determines an appropriate conﬁguration of glmnet based on the characteristics of the dataset. Sp eciﬁcally , we generate large- scale artiﬁcial datasets co vering a wide range of data c haracteristics. F or eac h dataset, w e solv e the lasso problem using glmnet under v arious conﬁgurations. Additionally , we employ LARS to obtain the exact solution path as a reference for ev aluating accuracy . W e then record the accuracy of the glmnet solution path and its computation time, including the cross-v alidation. These simulation results are used to train a neural netw ork that learns the relationship among data characteristics, conﬁguration, and their corresp onding p erformance. Once trained, the neural net work can predict accuracy and computation time for a new dataset and conﬁguration. Based on these predictions, the Pareto fron t is deriv ed to capture the trade-oﬀ b et ween accuracy and computation time. F rom this fron t, our prop osed framework automatically selects the conﬁguration that ac hiev es the highest p ossible accuracy aiming not to exceed the user-speciﬁed computation time. A key feature of our framew ork is its abilit y to explicitly manage the trade-oﬀ b et ween accuracy and computation time. This capabilit y enables users to perform conﬁguration tuning that explicitly accoun ts for computational costs. F urthermore, the tuning pro cess itself is fast, thereb y keeping the total run time shorter than that of LARS. The organization of the paper is as follo ws. In Section 2, we presen t the bac kground and motiv ation for conﬁguration tuning. In Section 3, we explain our prop osed method and demonstrate ho w to tune the conﬁguration of the glmnet function based on the dataset. Section 4 ev aluate the p erformance of our prop osed metho d through numerical sim ulations and the application to compressed sensing, resp ectiv ely . 2 Algorithmic details and default conﬁguration in glmnet This section reviews the computational details of the co ordinate descent algorithm and the default conﬁguration of glmnet . W e sp eciﬁcally discuss wh y this default conﬁguration can result in in- appropriate solutions. W e also brieﬂy describ e the LARS algorithm as a reference for the exact solution path. 2.1 Co ordinate descen t algorithm The R pac k age glmnet implemen ts the co ordinate descent algorithm to solve the lasso problem eﬃcien tly (F riedman et al., 2010). F or a giv en v alue of λ , the coordinate descen t algorithm computes an approximate solution through an iterative pro cedure. T o obtain the entire solution path, the algorithm is rep eatedly applied ov er a sequence of λ v alues. Linearly interpolating these solutions yields an appro ximate solution path. 3 The coordinate descent algorithm optimizes one co eﬃcien t at a time while holding the others ﬁxed, and cycles through all coeﬃcients until con vergence. F or a ﬁxed v alue of the regularization parameter λ , the glmnet pack age minimizes the ob jective function in (1) b y iterativ ely up dating eac h coeﬃcient using the co ordinate descent algorithm. At iteration t , the up date for the j -th co eﬃcien t β t +1 j is given by β t +1 j = S λ  1 N X ⊤ j r ( j )  , where X j denotes the j -th column of X , S λ ( z ) = sign( z ) ( | z | − λ ) + is the soft-thresholding op erator, and r ( j ) represen ts the partial residual v ector with elemen ts r ( j ) i = y i − P k  = j x ik β t k ( λ ). This pro cedure cyclically up dates all coeﬃcients until con vergence. 2.2 Conﬁguration details: the con vergence threshold and the se- quence of λ This subsection examines the roles and default settings of t w o key comp onen ts: the con vergence threshold and the sequence of λ v alues. They are commonly used in both the glmnet() and cv.glmnet() functions of the glmnet pack age. Here, the function glmnet() computes a solution path on a grid of λ v alues, while cv.glmnet() p erforms cross-v alidation to select the optimal λ from this path. The con vergence threshold. The con vergence threshold, denoted by τ , determines the stop- ping criterion for the co ordinate descen t algorithm. Sp eciﬁcally , the iterative up dates terminate when the impro vemen t in the ob jectiv e function falls b elo w the product of τ and the n ull deviance. The smaller τ , the stricter the stopping condition b ecomes, which can improv e the accuracy of the solution but also increase computation time. The default v alue of τ is 10 − 7 . The λ sequence. The sequence of λ v alues deﬁnes the grid ov er which the lasso solution path is computed, as discussed in the previous subsection. Extending the range of λ and reﬁning the grid yields a more accurate solution path, but increases computation time. The λ sequence is determined b y its range, deﬁned b y the maxim um v alue λ max and the minimum v alue λ min , and the n umber of grid p oin ts n λ . In the glmnet pack age, the default v alues for these parameters are sp eciﬁed as follo ws. λ max is deﬁned as the smallest λ for which all co eﬃcien ts are zero, given by λ max = max j 1 N | X ⊤ j y | . The default low er bound λ def min is deﬁned as: λ def min = ( 10 − 2 λ max if N < p, 10 − 4 λ max if N ≥ p. The default sequence of λ v alues is then generated on a logarithmic scale from λ max do wn to λ def min with n def λ = 100 p oin ts. 2.3 Limitations of the default conﬁguration W e inv estigate wh y the default conﬁguration of glmnet may lead to inaccurate results for highly correlated datasets. 4 The conv ergence threshold. Although the default threshold of 10 − 7 is computationally ef- ﬁcien t, our numerical exp erimen ts suggest that a stricter threshold is often necessary to ensure accuracy . There are tw o main reasons for this requirement. First, high correlations among predictors lead to a ﬂat lasso ob jective function. In suc h cases, the coordinate descen t algorithm tends to mo ve in a zig-zag pattern with very small up date steps. Consequen tly , the improv ement in the ob jective function at each step b ecomes extremely small, often causing the algorithm to terminate prematurely . Indeed, Massias et al. (2018) note that stopping rules based only on c hanges in the primal ob jective can lead to sub optimal solutions, and they recommend monitoring the dualit y gap as a more rigorous criterion. How ever, since our aim is to impro ve glmnet without mo difying its in ternal source co de, w e do not adopt the dualit y gap criterion. Instead, w e address this issue by using a signiﬁcantly stricter threshold to improv e the n umerical precision. Second, a stricter threshold is necessary to provide a more accurate initialization for the warm start strategy . As already mentioned, the algorithm is repeatedly applied o ver a sequence of λ v alues, denoted b y λ 1 = λ max > λ 2 > · · · > λ n λ = λ min . In this sequen tial pro cess, the algorithm emplo ys a warm start strategy , where the solution obtained at the previous λ is used to initialize the optimization for the current λ . If the optimization at the previous step stops to o early due to a lo ose threshold, the resulting sub optimal solution provides an inaccurate starting p oin t for the next step. In a ﬂat ob jectiv e function, the solv er may fail to mov e suﬃcien tly aw ay from this p o or initialization, b ecause the up date steps are small and the stopping criterion is satisﬁed. Consequen tly , the accumulation of such errors ma y cause the computed solution to deteriorate progressiv ely . Therefore, maintaining a tight con vergence threshold is essential to prev en t this error accum ulation and to impro ve the reliabilit y of the entire solution path. The λ sequence. • Range of the sequence. The default sequence spans from λ max do wn to either 10 − 2 λ max or 10 − 4 λ max , dep ending on whether N < p . Ho wev er, n umerical exp erimen ts indicate that this range is sometimes to o narrow to fully capture the b eha vior of the true solution path. In particular, when compared with the exact path obtained b y LARS, the default sequence often fails to explore the region of suﬃciently small λ , where additional c hanges in the zero–nonzero pattern can o ccur. If these regions are omitted, the solution path computed b y glmnet ma y miss imp ortan t structural changes in the co eﬃcien ts. F rom the viewp oin t of cross-v alidation, a narro w range restricts the div ersity of candidate mo dels. More critically , the default sequence ma y fail to include the optimal λ , since the optimal λ tends to b e small when the correlation among predictors is high (Hebiri and Lederer, 2013). Therefore, extending the range of the λ sequence tow ards zero is essential to increase the probabilit y that the optimal λ is included in the candidate set. • Number of grid p oin ts. The default n umber of grid p oin ts is n def λ = 100. How ever, this ﬁxed n umber ma y be insuﬃcient relative to the dimension p . The LARS algorithm (Section 2.4) implies that the active set of the lasso solution changes at least min { N , p } times along the path. Thus, when N and p are larger than 100, the default grid cannot capture all c hanges in the true solution path, and linear interpolation b et w een coarse grid p oin ts may degrade the accuracy of the appro ximated path. F rom the p erspective of cross-v alidation, a small n um b er of λ candidates means that the search space for selecting the λ b ecomes to o limited, which can result in sub optimal mo del selection. The discussions ab o v e demonstrate that the default conﬁguration whic h is indep enden t of the 5 dataset is insuﬃcient to main tain numerical accuracy . Although manual tuning of the conﬁguration is possible, an automated approac h adapted to the dataset is highly desirable in practice. Therefore, w e prop ose a data-driv en automated framework that determines the appropriate conﬁguration to ac hieve accuracy comparable to LARS, while maintaining computational eﬃciency . 2.4 LARS algorithm and solution path accuracy Efron et al. (2004) proposed the Least Angle Regression (LARS) algorithm, whic h pro vides an exact computation of the entire solution path of the lasso problem (1). The algorithm b egins at λ = ∞ , where the lasso solution is trivially 0 ∈ R p . As λ decreases, it computes a piecewise linear and con tinuous solution path. Each knot along this path corresponds to a point where the activ e set A = { j : β j ( λ )  = 0 } c hanges. A t ev ery iteration, the algorithm up dates the direction of the co eﬃcien t path so that the Karush–Kuhn–T uc ker (KKT) optimality conditions remain satisﬁed. T o determine this direction, the algorithm m ust compute the inv erse of the Gram matrix ( X ⊤ A X A ) − 1 , where X A denotes the submatrix of active predictors. Because the active set A changes sequen tially along the path, the LARS algorithm requires such matrix inv ersions to b e p erformed at least min { N , p } times. Consequen tly , the computational cost increases rapidly with the n umber of v ariables p . In this study , w e utilize the LARS as a reference in three w ays: (i) the exact path serves as the ground truth for ev aluating appro ximation accuracy , (ii) the exact n umber of knots is used to in vestigate the v alidit y of the default λ grid in glmnet , and (iii) the computation time provides an upp er b ound for eﬃciency comparisons. 3 Prop osed metho d 3.1 Ov erview of our prop osal Our framework aims to automatically determine the appropriate conﬁguration for a giv en dataset. Sp eciﬁcally , it aims to maximize accuracy given a user-sp eciﬁed computation time, denoted as T hope . T o this end, we fo cus on tuning t w o key parameters: the conv ergence threshold τ and the sequence length n λ . The detailed deﬁnition of n λ is pro vided in Section 3.2. Figure 2 illustrates the o verall w orkﬂow of our prop osal, which consists of tw o main steps: • Step 1: Construction of the predictiv e mo del (Section 3.2). The upp er panel of Figure 2 shows the preparatory stage. Starting from div erse simulation parameters, we gen- erate a summary dataset to train a predictive mo del, referred to as glmnet-MLP . This mo del learns the mapping b et ween dataset c haracteristics (e.g., N , p, γ ), conﬁgurations ( τ , n λ ), and the resulting p erformance metrics, sp eciﬁcally the computation time and the Solution Path Error (SPE). Sections 3.2.1 and 3.2.2 provide the details of this pro cess, including the formal deﬁnition of SPE, the generation of the summary dataset, and the training strategy . • Step 2: Conﬁguration tuning using the predictive mo del (Section 3.3). The low er panel of Figure 2 presen ts the execution phase. Given a new dataset, the framework extracts its features and utilizes the trained glmnet-MLP to predict p erformance. Finally , by deriv- ing the P areto fron t of the predicted SPE and c ompu tation time, the b est conﬁguration is automatically selected to maximize accuracy while satisfying the time constrain t T hope . The details of this tuning strategy are pro vided in Section 3.3.2. 6 Prop osed F ramewo rk Simulation parameters (e.g., N , p, Σ , β ) Generate summary dataset (e.g., N , p, γ , SPE , T glmnet ) (Section 3.2.1) T rain MLP (Section 3.2.2 ) Predictive mo del ( glmnet-MLP ) Step 1: Construction of the p redictive mo del (Section 3.2) New dataset ( X ∗ , y ∗ ) Extract dataset features (e.g., N , p, γ ) Predict SPE & computation time using glmnet-MLP Derive Pareto front & select conﬁguration (Section 3.3.2) Best conﬁguration Time constraint T hope Step 2: Conﬁguration tuning using the p redictive model (Section 3.3) Load glmnet-MLP Figure 2: Ov erview of the prop osed framew ork. The pro cess consists of t wo phases: Step 1 constructs a predictive model using a summary dataset generated from sim ulation parameters. Step 2 utilizes this trained mo del to predict p erformance metrics for a target dataset, selecting the b est conﬁguration that satisﬁes the time constrain t T hope . 3.2 Step 1: Construction of the predictiv e mo del In this step, w e construct a predictive mo del using a m ultilay er p erceptron (MLP) (Rumelhart et al., 1986), referred to as glmnet-MLP . The ob jectiv e of this model is to predict the p erformance metrics, speciﬁcally the SPE and the computation time T glmnet ,τ ,n λ . Here, T glmnet ,τ ,n λ is deﬁned as the total run time, including the cross-v alidation pro cess for selecting the optimal λ . Regarding the dataset characteristics, we sp eciﬁcally include the sample size N , the n umber of predictors p , and the eigen v alue features of the co v ariance matrix. The eigen v alues are included to capture the correlation structure among the predictors. In addition to these data features, w e incorporate the conﬁguration parameters: the con vergence threshold τ and the length of λ sequence n λ . One c haracteristic of our framew ork is the construction of the λ sequence using n λ . Unlik e the default conﬁguration, we prop ose a ﬂexible construction where the sequence length is determined by n λ ( n λ > n def λ ). Sp eciﬁcally , we extend the default sequence b y app ending ( n λ − n def λ ) additional v alues ev enly spaced b et ween the default minimum λ def min and 0. Through this parameterized construction, the complex problem of designing an appropriate λ sequence is eﬀectiv ely reduced to determining a single optimal v alue for n λ . 3.2.1 Construction of the summary dataset T o train the glmnet-MLP , w e construct a large-scale dataset, which w e refer to as the summary dataset. This dataset is created b y generating artiﬁcial dataset and recording the corresp onding glmnet p erformance. Each sample in the summary dataset consists of the data c haracteristics ( N , p , eigen v alues), the conﬁguration ( τ , n λ ), and the resulting p erformance metrics (SPE, computation time). The detailed construction pro cedure is as follows: 1. Parameter setting and feature extraction: Sp ecify the simulation parameters: sample size N , n umber of predictors p , cov ariance matrix Σ , true co eﬃcien ts β , and error v ariance σ 2 . At 7 this stage, compute the eigenv alue features of Σ . Select the top and bottom ﬁve eigenv alues, denoted as γ k ( k = ± 1 , . . . , ± 5), where p ositiv e and negative indices correspond to the largest and smallest eigen v alues, respectively . 2. Data generation: Using the sp eciﬁed parameters, generate a syn thetic dataset ( X , y ) accord- ing to x i ∼ N ( 0 , Σ ) ( i = 1 , . . . , N ) , ϵ ∼ N ( 0 , σ 2 I ) , y = X β + ϵ , where N ( µ , Σ ) denotes the m ultiv ariate normal distribution with mean µ and cov ariance matrix Σ . 3. Performance ev aluation: Compute the lasso solutions using glmnet under v arious conﬁgura- tions ( τ , n λ ). F or eac h conﬁguration, we quantify the discrepancy b et w een the appro ximate solution path and the exact path using the Solution P ath Error (SPE), deﬁned as: SPE τ ,n λ = 1 k k X i =1 1 √ p    β true ( λ i ) − ˆ β glmnet τ ,n λ ( λ i )    2 , where { λ i } k i =1 is a reference sequence of k = 20 p oin ts logarithmically spaced from λ max to λ start = 0 . 001. Here, β true ( λ i ) is the exact solution obtained via LARS, and ˆ β glmnet τ ,n λ ( λ i ) is the solution estimated b y glmnet . W e also record the computation time T glmnet ,τ ,n λ . 4. Data aggregation: Record the com bination of the data c haracteristics, conﬁguration, and p erformance metrics as a single data point: ( N , p, γ 1 , . . . , γ − 1 , τ , n λ , SPE τ ,n λ , T glmnet ,τ ,n λ ) . 5. Iteration: Repeat Steps 1–4 under v arious parameter settings. Consequen tly , this pro cess yielded a total of 810 , 492 samples, whic h constitute the summary dataset. Detailed sp eciﬁcations of the simulation parameters and the summary dataset are provided in App endix A. 3.2.2 T raining strategy and determination of net w ork arc hitecture W e train the glmnet-MLP using the summary dataset. The dataset is randomly split in to training, v alidation, and test sets. Prior to training, the target v ariables (SPE and computation time) are log-transformed and standardized to stabilize the learning pro cess. T o obtain predictions on the original scale, w e apply the in verse transformations. T o determine the optimal netw ork architecture (e.g., num b er of lay ers and units) and the learning rate, we emplo y Bay esian optimization. W e formulate the task as a blac k-b o x optimization problem to minimize the v alidation error, implemen t using the Optuna framework (Akiba et al., 2019). F urther details regarding the training proto col, the search space for hyperparameters, and the ﬁnal net work architecture are provided in App endix B. 3.3 Step 2: conﬁguration tuning using the predictiv e mo del 3.3.1 Deﬁnition of Pareto front First, we introduce the concept of P areto optimalit y for a multi-ob jectiv e optimization problem. Consider the problem of sim ultaneously minimizing a v ector-v alued ob jectiv e function f : X → R M : min x ∈X f ( x ) = min x ∈X  f 1 ( x ) , . . . , f M ( x )  . (2) 8 In general, a unique solution that minimizes all ob jectiv e functions simultaneously do es not exist. Instead, we seek Pareto optimal solutions, whic h represent optimal trade-oﬀs among the ob jectives. Deﬁnition 1 (W eak dominance) . F or x, x ′ ∈ X , if f m ( x ) ≤ f m ( x ′ ) ∀ m = 1 , · · · M , we sa y that f ( x ) weakly dominates f ( x ′ ). P areto optimal solution is deﬁned as follows: Deﬁnition 2 (Pareto optimal solution and P areto fron t) . W e say that x ∗ ∈ X is a Pareto optimal solution if there exists no x ∈ X suc h that f ( x ) w eakly dominates f ( x ∗ ) with f ( x )  = f ( x ∗ ). In addition, we deﬁne the P areto front as the set of the ob jectiv e v alues of Pareto optimal solutions. The Pareto front P ∗ is given by: P ∗ = { f ( x ∗ ) | x ∗ ∈ X : Pareto optimal solution } . Note that theoretically , there may exist an inﬁnite n um b er of P areto optimal solutions. Th us, w e need to select the b est solution from the set of Pareto optimal solutions. 3.3.2 P areto fron t for optimizing conﬁguration In this section, w e describe the pro cedure to tune the glmnet conﬁguration using the trained predic- tiv e mo del. Our goal is to determine the optimal conﬁguration ( τ ∗ , n ∗ λ ) for a new dataset ( X ∗ , y ∗ ) under a user-sp eciﬁed c ompu tation time constraint, denoted as T hope . The sp eciﬁc procedure is as follo ws: 1. F eature extraction: Compute the data characteristics for the target dataset ( X ∗ , y ∗ ). Sp eciﬁcally , calculate the sample size N , the dimension p , and the eigenv alue statistics γ i ( i = ± 1 , . . . , ± 5) derived from the sample co v ariance matrix of X ∗ . Note that unlike the training phase (Step 1), where the eigen v alues were computed from the true cov ariance matrix Σ , here they are deriv ed from the sample cov ariance matrix of X ∗ . 2. Mo del setup: Fix these extracted features in glmnet-MLP . Consequently , the MLP functions as a mapping from conﬁguration ( τ , n λ ) to the predicted SPE and computation time. This mapping corresp onds to the ob jective function f ( x ) in Eq. (2). 3. Random sampling: Randomly sample K conﬁgurations { ( τ k , n k λ ) } K k =1 from the searc h space, where τ is sampled from [10 − 9 , 10 − 7 ] on a log-scale and n λ from [100 , 2 p ]. 4. Performance prediction: Obtain the predictions { SPE τ k ,n k λ , T glmnet ,τ k ,n k λ } K k =1 b y substitut- ing the sampled conﬁgurations { ( τ k , n k λ ) } K k =1 in to the mapping deﬁned in the Mo del setup step. 5. Pareto fron t extraction: Identify the discrete P areto front b P ∗ from the set of predicted outcomes { SPE τ k ,n k λ , T glmnet ,τ k ,n k λ } K k =1 . 6. Best conﬁguration selection: F rom the P areto fron t b P ∗ , select the optimal conﬁguration ( τ ∗ , n ∗ λ ) that minimizes the SPE sub ject to a user-speciﬁed computation time constrain t T hope . The index of the best conﬁguration k ∗ is determined b y: k ∗ = arg min k ∈{ 1 ,...,K } SPE τ k ,n k λ sub ject to    T glmnet ,τ k ,n k λ < T hope ,  SPE τ k ,n k λ , T glmnet ,τ k ,n k λ  ∈ b P ∗ . Finally , the b est conﬁguration is giv en b y ( τ ∗ , n ∗ λ ) = ( τ k ∗ , n k ∗ λ ). 9 T_hope = 20 10 20 0.01 0.02 0.03 0.04 0.05 SPE Time P oint Categor y Not P areto Front P areto Front Best Configuration Figure 3: Visualization of the Pareto fron t deriv ed from the glmnet-MLP for the same dataset used in Figure 1. The horizon tal and vertical axes represen t the predicted SPE and computation time, resp ectiv ely . The blue p oin ts represent the set of Pareto optimal solutions. F rom this set, the red triangle highligh ts the b est conﬁguration selected based on the user-sp eciﬁed time constraint ( T hope = 20 s), indicated b y the horizon tal dashed line. Under this constrain t, the b est conﬁguration w as iden tiﬁed as ( τ ∗ , n ∗ λ ) = (1 . 159 × 10 − 9 , 864). By applying this tuning pro cedure to the same dataset used in Figure 1, we obtain the P areto fron t shown in Figure 3. In this example, w e set the time constraint to T hope = 20 seconds. This approac h oﬀers signiﬁcan t adv an tages in terms of b oth eﬃciency and interpretabilit y . First, the optimization pro cess is extremely fast; for instance, computing the Pareto fron t for Figure 3 to ok only ab out one second. The only computationally intensiv e step is the eigenv alue calculation. Once extracted, ev aluating thousands of conﬁgurations via the neural netw ork takes negligible time. This eﬃciency meets the requirement discusse d in Section 1 to optimize the conﬁguration as quic kly as p ossible. Second, the P areto fron t pro vides visual clarity regarding the trade-oﬀ b et w een SPE and computation time. This allows users to in tuitively assess the cost of accuracy . F or example, in Figure 3, one can observe a substan tial diﬀerence in SPE b et ween computation times of 20 seconds and 5 seconds. Based on this visualization, users can mak e informed decisions, such as whether to relax or tigh ten the constraint T hope to achiev e the desired balance. W e implemented the prop osed framework as an R pac k age named glmnetconf . This pack age pro vides the conﬁguration tuning framework prop osed in this study . F urthermore, it incorp orates a mec hanism to select the appropriate pac k age (i.e., glmnet or lars ) considering computation time. The details of this solv er selection and speciﬁc usage examples are provided in App endix C. 10 4 Numerical exp erimen ts 4.1 Sim ulation In this section, w e verify that our prop osed metho d properly tunes the conﬁguration ( τ , n λ ) through n umerical exp erimen ts. The simulation dataset with N observ ations and p predictors is generated as follows: x i i.i.d. ∼ N ( 0 , (1 − ρ ) I p + ρ 1 p 1 ⊤ p ) , X = ( x 1 , . . . , x N ) ⊤ , β = P (1 , . . . , 1 | {z } ⌊ p/ 2 ⌋ , 0 , . . . , 0 | {z } p −⌊ p/ 2 ⌋ ) ⊤ , ε ∼ N ( 0 , I N ) , y = X β + ε . where P is a random permutation matrix of size p × p , and ⌊·⌋ denotes the ﬂo or function. In this sim ulation, w e compare the p erformance of the following three metho ds: • glmnet (default): glmnet with the default conﬁguration. • glmnet (prop osed): glmnet with the conﬁguration optimized by our prop osed metho d with T hope = 20 s. • LARS: Serv es as a reference to pro vide the exact solution path b y the lars pack age. W e conducted the exp erimen ts for all com binations of N , p ∈ { 100 , 500 , 1000 , 1500 , 2000 } and ρ ∈ { 0 , 0 . 1 , 0 . 3 , 0 . 5 , 0 . 7 , 0 . 9 } ov er 100 sim ulation runs. T o ev aluate the predictive p erformance, w e emplo y ed the Ro ot Mean Square Error (RMSE) computed on test datasets of 100 samples. W e also measured the computation time for each metho d. The regularization parameter λ was selected via ten-fold cross-v alidation b y cv.glmnet() and cv.lars() . F rom the p ersp ectiv e of n umerical stability , we sp eciﬁed mode = "step" in cv.lars() when N = p , while choosing mode = "fraction" otherwise. Figure 4 presen ts the results of the n umerical exp erimen t. In each panel, the v ertical axis represen ts the a verage RMSE, and the horizontal axis represents the sample size N . The panels are organized b y com binations of the n um b er of predictors p and the correlation ρ . When ρ = 0, the test errors of all three methods are similar across all com binations of N and p . Ho wev er, when ρ > 0, the test error of glmnet (default) tends to b e higher than that of LARS. This result indicates that the default conﬁguration is not appropriate for suc h correlated data. In con trast, glmnet (prop osed) ac hieves p erformance comparable to that of LARS in most cases. Although sligh tly higher errors are observ ed when p = 2000, this can b e attributed to the imp osed T hope , reﬂecting the trade-oﬀ b et w een computational time and accuracy . Figure 5 rep orts the av erage computation time of the exp erimen ts using the same la y out as Figure 4. Among the three metho ds, LARS consistently required the longest computation time for p ≥ 1000, and its runtime increased drastically with larger N and p . In con trast, glmnet (prop osed) w as signiﬁcan tly faster in these settings, with runtimes consistently staying close to T hope . Despite this sp eed adv antage, Figure 4 conﬁrms that their predictiv e accuracy remains comparable. T ak en together, these results demonstrate that glmnet (proposed) ac hieves accuracy comparable to that of LARS while signiﬁcantly reducing computation time. This suggests that our prop osed metho d successfully selects the appropriate conﬁguration for glmnet adaptiv ely based on the dataset. It is w orth noting that for p ≤ 500, glmnet (prop osed) o ccasionally exhibited slightly longer computation times than LARS. This b eha vior is attributable to the setting of T hope . Because the budget is generous for small problems, glmnet (prop osed) utilizes this a v ailable time to maximize accuracy . 11 ρ = 0 ρ = 0.1 ρ = 0.3 ρ = 0.5 ρ = 0.7 ρ = 0.9 p = 100 p = 500 p = 1000 p = 1500 p = 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 1.0 1.5 2.0 2.5 3.0 0 10 20 30 0 20 40 60 0 30 60 90 120 0 50 100 150 N RMSE Method glmnet (default) glmnet (proposed) LARS Figure 4: Comparison of prediction accuracy (RMSE) across diﬀeren t sample sizes N . The plot compares glmnet (default), glmnet (prop osed) tuned with T hope = 20 s and LARS as the exact reference. The results are av eraged ov er 100 simulation runs. The panels corresp ond to diﬀeren t com binations of the num b er of predictors p and the correlation among the predictors ρ . Notably , the glmnet (prop osed) consisten tly ac hiev es accuracy comparable to the exact LARS solution across all settings. 4.2 Application to compressed sensing Compressed sensing (Cand` es and W akin, 2008) is a signal pro cessing tec hnique that reconstructs a signal from a compressed representation obtained via a random pro jection matrix. In this section, w e apply our prop osed framew ork to solve the lasso problem arising in compressed sensing. W e compare the reconstruction accuracy and computation time of the glmnet (prop osed) against the glmnet (default) and LARS. W e used an image from the MNIST dataset (LeCun et al., 1998) for the exp erimen t, resizing it to 32 × 32 pixels. First, the image w as compressed as follo ws. The image matrix w as v ectorized in column-ma jor order to form θ ∈ R 1024 . Let N ′ denote the dimension of the compressed data. W e generated a random pro jection matrix Z ∈ R N ′ × 1024 , where eac h element Z ij w as drawn indepen- den tly from N (0 , 1). The v ector θ was then compressed into y = Z θ ∈ R N ′ . In this exp erimen t, w e set the compressed dimension to N ′ = 700. This process reduces the dimensionalit y from 1024 to N ′ , eﬀectively compressing the data. Next, w e reconstructed the original image θ using the compressed vector y and the pro jection 12 ρ = 0 ρ = 0.1 ρ = 0.3 ρ = 0.5 ρ = 0.7 ρ = 0.9 p = 100 p = 500 p = 1000 p = 1500 p = 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 0.1 0.2 0.3 0.4 0.5 0 5 10 15 20 0 30 60 90 0 100 200 300 400 0 250 500 750 1000 N Computation Time (s) Method glmnet (default) glmnet (proposed) LARS Figure 5: Comparison of computation time (seconds) across diﬀerent sample sizes N . Similar to Figure 4, this plot compares glmnet (default), glmnet (proposed) tuned with T hope = 20 s, and LARS. The results are a veraged o ver 100 sim ulation runs. The panels corresp ond to diﬀerent com binations of the n um b er of predictors p and the correlation among the predictors ρ . The prop osed metho d is not only signiﬁcantly faster than LARS but also satisﬁes T hope in the ma jority of cases. matrix Z . Employing a tw o-level w av elet basis matrix Ψ , the reconstruction corresp onds to solving the following lasso problem: ˆ β = argmin β 1 2 N ′ ∥ y − X β ∥ 2 2 + λ ∥ β ∥ 1 , where X = Z Ψ ∈ R N ′ × 1024 . The reconstructed image is obtained by ˆ θ = Ψ ˆ β . Using this form ulation, w e ev aluated the performance of the proposed metho d. Figure 6 illustrates the reconstruction results. T o quantify the reconstruction quality , we ev al- uated the RMSE against the original image on the pixel v alue scale [0 , 255]. The glmnet (default) yielded a high RMSE of 31 . 57, resulting in a degraded image lacking sharpness. In contrast, the prop osed metho d ac hieved an RMSE of 14 . 53, whic h is signiﬁcan tly lo wer than the default and com- parable to the RMSE of 12 . 22 obtained by the exact solution of LARS. Regarding computational eﬃciency , the glmnet (proposed) required only appro ximately one-fourth of the computation time of LARS. These results demonstrate that our prop osed framework successfully tunes the conﬁgura- tion to achiev e accuracy comparable to LARS while main taining signiﬁcan tly lo wer computational 13 cost. 4.3 Discussion Our results demonstrate that the prop osed method successfully tunes the conﬁguration to ac hieve accuracy comparable to the exact solution of LARS, while approximately satisfying the speciﬁed computation time constrain t, T hope . This suggests that our framework successfully tunes appropriate conﬁguration that ov ercomes the limitations inheren t in the default conﬁguration. The observed impro vemen t in test error is primarily attributable to the expanded search range for λ . This wider range enables cross-v alidation to identify optimal λ v alues that restrictiv e default grids often miss. Regarding computation time, the prediction acc u racy of glmnet-MLP prov ed reasonably reliable. This accuracy enabled the successful selection of conﬁguration that adhered to the time constraint, T hope . Ho wev er, a primary limitation of our framew ork lies in the range of the training dataset. Our mo del was trained using synthetic dataset generated from multiv ariate normal distributions within sp eciﬁc ranges of sample size n and dimension p . Since it is practically imp ossible to learn the c haracteristics of all p ossible data distributions, this dep endency on training dataset is unav oidable. In particular, caution is required when extrap olating to cases where N or p exceeds th e upp er b ounds of the training range deﬁned ab o ve. Nevertheless, the compressed sensing exp erimen t pro vided a promising indication of robustness. In this case, while N and p w ere within the training range, the structural prop erties of the design matrix X diﬀered from the multiv ariate normal assumption used in training. The successful application in this con text demonstrates our metho d’s potential applicabilit y to datasets with design matrices outside the training distribution. Finally , regarding hardw are dep endency , computation time inevitably v aries across diﬀerent computing en vironmen ts. F rom a practical standp oin t, how ever, the order of magnitude is often more critical than precise timing. Minor deviations in seconds are generally acceptable in real-w orld applications, provided the algorithm op erates within the exp ected time scale. 5 Conclusion In this study , w e established a data-driv en framew ork for conﬁguration tuning of glmnet by learning from large-scale artiﬁcial datasets. Our approach explicitly mo dels the trade-oﬀ b et ween accuracy and computation time. This capability enables the iden tiﬁcation of a conﬁguration that ac hiev es accuracy comparable to LARS while satisfying user-sp eciﬁed time constrain ts. By emplo ying P areto optimization, we achiev ed a tuning mechanism that is b oth transparent and eﬃcien t. In future w ork, w e aim to address the limitations discussed in Section 4.3. Sp eciﬁcally , w e plan to enhance the mo del’s generalizabilit y b y expanding the training dataset to include a wider range of sample sizes and dimensions ( N , p ) and div erse data distributions. F urthermore, extend- ing this framew ork to other families of Generalized Linear Mo dels (GLMs) supp orted b y glmnet (e.g., logistic and P oisson regression) represen ts a promising av enue, given their shared algorithmic structure. 14 (a) Original image (b) glmnet (default) (c) glmnet (prop osed) (d) LARS Figure 6: Visual comparison of reconstruction results. (a) The original image. The b ottom row displa ys the reconstructed images along with their RMSE v alues calculated against the original image (on a [0 , 255] scale): (b) default glmnet (RMSE: 31 . 57), (c) glmnet tuned by the prop osed metho d (RMSE: 14 . 53), and (d) LARS (RMSE: 12 . 22). Notably , regarding computation time, the glmnet (prop osed) required only 13 . 2 s, whereas LARS required 53 . 4 s. This demonstrates that our approac h successfully identiﬁes an appropriate conﬁguration that is b oth accurate and computationally eﬃcient. 15 App endix A Details of the summary dataset A.1 Hyp erparameters for summary dataset In Section 3.2.1, w e describ ed the generation of the design matrix X , y to get the summary dataset for training the glmnet MLP . Then, we need to sp ecify the parameters Σ , β , σ to generate X , y . This app endix provides the sp eciﬁc details of these parameter settings. Structure of the true co v ariance matrix Σ . W e employ ed four t yp es of co v ariance matrix for Σ : 1. Comp ound symmetry co v ariance matrix: Σ = (1 − ρ ) I p + ρ 1 p 1 ⊤ p =      1 ρ . . . ρ ρ 1 . . . ρ . . . . . . . . . . . . ρ ρ . . . 1      (0 < ρ < 1) 2. AR(1) cov ariance matrix:      1 ρ . . . ρ p − 1 ρ 1 . . . ρ p − 2 . . . . . . . . . . . . ρ p − 1 ρ p − 2 . . . 1      (0 < ρ < 1) 3. Random structured cov ariance matrix: The construction procedure for the random structured cov ariance matrix is based on Hirose et al. (2017). The sp eciﬁc steps are as follo ws: (a) Deﬁne the set D = [ − 0 . 75 , − 0 . 25] ∪ [0 . 25 , 0 . 75], and construct a p × p matrix E , where eac h elemen t E i,j is drawn indep enden tly from a uniform distribution U ( D ). (b) Assign 0 to some oﬀ-diagonal elements of the matrix E generated. The n umber and sp eciﬁc lo cations for these assignments are determined b y a uniform random selection. (c) Compute ˜ E = ( E + E ⊤ ) / 2. (d) Calculate ˜ Ω = ˜ E + (0 . 1 − λ min ) I , where λ min is the minim um eigen v alue of ˜ E . (e) Let L = diag( ˜ Ω − 1 ), and then compute Ω = L 1 2 ˜ Ω L 1 2 . (f ) Finally , the random structured cov ariance matrix C is giv en b y C = Ω − 1 . 4. Inv erse of the random structured co v ariance matrix: W e adopt the in verse of C as the cov ariance matrix. 16 Structure of the true co eﬃcien t β . W e prepared the following four structural patterns for β ∈ R p : 1. ⌊ p 2 ⌋ elements are 1, and the others are 0. 2. ⌊ p 10 ⌋ elements are 1, and the others are 0. 3. ⌊ p 2 ⌋ elements are generated from N (0 , 1), and the others are 0. 4. ⌊ p 10 ⌋ elements are generated from N (0 , 1), and the others are 0. In all cases, the positions of the β elements are randomly p erm uted. Structure of the true error standard deviation σ . W e emplo yed t wo settings for the noise level σ ∈ R : 1. σ = 1 2. σ = p 10 A.2 Scop e of the summary dataset This section describ es the scop e of the summary dataset. Sp eciﬁcally , the distribution of sample size N and the num b er of predictors p directly determines the applicable range of our prop osed metho d. Figure 7 illustrates the distribution of N and p within the dataset. The dataset co v ers a broad sp ectrum of dimensions, explicitly deﬁned b y the ranges N ∈ [30 , 3000] and p ∈ [10 , 2980]. T o ensure the accuracy of computation time measurements, w e av oided large-scale parallelization. This constraint signiﬁcan tly increased the total time required to generate the summary dataset. Consequen tly , w e employ ed a denser sampling strategy in regions where N and p are small. A.3 Computational en vironmen t All n umerical exp erimen ts were conducted on a serv er running Ubuntu 24.04.1 L TS (Linux kernel 6.8.0), equipp ed with an AMD EPYC 7763 64-Core Pro cessor (up to 3.5 GHz) and 2 TB of DDR4- 3200 ECC memory . Computational tasks w ere implemen ted in R v ersion 4.3.3. P arallel pro cessing with 10 logical cores w as employ ed during the generation of summary datasets to enhance computa- tional eﬃciency , using the doParallel (v1.0.17) and foreach (v1.5.2) pack ages. In contrast, other sim ulation pro cedures were executed in a single-threaded manner. The R environmen t w as linked against the reference BLAS (v3.12.0) and LAP A CK (v3.12.0) libraries. The versions of glmnet and LARS were 4.1.8 and 1.3, resp ectiv ely . B Details of training strategy and h yp erparameters In this section, w e provide detailed speciﬁcations of the training pro cess and the resulting model arc hitecture for glmnet-MLP , describ ed in Section 3.2.2. Data preparation. The summary dataset was split in to 80%, 10%, and 10% for training, v alidation, and testing, resp ectiv ely . As men tioned in the main text, target v ariables w ere log- transformed and standardized. 17 0 1000 2000 3000 0 1000 2000 3000 p N Count 20000 40000 60000 Figure 7: Heatmap showing the distribution of combinations of N and p in the summary dataset. The color in tensity represen ts the num b er of data p oin ts; redder regions indicate a higher concen- tration of samples. White regions indicate sparse sampling (low count), not necessarily the absence of dataset. Optimization setup. The Bay esian optimization was p erformed using the BoTorchSampler (Balandat et al., 2020) within Optuna , based on Gaussian process regression and the Exp ected Impro vemen t acquisition function. W e executed the optimization for 500 trials. Throughout the pro cess, the activ ation function w as ﬁxed to the Swish function (Ramac handran et al., 2017): f ( x ) = x 1 + e − x . The n um b er of epo c hs was set to 500, and the mini-batch size w as ﬁxed at 20,263. The search space for the optimization was deﬁned as follo ws: • Number of hidden la yers: { 1 , 2 , 3 } ; • Number of units p er lay er: { 1 , 2 , . . . , 64 } ; • Learning rate: [10 − 5 , 10 − 1 ] (log scale). Resulting mo del h yp erparameters. As a result of the optimization, a three-la yer netw ork structure was selected, with 64, 61, and 57 units in the resp ectiv e hidden lay ers. The optimal learning rate was determined to b e approximately 7 . 6 × 10 − 4 . This conﬁguration was adopted as the ﬁnal glmnet-MLP for ev aluation. C R P ac k age glmnetconf W e developed the R pack age glmnetconf to implement our tuning metho d and make it easily accessible to a broad user base. Our implementation includes not only conﬁguration tuning metho d 18 Input: X , y , T hop e Predict T lars for pac k age selection T hop e > T lars ? Solv e the lasso problem b y lars T une conﬁguration for glmnet Y es No Figure 8: W orkﬂo w of our pack age glmnetconf but also a R pac k age selection metho d. Since lars and glmnet eac h p ossess distinct adv antages, the appropriate choice dep ends on the user’s ob jectiv e. Sp eciﬁcally , if an exact solution is required without considering computation time, lars is the optimal c hoice. Conv ersely , if computational eﬃciency is prioritized, glmnet is preferable. Therefore, w e implemen t the function to select the R pac k age based on the dataset X , y and T hope in our pac k age. C.1 The w orkﬂo w of glmnetconf Supp ose we hav e a dataset X , y and desired time for computation T hope . Figure 8 shows the w orkﬂow of our prop osed pack age. First, our framew ork determines whic h pack age to emplo y . W e predict the computation time of lars , denoted as T lars , based on X , y . If the predicted T lars is smaller than T hope , our framew ork selects lars to ensure an exact solution. On the other hand, if T lars exceeds T hope , our framework selects glmnet . In this scenario, the conﬁguration for glmnet is tuned by our proposed method in Section 3. C.2 Prediction mo del for the lars computation time Similar to the glmnet-MLP , we construct a predictive mo del to forecast the computation time of lars , denoted as T lars . The input features consist of the sample size N , the dimension p , and the selected eigenv alues of the sample co v ariance matrix of X ( γ ± 1 , . . . , γ ± 5 ). The output is the predicted computation time T lars . The training dataset for the lars-MLP w as collected during the generation of the summary dataset describ ed in Section 3.2.1. The resulting dataset comprises 68 , 013 samples. Using this dataset, we trained the mo del emplo ying the Adam optimizer. The netw ork architecture and h yp erparameters were determined via Ba yesian optimization, follo wing the same proto col and searc h space as the glmnet-MLP . The optimization yielded a three- la yer hidden net work with 45, 44, and 37 units in the resp ectiv e la yers, and a learning rate of appro ximately 1 . 08 × 10 − 3 . Consistent with the glmnet-MLP , we emplo y ed the Swish activ ation function and set the num b er of ep o c hs to 500. Ho wev er, the batc h size w as set to 1700 for this mo del. 19 ρ = 0 ρ = 0.1 ρ = 0.3 ρ = 0.5 ρ = 0.7 ρ = 0.9 p = 100 p = 500 p = 1000 p = 1500 p = 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 0.3 0.4 0.5 0.6 0.7 5 10 15 0 30 60 90 0 100 200 300 400 0 250 500 750 1000 N Computation Time (s) Method lars−MLP LARS Figure 9: Comparison of predicted v ersus actual computation times for lars across diﬀeren t sample sizes N . The la yout is iden tical to that of Figure 4. The lars-MLP eﬀectiv ely captures the o verall trend of the computation time. How ever, it tends to underestimate the actual computation time when b oth N and p are large. Figure 9 compares the predicted computation time with the actual runtime of lars , using the sim ulation dataset described in Section 4. The mo del eﬀectiv ely captures the ov erall trend of the computation time. Ho wev er, it tends to underestimate the runtime when both N and p are large. This bias is lik ely due to the scarcit y of training samples in high-dimensional regions, limited by the high computational cost of data generation. C.3 Usage example This section demonstrates the usage of the glmnetconf pack age. The primary function, auto lasso() , automates the entire tuning process. Sp eciﬁcally , it automatically selects the appropriate pac k age and tunes the conﬁguration based on the input dataset and T hope . Listing 1 shows a usage example with the synthetic dataset, where T hope = 20 seconds. The dataset is generated via the function data generation , follo wing the exact same sim ulation settings describ ed in Section 4. In this example, we set the sample size to N = 1500, the dimension to p = 800 and the correlation to ρ = 0 . 5. The script not only exec utes the prop osed automated w orkﬂow via auto lasso() but also compares its predictiv e p erformance against the standard usage of cv.glmnet (default conﬁguration). This comparison illustrates how the prop osed metho d 20 ac hieves comp etitiv e accuracy while satisfying the time constrain t. 1 l i b r a r y ( g l m n e t c o n f ) ; l i b r a r y ( g l m n e t ) 2 3 # S e t u p : G e n e r a t e X _ t r a i n , y _ t r a i n , X _ t e s t , y _ t e s t . 4 d a t < - d a t a _ g e n e r a t i o n ( N _ t r a i n = 1 5 0 0 , N _ t e s t = 1 0 0 , p = 8 0 0 , r h o = 0 . 5 , s p a r s e _ r a t e = 0 . 5 , s i g m a = 1 ) 5 X _ t r a i n < - d a t $ X _ t r a i n 6 y _ t r a i n < - d a t $ y _ t r a i n 7 X _ t e s t < - d a t $ X _ t e s t 8 y _ t e s t < - d a t $ y _ t e s t 9 10 # - - - 1 . P r o p o s e d M e t h o d ( g l m n e t c o n f ) - - - 11 # P e r f o r m a u t o m a t i c t u n i n g w i t h a t i m e c o n s t r a i n t T _ h o p e = 2 0 s 12 f i t _ g l m n e t c o n f < - a u t o _ l a s s o ( X _ t r a i n , y _ t r a i n , n e w _ x = X _ t e s t , T _ h o p e = 2 0 ) 13 m s e _ g l m n e t c o n f < - m e a n ( ( y _ t e s t - f i t _ g l m n e t c o n f $ p r e d i c t i o n ) ^ 2 ) 14 15 # - - - 2 . B e n c h m a r k ( g l m n e t d e f a u l t ) - - - 16 # S t a n d a r d c r o s s - v a l i d a t i o n w i t h d e f a u l t s e t t i n g s 17 f i t _ g l m < - c v . g l m n e t ( X _ t r a i n , y _ t r a i n , a l p h a = 1 ) 18 p r e d _ g l m < - p r e d i c t ( f i t _ g l m , n e w x = X _ t e s t , s = " l a m b d a . m i n " ) 19 m s e _ g l m < - m e a n ( ( y _ t e s t - p r e d _ g l m ) ^ 2 ) 20 21 # - - - 3 . P e r f o r m a n c e C o m p a r i s o n - - - 22 p r i n t ( c ( g l m n e t c o n f = m s e _ g l m n e t c o n f , g l m n e t = m s e _ g l m ) ) 23 24 # ( O p t i o n a l ) C h e c k t u n e d c o n f i g u r a t i o n a n d P a r e t o f r o n t 25 # p r i n t ( f i t _ g l m n e t c o n f $ c o n f i g u r a t i o n ) 26 # p r i n t ( f i t _ g l m n e t c o n f $ P a r e t o _ f r o n t ) Listing 1: Usage example of glmnetconf . The script demonstrates the prop osed tuning metho d with a time constraint and compares its test error against the default conﬁguration of glmnet . References Akiba, T., Sano, S., Y anase, T., Ohta, T., and Koy ama, M. (2019). Optuna: A next-generation h yp erparameter optimization framew ork. In Pr o c e e dings of the 25th ACM SIGKDD International Confer enc e on Know le dge Disc overy & Data Mining , KDD ’19, page 2623–2631, New Y ork, NY, USA. Asso ciation for Computing Machinery . Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Baksh y , E. (2020). BoT orc h: A F ramework for Eﬃcient Monte-Carlo Ba yesian Optimization. In A dvanc es in Neur al Information Pr o c essing Systems 33 . Bec k, A. and T eb oulle, M. (2009). A fast iterativ e shrink age-thresholding algorithm for linear inv erse problems. SIAM Journal on Imaging Scienc es , 2(1):183–202. 21 Bøv elstad, H., Nyg ˚ ard, S., Størvold, H., Aldrin, M., Borgan, Ø., F rigessi, A., and Ling jærde, O. (2007). Predicting surviv al from microarray data—a comparative study . Bioinformatics , 23(16):2080–2087. Bo yd, S., P arikh, N., Ch u, E., P eleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of m ultipliers. F oundations and T r ends ® in Machine le arning , 3(1):1–122. Cand ` es, E. J. and W akin, M. B. (2008). An in tro duction to compressiv e sampling. IEEE signal pr o c essing magazine , 25(2):21–30. Cs´ ardi, G. (2019). cr anlo gs: Downlo ad L o gs fr om the ’RStudio’ ’CRAN’ Mirr or . R pack age version 2.1.1. Daub ec hies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for linear in verse problems with a sparsit y constrain t. Comm. Pur e Appl. Math. , 57(11):1413–1457. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. The A nnals of Statistics , 32(2):407–499. F riedman, J. H., Hastie, T., and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Mo dels via Co ordinate Descen t. Journal of Statistic al Softwar e , 33:1–22. F u, W. J. (1998). P enalized regressions: The bridge versus the lasso. Journal of Computational and Gr aphic al Statistics , 7(3):397–416. Hebiri, M. and Lederer, J. (2013). Ho w correlations inﬂuence lasso prediction. IEEE T r ans. Inf. The or. , 59(3):1846–1854. Hirose, K., F ujisa wa, H., and Sese, J. (2017). Robust sparse gaussian graphical mo deling. Journal of Multivariate Analysis , 161:172–190. LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P . (1998). Gradient-based learning applied to do cumen t recognition. Pr o c e e dings of the IEEE , 86(11):2278–2324. Lu, Y. and Li, X. (2015). Estimating stellar atmospheric parameters based on lasso and supp ort- v ector regression. Monthly Notic es of the R oyal Astr onomic al So ciety , 452(2):1394–1401. Massias, M., Gramfort, A., and Salmon, J. (2018). Celer: a fast solver for the lasso with dual extrap olation. In Dy , J. and Krause, A., editors, Pr o c e e dings of the 35th International Confer enc e on Machine L e arning , volume 80 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 3315–3324. PMLR. Osb orne, M., Presnell, B., and T urlac h, B. (2000). A new approac h to v ariable selection in least squares problems. IMA Journal of Numeric al A nalysis , 20(3):389–403. Ramac handran, P ., Zoph, B., and Le, Q. V. (2017). Searching for activ ation functions. arXiv pr eprint arXiv:1710.05941 . Rumelhart, D. E., Hin ton, G. E., and Williams, R. J. (1986). Learning represen tations b y bac k- propagating errors. Natur e , 323(6088):533–536. Tibshirani, R. (1996). Regression Shrink age and Selection Via the Lasso. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al) , 58(1):267–288. 22

Data-driven configuration tuning of glmnet for balancing accuracy and computational efficiency

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment