Data-driven configuration tuning of glmnet for balancing accuracy and computational efficiency
The glmnet package in R is widely used for lasso estimation because of its computational efficiency. Despite its popularity, glmnet occasionally yields solutions that deviate substantially from the true ones because of the inappropriate default confi…
Authors: Shuhei Muroya, Kei Hirose
Data-driv en configuration tuning of glmnet to balance accuracy and computation time Sh uhei Muro ya 1 ∗ and Kei Hirose 2 † 1 Join t Graduate Sc ho ol of Mathematics for Innov ation, Kyush u Univ ersity , F ukuok a, Japan 2 Institute of Mathematics for Industry , Kyushu Universit y , F ukuok a, Japan Abstract glmnet is a widely adopted R pac k age for lasso estimation due to its computational efficiency . Despite its p opularit y , glmnet sometimes yields solutions that are substan- tially differen t from the true ones b ecause of the inappropriate default configuration of the algorithm. The accuracy of the obtained solutions can b e impro v ed b y appropriately tuning the configuration. How ever, impro ving accuracy typically increases computa- tional time, resulting in a trade-off b et w een accuracy and computational efficiency . Therefore, it is essen tial to establish a systematic approac h to determine appropriate configuration. T o address this need, w e prop ose a unified data-driv en framework specifi- cally designed to optimize the configuration by balancing the trade-off betw een accuracy and computational efficiency . W e generate large-scale simulated datasets and apply glmnet under v arious configurations to obtain accuracy and computation time. Based on these results, w e construct neural net w orks that predict accuracy and computation time from data c haracteristics and configuration. Given a new dataset, our framew ork uses the neural netw orks to explore the configuration space and deriv e a Pareto fron t that represen ts the trade-off b et ween accuracy and computational cost. This front al- lo ws us to automatically identify the configuration that maximize accuracy under a user-sp ecified time constraint. The proposed metho d is implemen ted in the R pac k age glmnetconf , av ailable at https://github.com/Shuhei- Muroya/glmnetconf.git . Keywor ds : lasso, glmnet , h yp erparameter optimization, computational efficiency 1 In tro duction The lasso (least absolute shrink age and selection op erator; Tibshirani, 1996) is a p opular metho d for regression that uses an ℓ 1 p enalt y to obtain sparse regression co efficien ts. The lasso can be applied to high-dimensional data, where the n umber of predictors exceeds the n umber of observ ations, and it provides interpretable results. Due to these features, the lasso is widely applied across v arious fields suc h as signal pro cessing (Cand` es and W akin, 2008), genomics (Bøvelstad et al., 2007) and astronom y (Lu and Li, 2015). ∗ Email: muro ya.sh uhei.697@s.kyushu-u.ac.jp † Email: hirose@imi.kyushu-u.ac.jp 1 −8 −6 −4 −2 0 2 4 −0.08 −0.04 0.00 0.04 log ( λ ) (a) glmnet (default) −8 −6 −4 −2 0 2 4 −0.08 −0.04 0.00 0.04 log ( λ ) (b) glmnet (man ual) −8 −6 −4 −2 0 2 4 −0.08 −0.04 0.00 0.04 log ( λ ) (c) LARS Figure 1: The solution path for the same dataset by each pack age. The exp erimen tal setting is identical to that in Section 4, with N = 1500 , p = 800 , ρ = 0 . 5. F or clarit y , we displa y the solution path for only the first 10 co efficien ts to av oid visual congestion. The glmnet (default) denotes the estimator by glmnet using the default configuration, whereas glmnet (man ual) denotes the estimator by glmnet whose configuration is man ually optimized b y the authors. LARS denotes the estimator b y LARS algorithm, which pro vides the exact solution path and thus serves as a reference (ground truth). By comparison, the results of glmnet (manual) are seen to b e close to that of LARS. Let N b e the n umber of observ ations and p b e the num b er of predictors. Let X ∈ R N × p b e the design matrix with rows x i ∈ R p for i = 1 , . . . , N , and let y ∈ R N b e the resp onse vector. W e assume that the explanatory v ariables are standardized and the resp onse v ector is centered. Under these assumptions, the lasso estimates the coefficient vector β ∈ R p b y solving minimize β 1 2 N ∥ y − X β ∥ 2 2 + λ ∥ β ∥ 1 , (1) where λ > 0 is a regularization parameter, and ∥ · ∥ 2 and ∥ · ∥ 1 denote the ℓ 2 - and ℓ 1 -norms, resp ectiv ely . Since the lasso solution do es not generally hav e a closed-form expression b ecause of the non-differen tiability of the ℓ 1 norm, v arious algorithms ha ve been proposed to solv e this problem (F u, 1998; Osb orne et al., 2000; Efron et al., 2004; Daubechies et al., 2004; Beck and T eb oulle, 2009; F riedman et al., 2010; Boyd et al., 2011). In particular, the co ordinate descen t algorithm (F u, 1998; F riedman et al., 2010) and the Least Angle Regression (LARS) algorithm (Efron et al., 2004) ha ve been widely used. The co ordinate descen t algorithm pro vides a fast approximate solution by iterativ ely up dating each co efficien t. In contrast, LARS yields the exact entire solution path for the lasso problem at a higher computational cost. The coordinate descen t algorithm and the LARS algorithm are implemented in the R pac k age glmnet and lars , resp ectiv ely . The glmnet pac k age is esp ecially widely used due to its computation efficiency . In fact, it was downloaded ov er 1.3 million times in 2024, whic h was more than ten times that of lars , according to the CRAN do wnload logs pro vided b y cranlogs (Cs´ ardi, 2019). Ho wev er, our n umerical experiments rev eal that the glmnet solution path can differ significan tly from the exact solution path for correlated high-dimensional data. These discrepancies app ear to b e caused b y the default settings in glmnet . In particular, the conv ergence threshold and the sp ecification of the λ sequence play critical roles. Hereafter, we refer to these settings as the configuration of glmnet . T o illustrate how the configuration affects the results, Figure 1 compares the solution path of the first 10 co efficien ts from three estimators for a given dataset: glmnet (default), glmnet (manual) and LARS. The glmnet (default) and glmnet (manual) denote the estimators obtained using the default and man ually tuned configuration, resp ectiv ely . The lab el 2 LARS corresp onds to the exact solution path computed by the lars pack age. As shown in the figure, the solution path of glmnet (default) is substan tially differen t from that of LARS, whereas the path of glmnet (man ual) is muc h closer to that of LARS. This result implies that appropriate tuning of the configuration is crucial for obtaining an accurate solution path. In practice, users often rely on the default configuration without knowing its critical impact. This is partly b ecause glmnet returns results without warning, even when the default configuration is inappropriate for the given dataset. F urthermore, manual tuning is rarely p erformed as it requires exp ert knowledge of the underlying algorithm. Crucially , improving accuracy t ypically increases computational time, resulting in a trade-off b et ween accuracy and computational efficiency . While an appropriate configuration should ideally b e determined for each dataset to balance this trade-off, a systematic approac h for such tuning has not yet b een established. Therefore, w e prop ose a data-driven framew ork that automatically determines an appropriate configuration of glmnet based on the characteristics of the dataset. Sp ecifically , we generate large- scale artificial datasets co vering a wide range of data c haracteristics. F or eac h dataset, w e solv e the lasso problem using glmnet under v arious configurations. Additionally , we employ LARS to obtain the exact solution path as a reference for ev aluating accuracy . W e then record the accuracy of the glmnet solution path and its computation time, including the cross-v alidation. These simulation results are used to train a neural netw ork that learns the relationship among data characteristics, configuration, and their corresp onding p erformance. Once trained, the neural net work can predict accuracy and computation time for a new dataset and configuration. Based on these predictions, the Pareto fron t is deriv ed to capture the trade-off b et ween accuracy and computation time. F rom this fron t, our prop osed framework automatically selects the configuration that ac hiev es the highest p ossible accuracy aiming not to exceed the user-specified computation time. A key feature of our framew ork is its abilit y to explicitly manage the trade-off b et ween accuracy and computation time. This capabilit y enables users to perform configuration tuning that explicitly accoun ts for computational costs. F urthermore, the tuning pro cess itself is fast, thereb y keeping the total run time shorter than that of LARS. The organization of the paper is as follo ws. In Section 2, we presen t the bac kground and motiv ation for configuration tuning. In Section 3, we explain our prop osed method and demonstrate ho w to tune the configuration of the glmnet function based on the dataset. Section 4 ev aluate the p erformance of our prop osed metho d through numerical sim ulations and the application to compressed sensing, resp ectiv ely . 2 Algorithmic details and default configuration in glmnet This section reviews the computational details of the co ordinate descent algorithm and the default configuration of glmnet . W e sp ecifically discuss wh y this default configuration can result in in- appropriate solutions. W e also briefly describ e the LARS algorithm as a reference for the exact solution path. 2.1 Co ordinate descen t algorithm The R pac k age glmnet implemen ts the co ordinate descent algorithm to solve the lasso problem efficien tly (F riedman et al., 2010). F or a giv en v alue of λ , the coordinate descen t algorithm computes an approximate solution through an iterative pro cedure. T o obtain the entire solution path, the algorithm is rep eatedly applied ov er a sequence of λ v alues. Linearly interpolating these solutions yields an appro ximate solution path. 3 The coordinate descent algorithm optimizes one co efficien t at a time while holding the others fixed, and cycles through all coefficients until con vergence. F or a fixed v alue of the regularization parameter λ , the glmnet pack age minimizes the ob jective function in (1) b y iterativ ely up dating eac h coefficient using the co ordinate descent algorithm. At iteration t , the up date for the j -th co efficien t β t +1 j is given by β t +1 j = S λ 1 N X ⊤ j r ( j ) , where X j denotes the j -th column of X , S λ ( z ) = sign( z ) ( | z | − λ ) + is the soft-thresholding op erator, and r ( j ) represen ts the partial residual v ector with elemen ts r ( j ) i = y i − P k = j x ik β t k ( λ ). This pro cedure cyclically up dates all coefficients until con vergence. 2.2 Configuration details: the con vergence threshold and the se- quence of λ This subsection examines the roles and default settings of t w o key comp onen ts: the con vergence threshold and the sequence of λ v alues. They are commonly used in both the glmnet() and cv.glmnet() functions of the glmnet pack age. Here, the function glmnet() computes a solution path on a grid of λ v alues, while cv.glmnet() p erforms cross-v alidation to select the optimal λ from this path. The con vergence threshold. The con vergence threshold, denoted by τ , determines the stop- ping criterion for the co ordinate descen t algorithm. Sp ecifically , the iterative up dates terminate when the impro vemen t in the ob jectiv e function falls b elo w the product of τ and the n ull deviance. The smaller τ , the stricter the stopping condition b ecomes, which can improv e the accuracy of the solution but also increase computation time. The default v alue of τ is 10 − 7 . The λ sequence. The sequence of λ v alues defines the grid ov er which the lasso solution path is computed, as discussed in the previous subsection. Extending the range of λ and refining the grid yields a more accurate solution path, but increases computation time. The λ sequence is determined b y its range, defined b y the maxim um v alue λ max and the minimum v alue λ min , and the n umber of grid p oin ts n λ . In the glmnet pack age, the default v alues for these parameters are sp ecified as follo ws. λ max is defined as the smallest λ for which all co efficien ts are zero, given by λ max = max j 1 N | X ⊤ j y | . The default low er bound λ def min is defined as: λ def min = ( 10 − 2 λ max if N < p, 10 − 4 λ max if N ≥ p. The default sequence of λ v alues is then generated on a logarithmic scale from λ max do wn to λ def min with n def λ = 100 p oin ts. 2.3 Limitations of the default configuration W e inv estigate wh y the default configuration of glmnet may lead to inaccurate results for highly correlated datasets. 4 The conv ergence threshold. Although the default threshold of 10 − 7 is computationally ef- ficien t, our numerical exp erimen ts suggest that a stricter threshold is often necessary to ensure accuracy . There are tw o main reasons for this requirement. First, high correlations among predictors lead to a flat lasso ob jective function. In suc h cases, the coordinate descen t algorithm tends to mo ve in a zig-zag pattern with very small up date steps. Consequen tly , the improv ement in the ob jective function at each step b ecomes extremely small, often causing the algorithm to terminate prematurely . Indeed, Massias et al. (2018) note that stopping rules based only on c hanges in the primal ob jective can lead to sub optimal solutions, and they recommend monitoring the dualit y gap as a more rigorous criterion. How ever, since our aim is to impro ve glmnet without mo difying its in ternal source co de, w e do not adopt the dualit y gap criterion. Instead, w e address this issue by using a significantly stricter threshold to improv e the n umerical precision. Second, a stricter threshold is necessary to provide a more accurate initialization for the warm start strategy . As already mentioned, the algorithm is repeatedly applied o ver a sequence of λ v alues, denoted b y λ 1 = λ max > λ 2 > · · · > λ n λ = λ min . In this sequen tial pro cess, the algorithm emplo ys a warm start strategy , where the solution obtained at the previous λ is used to initialize the optimization for the current λ . If the optimization at the previous step stops to o early due to a lo ose threshold, the resulting sub optimal solution provides an inaccurate starting p oin t for the next step. In a flat ob jectiv e function, the solv er may fail to mov e sufficien tly aw ay from this p o or initialization, b ecause the up date steps are small and the stopping criterion is satisfied. Consequen tly , the accumulation of such errors ma y cause the computed solution to deteriorate progressiv ely . Therefore, maintaining a tight con vergence threshold is essential to prev en t this error accum ulation and to impro ve the reliabilit y of the entire solution path. The λ sequence. • Range of the sequence. The default sequence spans from λ max do wn to either 10 − 2 λ max or 10 − 4 λ max , dep ending on whether N < p . Ho wev er, n umerical exp erimen ts indicate that this range is sometimes to o narrow to fully capture the b eha vior of the true solution path. In particular, when compared with the exact path obtained b y LARS, the default sequence often fails to explore the region of sufficiently small λ , where additional c hanges in the zero–nonzero pattern can o ccur. If these regions are omitted, the solution path computed b y glmnet ma y miss imp ortan t structural changes in the co efficien ts. F rom the viewp oin t of cross-v alidation, a narro w range restricts the div ersity of candidate mo dels. More critically , the default sequence ma y fail to include the optimal λ , since the optimal λ tends to b e small when the correlation among predictors is high (Hebiri and Lederer, 2013). Therefore, extending the range of the λ sequence tow ards zero is essential to increase the probabilit y that the optimal λ is included in the candidate set. • Number of grid p oin ts. The default n umber of grid p oin ts is n def λ = 100. How ever, this fixed n umber ma y be insufficient relative to the dimension p . The LARS algorithm (Section 2.4) implies that the active set of the lasso solution changes at least min { N , p } times along the path. Thus, when N and p are larger than 100, the default grid cannot capture all c hanges in the true solution path, and linear interpolation b et w een coarse grid p oin ts may degrade the accuracy of the appro ximated path. F rom the p erspective of cross-v alidation, a small n um b er of λ candidates means that the search space for selecting the λ b ecomes to o limited, which can result in sub optimal mo del selection. The discussions ab o v e demonstrate that the default configuration whic h is indep enden t of the 5 dataset is insufficient to main tain numerical accuracy . Although manual tuning of the configuration is possible, an automated approac h adapted to the dataset is highly desirable in practice. Therefore, w e prop ose a data-driv en automated framework that determines the appropriate configuration to ac hieve accuracy comparable to LARS, while maintaining computational efficiency . 2.4 LARS algorithm and solution path accuracy Efron et al. (2004) proposed the Least Angle Regression (LARS) algorithm, whic h pro vides an exact computation of the entire solution path of the lasso problem (1). The algorithm b egins at λ = ∞ , where the lasso solution is trivially 0 ∈ R p . As λ decreases, it computes a piecewise linear and con tinuous solution path. Each knot along this path corresponds to a point where the activ e set A = { j : β j ( λ ) = 0 } c hanges. A t ev ery iteration, the algorithm up dates the direction of the co efficien t path so that the Karush–Kuhn–T uc ker (KKT) optimality conditions remain satisfied. T o determine this direction, the algorithm m ust compute the inv erse of the Gram matrix ( X ⊤ A X A ) − 1 , where X A denotes the submatrix of active predictors. Because the active set A changes sequen tially along the path, the LARS algorithm requires such matrix inv ersions to b e p erformed at least min { N , p } times. Consequen tly , the computational cost increases rapidly with the n umber of v ariables p . In this study , w e utilize the LARS as a reference in three w ays: (i) the exact path serves as the ground truth for ev aluating appro ximation accuracy , (ii) the exact n umber of knots is used to in vestigate the v alidit y of the default λ grid in glmnet , and (iii) the computation time provides an upp er b ound for efficiency comparisons. 3 Prop osed metho d 3.1 Ov erview of our prop osal Our framework aims to automatically determine the appropriate configuration for a giv en dataset. Sp ecifically , it aims to maximize accuracy given a user-sp ecified computation time, denoted as T hope . T o this end, we fo cus on tuning t w o key parameters: the conv ergence threshold τ and the sequence length n λ . The detailed definition of n λ is pro vided in Section 3.2. Figure 2 illustrates the o verall w orkflow of our prop osal, which consists of tw o main steps: • Step 1: Construction of the predictiv e mo del (Section 3.2). The upp er panel of Figure 2 shows the preparatory stage. Starting from div erse simulation parameters, we gen- erate a summary dataset to train a predictive mo del, referred to as glmnet-MLP . This mo del learns the mapping b et ween dataset c haracteristics (e.g., N , p, γ ), configurations ( τ , n λ ), and the resulting p erformance metrics, sp ecifically the computation time and the Solution Path Error (SPE). Sections 3.2.1 and 3.2.2 provide the details of this pro cess, including the formal definition of SPE, the generation of the summary dataset, and the training strategy . • Step 2: Configuration tuning using the predictive mo del (Section 3.3). The low er panel of Figure 2 presen ts the execution phase. Given a new dataset, the framework extracts its features and utilizes the trained glmnet-MLP to predict p erformance. Finally , by deriv- ing the P areto fron t of the predicted SPE and c ompu tation time, the b est configuration is automatically selected to maximize accuracy while satisfying the time constrain t T hope . The details of this tuning strategy are pro vided in Section 3.3.2. 6 Prop osed F ramewo rk Simulation parameters (e.g., N , p, Σ , β ) Generate summary dataset (e.g., N , p, γ , SPE , T glmnet ) (Section 3.2.1) T rain MLP (Section 3.2.2 ) Predictive mo del ( glmnet-MLP ) Step 1: Construction of the p redictive mo del (Section 3.2) New dataset ( X ∗ , y ∗ ) Extract dataset features (e.g., N , p, γ ) Predict SPE & computation time using glmnet-MLP Derive Pareto front & select configuration (Section 3.3.2) Best configuration Time constraint T hope Step 2: Configuration tuning using the p redictive model (Section 3.3) Load glmnet-MLP Figure 2: Ov erview of the prop osed framew ork. The pro cess consists of t wo phases: Step 1 constructs a predictive model using a summary dataset generated from sim ulation parameters. Step 2 utilizes this trained mo del to predict p erformance metrics for a target dataset, selecting the b est configuration that satisfies the time constrain t T hope . 3.2 Step 1: Construction of the predictiv e mo del In this step, w e construct a predictive mo del using a m ultilay er p erceptron (MLP) (Rumelhart et al., 1986), referred to as glmnet-MLP . The ob jectiv e of this model is to predict the p erformance metrics, specifically the SPE and the computation time T glmnet ,τ ,n λ . Here, T glmnet ,τ ,n λ is defined as the total run time, including the cross-v alidation pro cess for selecting the optimal λ . Regarding the dataset characteristics, we sp ecifically include the sample size N , the n umber of predictors p , and the eigen v alue features of the co v ariance matrix. The eigen v alues are included to capture the correlation structure among the predictors. In addition to these data features, w e incorporate the configuration parameters: the con vergence threshold τ and the length of λ sequence n λ . One c haracteristic of our framew ork is the construction of the λ sequence using n λ . Unlik e the default configuration, we prop ose a flexible construction where the sequence length is determined by n λ ( n λ > n def λ ). Sp ecifically , we extend the default sequence b y app ending ( n λ − n def λ ) additional v alues ev enly spaced b et ween the default minimum λ def min and 0. Through this parameterized construction, the complex problem of designing an appropriate λ sequence is effectiv ely reduced to determining a single optimal v alue for n λ . 3.2.1 Construction of the summary dataset T o train the glmnet-MLP , w e construct a large-scale dataset, which w e refer to as the summary dataset. This dataset is created b y generating artificial dataset and recording the corresp onding glmnet p erformance. Each sample in the summary dataset consists of the data c haracteristics ( N , p , eigen v alues), the configuration ( τ , n λ ), and the resulting p erformance metrics (SPE, computation time). The detailed construction pro cedure is as follows: 1. Parameter setting and feature extraction: Sp ecify the simulation parameters: sample size N , n umber of predictors p , cov ariance matrix Σ , true co efficien ts β , and error v ariance σ 2 . At 7 this stage, compute the eigenv alue features of Σ . Select the top and bottom five eigenv alues, denoted as γ k ( k = ± 1 , . . . , ± 5), where p ositiv e and negative indices correspond to the largest and smallest eigen v alues, respectively . 2. Data generation: Using the sp ecified parameters, generate a syn thetic dataset ( X , y ) accord- ing to x i ∼ N ( 0 , Σ ) ( i = 1 , . . . , N ) , ϵ ∼ N ( 0 , σ 2 I ) , y = X β + ϵ , where N ( µ , Σ ) denotes the m ultiv ariate normal distribution with mean µ and cov ariance matrix Σ . 3. Performance ev aluation: Compute the lasso solutions using glmnet under v arious configura- tions ( τ , n λ ). F or eac h configuration, we quantify the discrepancy b et w een the appro ximate solution path and the exact path using the Solution P ath Error (SPE), defined as: SPE τ ,n λ = 1 k k X i =1 1 √ p β true ( λ i ) − ˆ β glmnet τ ,n λ ( λ i ) 2 , where { λ i } k i =1 is a reference sequence of k = 20 p oin ts logarithmically spaced from λ max to λ start = 0 . 001. Here, β true ( λ i ) is the exact solution obtained via LARS, and ˆ β glmnet τ ,n λ ( λ i ) is the solution estimated b y glmnet . W e also record the computation time T glmnet ,τ ,n λ . 4. Data aggregation: Record the com bination of the data c haracteristics, configuration, and p erformance metrics as a single data point: ( N , p, γ 1 , . . . , γ − 1 , τ , n λ , SPE τ ,n λ , T glmnet ,τ ,n λ ) . 5. Iteration: Repeat Steps 1–4 under v arious parameter settings. Consequen tly , this pro cess yielded a total of 810 , 492 samples, whic h constitute the summary dataset. Detailed sp ecifications of the simulation parameters and the summary dataset are provided in App endix A. 3.2.2 T raining strategy and determination of net w ork arc hitecture W e train the glmnet-MLP using the summary dataset. The dataset is randomly split in to training, v alidation, and test sets. Prior to training, the target v ariables (SPE and computation time) are log-transformed and standardized to stabilize the learning pro cess. T o obtain predictions on the original scale, w e apply the in verse transformations. T o determine the optimal netw ork architecture (e.g., num b er of lay ers and units) and the learning rate, we emplo y Bay esian optimization. W e formulate the task as a blac k-b o x optimization problem to minimize the v alidation error, implemen t using the Optuna framework (Akiba et al., 2019). F urther details regarding the training proto col, the search space for hyperparameters, and the final net work architecture are provided in App endix B. 3.3 Step 2: configuration tuning using the predictiv e mo del 3.3.1 Definition of Pareto front First, we introduce the concept of P areto optimalit y for a multi-ob jectiv e optimization problem. Consider the problem of sim ultaneously minimizing a v ector-v alued ob jectiv e function f : X → R M : min x ∈X f ( x ) = min x ∈X f 1 ( x ) , . . . , f M ( x ) . (2) 8 In general, a unique solution that minimizes all ob jectiv e functions simultaneously do es not exist. Instead, we seek Pareto optimal solutions, whic h represent optimal trade-offs among the ob jectives. Definition 1 (W eak dominance) . F or x, x ′ ∈ X , if f m ( x ) ≤ f m ( x ′ ) ∀ m = 1 , · · · M , we sa y that f ( x ) weakly dominates f ( x ′ ). P areto optimal solution is defined as follows: Definition 2 (Pareto optimal solution and P areto fron t) . W e say that x ∗ ∈ X is a Pareto optimal solution if there exists no x ∈ X suc h that f ( x ) w eakly dominates f ( x ∗ ) with f ( x ) = f ( x ∗ ). In addition, we define the P areto front as the set of the ob jectiv e v alues of Pareto optimal solutions. The Pareto front P ∗ is given by: P ∗ = { f ( x ∗ ) | x ∗ ∈ X : Pareto optimal solution } . Note that theoretically , there may exist an infinite n um b er of P areto optimal solutions. Th us, w e need to select the b est solution from the set of Pareto optimal solutions. 3.3.2 P areto fron t for optimizing configuration In this section, w e describe the pro cedure to tune the glmnet configuration using the trained predic- tiv e mo del. Our goal is to determine the optimal configuration ( τ ∗ , n ∗ λ ) for a new dataset ( X ∗ , y ∗ ) under a user-sp ecified c ompu tation time constraint, denoted as T hope . The sp ecific procedure is as follo ws: 1. F eature extraction: Compute the data characteristics for the target dataset ( X ∗ , y ∗ ). Sp ecifically , calculate the sample size N , the dimension p , and the eigenv alue statistics γ i ( i = ± 1 , . . . , ± 5) derived from the sample co v ariance matrix of X ∗ . Note that unlike the training phase (Step 1), where the eigen v alues were computed from the true cov ariance matrix Σ , here they are deriv ed from the sample cov ariance matrix of X ∗ . 2. Mo del setup: Fix these extracted features in glmnet-MLP . Consequently , the MLP functions as a mapping from configuration ( τ , n λ ) to the predicted SPE and computation time. This mapping corresp onds to the ob jective function f ( x ) in Eq. (2). 3. Random sampling: Randomly sample K configurations { ( τ k , n k λ ) } K k =1 from the searc h space, where τ is sampled from [10 − 9 , 10 − 7 ] on a log-scale and n λ from [100 , 2 p ]. 4. Performance prediction: Obtain the predictions { SPE τ k ,n k λ , T glmnet ,τ k ,n k λ } K k =1 b y substitut- ing the sampled configurations { ( τ k , n k λ ) } K k =1 in to the mapping defined in the Mo del setup step. 5. Pareto fron t extraction: Identify the discrete P areto front b P ∗ from the set of predicted outcomes { SPE τ k ,n k λ , T glmnet ,τ k ,n k λ } K k =1 . 6. Best configuration selection: F rom the P areto fron t b P ∗ , select the optimal configuration ( τ ∗ , n ∗ λ ) that minimizes the SPE sub ject to a user-specified computation time constrain t T hope . The index of the best configuration k ∗ is determined b y: k ∗ = arg min k ∈{ 1 ,...,K } SPE τ k ,n k λ sub ject to T glmnet ,τ k ,n k λ < T hope , SPE τ k ,n k λ , T glmnet ,τ k ,n k λ ∈ b P ∗ . Finally , the b est configuration is giv en b y ( τ ∗ , n ∗ λ ) = ( τ k ∗ , n k ∗ λ ). 9 T_hope = 20 10 20 0.01 0.02 0.03 0.04 0.05 SPE Time P oint Categor y Not P areto Front P areto Front Best Configuration Figure 3: Visualization of the Pareto fron t deriv ed from the glmnet-MLP for the same dataset used in Figure 1. The horizon tal and vertical axes represen t the predicted SPE and computation time, resp ectiv ely . The blue p oin ts represent the set of Pareto optimal solutions. F rom this set, the red triangle highligh ts the b est configuration selected based on the user-sp ecified time constraint ( T hope = 20 s), indicated b y the horizon tal dashed line. Under this constrain t, the b est configuration w as iden tified as ( τ ∗ , n ∗ λ ) = (1 . 159 × 10 − 9 , 864). By applying this tuning pro cedure to the same dataset used in Figure 1, we obtain the P areto fron t shown in Figure 3. In this example, w e set the time constraint to T hope = 20 seconds. This approac h offers significan t adv an tages in terms of b oth efficiency and interpretabilit y . First, the optimization pro cess is extremely fast; for instance, computing the Pareto fron t for Figure 3 to ok only ab out one second. The only computationally intensiv e step is the eigenv alue calculation. Once extracted, ev aluating thousands of configurations via the neural netw ork takes negligible time. This efficiency meets the requirement discusse d in Section 1 to optimize the configuration as quic kly as p ossible. Second, the P areto fron t pro vides visual clarity regarding the trade-off b et w een SPE and computation time. This allows users to in tuitively assess the cost of accuracy . F or example, in Figure 3, one can observe a substan tial difference in SPE b et ween computation times of 20 seconds and 5 seconds. Based on this visualization, users can mak e informed decisions, such as whether to relax or tigh ten the constraint T hope to achiev e the desired balance. W e implemented the prop osed framework as an R pac k age named glmnetconf . This pack age pro vides the configuration tuning framework prop osed in this study . F urthermore, it incorp orates a mec hanism to select the appropriate pac k age (i.e., glmnet or lars ) considering computation time. The details of this solv er selection and specific usage examples are provided in App endix C. 10 4 Numerical exp erimen ts 4.1 Sim ulation In this section, w e verify that our prop osed metho d properly tunes the configuration ( τ , n λ ) through n umerical exp erimen ts. The simulation dataset with N observ ations and p predictors is generated as follows: x i i.i.d. ∼ N ( 0 , (1 − ρ ) I p + ρ 1 p 1 ⊤ p ) , X = ( x 1 , . . . , x N ) ⊤ , β = P (1 , . . . , 1 | {z } ⌊ p/ 2 ⌋ , 0 , . . . , 0 | {z } p −⌊ p/ 2 ⌋ ) ⊤ , ε ∼ N ( 0 , I N ) , y = X β + ε . where P is a random permutation matrix of size p × p , and ⌊·⌋ denotes the flo or function. In this sim ulation, w e compare the p erformance of the following three metho ds: • glmnet (default): glmnet with the default configuration. • glmnet (prop osed): glmnet with the configuration optimized by our prop osed metho d with T hope = 20 s. • LARS: Serv es as a reference to pro vide the exact solution path b y the lars pack age. W e conducted the exp erimen ts for all com binations of N , p ∈ { 100 , 500 , 1000 , 1500 , 2000 } and ρ ∈ { 0 , 0 . 1 , 0 . 3 , 0 . 5 , 0 . 7 , 0 . 9 } ov er 100 sim ulation runs. T o ev aluate the predictive p erformance, w e emplo y ed the Ro ot Mean Square Error (RMSE) computed on test datasets of 100 samples. W e also measured the computation time for each metho d. The regularization parameter λ was selected via ten-fold cross-v alidation b y cv.glmnet() and cv.lars() . F rom the p ersp ectiv e of n umerical stability , we sp ecified mode = "step" in cv.lars() when N = p , while choosing mode = "fraction" otherwise. Figure 4 presen ts the results of the n umerical exp erimen t. In each panel, the v ertical axis represen ts the a verage RMSE, and the horizontal axis represents the sample size N . The panels are organized b y com binations of the n um b er of predictors p and the correlation ρ . When ρ = 0, the test errors of all three methods are similar across all com binations of N and p . Ho wev er, when ρ > 0, the test error of glmnet (default) tends to b e higher than that of LARS. This result indicates that the default configuration is not appropriate for suc h correlated data. In con trast, glmnet (prop osed) ac hieves p erformance comparable to that of LARS in most cases. Although sligh tly higher errors are observ ed when p = 2000, this can b e attributed to the imp osed T hope , reflecting the trade-off b et w een computational time and accuracy . Figure 5 rep orts the av erage computation time of the exp erimen ts using the same la y out as Figure 4. Among the three metho ds, LARS consistently required the longest computation time for p ≥ 1000, and its runtime increased drastically with larger N and p . In con trast, glmnet (prop osed) w as significan tly faster in these settings, with runtimes consistently staying close to T hope . Despite this sp eed adv antage, Figure 4 confirms that their predictiv e accuracy remains comparable. T ak en together, these results demonstrate that glmnet (proposed) ac hieves accuracy comparable to that of LARS while significantly reducing computation time. This suggests that our prop osed metho d successfully selects the appropriate configuration for glmnet adaptiv ely based on the dataset. It is w orth noting that for p ≤ 500, glmnet (prop osed) o ccasionally exhibited slightly longer computation times than LARS. This b eha vior is attributable to the setting of T hope . Because the budget is generous for small problems, glmnet (prop osed) utilizes this a v ailable time to maximize accuracy . 11 ρ = 0 ρ = 0.1 ρ = 0.3 ρ = 0.5 ρ = 0.7 ρ = 0.9 p = 100 p = 500 p = 1000 p = 1500 p = 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 1.0 1.5 2.0 2.5 3.0 0 10 20 30 0 20 40 60 0 30 60 90 120 0 50 100 150 N RMSE Method glmnet (default) glmnet (proposed) LARS Figure 4: Comparison of prediction accuracy (RMSE) across differen t sample sizes N . The plot compares glmnet (default), glmnet (prop osed) tuned with T hope = 20 s and LARS as the exact reference. The results are av eraged ov er 100 simulation runs. The panels corresp ond to differen t com binations of the num b er of predictors p and the correlation among the predictors ρ . Notably , the glmnet (prop osed) consisten tly ac hiev es accuracy comparable to the exact LARS solution across all settings. 4.2 Application to compressed sensing Compressed sensing (Cand` es and W akin, 2008) is a signal pro cessing tec hnique that reconstructs a signal from a compressed representation obtained via a random pro jection matrix. In this section, w e apply our prop osed framew ork to solve the lasso problem arising in compressed sensing. W e compare the reconstruction accuracy and computation time of the glmnet (prop osed) against the glmnet (default) and LARS. W e used an image from the MNIST dataset (LeCun et al., 1998) for the exp erimen t, resizing it to 32 × 32 pixels. First, the image w as compressed as follo ws. The image matrix w as v ectorized in column-ma jor order to form θ ∈ R 1024 . Let N ′ denote the dimension of the compressed data. W e generated a random pro jection matrix Z ∈ R N ′ × 1024 , where eac h element Z ij w as drawn indepen- den tly from N (0 , 1). The v ector θ was then compressed into y = Z θ ∈ R N ′ . In this exp erimen t, w e set the compressed dimension to N ′ = 700. This process reduces the dimensionalit y from 1024 to N ′ , effectively compressing the data. Next, w e reconstructed the original image θ using the compressed vector y and the pro jection 12 ρ = 0 ρ = 0.1 ρ = 0.3 ρ = 0.5 ρ = 0.7 ρ = 0.9 p = 100 p = 500 p = 1000 p = 1500 p = 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 0.1 0.2 0.3 0.4 0.5 0 5 10 15 20 0 30 60 90 0 100 200 300 400 0 250 500 750 1000 N Computation Time (s) Method glmnet (default) glmnet (proposed) LARS Figure 5: Comparison of computation time (seconds) across different sample sizes N . Similar to Figure 4, this plot compares glmnet (default), glmnet (proposed) tuned with T hope = 20 s, and LARS. The results are a veraged o ver 100 sim ulation runs. The panels corresp ond to different com binations of the n um b er of predictors p and the correlation among the predictors ρ . The prop osed metho d is not only significantly faster than LARS but also satisfies T hope in the ma jority of cases. matrix Z . Employing a tw o-level w av elet basis matrix Ψ , the reconstruction corresp onds to solving the following lasso problem: ˆ β = argmin β 1 2 N ′ ∥ y − X β ∥ 2 2 + λ ∥ β ∥ 1 , where X = Z Ψ ∈ R N ′ × 1024 . The reconstructed image is obtained by ˆ θ = Ψ ˆ β . Using this form ulation, w e ev aluated the performance of the proposed metho d. Figure 6 illustrates the reconstruction results. T o quantify the reconstruction quality , we ev al- uated the RMSE against the original image on the pixel v alue scale [0 , 255]. The glmnet (default) yielded a high RMSE of 31 . 57, resulting in a degraded image lacking sharpness. In contrast, the prop osed metho d ac hieved an RMSE of 14 . 53, whic h is significan tly lo wer than the default and com- parable to the RMSE of 12 . 22 obtained by the exact solution of LARS. Regarding computational efficiency , the glmnet (proposed) required only appro ximately one-fourth of the computation time of LARS. These results demonstrate that our prop osed framework successfully tunes the configura- tion to achiev e accuracy comparable to LARS while main taining significan tly lo wer computational 13 cost. 4.3 Discussion Our results demonstrate that the prop osed method successfully tunes the configuration to ac hieve accuracy comparable to the exact solution of LARS, while approximately satisfying the specified computation time constrain t, T hope . This suggests that our framework successfully tunes appropriate configuration that ov ercomes the limitations inheren t in the default configuration. The observed impro vemen t in test error is primarily attributable to the expanded search range for λ . This wider range enables cross-v alidation to identify optimal λ v alues that restrictiv e default grids often miss. Regarding computation time, the prediction acc u racy of glmnet-MLP prov ed reasonably reliable. This accuracy enabled the successful selection of configuration that adhered to the time constraint, T hope . Ho wev er, a primary limitation of our framew ork lies in the range of the training dataset. Our mo del was trained using synthetic dataset generated from multiv ariate normal distributions within sp ecific ranges of sample size n and dimension p . Since it is practically imp ossible to learn the c haracteristics of all p ossible data distributions, this dep endency on training dataset is unav oidable. In particular, caution is required when extrap olating to cases where N or p exceeds th e upp er b ounds of the training range defined ab o ve. Nevertheless, the compressed sensing exp erimen t pro vided a promising indication of robustness. In this case, while N and p w ere within the training range, the structural prop erties of the design matrix X differed from the multiv ariate normal assumption used in training. The successful application in this con text demonstrates our metho d’s potential applicabilit y to datasets with design matrices outside the training distribution. Finally , regarding hardw are dep endency , computation time inevitably v aries across different computing en vironmen ts. F rom a practical standp oin t, how ever, the order of magnitude is often more critical than precise timing. Minor deviations in seconds are generally acceptable in real-w orld applications, provided the algorithm op erates within the exp ected time scale. 5 Conclusion In this study , w e established a data-driv en framew ork for configuration tuning of glmnet by learning from large-scale artificial datasets. Our approach explicitly mo dels the trade-off b et ween accuracy and computation time. This capability enables the iden tification of a configuration that ac hiev es accuracy comparable to LARS while satisfying user-sp ecified time constrain ts. By emplo ying P areto optimization, we achiev ed a tuning mechanism that is b oth transparent and efficien t. In future w ork, w e aim to address the limitations discussed in Section 4.3. Sp ecifically , w e plan to enhance the mo del’s generalizabilit y b y expanding the training dataset to include a wider range of sample sizes and dimensions ( N , p ) and div erse data distributions. F urthermore, extend- ing this framew ork to other families of Generalized Linear Mo dels (GLMs) supp orted b y glmnet (e.g., logistic and P oisson regression) represen ts a promising av enue, given their shared algorithmic structure. 14 (a) Original image (b) glmnet (default) (c) glmnet (prop osed) (d) LARS Figure 6: Visual comparison of reconstruction results. (a) The original image. The b ottom row displa ys the reconstructed images along with their RMSE v alues calculated against the original image (on a [0 , 255] scale): (b) default glmnet (RMSE: 31 . 57), (c) glmnet tuned by the prop osed metho d (RMSE: 14 . 53), and (d) LARS (RMSE: 12 . 22). Notably , regarding computation time, the glmnet (prop osed) required only 13 . 2 s, whereas LARS required 53 . 4 s. This demonstrates that our approac h successfully identifies an appropriate configuration that is b oth accurate and computationally efficient. 15 App endix A Details of the summary dataset A.1 Hyp erparameters for summary dataset In Section 3.2.1, w e describ ed the generation of the design matrix X , y to get the summary dataset for training the glmnet MLP . Then, we need to sp ecify the parameters Σ , β , σ to generate X , y . This app endix provides the sp ecific details of these parameter settings. Structure of the true co v ariance matrix Σ . W e employ ed four t yp es of co v ariance matrix for Σ : 1. Comp ound symmetry co v ariance matrix: Σ = (1 − ρ ) I p + ρ 1 p 1 ⊤ p = 1 ρ . . . ρ ρ 1 . . . ρ . . . . . . . . . . . . ρ ρ . . . 1 (0 < ρ < 1) 2. AR(1) cov ariance matrix: 1 ρ . . . ρ p − 1 ρ 1 . . . ρ p − 2 . . . . . . . . . . . . ρ p − 1 ρ p − 2 . . . 1 (0 < ρ < 1) 3. Random structured cov ariance matrix: The construction procedure for the random structured cov ariance matrix is based on Hirose et al. (2017). The sp ecific steps are as follo ws: (a) Define the set D = [ − 0 . 75 , − 0 . 25] ∪ [0 . 25 , 0 . 75], and construct a p × p matrix E , where eac h elemen t E i,j is drawn indep enden tly from a uniform distribution U ( D ). (b) Assign 0 to some off-diagonal elements of the matrix E generated. The n umber and sp ecific lo cations for these assignments are determined b y a uniform random selection. (c) Compute ˜ E = ( E + E ⊤ ) / 2. (d) Calculate ˜ Ω = ˜ E + (0 . 1 − λ min ) I , where λ min is the minim um eigen v alue of ˜ E . (e) Let L = diag( ˜ Ω − 1 ), and then compute Ω = L 1 2 ˜ Ω L 1 2 . (f ) Finally , the random structured cov ariance matrix C is giv en b y C = Ω − 1 . 4. Inv erse of the random structured co v ariance matrix: W e adopt the in verse of C as the cov ariance matrix. 16 Structure of the true co efficien t β . W e prepared the following four structural patterns for β ∈ R p : 1. ⌊ p 2 ⌋ elements are 1, and the others are 0. 2. ⌊ p 10 ⌋ elements are 1, and the others are 0. 3. ⌊ p 2 ⌋ elements are generated from N (0 , 1), and the others are 0. 4. ⌊ p 10 ⌋ elements are generated from N (0 , 1), and the others are 0. In all cases, the positions of the β elements are randomly p erm uted. Structure of the true error standard deviation σ . W e emplo yed t wo settings for the noise level σ ∈ R : 1. σ = 1 2. σ = p 10 A.2 Scop e of the summary dataset This section describ es the scop e of the summary dataset. Sp ecifically , the distribution of sample size N and the num b er of predictors p directly determines the applicable range of our prop osed metho d. Figure 7 illustrates the distribution of N and p within the dataset. The dataset co v ers a broad sp ectrum of dimensions, explicitly defined b y the ranges N ∈ [30 , 3000] and p ∈ [10 , 2980]. T o ensure the accuracy of computation time measurements, w e av oided large-scale parallelization. This constraint significan tly increased the total time required to generate the summary dataset. Consequen tly , w e employ ed a denser sampling strategy in regions where N and p are small. A.3 Computational en vironmen t All n umerical exp erimen ts were conducted on a serv er running Ubuntu 24.04.1 L TS (Linux kernel 6.8.0), equipp ed with an AMD EPYC 7763 64-Core Pro cessor (up to 3.5 GHz) and 2 TB of DDR4- 3200 ECC memory . Computational tasks w ere implemen ted in R v ersion 4.3.3. P arallel pro cessing with 10 logical cores w as employ ed during the generation of summary datasets to enhance computa- tional efficiency , using the doParallel (v1.0.17) and foreach (v1.5.2) pack ages. In contrast, other sim ulation pro cedures were executed in a single-threaded manner. The R environmen t w as linked against the reference BLAS (v3.12.0) and LAP A CK (v3.12.0) libraries. The versions of glmnet and LARS were 4.1.8 and 1.3, resp ectiv ely . B Details of training strategy and h yp erparameters In this section, w e provide detailed specifications of the training pro cess and the resulting model arc hitecture for glmnet-MLP , describ ed in Section 3.2.2. Data preparation. The summary dataset was split in to 80%, 10%, and 10% for training, v alidation, and testing, resp ectiv ely . As men tioned in the main text, target v ariables w ere log- transformed and standardized. 17 0 1000 2000 3000 0 1000 2000 3000 p N Count 20000 40000 60000 Figure 7: Heatmap showing the distribution of combinations of N and p in the summary dataset. The color in tensity represen ts the num b er of data p oin ts; redder regions indicate a higher concen- tration of samples. White regions indicate sparse sampling (low count), not necessarily the absence of dataset. Optimization setup. The Bay esian optimization was p erformed using the BoTorchSampler (Balandat et al., 2020) within Optuna , based on Gaussian process regression and the Exp ected Impro vemen t acquisition function. W e executed the optimization for 500 trials. Throughout the pro cess, the activ ation function w as fixed to the Swish function (Ramac handran et al., 2017): f ( x ) = x 1 + e − x . The n um b er of epo c hs was set to 500, and the mini-batch size w as fixed at 20,263. The search space for the optimization was defined as follo ws: • Number of hidden la yers: { 1 , 2 , 3 } ; • Number of units p er lay er: { 1 , 2 , . . . , 64 } ; • Learning rate: [10 − 5 , 10 − 1 ] (log scale). Resulting mo del h yp erparameters. As a result of the optimization, a three-la yer netw ork structure was selected, with 64, 61, and 57 units in the resp ectiv e hidden lay ers. The optimal learning rate was determined to b e approximately 7 . 6 × 10 − 4 . This configuration was adopted as the final glmnet-MLP for ev aluation. C R P ac k age glmnetconf W e developed the R pack age glmnetconf to implement our tuning metho d and make it easily accessible to a broad user base. Our implementation includes not only configuration tuning metho d 18 Input: X , y , T hop e Predict T lars for pac k age selection T hop e > T lars ? Solv e the lasso problem b y lars T une configuration for glmnet Y es No Figure 8: W orkflo w of our pack age glmnetconf but also a R pac k age selection metho d. Since lars and glmnet eac h p ossess distinct adv antages, the appropriate choice dep ends on the user’s ob jectiv e. Sp ecifically , if an exact solution is required without considering computation time, lars is the optimal c hoice. Conv ersely , if computational efficiency is prioritized, glmnet is preferable. Therefore, w e implemen t the function to select the R pac k age based on the dataset X , y and T hope in our pac k age. C.1 The w orkflo w of glmnetconf Supp ose we hav e a dataset X , y and desired time for computation T hope . Figure 8 shows the w orkflow of our prop osed pack age. First, our framew ork determines whic h pack age to emplo y . W e predict the computation time of lars , denoted as T lars , based on X , y . If the predicted T lars is smaller than T hope , our framew ork selects lars to ensure an exact solution. On the other hand, if T lars exceeds T hope , our framework selects glmnet . In this scenario, the configuration for glmnet is tuned by our proposed method in Section 3. C.2 Prediction mo del for the lars computation time Similar to the glmnet-MLP , we construct a predictive mo del to forecast the computation time of lars , denoted as T lars . The input features consist of the sample size N , the dimension p , and the selected eigenv alues of the sample co v ariance matrix of X ( γ ± 1 , . . . , γ ± 5 ). The output is the predicted computation time T lars . The training dataset for the lars-MLP w as collected during the generation of the summary dataset describ ed in Section 3.2.1. The resulting dataset comprises 68 , 013 samples. Using this dataset, we trained the mo del emplo ying the Adam optimizer. The netw ork architecture and h yp erparameters were determined via Ba yesian optimization, follo wing the same proto col and searc h space as the glmnet-MLP . The optimization yielded a three- la yer hidden net work with 45, 44, and 37 units in the resp ectiv e la yers, and a learning rate of appro ximately 1 . 08 × 10 − 3 . Consistent with the glmnet-MLP , we emplo y ed the Swish activ ation function and set the num b er of ep o c hs to 500. Ho wev er, the batc h size w as set to 1700 for this mo del. 19 ρ = 0 ρ = 0.1 ρ = 0.3 ρ = 0.5 ρ = 0.7 ρ = 0.9 p = 100 p = 500 p = 1000 p = 1500 p = 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 0.3 0.4 0.5 0.6 0.7 5 10 15 0 30 60 90 0 100 200 300 400 0 250 500 750 1000 N Computation Time (s) Method lars−MLP LARS Figure 9: Comparison of predicted v ersus actual computation times for lars across differen t sample sizes N . The la yout is iden tical to that of Figure 4. The lars-MLP effectiv ely captures the o verall trend of the computation time. How ever, it tends to underestimate the actual computation time when b oth N and p are large. Figure 9 compares the predicted computation time with the actual runtime of lars , using the sim ulation dataset described in Section 4. The mo del effectiv ely captures the ov erall trend of the computation time. Ho wev er, it tends to underestimate the runtime when both N and p are large. This bias is lik ely due to the scarcit y of training samples in high-dimensional regions, limited by the high computational cost of data generation. C.3 Usage example This section demonstrates the usage of the glmnetconf pack age. The primary function, auto lasso() , automates the entire tuning process. Sp ecifically , it automatically selects the appropriate pac k age and tunes the configuration based on the input dataset and T hope . Listing 1 shows a usage example with the synthetic dataset, where T hope = 20 seconds. The dataset is generated via the function data generation , follo wing the exact same sim ulation settings describ ed in Section 4. In this example, we set the sample size to N = 1500, the dimension to p = 800 and the correlation to ρ = 0 . 5. The script not only exec utes the prop osed automated w orkflow via auto lasso() but also compares its predictiv e p erformance against the standard usage of cv.glmnet (default configuration). This comparison illustrates how the prop osed metho d 20 ac hieves comp etitiv e accuracy while satisfying the time constrain t. 1 l i b r a r y ( g l m n e t c o n f ) ; l i b r a r y ( g l m n e t ) 2 3 # S e t u p : G e n e r a t e X _ t r a i n , y _ t r a i n , X _ t e s t , y _ t e s t . 4 d a t < - d a t a _ g e n e r a t i o n ( N _ t r a i n = 1 5 0 0 , N _ t e s t = 1 0 0 , p = 8 0 0 , r h o = 0 . 5 , s p a r s e _ r a t e = 0 . 5 , s i g m a = 1 ) 5 X _ t r a i n < - d a t $ X _ t r a i n 6 y _ t r a i n < - d a t $ y _ t r a i n 7 X _ t e s t < - d a t $ X _ t e s t 8 y _ t e s t < - d a t $ y _ t e s t 9 10 # - - - 1 . P r o p o s e d M e t h o d ( g l m n e t c o n f ) - - - 11 # P e r f o r m a u t o m a t i c t u n i n g w i t h a t i m e c o n s t r a i n t T _ h o p e = 2 0 s 12 f i t _ g l m n e t c o n f < - a u t o _ l a s s o ( X _ t r a i n , y _ t r a i n , n e w _ x = X _ t e s t , T _ h o p e = 2 0 ) 13 m s e _ g l m n e t c o n f < - m e a n ( ( y _ t e s t - f i t _ g l m n e t c o n f $ p r e d i c t i o n ) ^ 2 ) 14 15 # - - - 2 . B e n c h m a r k ( g l m n e t d e f a u l t ) - - - 16 # S t a n d a r d c r o s s - v a l i d a t i o n w i t h d e f a u l t s e t t i n g s 17 f i t _ g l m < - c v . g l m n e t ( X _ t r a i n , y _ t r a i n , a l p h a = 1 ) 18 p r e d _ g l m < - p r e d i c t ( f i t _ g l m , n e w x = X _ t e s t , s = " l a m b d a . m i n " ) 19 m s e _ g l m < - m e a n ( ( y _ t e s t - p r e d _ g l m ) ^ 2 ) 20 21 # - - - 3 . P e r f o r m a n c e C o m p a r i s o n - - - 22 p r i n t ( c ( g l m n e t c o n f = m s e _ g l m n e t c o n f , g l m n e t = m s e _ g l m ) ) 23 24 # ( O p t i o n a l ) C h e c k t u n e d c o n f i g u r a t i o n a n d P a r e t o f r o n t 25 # p r i n t ( f i t _ g l m n e t c o n f $ c o n f i g u r a t i o n ) 26 # p r i n t ( f i t _ g l m n e t c o n f $ P a r e t o _ f r o n t ) Listing 1: Usage example of glmnetconf . The script demonstrates the prop osed tuning metho d with a time constraint and compares its test error against the default configuration of glmnet . References Akiba, T., Sano, S., Y anase, T., Ohta, T., and Koy ama, M. (2019). Optuna: A next-generation h yp erparameter optimization framew ork. In Pr o c e e dings of the 25th ACM SIGKDD International Confer enc e on Know le dge Disc overy & Data Mining , KDD ’19, page 2623–2631, New Y ork, NY, USA. Asso ciation for Computing Machinery . Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Baksh y , E. (2020). BoT orc h: A F ramework for Efficient Monte-Carlo Ba yesian Optimization. In A dvanc es in Neur al Information Pr o c essing Systems 33 . Bec k, A. and T eb oulle, M. (2009). A fast iterativ e shrink age-thresholding algorithm for linear inv erse problems. SIAM Journal on Imaging Scienc es , 2(1):183–202. 21 Bøv elstad, H., Nyg ˚ ard, S., Størvold, H., Aldrin, M., Borgan, Ø., F rigessi, A., and Ling jærde, O. (2007). Predicting surviv al from microarray data—a comparative study . Bioinformatics , 23(16):2080–2087. Bo yd, S., P arikh, N., Ch u, E., P eleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of m ultipliers. F oundations and T r ends ® in Machine le arning , 3(1):1–122. Cand ` es, E. J. and W akin, M. B. (2008). An in tro duction to compressiv e sampling. IEEE signal pr o c essing magazine , 25(2):21–30. Cs´ ardi, G. (2019). cr anlo gs: Downlo ad L o gs fr om the ’RStudio’ ’CRAN’ Mirr or . R pack age version 2.1.1. Daub ec hies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for linear in verse problems with a sparsit y constrain t. Comm. Pur e Appl. Math. , 57(11):1413–1457. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. The A nnals of Statistics , 32(2):407–499. F riedman, J. H., Hastie, T., and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Mo dels via Co ordinate Descen t. Journal of Statistic al Softwar e , 33:1–22. F u, W. J. (1998). P enalized regressions: The bridge versus the lasso. Journal of Computational and Gr aphic al Statistics , 7(3):397–416. Hebiri, M. and Lederer, J. (2013). Ho w correlations influence lasso prediction. IEEE T r ans. Inf. The or. , 59(3):1846–1854. Hirose, K., F ujisa wa, H., and Sese, J. (2017). Robust sparse gaussian graphical mo deling. Journal of Multivariate Analysis , 161:172–190. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P . (1998). Gradient-based learning applied to do cumen t recognition. Pr o c e e dings of the IEEE , 86(11):2278–2324. Lu, Y. and Li, X. (2015). Estimating stellar atmospheric parameters based on lasso and supp ort- v ector regression. Monthly Notic es of the R oyal Astr onomic al So ciety , 452(2):1394–1401. Massias, M., Gramfort, A., and Salmon, J. (2018). Celer: a fast solver for the lasso with dual extrap olation. In Dy , J. and Krause, A., editors, Pr o c e e dings of the 35th International Confer enc e on Machine L e arning , volume 80 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 3315–3324. PMLR. Osb orne, M., Presnell, B., and T urlac h, B. (2000). A new approac h to v ariable selection in least squares problems. IMA Journal of Numeric al A nalysis , 20(3):389–403. Ramac handran, P ., Zoph, B., and Le, Q. V. (2017). Searching for activ ation functions. arXiv pr eprint arXiv:1710.05941 . Rumelhart, D. E., Hin ton, G. E., and Williams, R. J. (1986). Learning represen tations b y bac k- propagating errors. Natur e , 323(6088):533–536. Tibshirani, R. (1996). Regression Shrink age and Selection Via the Lasso. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al) , 58(1):267–288. 22
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment