Fenchel-Young Estimators of Perturbed Utility Models

F enc hel-Y oung Estimators of P erturb ed Utilit y Mo dels Xi Lin a , Y afeng Yin a,b,* , and Tianming Liu a a Dep artment of Civil and Envir onmental Engine ering, University of Michigan, Ann Arb or, MI 48109, USA b Dep artment of Industrial and Op er ations Engine ering, University of Michigan, Ann A rb or, MI 48109, USA * Corr esp onding author. Email addr esses: xilina, yafeng, tianmliu@umich.e du . F ebruary 26, 2026 Abstract The P erturb ed Utilit y Mo del framew ork oﬀers a p o werful generalization of discrete c hoice analysis, unifying mo dels lik e Multinomial Logit and Sparsemax through con vex optimization. Ho wev er, standard Maxim um Lik eliho o d Estimation (MLE) faces sev ere theoretical and n umerical c hallenges when applied to this broader class, particularly regarding non-conv exity and instabilit y in sparse regimes. T o resolve these issues, this pap er in tro duces a uniﬁed estimation framework based on the F enc hel-Y oung loss. By leveraging the intrinsic con vex conjugate structure of PUMs, w e demonstrate that the F enc hel-Y oung estimator guarantees global con vexit y and b ounded gradien ts, providing a mathematically natural alternative to MLE. Addressing the critical challenge of data scarcit y , we further extend this framework via W asserstein Distributionally Robust Optimization. W e ﬁrst deriv e an exact ﬁnite-dimensional reform ulation of the inﬁnite-dimensional primal problem, establishing its theoretical conv exity . Ho wev er, recognizing that the resulting w orst-case constrain ts inv olv e computationally in tractable inner maximizations, we subsequen tly construct a tractable safe approximation b y exploiting the global Lipschitz con tin uity of the F enchel-Y oung loss. Through this tractable formulation, w e unco ver a rigorous geometric uniﬁcation: tw o canonical regularization techniques—standard ℓ 2 -regularization and the margin-enforcing Hinge loss—emerge mathematically as speciﬁc limiting cases of our distributionally robust estimator. Extensiv e exp erimen ts on synthetic data and the Swissmetro b enc hmark v alidate that the prop osed framework signiﬁcantly outp erforms traditional metho ds, reco vering stable preferences ev en under severe data limitations. 1 In tro duction Since the seminal contributions of Daniel McF adden established the econometrics of discrete choice (McF adden, 1972), this framework has b ecome the dominant paradigm for analyzing decision-making b eha vior across a wide range of ﬁelds, including transp ortation (McF adden, 1974; Ben-Akiv a and Lerman, 1985), mark eting (Malhotra, 1984) and health economics (Clark et al., 2014). Within this paradigm, individuals are assumed to select alternatives so as to maximize utility sub ject to sto c hastic p erturbations, giving rise to the Random Utility Mo del (R UM). Among its many op erationalizations, the Multinomial Logit (MNL) mo del has emerged as the canonical baseline due to its analytical tractabilit y and closed-form probability represen tation (Hausman and McF adden, 1984). In addition, the MNL mo del p ossesses a profound computational elegance: estimating its parameters from data via Maximum Likelihoo d Estimation (MLE; W ald, 1949) constitutes a globally con vex optimization problem (Donoso and de Grange, 2010). This conv exit y ensures uniqueness of the estimator and 1 n umerical s tabilit y indep enden t of initialization, rendering inference b oth theoretically well-posed and computationally eﬃcien t. The alignment b et ween b eha vioral microfoundations and conv ex optimization geometry has therefore cemen ted the MNL–MLE framework as the classical foundation for the inv erse recov ery of agen t preferences from observ ed c hoice data. Despite its foundational role, this classical R UM framew ork exhibits w ell-do cumen ted structural limitations. Most notably , the Indep endence of Irrelev an t Alternativ es (IIA) prop ert y imp oses restrictiv e substitution patterns (Ray, 1973), while the reliance on speciﬁc distributional assumptions for sto c hastic errors constrains b eha vioral ﬂexibilit y . These limitations ha ve motiv ated a ric h b o dy of extensions, including Probit (Daganzo et al., 1977) and Nested Logit form ulations (W en and Kopp elman, 2001), which relax distributional and correlation constraints at the cost of increased computational complexity . More recen tly , discrete c hoice theory has undergone a conceptual generalization through the P erturb ed Utilit y Mo del (PUM) framework (McF adden and F osgerau, 2012; F udenberg et al., 2015). Rather than deriving c hoice probabilities from exogenously speciﬁed stochastic error distributions, the PUM framew ork form ulates decision-making as utility maximization under a con vex perturbation (or regularization) functional deﬁned ov er the probabilit y simplex. F ormally , the decision-maker selects a probability vector that maximizes exp ected utility net of a con vex p enalt y capturing information-pro cessing costs, bounded rationalit y , or intrinsic preference for randomization. Cen tral to this framework is the duality b et ween the perturbation function and the surplus (exp ected maximum utility) function. Building on McF adden’s surplus theory (McF adden, 1981), the exp ected maxim um p erturbed utility deﬁnes a conv ex p oten tial whose gradient yields the c hoice probabilities. Through Legendre–F enc hel dualit y (T ouc hette, 2005; McF adden and F osgerau, 2012) , any admissible discrete choice model can th us be generated from a conv ex p erturbation function, establishing a one-to-one correspondence betw een sto c hastic c hoice b eha vior and con vex regularization structures. Classical models are recov ered as sp ecial cases; for example, the MNL mo del arises when the p erturbation corresp onds to negative Shannon en tropy (Anderson et al., 1992) This conv ex-analytic p ersp ectiv e rev eals discrete choice mo dels as p oin ts within a broader geometric space of decision mechanisms. It also clariﬁes the equiv alence b et w een additiv e random utilit y form ulations and regularized optimization representations, a connection that has b een op erationalized in evolutionary game dynamics and sto c hastic learning mo dels through smo oth b est-response mappings (Hofbauer and Sandholm, 2002). Complemen tarily , revealed-preference foundations for pe rturbed utilit y hav e been established by F udenberg et al. (2015), who provide axiomatic conditions under which sto c hastic choice data admit a conv ex p erturbation representation. Owing to its structural generalit y and analytical tractabilit y , the PUM framew ork has b egun to p ermeate transp ortation science and related ﬁelds. It has prov en particularly v aluable in sto c hastic route choice mo deling and en tropy-regularized traﬃc assignment, where con vex p oten tial structures facilitate equilibrium analysis and computational scalability (F osgerau et al., 2022; Y ao et al., 2024). More broadly , the framew ork pro vides a unifying language for in tegrating b eha vioral realism, information frictions, and optimization geometry within a single mo deling architecture. Giv en the structural elegance of the PUM framework as a generalization of classical discrete c hoice, a fundamental question arises regarding its statistical inference: ho w can the parameters of a general PUM b e reliably estimated from observed choice data? The most immediate strategy is to extend the classical MLE paradigm, whic h serv es as the gold standard for MNL estimation. In the sp eciﬁc case of the logit model, the negativ e log-likelihoo d is globally conv ex with resp ect to the utilit y parameters, ensuring tractable optimization and statistically eﬃcient estimators. Ho wev er, these desirable properties do not automatically generalize to arbitrary p erturbations. The conv ex geometry induced by entropic regularization is highly sp eciﬁc; alternative p erturbation 2 structures may violate log-concavit y , rendering the lik eliho o d non-con vex. More critically , man y mo dern PUM sp eciﬁcations induce sparse choice probabilities. Unlik e the MNL mo del, which assigns strictly p ositiv e probabilit y to all alternatives (dense supp ort), p erturbations such as the quadratic regularization (e.g., Sparsemax; Martins and Astudillo, 2016; Gabaix, 2014) frequently yield corner solutions with strictly zero probabilities for lo w-utility alternatives. In such regimes, the standard MLE ob jectiv e fails catastrophically; the logarithm of a zero probability diverges to negative inﬁnit y , rendering the loss function undeﬁned and the optimization numerically unstable. This structural incompatibilit y suggests that lik eliho o d geometry is misaligned with the broader con vex structure of PUMs. Bey ond the geometrical c hallenges of estimation, a parallel statistical hurdle p ersists: data scarcit y . Parado xically , despite the perv asiv e narrativ e of a big data era, practical applications of discrete choice mo deling frequen tly encoun ter sev ere limitations in sample size. Constraints such as high survey costs, limited sensor deploymen t, and stringen t priv acy regulations often force analysts to rely on small datasets. In the same-sample regimes, MNL estimators are kno wn to exhibit ﬁnite- sample bias and ov erﬁtting tendencies (Alb ert and Anderson, 1984; Firth, 1993; Sur and Cand ` es, 2019). These vulnerabilities are ampliﬁed in ﬂexible PUM sp eciﬁcations, where high-dimensional p erturbation structures exacerbate statistical instability . In this pap er, we develop a uniﬁed estimation framew ork for PUMs grounded in F enchel–Y oung loss functions (Blondel et al., 2019, 2020). Exploiting the intrinsic conv ex conjugacy that deﬁnes the PUM arc hitecture, we show that the F enc hel–Y oung estimator constitutes the most natural scoring rule for this class of mo dels. By aligning the geometry of the loss function with the con vex structure of the surplus function, w e prov e that the F enc hel-Y oung estimator yields a globally con vex optimization problem for any admissible PUM, ev en those inducing sparse probabilities, while maintaining desirable asymptotic statistical prop erties. T o address the critical c hallenge of estimation in data-scarce regimes, w e extend this framework b y incorp orating W asserstein Distributionally Robust Optimization (DRO; Moha jerin Esfahani and Kuhn, 2018; Kuhn et al., 2019; Gao and Kleyw egt, 2023). W e construct a W asserstein am biguity set around the empirical distribution to explicitly immunize the estimator against the ﬁnite-sample bias and ov erﬁtting risks inherent to small datasets. Crucially , we ﬁrst deriv e an exact ﬁnite-dimensional reform ulation of the worst-case problem, pro ving that it preserves the underlying conv exit y of the nominal F enc hel-Y oung ob jective. How ev er, ac knowledging that the exact constraints in volv e computationally prohibitiv e inner maximizations, w e subsequen tly develop a tractable safe appro ximation by lev eraging the Lipschitz contin uity of the loss function. This yields a computationally eﬃcien t conv ex program solv able via standard dualit y-based tec hniques. A notable b ypro duct of this robustiﬁcation is the emergence of a unifying geometric interpretation: the classical ℓ 2 regularization and the Hinge loss arise as limiting cases of our appro ximate W asserstein F enchel-Y oung estimator under sp eciﬁc ambiguit y set conﬁgurations. The remainder of this paper is organized as follo ws. Section 2 pro vides a rigorous c haracterization of the PUM framework for discrete choice, establishing the foundational link betw een conv ex p erturbation functions and c hoice probability generation. In Section 3, w e introduce the F enc hel- Y oung estimator, analyzing its global con vexit y and asymptotic prop erties as a mathematically natural alternativ e to MLE. Section 4 addresses the c hallenge of estimation under data scarcit y b y dev eloping the W asserstein DRO framework. Here, w e derive the tractable dual reform ulations and establish the theoretical connections to ℓ 2 regularization and the Hinge loss. Section 5 presents comprehensiv e n umerical exp eriments on both syn thetic datasets and the Swissmetro dataset (Bierlaire et al., 2001) to v alidate the robustness and eﬃciency of the prop osed metho ds. Finally , Section 6 concludes the pap er. 3 2 P erturb ed Utilit y Mo dels 2.1 Basics and Examples In this subsection, we pro vide a characterization of the PUM framework as established b y McF adden and F osgerau (2012) and F udenberg et al. (2015), whic h derives c hoice probabilities through a con vex optimization principle. This persp ective is particularly adv antageous for the up coming F enchel-Y oung estimation as it exposes the geometric structure of the choice probability mapping. 2.1.1 The Primal F ormulation: Utility Maximization with Regularization Consider a decision-maker n facing a ﬁnite set of m utually exclusive alternatives C . Let V n ∈ R |C | denote the vector of systematic (deterministic) utilities, where V ni = β ⊤ x ni is the utility of alternative i ∈ C , x ni is a v ector of observ ed attributes, and β is the v ector of unkno wn parameters to b e estimated. In the PUM framew ork, the decision-mak er is assumed to maximize the exp ected systematic utilit y min us a con vex p enalt y function that represents the cost of information pro cessing or the in trinsic preference for randomization. Let p n ∈ ∆ be the choice probabilit y vector, where ∆ = { p ∈ R |C | | p ≥ 0 , P i ∈C p i = 1 } is the probabilit y simplex. The choice probabilit y v ector p n is deﬁned as the unique solution to the follo wing con vex optimization problem: p n ( V n ) = arg max p ∈ ∆ ( X i ∈C p i V ni − Λ( p ) ) , (1) where Λ : ∆ → R ∪ {∞} is the p erturb ation function (or regularizer). W e assume Λ satisﬁes the follo wing standard regularit y conditions: (A1) Λ is strictly con vex on the relativ e in terior of ∆. (A2) Λ is contin uously diﬀerentiable on the relative interior of ∆. (A3) The gradient of Λ div erges at the b oundary of ∆ (for mo dels with full supp ort) or admits a subgradien t structure allo wing for corner solutions. 2.1.2 Con vex Conjugacy and the Generalized Williams-Daly-Zachary Theorem The structural prop erties of the PUM are best understo od through F enc hel-Legendre dualit y . The con vex conjugate of the p erturbation function Λ( p ), typically interpreted as the surplus function or the exp e cte d maximum utility , is deﬁned as: Ω( V n ) = sup p ∈ ∆ n p ⊤ V n − Λ( p ) o . (2) By the en velope theorem and the properties of conv ex conjugation, the relationship betw een the utilit y v ector and the c hoice probability is given by the gradien t of the surplus function. This yields a generalized version of the Williams-Daly-Zac hary theorem: Prop osition 1 (PUM Dualit y) . Under assumptions (A1)-(A2), the surplus function Ω( V n ) is c onvex and diﬀer entiable everywher e on R |C | . F urthermor e, the optimal choic e pr ob ability ve ctor satisfying Eq. (1) is given by: p n = ∇ V Ω( V n ) . (3) 4 This duality is cen tral to our estimation strategy . It implies that estimating β amoun ts to c ho osing parameters so that the probability vectors generated by Eq. (3) align with observ ed c hoice realizations under an appropriate loss function. 2.1.3 Additiv e Separabilit y and the Choice Kernel A particularly tractable sub class of PUMs arises when the perturbation function Λ( p ) decomposes in to a sum of univ ariate conv ex functions (McF adden and F osgerau, 2012). This structural assump- tion, kno wn as additive sep ar ability , is satisﬁed by the ma jority of discrete choice mo dels used in practice, including MNL and Sparsemax. Additive separability yields semi-analytical choice kernels, while still encompassing a broad class of non-logit and sparse choice mo dels. F urthermore, w e note that non-separable p erturbations can enco de richer cross-alternative interactions; although they t ypically lead to implicit c hoice mappings and intractable inference in con ven tional settings, our pro- p osed framework successfully ov ercomes these computational hurdles to accommo date non-separable cases as well. Deﬁnition 1 (Additive Separable PUM) . A PUM is said to b e additive separable if the regularization function admits the form: Λ( p ) = X i ∈C h ( p i ) , (4) where h : [0 , 1] → R is a strictly con vex, contin uously diﬀerentiable function. F or additiv e separable mo dels, the complex in terdep endence b et ween alternativ es in the op- timization problem (1) is decoupled. The ﬁrst-order optimality conditions imply that for each alternativ e i , the relationship b et w een the probabilit y p i and the utility V i is mediated solely b y a global normalization constant. Deﬁnition 2 (Choice Kernel) . Let h ( · ) b e the component-wise conv ex regularizer. The Choice Kernel , denoted by ψ : R → [0 , 1], is deﬁned as the inv erse of the deriv ativ e of the regularizer: ψ ( z ) ≜ ( h ′ ) − 1 ( z ) . (5) Note that, by the conv exit y of h ( · ), the choice kernel function ψ ( · ) must b e a monotone function. Using the choice k ernel, the optimal choice probability for any alternative i takes the form: p i ( V ) = ψ ( V i − λ ( V )) , (6) where λ ( V ) is the Lagrange m ultiplier a ssociated with the simplex constraint P p j = 1. The v alue of λ ( V ) is uniquely determined as the solution to the implicit equation: X j ∈C ψ ( V j − λ ) = 1 . (7) The choice kernel ψ characterizes the lo cal resp onse of the decision-mak er to utilit y stim uli. It maps the eﬀectiv e utility (systematic utility V i adjusted by the shadow price of the budget constraint λ ) directly to the choice probability . 2.1.4 Sp eciﬁc Instances The PUM framework uniﬁes v arious discrete choice sp eciﬁcations through the choice of Λ( p ). 5 Multinomial Logit (MNL): The standard MNL mo del is recov ered when the p erturbation function is the negativ e Shannon en tropy scaled b y a disp ersion parameter µ > 0: Λ Logit ( p ) = µ X i ∈C p i ln p i . (8) The corresp onding conjugate function is the familiar Log-Sum-Exp function: Ω Logit ( V ) = µ ln X i ∈C exp  V i µ  ! . (9) The Sparsemax Mo del (Quadratic Regularization): A distinct and distinctively robust instance of the PUM arises when the entropic p enalt y is replaced b y a quadratic Euclidean norm, scaled by a disp ersion parameter µ > 0. Sp eciﬁcally , let the regularization function be deﬁned as the scaled squared ℓ 2 -norm: Λ Sparse ( p ) = µ 2 || p || 2 2 . (10) Substituting this in to the primal problem (1) rev eals that the c hoice probability vector is the Eu- clidean pro jection of the sc ale d utilit y v ector V n /µ on to the probabilit y simplex ∆. Mathematically , this is equiv alent to solving the minimum distance problem: p Sparse n = arg min p ∈ ∆     p − V n µ     2 2 . (11) Unlik e the Multinomial Logit model, which yields strictly p ositiv e probabilities for all alternativ es (dense supp ort), the Sparsemax model admits a closed-form solution with c omp act supp ort : p ni = max  0 , V ni µ − λ ( V n , µ )  , (12) where λ ( V n , µ ) is a threshold function determined eﬀectiv ely by the normalization constrain t P p ni = 1. Here, the parameter µ go verns the sparsity of the decision: a smaller µ ampliﬁes the utilit y diﬀerences (similar to a lo wer temp erature in Logit), pushing the pro jection tow ards the simplex v ertices (hard-max), whereas a larger µ shrinks the eﬀective utilities to wards zero, resulting in a more uniform and denser probability distribution. The Cauch y Mo del: T o capture b eha viors with hea vier tails than the Gumbel distribution (implied by MNL), one can emplo y a p erturbation deriv ed from the Cauch y distribution. The sp eciﬁc additive regularizer is given b y: Λ Cauch y ( p ) = − µ π X i ∈C ln  cos  π  p i − 1 2  . (13) This formu lation corresponds to the cum ulative distribution function of the standard Cauc hy distribution. The explicit probability function is deriv ed from the ﬁrst-order optimalit y condition ∇ Λ( p ) = V − λ 1 . Sp eciﬁcally , diﬀerentiating the regularization term yields the tangen t link function: V ni − λ µ = tan  π  p ni − 1 2  . (14) 6 In verting this relationship gives the c hoice probability: p ni = 1 2 + 1 π arctan  V ni − λ µ  . (15) where λ ensures all probabilities sum up to 1. Unlik e MNL and Sparsemax, this mo del is generated by a hea vy-tailed p erturbation. This struc- tural characteristic results in choice probabilities that decay p olynomially rather than exp onen tially as utilit y decreases, maintaining a fatter tail than the standard MNL mo del. Consequently , the Cauc hy sp eciﬁcation is particularly eﬀectiv e for capturing noisy decision-making scenarios where agen ts retain a non-negligible likelihoo d of selecting low-utilit y alternativ es, thereby accommo dating signiﬁcan t deviations from strict utility maximization. 3 F enc hel-Y oung Estimation of P erturb ed Utilit y Mo dels Ha ving established the structural link b etw een the utility vector and choice probabilities via the gradien t of the surplus function, we now turn to the estimation of the mo del parameters. Let the dataset consist of N indep enden t observ ations. F or each dec ision-mak er n , let x n denote the matrix of observed attributes, and let y n ∈ { 0 , 1 } |C | denote the observ ed choice as a one-hot vector, where y ni = 1 if alternative i is c hosen and 0 otherwise. 3.1 The Pathologies of Maxim um Lik eliho o d Estimation While MLE constitutes the standard approach for discrete choice analysis, its reliance on the logarithmic scoring rule prov es structurally inadequate for the broader class of PUMs. This incompatibilit y manifests as tw o critical pathologies. First, in mo dels with sparse c hoice kernels, the assignment of zero probabilit y to an observ ed alternative triggers a mathematical singularity , causing the ob jectiv e function to div erge to negativ e inﬁnit y . Second, even in regimes with full supp ort, the negative log-lik eliho od generally fails to preserve global conv exit y for some p erturbation functions, rendering the optimization landscap e n umerically unstable. 3.1.1 The Zero-Probabilit y Singularit y A fundamen tal theoretical distinction b etw een the classical MNL and the broader class of PUMs lies in the geometry of the probability simplex. While the en tropic perturbation of the MNL forces the solution into the relativ e interior of the simplex (ensuring p ni > 0 for all i ), modern PUM sp eciﬁcations such as Sparsemax are explicitly designed to yield corner solutions. In these sparse regimes, low-utilit y alternatives result in strictly zero probabilities. This sparsit y , while computationally and b eha viorally desirable, renders the standard MLE structurally undeﬁned. Consider the negative log-likelihoo d ob jective function for a dataset of N observ ations: L NLL ( β ) = − N X n =1 X i ∈C y ni ln ( p ni ( β )) , (16) where y n denotes the index of the observ ed c hoice. Sp eciﬁcally , if the mo del predicts p n,y n ( β ) = 0 for even a single observ ation n , the loss function suﬀers a mathematical singularit y: lim p → 0 + − ln( p ) = + ∞ . (17) 7 Consequen tly , the ob jectiv e function b ecomes un b ounded, and the gradient is undeﬁned. This is not merely a numerical nuisance but a catastrophic failure of the estimator: the logarithmic geometry imp oses an inﬁnite penalty on “hard” mistakes, preven ting the optimization algorithm from navigating the parameter space. 3.1.2 Non-Con vexit y Bey ond the n umerical singularities encoun tered in sparse regimes, MLE faces a sev ere theoretical imp edimen t even when applied to mo dels with dense supp ort. The reliance on the logarithmic scoring rule implies that the conv exit y of the estimation problem is not guaranteed for general p erturbations, but rather contingen t on sp eciﬁc tail b eha viors. The negative log-likelihoo d loss for a PUM is given by ℓ NLL ( β ) = − ln p y ( V ( β )). F or this ob jectiv e to be con vex with resp ect to linear utilit y parameters β , the choice probability map p ( V ) = ∇ Ω( V ) m ust b e lo g-c onc ave . While this condition holds for MNL (where probabilities are deriv ed from the Gum b el distribution, whic h is log-conca ve), it fails for some mo dels induced by hea vy-tailed p erturbations. Belo w w e provide a concrete example. Example 1 (Non-con vexit y of MLE for Cauc hy PUMs) . Consider a binary choice mo del generated b y a heavy-tailed perturbation, sp eciﬁcally one where the choice k ernel ψ ( · ) corresp onds to the cum ulative distribution function of the standard Cauc hy distribution: p 1 ( z ) = ψ ( z ) = 1 2 + 1 π arctan( z ) , z = V 1 − V 2 . (18) Within the additive separable PUM framework, this kernel arises from a one-dimensional regularizer h : [0 , 1] → R suc h that ψ ( z ) = ( h ′ ) − 1 ( z ) . So we hav e h ( p ) = − 1 π log  cos  π ( p − 1 2 )   + C, (19) whic h is strictly con vex on (0 , 1). The corresp onding Cauc hy PUM is then deﬁned b y the additiv e separable regularizer Λ Cauch y ( p ) = X i ∈C h ( p i ) = − 1 π X i ∈C log  cos  π ( p i − 1 2 )   + C ′ , p ∈ ∆ . (20) Despite b eing a v alid PUM generated b y a strictly con vex Λ Cauch y , the induced c hoice probability p 1 ( z ) is not lo g-c onc ave . The second deriv ative of the negative log-likelihoo d with resp ect to z , d 2 dz 2 [ − ln ψ ( z )], b ecomes negativ e in the left tail (sp eciﬁcally for z < − √ 3 ), indicating lo cal non- con vexit y . Consequently , standard gradient-based MLE solv ers may get trapp ed in lo cal optima or div erge, whereas the F enc hel–Y oung estimator for the exact same mo del remains globally conv ex and tractable. These structural failures necessitate a fundamen tal departure from the logarithmic scoring rule. Rather than forcing an ill-ﬁtting geometry up on the mo del, we prop ose an estimation framew ork deriv ed directly from the PUM’s in trinsic con vex conjugate structure. This leads to the F enchel-Y oung loss. 8 3.2 The Loss F unction T o address the ab o v e problems, and to pro vide a uniﬁed conv ex optimization framework for all PUMs, w e adopt the F enchel-Y oung L oss (FYLoss) as our estimation ob jective. Lev eraging the dualit y structure of PUMs, the loss for an observ ation ( x , y ) is deﬁned as: ℓ FY ( β ; x , y ) : = Ω( V ( x ; β )) − y ⊤ V ( x ; β ) , (21) where V ( x ; β ) = [ β ⊤ x 1 , . . . , β ⊤ x K ] ⊤ is the v ector of systematic utilities, and Ω( V ) = sup p ∈ ∆ { p ⊤ V − Λ( p ) } is the surplus function. The term y ⊤ V ( x ; β ) represents the utility of the observed choice via the inner pro duct with the ground-truth one-hot v ector y ∈ { 0 , 1 } K . It is necessary to clarify the theoretical justiﬁcation for treating the discrete realization y within this conv ex analysis framew ork. The fundamental F enchel-Y oung ine quality is originally deﬁned o ver the con tinuous probability simplex ∆: for any p oten tial vector V and any probability vector p ∈ ∆, the inequalit y Ω( V ) + Λ( p ) ≥ p ⊤ V holds universally . Imp ortantly , the realized c hoice y (a one-hot vector) represents a vertex, and th us an extreme point, of the simplex ∆. Since y ∈ ∆, the inequality remains v alid when directly applying the observed choice as the dual v ariable. The canonical F enc hel-Y oung loss is therefore rigorously deﬁned as the duality gap inheren t in this inequalit y: L gap ( β ; x , y ) : = Ω( V ( x )) + Λ( y ) − y ⊤ V ( x ) | {z } ≥ 0 . Crucially , the F enc hel-Y oung loss provides a solid statistical guaran tee. By construction, the loss constitutes a pr op er sc oring rule generated b y the conv ex function Ω. This ensures that minimizing the exp ected risk inherently drives the model to recov er the true conditional probabilities of the data-generating pro cess, rather than merely ﬁtting discrete labels. Prop osition 2 (Prop er Scoring) . L et the r e alize d choic e y b e dr awn fr om a true c onditional pr ob ability distribution p ∗ ∈ ∆ , such that E [ y ] = p ∗ . Assume the surplus function Ω is strictly c onvex and diﬀer entiable. The exp e cte d F enchel-Y oung loss, deﬁne d as R ( V ) = E y ∼ p ∗ [ ℓ FY ( V ; y )] , is minimize d if and only if the mo del’s pr e dicte d pr ob ability distribution ˆ p = ∇ Ω( V ) exactly matches the true distribution p ∗ . Pr o of. Recall that the op erational F enchel-Y oung loss is given by ℓ FY ( V ; y ) = Ω( V ) − y ⊤ V . W e consider the p opulation risk minimization problem: min V R ( V ) = E y ∼ p ∗ h Ω( V ) − y ⊤ V i . (22) By the linearit y of exp ectation, and noting that Ω( V ) is deterministic with resp ect to the sampling of y , the ob jective function can b e rewritten as: R ( V ) = Ω( V ) − E y ∼ p ∗ [ y ] ⊤ V . (23) Since y is a one-hot v ector representation of the choice, its exp ectation is precisely the true probability v ector, i.e., E y ∼ p ∗ [ y ] = p ∗ . Substituting this back yields: R ( V ) = Ω( V ) − ( p ∗ ) ⊤ V . (24) Since Ω is strictly con vex, the risk function R ( V ) is also strictly conv ex with resp ect to V (up to a shift by a linear term). A necessary and suﬃcient condition for optimality is that the gradient of R ( V ) v anishes: ∇ V R ( V ) = ∇ Ω( V ) − p ∗ = 0 . (25) 9 Rearranging the terms, the optimal condition requires: ∇ Ω( V ) = p ∗ . (26) By the deﬁnition of the mo del’s prediction mapping in PUMs, ˆ p : = ∇ Ω( V ). Therefore, the stationary p oin t satisﬁes ˆ p = p ∗ . Due to the strict conv exit y of Ω, this global minimum is unique in terms of the predicted distribution. Thus, minimizing the FYLoss naturally enforces probabilit y matc hing. In the context of parameter estimation, the term Λ( y ) dep ends exclusively on the ﬁxed observed lab el y and is indep enden t of the utilit y parameters β . It therefore acts merely as an additive constan t. Consequently , omitting Λ( y ) simpliﬁes the op erational ob jective function in Eq. (21) without altering the gradien t dynamics or the lo cation of the optimal solution. The estimation problem ov er a dataset of N samples is formulated as the Empirical Risk Minimization (ERM) of the FYLoss: min β J ( β ) = 1 N N X n =1  Ω( V n ; β ) − y ⊤ n V n  , (27) where V n = V ( x n ; β ) denotes the utilit y v ector for the n -th sample. 3.3 Geometric Interpretation of the F enchel-Y oung Estimator The F enchel-Y oung estimation framework admits a profound geometric interpretation that links con vex analysis to statistical consistency . F undamentally , the estimation pro cess op erates on the landscap e deﬁned b y the surplus function Ω. A candidate parameter vector β maps attributes x to systematic utilities V , which in turn induce a predicted probabilit y distribution via the gradien t map ˆ p = ∇ Ω( V ). Rather than forcing a literal geometric pro jection of empirical lab els onto the mo del manifold, the optimization of β minimizes the F enchel-Y oung loss, a div ergence measure rigorously deﬁned as the duality gap betw een the prediction and the realization. Although ev aluating this gap against a discrete vertex y (the observed c hoice) is p erformed p oin t wise, minimizing this loss ov er the population driv es the predicted gradien ts to align with the true conditional probabilities in exp ectation. Unlike generic loss functions (suc h as squared error) that may imp ose an extrinsic geometry mismatc hed with the choice mo del, the F enc hel-Y oung loss is endogenous: it is the unique Bregman divergence generated b y the v ery same p oten tial function Ω that go verns the probability generation. This in trinsic consistency renders the F enchel-Y oung estimator theoretically natural for PUMs, ensuring that the metric used for estimation is structurally congruen t with the mechanism used for prediction. 3.3.1 Bregman Div ergence In the discrete choice setting, the observed lab el is represen ted as a one-hot v ector y = e y , and the mo del prediction is the probability vector p = ∇ V Ω( V ) ∈ ∆ , (28) i.e., the gradien t of the surplus function. Thus ℓ FY is nonnegative, and it v anishes exactly when the predicted distribution p matc hes the target distribution y (so that y ∈ ∂ Ω( V )). 10 Using con vex conjugacy , the F enchel–Y oung loss can b e written as a Bregman divergence in probabilit y space. Recall that, b y deﬁnition of the conjugate, Ω( V ) = sup p ∈ ∆ n p ⊤ V − Λ( p ) o , p = ∇ Ω( V ) ⇐ ⇒ V = ∇ Λ( p ) . (29) Ev aluating the suprem um at p gives Ω( V ) = p ⊤ V − Λ( p ) . (30) Substituting this into the F enchel–Y oung loss yields ℓ FY ( V ; y ) =  p ⊤ V − Λ( p )  + Λ( y ) − y ⊤ V (31) = Λ( y ) − Λ( p ) − ( y − p ) ⊤ V (32) = Λ( y ) − Λ( p ) − ( y − p ) ⊤ ∇ Λ( p ) , (33) where in the last step we used V = ∇ Λ( p ). The right-hand side is exactly the Bregman divergence generated by Λ: D Λ ( y ∥ p ) : = Λ( y ) − Λ( p ) − ( y − p ) ⊤ ∇ Λ( p ) . (34) Therefore, ℓ FY ( V ; y ) = D Λ ( y ∥ p ) (35) so the F enchel–Y oung loss is the Bregman divergence b et w een the target distribution y and the predicted distribution p , with generator Λ. 3.3.2 Geometric Picture The Bregman divergence D Λ ( y ∥ p ) has a simple geometric meaning: it measures ho w muc h Λ( y ) exceeds the ﬁrst-order (aﬃne) appro ximation of Λ around p . In other w ords, it is the vertical gap b et w een the v alue of Λ at y and the tangent hyperplane of Λ at p . This gap is alwa ys nonnegative b y con vexit y , and it is zero if and only if y = p . F rom this viewp oin t, minimizing the F enc hel–Y oung loss ov er the dataset is equiv alen t to pro jecting the empirical distribution (the aggregate of observed one-hot lab els) onto the manifold of mo del distributions M = n p ( · ; β ) = ∇ Ω( V ( · ; β )) : β ∈ R d o under the Bregman geometry induced b y Λ. Sp eciﬁcally , the estimator ˆ β seeks the distribution on M that is closest to the data-generating pro cess in terms of the exp ected divergence. Rather than matching each realization p oin twise, this process constitutes an information pr oje ction of the observ ed statistics on to the mo del family , aligning the predicted probabilities with the realized c hoices in exp ectation. 3.3.3 Equiv alence to Maxim um Lik eliho o d Estimation for Multinomial Logit F or the Shannon-en tropy regularizer Λ Shannon ( p ) = µ P i p i log p i with a smo othing parameter µ > 0, the conv ex conjugate is the scaled log-sum-exp potential Ω µ ( V ) = µ log P i exp ( V i /µ ), and the induced PUM coincides with the MNL mo del with temp erature µ . In this case, the F enc hel–Y oung loss reduces exactly to the negative log-lik eliho o d scaled b y µ . 11 Figure 1: Geometric Interpretation of F enc hel-Y oung Estimation: Dualit y b et ween Utilit y and Probabilit y Spaces. The Left P anel displays the surplus function Ω( V ) in the utilit y space. The mo del prediction p corresp onds to the gradien t (slop e) of the tangent at the curren t utility V curr . The estimation goal is to align this slope with the target slop e deﬁned b y the observed lab el y . The Right Panel displa ys the conjugate regularization function Λ( p ) in the probabilit y space. Here, the F enchel-Y oung loss is visualized as the Bregman div ergence D Λ ( y ∥ p ), i.e., the vertical gap b et ween the function v alue Λ( y ) and the tangent hyperplane constructed at the prediction p . Crucially , the t wo views are mathematically equiv alent via the Legendre-F enchel transform: the slop e in the left panel ( p ) b ecomes the co ordinate in the right panel, and the slop e in the right panel ( V ) corresp onds to the co ordinate in the left. Prop osition 3 (FY vs. MLE for Multinomial Logit) . Consider the MNL mo del with utilities V ni ( β ) = β ⊤ x ni and temp er atur e µ , wher e the choic e pr ob abilities ar e given by p ni ( β ) = exp  V ni ( β ) /µ  P j exp  V nj ( β ) /µ  . F or an observation ( x n , y n ) , the asso ciate d F enchel–Y oung loss with sc ale d Shannon entr opy is ℓ FY ( β ; x n , y n ) = µ log X j exp  β ⊤ x nj /µ  − β ⊤ x ny n . This c oincides with µ times the ne gative lo g-likeliho o d: ℓ FY = µ · ( − log p ny n ( β )) . Sinc e µ is a strictly p ositive c onstant, the FY estimator and the maximum likeliho o d estimator for this mo del ar e identic al: ˆ β FY = ˆ β MLE . In tuitively , this equiv alence follo ws from the fact that the Bregman divergence induced b y the scaled negativ e Shannon en tropy is exactly the scaled Kullback–Leibler divergence. F or one-hot lab els e y n and mo del probabilities p n ( β ), the F enchel–Y oung loss can be written as ℓ FY ( β ; x n , y n ) = µ D KL  e y n ∥ p n ( β )  , 12 up to an additive constan t indep enden t of β . Minimizing the a verage FY loss therefore amounts to minimizing the empirical KL divergence b et ween empirical lab el distributions and mo del predictions, whic h is exactly maximum likelihoo d estimation for the m ultinomial logit mo del. 3.4 Con v exity Analysis A ma jor adv an tage of the F enchel-Y oung estimation framework is that the conv exity of the opti- mization problem follo ws directly from the deﬁnition of PUMs, without requiring complex analysis of the Hessian or log-concavit y conditions. Prop osition 4 (Global Con vexit y of PUM Estimation) . F or any PUM deﬁne d by a c onvex r e gularizer Λ , the estimation pr oblem (27) is a c onvex optimization pr oblem. Pr o of. The ob jectiv e function is a sum of terms ℓ FY ( β ; x , y ). Recall that Ω( V ) is deﬁned as the con vex conjugate (F enchel-Legendre transform) of the regularizer Λ( p ). A fundamental result in con vex analysis states that the conjugate of any function is alwa ys a con vex function. Since the systematic utility V ( x ) is a linear function of the parameters β , and con vexit y is preserv ed under aﬃne comp osition, the term Ω( V ( β )) is con vex with resp ect to β . The second term, − β ⊤ x n,y n , is linear in β . Therefore, the loss function ℓ FY , b eing the sum of a conv ex function and a linear function, is globally conv ex with resp ect to β . This guarantees that an y lo cal minimum of (27) is a global minim um, ensuring the feasibility of n umerical optimization. 3.5 Solution Approach Ha ving established the global conv exity of the F enchel–Y oung estimation problem in Prop osition 4, the optimization landscape is guaran teed to b e free of local minima. Consequen tly , the estimation task simpliﬁes to a standard conv ex optimization problem, where the global optimum can b e reliably reco vered via iterativ e ﬁrst-order methods, pro vided that the gradien t of the ob jective function is computable. A remark able prop ert y of the prop osed framework is that, for any admissible PUM, regardless of the sp eciﬁc complexity of the c hoice k ernel, the gradien t with resp ect to the utilit y parameters β alwa ys simpliﬁes to a uniﬁed and ph ysically in terpretable form: ∇ β J ( β ) = 1 N N X n =1       X i ∈C p ni ( β ) x ni | {z } Predicted Exp ected F eatures − x n,y n | {z } Observed F eatures       . (36) This formulation reveals that the optimization pro cess is fundamentally driv en by moment matc hing: the solver iterativ ely adjusts β to align the mo del’s predicted aggregate attributes with the empirical observ ations, with the sp eciﬁc of the PUM en tering solely through the calculation of the probability w eights p ni ( β ). Crucially , while the gradien t structure remains uniﬁed, the computational complexity of ev alu- ating the probability weigh ts p ni ( β ) v aries signiﬁcantly across sp eciﬁcations. Unlike MNL, which admits a closed-form expression, determining the c hoice probabilities for a general perturbation function Λ necessitates solving the primal conv ex optimization problem inherent to the PUM deﬁnition. This implies that the gradient ev aluation inv olves an embedded optimization la yer: for ev ery observ ation n at every iterative step, the solv er must n umerically retriev e the optimal vector p n . Although this nested structure do es not compromise the global conv ergence prop erties of the 13 outer algorithm, it inevitably increases the computational cost per iteration, p oten tially creating a b ottlenec k for large-scale applications. F ortunately , within the F enc hel–Y oung framew ork, we can circumv ent this computational b ottlenec k by exploiting the v ariational deﬁnition of the conjugate function Ω. Instead of treating the probability ev aluation as a subroutine to b e solved to optimality at ev ery step, we can substitute the deﬁnition of Ω( V n ) back in to the empirical risk ob jective. This allows us to reform ulate the original nested minimization problem in to a single, uniﬁed c onvex-c onc ave sadd le p oint pr oblem . By lifting the probability v ectors from implicit mappings to explicit optimization v ariables, we arrive at the following minimax form ulation: min β ∈ R d max p ◦ ∈ ∆ N L ( β , p ◦ ) := 1 N N X n =1 h  p n − y n  ⊤ V n ( β ) − Λ( p n ) i , (37) where p ◦ = ( p 1 , . . . , p N ) denotes the stack ed v ector of choice probabilities for all observ ations, and ∆ N = ∆ × · · · × ∆ is the Cartesian product of simplexes. It can b e veriﬁed that this ob jectiv e is globally con vex (sp eciﬁcally , linear) with resp ect to β and globally concav e with resp ect to the collection of probability v ectors p ◦ . T o address these challenges while strictly a voiding exp ensive inner-lo op optimization, w e employ the pr oje cte d extr agr adient metho d (Mokh tari et al., 2020). This algorithm utilizes a prediction-correction mec hanism to stabilize the up date tra jectory and guarantees global conv ergence for smooth op erators. Let τ > 0 and σ > 0 denote the primal and dual step sizes, resp ectively . At iteration k , we ﬁrst compute a temp orary lo ok-ahead p oin t ( ˜ β ( k ) , ˜ p ◦ ( k ) ) by taking a standard gradient step from the curren t position: ˜ p ( k ) n = P ∆  p ( k ) n + σ h V n ( β ( k ) ) − ∇ Λ( p ( k ) n ) i , ∀ n = 1 , . . . , N , (38a) ˜ β ( k ) = β ( k ) − τ " 1 N N X n =1 X ⊤ n  p ( k ) n − y n  # , (38b) where P ∆ ( · ) denotes the Euclidean pro jection onto the probabilit y simplex, and ∇ Λ( · ) is the gradien t of the p erturbation function. Note that the brack et term in the second equation corresp onds precisely to the gradient of the primal ob jectiv e ∇ J ( β ), i.e., Eq.(36), but here the probability vector p n is treated as an indep enden t dual v ariable rather than an implicit function of β . W e then update the v ariables using the gradients ev aluated at the predicted p oint, rather than the current p oin t, to damp oscillations: p ( k +1) n = P ∆  p ( k ) n + σ h V n ( ˜ β ( k ) ) − ∇ Λ( ˜ p ( k ) n ) i , ∀ n = 1 , . . . , N , (39a) β ( k +1) = β ( k ) − τ " 1 N N X n =1 X ⊤ n  ˜ p ( k ) n − y n  # . (39b) According to Mokhtari et al. (2020), the abov e pro jected extragradient metho d can ac hieve a con vergence rate of O (1 /k ), whic h is sublinear. How ever, for the signiﬁcan t sub class of additiv e separable PUMs (including Sparsemax, Cauc hy , among others), w e can exploit the structural decomp osabilit y of the regularizer to achiev e substantially faster conv ergence. In these cases, the p erturbation function separates as Λ( p ) = P i ∈C h ( p i ), where h is a strictly con vex scalar function. This structure decouples the primal optimization problem, allowing the optimal choice probabilities p n ( β ) to b e recov ered through a computationally eﬃcient, semi-analytical mapping rather than 14 a high-dimensional iterative solv er. Sp eciﬁcally , the probabilit y for alternativ e i is given by the in verse of the deriv ative of the regularizer: p ni ( β ) = ψ ( V ni ( β ) − λ n ) , with ψ = ( h ′ ) − 1 , (40) where λ n ∈ R is the unique Lagrange m ultiplier (normalization constan t) determined by solving the scalar ro ot-ﬁnding equation: X j ∈C ψ ( V nj ( β ) − λ n ) = 1 . (41) The n umerical resolution of this equation is b oth w ell-p osed and highly eﬃcient. Since the mapping ψ ( · ) is strictly increasing (b eing the inv erse deriv ativ e of a strictly con vex function), the aggregate function G ( λ n ) = P j ∈C ψ ( V nj − λ n ) − 1 is strictly monotonically decreasing with resp ect to λ n . This monotonicity guarantees the existence and uniqueness of the Lagrange m ultiplier, enabling the use of standard root-ﬁnding algorithms with guaranteed con vergence. F or instance, deriv ative-free metho ds such as the golden section se arc h oﬀer robust global con vergence with a linear rate. More aggressively , if ψ is diﬀerentiable, Newton’s method can b e employ ed to achiev e a quadratic conv ergence rate, eﬀectively doubling the digits of precision at ev ery step and t ypically resolving λ n to machine tolerance within a few iterations. Consequen tly , since the inner forw ard problem is reduced to a lo w-cost scalar operation, the computational b ottlenec k asso ciated with the nested optimization is eﬀectively eliminated. By com bining this rapid ev aluation of c hoice probabilities with the analytical gradient deriv ed in Eq.(36), w e can solve the F enchel–Y oung estimation problem using adv anced gradient-based optimization algorithms that exploit the ob jective’s smo othness and conv exity . A quin tessential example is Nestero v’s Accelerated Gradient (NA G; Nestero v, 1983). Unlike standard gradient descen t, NA G incorp orates a momen tum term to correct the search tra jectory based on previous iterations. This mo diﬁcation allows the algorithm to achiev e the theoretically optimal con vergence rate of O (1 /k 2 ) for smo oth conv ex functions, ensuring that the global optimum is reac hed with signiﬁcan tly fewer gradien t ev aluations. 3.6 Asymptotic Prop erties In addition to conv exity and numerical tractability , a viable estimation method m ust also possess sound large-sample statistical behavior. In this section, w e study the asymptotic prop erties of F enchel–Y oung estimators for general PUMs. Our goal is tw ofold. First, w e show that under mild regularit y conditions and correct mo del sp eciﬁcation, the FY estimator consistently reco vers the true parameter vector as the sample size grows, i.e., it conv erges in probability to the data-generating parameter. Second, we establish an asymptotic normality result: after appropriate rescaling by √ N , the estimation error con verges in distribution to a multiv ariate normal law with a w ell-deﬁned co v ariance matrix. W e pro ceed by presenting t wo lemmas ﬁrst. Lemma 1 (Contin uity and domination of the FY loss) . L et ℓ FY ( β ; x , y ) denote the F enchel–Y oung loss asso ciate d with a PUM, and let B ⊂ R d b e a c omp act p ar ameter set. Assume that the data ar e gener ate d under a pr ob ability me asur e P ⋆ on X × Y , wher e Y = { e 1 , . . . , e K } . Then the fol lowing c ontinuity and domination c onditions hold: 1. F or e ach ( x , y ) , the map β 7→ ℓ FY ( β ; x , y ) is c ontinuous on B . 15 2. Ther e exists an inte gr able function M : X × Y → [0 , ∞ ) such that   ℓ FY ( β ; x , y )   ≤ M ( x , y ) for al l β ∈ B and for P ⋆ -almost al l ( x , y ) . Pr o of. See App endix A. Lemma 2 (Iden tiﬁability of the PUM parameter) . Fix the disp ersion p ar ameter µ . L et x 7→ V ( x ; β ) denote the systematic utilities and p ( x ; β ) = ∇ V Ω( V ( x ; β )) the c orr esp onding choic e pr ob abilities of a PUM. Assume the standar d r ank c ondition so that V ( x ; β 1 ) = V ( x ; β 2 ) for almost al l x implies β 1 = β 2 . Then β is identiﬁable in the sense that p ( x ; β 1 ) = p ( x ; β 2 ) for almost al l x = ⇒ β 1 = β 2 . Pr o of. If p ( x ; β 1 ) = p ( x ; β 2 ) for almost all x , then, by the PUM dualit y p = ∇ V Ω( V ) and the injectivit y of ∇ Ω, we ha ve V ( x ; β 1 ) = V ( x ; β 2 ) for almost all x . The rank condition on the regressors then implies β 1 = β 2 , which prov es the claim. Next, we present the consistency theorem for the FY estimators for PUMs. Theorem 1 (Consistency of FY Estimator for PUM) . L et { ( X n , Y n ) } N n =1 b e i.i.d. r andom ve ctors fr om P ⋆ and c onsider a PUM with FY loss ℓ FY ( β ; x , y ) . Assume the mo del is wel l-sp e ciﬁe d, i.e., ther e exists a unique β ⋆ ∈ B such that E P ⋆ [ Y | X = x ] = p ( x ; β ⋆ ) almost sur ely , and that the c ontinuity and identiﬁability c onditions in L emmas 1 and 2 hold. L et ˆ β N ∈ arg min β ∈B 1 N N X n =1 ℓ FY ( β ; X n , Y n ) b e any empiric al FY minimizer. Then the p opulation FY risk R ( β ) : = E P ⋆ [ ℓ FY ( β ; X , Y )] is uniquely minimize d at β ⋆ , and ˆ β N p − → β ⋆ as N → ∞ . Pr o of. See App endix A. Next, under slightly stronger regularit y conditions, we analyze the conv ergence rate of the FY estimators. Theorem 2 (Asymptotic Normality of the FY Estimator) . L et ˆ β N b e the estimator obtaine d by minimizing the empiric al F enchel–Y oung loss over a dataset of N i.i.d. samples gener ate d fr om a wel l-sp e ciﬁe d PUM with true p ar ameter β ⋆ . Assume the fol lowing r e gularity c onditions hold: 1. R e gularity of Ω : The surplus function Ω is twic e c ontinuously diﬀer entiable in a neighb orho o d of V ( X ; β ⋆ ) almost sur ely. 2. Non-singularity: The exp e cte d Hessian matrix H ( β ⋆ ) : = E [ ∇ 2 ℓ FY ( β ⋆ ; X , Y )] is p ositive deﬁnite. 3. Finite Moments: E [ ∥ X ∥ 2 ] < ∞ and the c ovarianc e of the gr adient J ( β ⋆ ) is ﬁnite. 16 Then, as N → ∞ , √ N ( ˆ β N − β ⋆ ) d − → N ( 0 , V sand ) , wher e the asymptotic varianc e is given by the sandwich form: V sand = H ( β ⋆ ) − 1 J ( β ⋆ ) H ( β ⋆ ) − 1 . (42) Pr o of. Let ℓ n ( β ) : = ℓ FY ( β ; X n , Y n ) denote the loss for the n -th observ ation, and let g n ( β ) : = ∇ ℓ n ( β ) denote its gradien t. The estimator satisﬁes the empirical ﬁrst-order condition 1 N P N n =1 g n ( ˆ β N ) = 0 . With Prop osition 2, we hav e conﬁrmed the Fisher consistency , implying E [ g ( β ⋆ )] = 0 . Since ˆ β N p − → β ⋆ (Theorem 1), we expand the empirical gradien t around β ⋆ : 0 = 1 N N X n =1 g n ( ˆ β N ) = 1 N N X n =1 g n ( β ⋆ ) + 1 N N X n =1 ∇ 2 ℓ n ( ˜ β N ) ! ( ˆ β N − β ⋆ ) , where ˜ β N lies on the line segment b et ween ˆ β N and β ⋆ . Rearranging terms and scaling by √ N : √ N ( ˆ β N − β ⋆ ) = − 1 N N X n =1 ∇ 2 ℓ n ( ˜ β N ) ! − 1 1 √ N N X n =1 g n ( β ⋆ ) ! . W e analyze the con vergence of the tw o comp onen ts: • Hessian Con v ergence: By the Uniform La w of Large Numbers and the assumed regularit y of the Hessian, the brack eted term con verges in probability to the p opulation Hessian: 1 N N X n =1 ∇ 2 ℓ n ( ˜ β N ) p − → E [ ∇ 2 ℓ ( β ⋆ )] = H ( β ⋆ ) . Note that for PUMs, the Hessian tak es the form ∇ 2 ℓ ( β ) = X [ ∇ 2 Ω( V )] X ⊤ . • Cen tral Limit Theorem: The term 1 √ N P g n ( β ⋆ ) is a sum of i.i.d. zero-mean random v ectors with co v ariance matrix: J ( β ⋆ ) : = E [ g ( β ⋆ ) g ( β ⋆ ) ⊤ ] . By the Multiv ariate CL T, this term con verges in distribution to N ( 0 , J ( β ⋆ )). Com bining these results via Slutsky’s Theorem, we obtain: √ N ( ˆ β N − β ⋆ ) d − → − H − 1 · N ( 0 , J ) ∼ N  0 , H − 1 JH − 1  . It is crucial to distinguish b et ween t wo cases regarding the asymptotic eﬃciency: • Multinomial Logit: F or MNL, the F enc hel–Y oung loss coincides with the negative log- lik eliho o d. By the Bartlett identities, the exp ected Hessian equals the cov ariance of the score ( H = J = I Fisher ). The sandwich v ariance simpliﬁes to I − 1 Fisher , ac hieving the Cram ´ er-Rao lo wer b ound (asymptotic eﬃciency). 17 • Sparsemax: F or PUMs with sparse choice kernels, while consistency holds, the regularit y conditions required for standard asymptotic normality (speciﬁcally twice con tin uous diﬀerentia- bilit y) ma y not hold everywhere due to the non-smo othness of the Sparsemax transform at the b oundaries of the supp ort. In these cases, H  = J generally . The matrix H reﬂects the curv ature of the conv ex conjugate Ω, which ma y hav e ﬂat regions (zero curv ature) where probabilities are sparse, whereas J reﬂects the v ariance of the residuals. Consequently , V sand ⪰ I − 1 Fisher (in the L¨ owner order), implying that the F enchel–Y oung estimator, while robust, ma y b e theoretically less eﬃcient than the MLE in the large-sample limit. 4 W asserstein Distance-based Distributionally Robust Estimation 4.1 The Curse of Limited Data Standard Empirical Risk Minimization or ERM strategies, including the proposed F enc hel-Y oung framew ork, op erate under the tacit assumption that the empirical distribution ˆ P N , constructed from observ ed samples, serves as a suﬃcient statistic for the true underlying data-generating distribution P . Asymptotically , as the sample size N → ∞ , w e know that ˆ P N con verges to P , guaran teeing the consistency of the estimator. How ever, in the regime of limited data, this asymptotic promise breaks down. The empirical distribution b ecomes a sparse, high-v ariance approximation that fails to capture the true supp ort and tail b eha vior of the underlying preference distribution. Consequen tly , an y estimator that strictly minimizes risk with respect to ˆ P N tends to o verﬁt the idiosyncratic noise of the ﬁnite sample rather than capturing the generalized structural parameters. Crucially , this statistical vulnerabilit y is inheren t to the estimation of PUMs due to the fun- damen tal nonlinearity of the choice mechanism. The optimality condition for the F enc hel-Y oung estimator requires ﬁnding the ro ot of the gradient equations: P n ( ∇ Ω( V n ( β )) − y n ) ⊗ x n = 0 . Here, the mapping from parameters to probabilities, β 7→ ∇ Ω( V ( β )), inv olves the choice k ernel ψ ( · ) (e.g., the pro jection op erator in Sparsemax), which is highly nonlinear. While these estimating equations are consistent, the op eration of retrieving parameters from noisy ﬁnite data is sub ject to Jensen’s inequalit y due to the curv ature of the surplus function Ω. This induces a systematic ﬁnite-sample bias, t ypically manifesting as magnitude inﬂation ( E [ ∥ ˆ β ∥ ] > ∥ β true ∥ ), a phenomenon that stems from the geometry of the PUM itself. Although the F enc hel-Y oung loss provides n umerical stability through b ounded gradien ts, stabilit y does not equate to statistical robustness against sampling error. In small-sample regimes, the discrepancy b et w een the empirical distribution ˆ P N and the true distribution P remains a dominan t source of error. T o ensure reliable inference under data scarcit y , w e m ust therefore go b ey ond standard ERM. Instead of optimizing solely o ver the unreliable empirical distribution, we prop ose a robust mo diﬁcation that explicitly accounts for the distributional uncertaint y surrounding ˆ P N . 4.2 W asserstein Distance-based F orm ulation and its Con v ex Dual F ormulation Let Z = X × Y ⊆ R d × { e 1 , . . . , e K } denote the sample space, where Y consists of the standard basis vectors in R K represen ting the K discrete alternatives. W e deﬁne the ambiguit y set B ϵ ( ˆ P N ) as the set of all probabilit y distributions Q supp orted on Z suc h that their W asserstein distance to the empirical distribution satisﬁes W c ( Q , ˆ P N ) ≤ ϵ . The radius ϵ > 0 serv es as a budget for distributional uncertain ty . The choice of the ground metric c ( z , z ′ ) is pivotal for the p erformance of PUM estimation. T o capture the distinct nature of contin uous attributes and discrete categorical lab els, we prop ose a 18 h ybrid transport cost function. F or tw o data points z = ( x , y ) and z ′ = ( x ′ , y ′ ), we deﬁne: c ( z , z ′ ) ≜ ∥ x − x ′ ∥ S + κ · I ( y  = y ′ ) , (43) where ∥ u ∥ S = √ u ⊤ S − 1 u represen ts the Mahalanobis distance w eighted b y the empirical co v ariance matrix S of the attributes, and κ > 0 is a p enalt y parameter for lab el mismatc h. This metric induces an anisotropic robustness in the feature space while allowing for con trolled lab el noise to prev ent p erfect separation. The primal robust estimation problem is form ulated as: min β sup Q ∈B ϵ ( ˆ P N ) E Q [ ℓ FY ( β ; x , y )] , (44) where ℓ FY ( β ; x , y ) = Ω( V ( x )) − y ⊤ V ( x ) denotes the F enchel–Y oung loss derived from the PUM’s surplus function Ω, and y ∈ Y represen ts the observ ed c hoice as a one-hot vector. 4.2.1 Exact Dual Reformulation W e now derive the exact reform ulation of the inﬁnite-dimensional distributionally robust problem (44) . The primal problem seeks to minimize the worst-case expected loss o ver the W asserstein ball B ϵ ( ˆ P N ). Since the F enchel–Y oung loss ℓ FY ( β ; x , y ) is contin uous in x and the uncertaint y set Ξ = X × Y is a closed subset in P olish space, the strong duality result for W asserstein DRO holds (Moha jerin Esfahani and Kuhn, 2018). This allows us to rephrase the problem as a minimization o ver the dual v ariable γ ≥ 0 (the Lagrange multiplier for the W asserstein constrain t) and auxiliary v ariables { s n } N n =1 represen ting the w orst-case loss for each sample. Theorem 3 (Exact Reformulation) . Consider the F enchel–Y oung distributional ly r obust estimator with the hybrid tr ansp ort c ost deﬁne d by c (( x , y ) , ( x ′ , y ′ )) = ∥ x − x ′ ∥ S + κ I ( y  = y ′ ) . The inﬁnite- dimensional primal pr oblem: min β sup Q ∈B ϵ ( ˆ P N ) E Q [ ℓ FY ( β ; x , y )] (45) is e quivalent to the fol lowing ﬁnite-dimensional c onvex optimization pr oblem: min β ,γ , { s n } γ ϵ + 1 N N X n =1 s n s.t. s n ≥ sup x ∈X { ℓ FY ( β ; x , e i ) − γ ∥ x − x n ∥ S } − γ κ I ( e i  = y n ) , ∀ n ∈ { 1 , . . . , N } , ∀ i ∈ { 1 , . . . , K } , γ ≥ 0 . (46) Pr o of. The pro of pro ceeds in t wo steps. First, w e inv oke the strong duality theorem for W as serstein DR O. Pro vided the underlying sample space is a P olish space, the transp ort cost is lo wer semicon- tin uous, and the loss is upp er semicontin uous and integrably b ounded, we can establish the strong dualit y according to Blanchet and Murth y (2019). In our setting, ℓ FY is contin uous (as sho wn in Lemma 1) and the h ybrid cost is a metric, thereby satisfying these regularit y assumptions. Thus, the worst-case risk is exactly given by: min β ,γ ≥ 0 ( γ ϵ + 1 N N X n =1 sup z ∈X ×Y ( ℓ FY ( β ; z ) − γ c ( z , z n )) ) . 19 Second, we decomp ose the supremum ov er the joint space X × Y . Since Y = { e 1 , . . . , e K } is a ﬁnite discrete set, the global supremum is equiv alent to the maximum ov er k of the conditional suprema giv en y = e k : sup z ∈X ×Y ( . . . ) = max k ∈{ 1 ,...,K } sup x ∈X { ℓ FY ( β ; x , e i ) − γ ( ∥ x − x n ∥ S + κ I ( e i  = y n )) } . Substituting this decomp osition into the robust constraints for s i yields the formulation in (46). Then, we verify the conv exity . The ob jectiv e function γ ϵ + 1 N P s n is linear in the decision v ariables ( β , γ , { s n } ), and is therefore conv ex. Next, consider the inequalit y constraints deﬁning the feasible set. F or each sample n and lab el k , the constraint can b e rewritten as s n ≥ ϕ nk ( β , γ ), where ϕ ni ( β , γ ) : = sup x ∈X { ℓ FY ( β ; x , e i ) − γ ∥ x − x n ∥ S } − γ κ I ( e i  = y n ) . W e observe that for any ﬁxed x ∈ X , the term ℓ FY ( β ; x , e k ) is con vex in β (due to the conv exit y of Ω), and the term − γ ( ∥ x − x n ∥ S + κ I ) is linear (and thus conv ex) in γ . Consequen tly , the expression inside the suprem um is jointly conv ex with resp ect to ( β , γ ). Since the p oint wise suprem um of a family of con vex functions preserv es conv exity , the function ϕ ni ( β , γ ) is jointly conv ex. The constrain ts s n ≥ ϕ ni ( β , γ ) thus deﬁne the epigraphs of con vex functions, which are conv ex sets. Since the intersection of conv ex sets is con vex, the feasible region of (46) is con vex. Therefore, the problem minimizes a conv ex ob jective ov er a con vex feasible set, c haracterizing it as a conv ex optimization problem. Despite the theoretical conv exity of the exact reformulation (46) , the problem remains com- putationally prohibitive for general PUMs. The ev aluation of the constraint function ϕ ni ( β , γ ) necessitates solving the inner maximization problem sup x { ℓ FY ( β ; x , e i ) − γ ∥ x − x n ∥ S } . Since the F enc hel–Y oung loss ℓ FY is conv ex with resp ect to the feature v ector x , this inner problem amounts to maximizing a conv ex function, whic h is a classic non-conv ex optimization problem. Consequently , the constraints and their gradien ts cannot b e ev aluated eﬃciently , rendering standard con vex solvers inapplicable. T o circum ven t this in tractability , we pro ceed by deriving a safe conv ex appro ximation. In the next section, we utilize the Lipschitz contin uity of the loss function to construct a tractable upp er b ound for the worst-case risk. This approac h allows us to replace the implicit and hard-to- compute w orst-case constraints with an explicit constraint on the parameter norm, derived from b ounding the global Lipschitz constant of the mo del. 4.2.2 Bounding the Lipschitz Constant with PUM Prop erties Recall that for a single observ ation ( x , y ), where y ∈ { 0 , 1 } K represen ts the choice as a one-hot v ector, the FY loss is ℓ FY ( β ; x , y ) = Ω( V ( x )) − y ⊤ V ( x ) , where V ( x ) = [ V 1 ( x ) , . . . , V K ( x )] ⊤ , V i ( x ) = β ⊤ x i , and x = ( x 1 , . . . , x K ) collects the attributes of all alternativ es. By the PUM construction, ∇ V Ω( V ) = p ( x ; β ) ∈ ∆ , so the gradient of the FY loss with resp ect to V is simply the diﬀerence b et ween the predicted probabilit y and the lab el vector: ∇ V ℓ FY = ∇ V Ω( V ) − y = p ( x ; β ) − y . 20 F or each alternative i , the Jacobian of V ( x ) with respect to x i satisﬁes ∇ x i V j ( x ) = ( β , j = i, 0 , j  = i, whic h yields, b y the c hain rule, ∇ x i ℓ FY ( β ; x , y ) =  p i ( x ; β ) − y i  β . Stac king o ver i = 1 , . . . , K , the gradien t with respect to the full feature v ector x is ∇ x ℓ FY ( β ; x , y ) =  ( p 1 − y 1 ) β , . . . , ( p K − y K ) β  . Euclidean b ound. The Euclidean norm of this gradient satisﬁes   ∇ x ℓ FY ( β ; x , y )   2 2 = K X i =1  p i − y i  2 ∥ β ∥ 2 2 = ∥ β ∥ 2 2 ∥ p − y ∥ 2 2 . Since p ∈ ∆ and y is a vertex of the simplex (one-hot), w e hav e the simple b ound ∥ p − y ∥ 2 2 = K X i =1 ( p i − y i ) 2 = (1 − p k ) 2 + X j  = k p 2 j ≤ (1 − p k ) 2 + (1 − p k ) ≤ 2 , where k is the index of the observ ed class (i.e., y k = 1). Hence, ∥ p − y ∥ 2 ≤ √ 2 ,   ∇ x ℓ FY ( β ; x , y )   2 ≤ √ 2 ∥ β ∥ 2 . (47) Th us, with resp ect to the Euclidean norm, the FY loss is √ 2 ∥ β ∥ 2 -Lipsc hitz in x , for any PUM (w e only used p ∈ ∆). Mahalanobis b ound. The DRO formulation uses the Mahalanobis norm ∥ u ∥ S = √ u ⊤ S − 1 u , whose dual norm is ∥ g ∥ S , ∗ : = sup ∥ u ∥ S ≤ 1 g ⊤ u = p g ⊤ Sg . By the sp ectral b ound g ⊤ Sg ≤ λ max ( S ) ∥ g ∥ 2 2 , we obtain   ∇ x ℓ FY ( β ; x , y )   S , ∗ ≤ p λ max ( S )   ∇ x ℓ FY ( β ; x , y )   2 ≤ p 2 λ max ( S ) ∥ β ∥ 2 . Since the Lipschitz constan t L y ( β ) with resp ect to the Mahalanobis norm is giv en by L y ( β ) = sup x ∈X   ∇ x ℓ FY ( β ; x , y )   S , ∗ , w e ma y take the uniform b ound L ( β ) : = max y L y ( β ) ≤ p 2 λ max ( S ) ∥ β ∥ 2 . (48) Therefore, substituting the b ound in (47) or (48) in to the dual constraint γ ≥ L ( β ), w e ma y enforce the (slightly stronger) second–order cone constrain t γ ≥ c S ∥ β ∥ 2 , c S : = p 2 λ max ( S ) or √ 2 . 21 4.2.3 T ractable Dual Reformulation W e now construct a safe conv ex approximation. By utilizing the global Lipschitz constant derived ab o v e, we replace the implicit lo cal gradien t constrain ts with a uniform upper b ound. This relaxation transforms the problem into a computationally eﬃcient ﬁnite-dimensional conv ex optimization problem. Theorem 4 (T ractable Reform ulation) . Given that the Lipschitz c onstant for the the map x 7→ ℓ FY ( β ; x , e i ) is L ( β ) , the exact distributional ly r obust pr oblem c an b e safely appr oximate d by the fol lowing ﬁnite-dimensional c onvex optimization pr oblem, which pr ovides an upp er b ound on the worst-c ase risk: min β , γ , { s n } γ ϵ + 1 N N X n =1 s n (49) subje ct to γ ≥ L ( β ) , (50) s n ≥ ℓ FY ( β ; x n , y n ) , ∀ n = 1 , . . . , N , (51) s n ≥ ℓ FY ( β ; x n , e i ) − γ κ, ∀ n = 1 , . . . , N , ∀ k such that e i  = y n . (52) Pr o of. W e start from the exact semi-inﬁnite reform ulation derived in Theorem 3, sp eciﬁcally Equation (46). The constraints require that for ev ery sample i and class k : s n ≥ sup x ∈X { ℓ FY ( β ; x , e i ) − γ ∥ x − x n ∥ S } − γ κ I ( e i  = y n ) . (53) The inner maximization problem inv olves the conv ex loss function and is generally intractable. T o construct a tractable upp er b ound, we utilize the Lipsc hitz con tinuit y of the loss. By the deﬁnition of the Lipschitz constan t L ( β ), we ha ve the inequalit y: ℓ FY ( β ; x , e i ) ≤ ℓ FY ( β ; x n , e i ) + L ( β ) ∥ x − x n ∥ S . (54) Substituting this upp er b ound into the suprem um yields: sup x ∈X { ℓ FY ( β ; x , e i ) − γ ∥ x − x n ∥ S } ≤ ℓ FY ( β ; x n , e i ) + sup x ∈X { ( L ( β ) − γ ) ∥ x − x n ∥ S } . (55) If we enforce the constrain t γ ≥ L ( β ), the term ( L ( β ) − γ ) is non-p ositiv e. Consequen tly , the suprem um o ver the un b ounded domain X is attained at x = x n (where the norm is zero), and the term v anishes. Conv ersely , if γ < L ( β ), the supremum explo des to + ∞ , making the minimization o ver γ infeasible or sub optimal. Therefore, b y restricting the search space to γ ≥ L ( β ), w e can replace the intractable supremum with the nominal loss v alue: sup x ∈X { ℓ FY ( β ; x , e i ) − γ ∥ x − x n ∥ S } ≤ ℓ FY ( β ; x n , e i ) . (56) Applying this relaxation to the constraints in (46) for b oth the correct lab el case ( e i = y n , where the indicator is 0) and the incorrect lab el case ( e i  = y n , where the indicator is 1) directly yields the ﬁnite system of constraints (50) – (52) . Since the feasible set of this appro ximate problem is a subset of the exact problem’s feasible set (due to the upp er bound on the s n constrain ts), the optimal v alue provides an upp er b ound on the true w orst-case risk, making it a safe approximation. 22 By substituting the explicit Lipschitz constant deriv ed ab ov e (i.e., L ( β ) = c S ∥ β ∥ 2 ), the problem admits the explicit form ulation: min β , γ , { s n } γ ϵ + 1 N N X n =1 s n s.t. s n ≥ ℓ FY ( β ; x n , y n ) , n = 1 , . . . , N , s n ≥ ℓ FY ( β ; x n , e i ) − γ κ, n = 1 , . . . , N , ∀ k s.t. e i  = y n , γ ≥ c S ∥ β ∥ 2 . (57) Since ℓ FY ( β ; x , y ) is conv ex in β and the constraint γ ≥ c S ∥ β ∥ 2 deﬁnes a conv ex second-order cone, the problem (57) is a globally con vex optimization problem. It can b e solved eﬃcien tly using standard conic solv ers or ﬁrst-order methods, where the distributionally robust form ulation eﬀectiv ely adds a geometry-aw are norm regularization to the standard F enchel–Y oung loss. 4.2.4 Solution Algorithm for the T ractable Appro ximation While the tractable reformulation deriv ed in Theorem 4 constitutes a v alid second-order cone program that can b e solved via interior-point metho ds, it contains inner optimization problems (i.e., solving ℓ FY ( β ; x n , y n )) in the constraints. T o enable eﬃcient large-scale estimation, we prop ose a ﬁrst-order primal-dual algorithm by restructuring the problem into a con vex-conca v e saddle p oin t formulation. Recall that the robustiﬁed loss function essen tially in volv es the surplus function Ω( V ). By inv oking the v ariational deﬁnition Ω( V ) = sup p ∈ ∆ { p ⊤ V − Λ( p ) } , w e can lift the implicit probabilit y vectors to explicit dual optimization v ariables. Consequen tly , the minimization of the robust ob jectiv e is equiv alent to the follo wing minimax problem: min β ∈ R d , γ ∈ R γ ≥ c S ∥ β ∥ 2 max p ◦ ∈ ∆ N L ( β , γ , p ◦ ) : = γ ϵ + 1 N N X n =1    p ⊤ n V ( x n ; β ) − Λ( p n ) | {z } V ariational form of Ω( V n ) + R n ( β , γ )    , (58) where p ◦ = ( p 1 , . . . , p N ) denotes the collection of dual probabilit y v ectors, and R n ( β , γ ) represents the adversarial correction term deriv ed from the ﬁnite constrain ts: R n ( β , γ ) : = max i ∈{ 1 ,...,K } n − e ⊤ i V ( x n ; β ) − I ( e i  = y n ) γ κ o . (59) The saddle p oin t form ulation presented in Eq. (58) in volv es the term R n ( β , γ ), which is deﬁned as the p oin twise maximum of a ﬁnite set of aﬃne functions. Consequently , the ob jectiv e function is non-smo oth with respect to the primal v ariables β and γ , as the gradien t is undeﬁned at p oints where m ultiple indices k ac hieve the maximum simultaneously . T o circum ven t this issue and recov er a fully smo oth ob jectiv e, we can further dualize the discrete maximization by in tro ducing an auxiliary adv ersarial probability vector q n ∈ ∆ K for eac h sample. By exploiting the v ariational iden tity max i { z i } = max q ∈ ∆ K q ⊤ z , w e transform the non-smo oth piecewise linear term into a smo oth bilinear term. This leads to the follo wing bilinear saddle p oin t problem: min β ∈ R d , γ ∈ R γ ≥ c S ∥ β ∥ 2 max p ◦ ∈ ∆ N q ◦ ∈ ∆ N L bilinear ( β , γ , p ◦ , q ◦ ) , (60) 23 where q ◦ = ( q 1 , . . . , q N ) denotes the collection of adv ersarial lab el distributions. The smoothed ob jective function is given by: L bilinear : = 1 N N X n =1 h ( p n − q n ) ⊤ V ( x n ; β ) − Λ( p n ) i + γ ϵ − κ N N X n =1 K X i =1 q ni I ( e i  = y n ) ! . (61) The mathematical structure of Eq. (60) is fav orable for numerical optimization. Sp eciﬁcally , the ob jective function L bilinear is c onvex-c onc ave : it is aﬃne (and thus conv ex) with resp ect to the primal v ariables ( β , γ ), and concav e with resp ect to the dual v ariables ( p ◦ , q ◦ ). Notably , the strict conv exity of the regularizer Λ implies that the ob jective is strictly conca ve in p ◦ , ensuring the uniqueness of the optimal prediction distribution. F urthermore, unlike the original form ulation inv olving the point wise maxim um, this bilinear form ulation is smo oth with Lipsc hitz contin uous gradients. Com bined with the fact that the constrain ts, i.e., the second-order cone for the primal v ariables and the probability simplexes for the dual v ariables, are simple con v ex sets admitting eﬃcient closed-form pro jections, this problem can b e solved eﬃciently using the pro jected extragradient algorithm, as demonstrated in Section 3.5. In algorithmic practice, to further enhance con vergence stability and mitigate p oten tial oscillatory b eha vior inherent in min-max optimization, a t wo-time-scale strategy can b e incorp orated into this framew ork (Lin et al., 2025), i.e., setting a longer step size for the inner iteration than the outer iteration. 4.3 Equiv alences of the T ractable Reform ulation in Tw o Limiting Cases Recall that the basic (non-robust) F enchel–Y oung estimator is obtained by minimizing the empirical F enchel–Y oung loss: min β J FY ( β ) : = 1 N N X n =1 ℓ ( β ; x n , y n ) . (62) In contrast, utilizing the tractable reformulation deriv ed in Theorem 4, the W asserstein DR O problem is approximated by the ﬁnite-dimensional dual problem (57) . Critically , this formulation relies on the global Lipschitz upp er b ound c S (deriv ed from the PUM structure and the Mahalanobis metric) to replace the intractable lo cal gradient constraints of the exact formulation. F or any ﬁxed pair ( β , γ ), the optimal slac k v ariables are giv en by s ⋆ n ( β , γ ) = max n ℓ ( β ; x n , y n ) , max e k  = y n  ℓ ( β ; x n , e i ) − γ κ  o , (63) where the inner maximization is taken o ver all alternative c hoice vectors e k distinct from the observ ed c hoice y n . Substituting s ⋆ n in to (57) , the tractable DRO problem can b e written purely in terms of ( β , γ ) as min β , γ γ ϵ + 1 N N X n =1 ˜ ℓ DRO ( β , γ ; x n , y n ) s.t. γ ≥ c S ∥ β ∥ 2 , (64) where the r obustiﬁe d p er-sample loss is deﬁned as ˜ ℓ DRO ( β , γ ; x n , y n ) : = max n ℓ ( β ; x n , y n ) , max e i  = y n  ℓ ( β ; x n , e i ) − γ κ  o . (65) 24 Therefore, the W asserstein–DRO estimator can b e in terpreted as a twofold mo diﬁc ation of the basic F enchel–Y oung form ulation (62) . First, the empirical FY loss for e ac h observ ation is replaced by the worst-case loss o ver adversarial lab el p erturbations, p enalized by γ κ ; this inﬂates the ob jective whenev er alternativ e choice vectors ( e i ) can explain the data almost as w ell as the observ ed v ector ( y n ). Second, the constraint γ ≥ c S ∥ β ∥ 2 together with the term γ ϵ induces an implicit norm-regularization on β , whose strength is controlled by the W asserstein radius ϵ . In this sense, the DRO formulation is equiv alent to minimizing a robustiﬁed F enchel–Y oung loss plus a distributional-uncertain ty-driv en p enalt y on the magnitude of the utilit y co eﬃcien ts. 4.3.1 Limiting Case: The Regularization Equiv alence W e now turn our attention to the speciﬁc asymptotic regime where the penalty for label ﬂipping b ecomes prohibitively large, i.e., κ → ∞ . This regime corresp onds to a mo deling scenario where the analyst places absolute trust in the observed discrete c hoices (represented by one-hot vectors y n ) but ackno wledges the p oten tial measurement errors or unobserved heterogeneit y in the attributes (features x n ). As κ → ∞ , the adv ersarial term ℓ ( β ; x n , e i ) − γ κ tends to −∞ for any ﬁnite dual v ariable γ > 0 and an y alternative choice vector e i  = y n . Consequently , the maximization ov er the alternativ e c hoices becomes inactiv e, as the adversary cannot aﬀord the inﬁnite cost of falsifying the observed c hoice. Mathematically , the robustiﬁed loss function collapses to the nominal loss: lim κ →∞ ˜ ℓ DRO ( β , γ ; x n , y n ) = ℓ ( β ; x n , y n ) . (66) In this limit, the auxiliary v ariables s i are constrained solely by the nominal F enc hel–Y oung loss. The optimization problem (64) simpliﬁes to minimizing γ ϵ + 1 N P ℓ ( β ; x n , y n ) sub ject to the gradien t norm constraint γ ≥ c S ∥ β ∥ 2 . Since the ob jective function is monotone increasing in γ , and ϵ represen ts a p ositiv e uncertain ty budget, the optimal dual v ariable γ ⋆ m ust bind at its low er b ound to minimize the ob jectiv e. Th us, at optimality , w e ha v e: γ ⋆ = c S ∥ β ∥ 2 . (67) Substituting this optimal γ ⋆ bac k into the ob jective function yields a reduced-form estimator, which is presented in the follo wing theorem. Theorem 5 (Limiting Equiv alence to ℓ 2 -Regularization) . Consider the Wasserstein Distributional ly R obust Optimization pr oblem formulate d in (64) with the hybrid tr ansp ort c ost. In the asymptotic r e gime wher e the lab el mismatch p enalty κ → ∞ , the r obust estimator is char acterize d as the solution to the fol lowing ℓ 2 -r e gularize d Empiric al Risk Minimization pr oblem: ˆ β κ →∞ ∈ arg min β            1 N N X i = n ℓ FY ( β ; x n , y n ) | {z } Empiric al R isk + λ reg ∥ β ∥ 2 | {z } R e gularization            , (68) wher e the r e gularization c o eﬃcient is explicitly determine d by the ambiguity set r adius ϵ and the ge ometric Lipschitz c onstant c S via the r elation λ reg : = ϵ c S . This theorem elucidates a fundamental dualit y implied by the tractable DRO form ulation: robustness against feature perturbations (under the Lipschitz approximation) is mathematically 25 equiv alent to norm-regularization in the parameter space. The term ϵ con trols the radius of the uncertain ty ball in the feature space, whic h translates directly into the strength of the shrink age p enalt y λ reg applied to the utilit y parameters. F urthermore, the presence of the constan t c S (deriv ed from the Mahalanobis metric) highlights the adaptive nature of this regularization. Unlik e standard Ridge regression whic h applies isotropic shrink age, the DR O-derived regularization is geometry-aw are: it penalizes the parameter magni- tude more hea vily in directions where the data v ariance (and thus the p oten tial for adversarial p erturbation) is larger. In summary , the general W asserstein–DRO form ulation prop osed in this pap er encapsulates the standard regularized estimator as a special limiting case, which resonates the results stated in Blanc het and Murth y (2019). How ever, by allowing for a ﬁnite κ , our formulation provides a ric her class of estimators that sim ultaneously p erform p ar ameter shrinkage (via γ ) to com bat high v ariance and loss trunc ation (via κ ) to robustify against outliers and p erfect separation. 4.3.2 Limiting Case: The Hinge Loss Equiv alence In this speciﬁc limiting instance, w e consider the regime where b oth the lab el mismatch p enalt y κ and the W asserstein ambiguit y radius ϵ approac h zero simultaneously ( κ → 0 , ϵ → 0), while their magnitudes remain comparable. Under this asymptotic condition, the geometric structure of the ambiguit y set B ϵ ( ˆ P N ) undergo es a fundamental transformation regarding the hybrid transp ort cost. Since the radius ϵ b ecomes inﬁnitesimal, the allow able budget for transp orting probability mass across the contin uous feature space eﬀectiv ely v anishes. Consequently , the decision-maker implicitly assumes near-perfect reliabilit y of the observ ed attributes x , as the cost of p erturbing features exceeds the tight uncertain ty budget. The eﬀectiv e co v erage of the W asserstein ball is thus restricted almost exc lusiv ely to the discrete label space. In this regime, the robustness mechanism decouples from the feature geometry (eliminating the parameter shrink age eﬀect link ed to ∥ β ∥ 2 ) and fo cuses entirely on the discrete adversarial p erturbations. Based on the robustiﬁed loss formulation in Eq. (64) , as b oth ϵ and κ tend to zero, a non-trivial equilibrium requires the dual v ariable γ to grow suﬃciently large to coun terbalance the v anishing p enalt y κ . Sp eciﬁcally , for the adv ersarial term ℓ ( β ; x i , e k ) − γ κ to remain active and meaningful, the pro duct γ κ m ust conv erge to a ﬁnite constant. This implies γ → ∞ , rendering the gradien t norm constraint γ ≥ c S ∥ β ∥ 2 asymptotically inactive (slack), thereby decoupling the dual v ariable from the parameter magnitude. By deﬁning the eﬀective hinged threshold τ = γ κ , the regularization term transforms as γ ϵ = τ ( ϵ/κ ). Consequen tly , the DRO problem (64) can b e reformulated as a hierarc hical t wo-lev el optimization: an outer optimization o ver the global threshold τ (w eighted b y the ratio ϵ/κ ), and an inner unconstrained minimization ov er β using a pure truncated loss function: min τ ≥ 0 ( ϵ κ τ + min β " 1 N N X n =1 max  ℓ ( β ; x n , y n ) , max e i  = y n { ℓ ( β ; x n , e i ) − τ }  #) (69) The inner minimization problem functions as an adversarial mar gin enfor c ement mechanism. Go verned by the p oint wise maximum op erator max ( ℓ real , ℓ adv − τ ), the optimization ob jective dynamically adapts to the classiﬁcation qualit y of each sample. F or “easy” samples where the mo del predicts the true lab el with high conﬁdence—resulting in a negligible nominal loss ℓ ( β ; x n , y n ) but a large adversarial loss—the ob jective switches to the adversarial comp onen t max e i  = y n { ℓ ( β ; x n , e i ) − τ } . This prev ents the optimizer from settling for mere correctness; instead, it comp els the mo del to further suppress the utilit y of comp etitiv e incorrect classes, thereby enforcing a robust safet y margin τ . Conv ersely , for “hard” samples or outliers where the mo del misclassiﬁes the data (yielding a large 26 nominal loss), the op erator retains the nominal loss to prioritize error correction. Mathematically , this mirrors a generalized m ulti-class Hinge loss, ensuring the decision b oundary is regularized by strictly separated conﬁdence in terv als while maintaining sensitivity to signiﬁcan t errors. The outer optimization ov er the scalar v ariable τ eﬀectiv ely searches for the globally optimal safet y margin. By minimizing the aggregate ob jectiv e, this step balances the empirical margin violations from the inner problem against the li near penalty w eighted b y the ratio ϵ/κ . This structural insigh t naturally leads to the follo wing theorem, which formally c haracterizes the estimator in this limiting regime. Theorem 6 (Limiting Equiv alence to Margin-Optimized Hinge Loss) . Consider the asymptotic r e gime wher e the lab el mismatch p enalty κ and the Wasserstein ambiguity r adius ϵ vanish simultane- ously, i.e., κ → 0 and ϵ → 0 , such that their r atio c onver ges to a ﬁnite c onstant ν := lim ( ϵ/κ ) > 0 . Under this c ondition, the distributional r obustness with r esp e ct to the hybrid tr ansp ort c ost de gener- ates into a pur e lab el-unc ertainty pr oblem. Conse quently, the Wasserstein distributional ly r obust estimator ˆ β DRO is char acterize d as the solution to the fol lowing bilevel c onvex optimization pr oblem: ˆ β DRO ∈ arg min β ( min τ ≥ 0 ν τ + 1 N N X n =1 max  ℓ FY ( β ; x n , y n ) , max e i  = y n ( ℓ FY ( β ; x n , e i ) − τ )  !) . (70) Her e, the inner optimization minimizes a gener alize d multi-class Hinge loss with a dynamic safety mar gin τ , while the outer optimization determines the glob al ly optimal mar gin τ ∗ by b alancing the empiric al mar gin violations against the line ar p enalty ν τ . In practical optimization scenarios where the speciﬁc ratio ν = ϵ/κ is a priori uncertain or treated as a hyperparameter, v arying ν is mathematically equiv alent to sw eeping across diﬀerent v alues of the safety margin τ . Consequen tly , rather than solving the full bilev el optimization for a ﬁxed ν , one can treat τ directly as a tunable hyperparameter. In this con text, it suﬃces to solve the inner minimization problem for a range of candidate τ v alues, thereby generating a regularization path of estimators that enforce v arying degrees of adv ersarial margin. 4.4 Determining the W asserstein Radius In this subsection, w e establish a theoretical guideline for selecting the W asserstein radius ϵ , which is the critical hyperparameter in our distributionally robust framework. T o isolate the impact of distributional uncertaint y in the feature space, w e fo cus on the regime where the label ﬂipping p enalt y is prohibitiv ely large (i.e., κ → ∞ ). As established in Eq. (65), the DR O formulation in this limit is equiv alent to the standard regularized F enc hel-Y oung estimator: ˆ β λ = arg min β {J ( β ) + λ ∥ β ∥ 2 } , (71) where J ( β ) := 1 N P n ℓ FY ( β ; x n , y n ) is the empirical risk, and the regularization coeﬃcient is directly prop ortional to the radius: λ = ϵc S . W e ﬁrst analyze the geometric sensitivit y of the estimator to the regularization strength. The follo wing lemma demonstrates that, under standard regularity conditions, the displacement of the estimator from the nominal solution is locally linear with respect to the regularization co eﬃcien t (and consequently , the W asserstein radius). Lemma 3 (Linear Sensitivit y of the Regularized Estimator) . L et ˆ β 0 b e the minimizer of the unr e gularize d empiric al risk J ( β ) and assume ˆ β 0  = 0 . Assume that J is twic e c ontinuously 27 diﬀer entiable in a neighb orho o d of ˆ β 0 and that the empiric al Hessian matrix H N := ∇ 2 J ( ˆ β 0 ) is p ositive deﬁnite. Then, for a suﬃciently smal l r e gularization c o eﬃcient λ > 0 , the distanc e b etwe en the r obust estimator ˆ β λ and the nominal estimator ˆ β 0 satisﬁes: ∥ ˆ β λ − ˆ β 0 ∥ 2 ≈ λ ·   H − 1 N u   2 , (72) wher e u = ˆ β 0 / ∥ ˆ β 0 ∥ 2 is the unit dir e ction ve ctor of the p ar ameters. Conse quently, the estimation bias induc e d by the Wasserstein c onstr aint is appr oximately pr op ortional to the r adius ϵ : ∥ ˆ β ϵ − ˆ β 0 ∥ 2 ∝ ϵ. (73) Pr o of. The ﬁrst-order optimality condition for the regularized problem is given by: ∇J ( ˆ β λ ) + λ ˆ β λ ∥ ˆ β λ ∥ 2 = 0 . (74) Consider the ﬁrst-order T a ylor expansion of the gradient ∇J around the unregularized minimizer ˆ β 0 . Noting that ∇J ( ˆ β 0 ) = 0 , we hav e: ∇J ( ˆ β λ ) ≈ H N ( ˆ β λ − ˆ β 0 ) . (75) Substituting this into the optimality condition yields: H N ( ˆ β λ − ˆ β 0 ) ≈ − λ ˆ β λ ∥ ˆ β λ ∥ 2 . (76) F or small λ , the direction of the estimator do es not change abruptly , so we appro ximate ˆ β λ / ∥ ˆ β λ ∥ 2 ≈ ˆ β 0 / ∥ ˆ β 0 ∥ 2 =: u . Solving for the displacemen t gives: ˆ β λ − ˆ β 0 ≈ − λ H − 1 N u . (77) T aking the Euclidean norm on b oth sides completes the pro of. The Lemma ab ov e establishes that the W asserstein radius ϵ acts as a linear control knob for the magnitude of the estimator’s displacement from the empirical minimizer. The remaining question is: ho w m uch displacement is optimal? F rom a statistical learning p ersp ectiv e, the goal of regularization is to coun teract the v ariance of the estimator inherent to ﬁnite sampling. The regularization strength should therefore b e calibrated to the magnitude of the natural estimation error. If ϵ is to o small, it fails to stabilize the high-v ariance comp onen ts; if ϵ is to o large, it in tro duces excessive bias that o verwhelms the true signal. Based on the asymptotic normalit y of F enchel-Y oung estimators (Theorem 2), we derive the optimal scaling la w for the W asserstein radius with respect to the sample size N and the parameter dimension d . Prop osition 5 (Optimal Scaling of the W asserstein Radius) . Consider a wel l-sp e ciﬁe d PUM with a true p ar ameter ve ctor β ∗ ∈ R d . L et ˆ β 0 denote the unr e gularize d empiric al estimator b ase d on N indep endent samples. Under the r e gularity c onditions of The or em 2, the exp e cte d Euclide an estimation err or due to varianc e sc ales as O ( p d/ N ) . T o eﬀe ctively mitigate this ﬁnite-sample err or (varianc e) while minimizing the intr o duction of systematic bias, the optimal Wasserstein r adius ϵ should b e chosen pr op ortional to the inverse Signal-to-Noise R atio: ϵ ∝ 1 ∥ β ∗ ∥ 2 r d N . (78) 28 Pr o of. The deriv ation relies on matc hing the regularization strength to the optimal bias-v ariance trade-oﬀ characterizing the asymptotic regime. First, consider the v ariance. According to Theorem 2, the unregularized estimator exhibits asymptotic normality , implying that the estimation error due to ﬁnite-sample v ariance scales as: E h ∥ ˆ β 0 − β ∗ ∥ 2 i ∼ O r d N ! . (79) Second, consider the bias. Lemma 3 establishes that the W asserstein radius ϵ induces a linear displacemen t (bias) of magnitude ∥ ˆ β ϵ − ˆ β 0 ∥ 2 ∝ ϵ . T o achiev e statistical optimality (i.e., minimal Mean Squared Error), this induced bias m ust b e calibrated to coun teract the v ariance eﬀectively . W e dra w an analogy to the theory of optimal shrink age (e.g., Ridge regression or James- Stein estimation). In standard Ridge regression with a quadratic p enalt y λ Ridge ∥ β ∥ 2 2 , the optimal regularization parameter is known to scale in versely with the signal energy: λ ⋆ Ridge ∝ d/ ∥ β ∗ ∥ 2 2 . The eﬀectiv e “shrink age force” exerted by this optimal quadratic penalty on the parameter vector is giv en b y the gradien t magnitude: ∥∇ ( λ ⋆ Ridge ∥ β ∥ 2 2 ) ∥ = 2 λ ⋆ Ridge ∥ β ∥ 2 ∝ 1 / ∥ β ∗ ∥ 2 . In our limiting DRO formulation, the p enalty is linear in the norm ( λ ∥ β ∥ 2 ), exerting a constant shrink age force of magnitude λ = ϵc S . Equating the DR O shrink age force to the optimal Ridge shrink age force yields the scaling la w for the regularization co eﬃcien t: ϵc S ∝ 1 ∥ β ∗ ∥ 2 . (80) Finally , incorp orating the sample-size dep endence of the noise lev el (whic h dictates the scale of the required correction), the optimal radius m ust deca y with the accumulation of evidence. F ollowing standard concentration results for W asserstein balls whic h suggest a radius scaling of O ( p d/ N ) to co ver the true distribution, and mo dulating this b y the signal-to-noise intensit y derived ab o ve, w e arriv e at the uniﬁed scaling la w: ϵ ∝ 1 ∥ β ∗ ∥ 2 r d N . (81) This prop osition reﬁnes the practical guidelines for hyperparameter tuning b y explicitly incorpo- rating the mo del signal str ength . While the radius should decay with the square root of the sample size ( √ N ) and gro w with mo del complexity ( √ d ), it is inv ersely related to the magnitude of the utilit y parameters ( ∥ β ∗ ∥ 2 ). Intuitiv ely , this reﬂects a Signal-to-Noise adaptation: when parameters are small (w eak signal), the empirical risk landscap e is ﬂat and susceptible to noise, requiring a larger W asserstein radius (stronger robustness) to stabilize the estimator. Conv ersely , when parameters are large (strong signal), the c hoice probabilities are distinct, and the estimator naturally requires less regularization. 5 Numerical Studies In this section, we present a comprehensive series of n umerical experiments designed to empirically v alidate the theoretical framew ork established in the preceding chapters. The primary ob jective is to demonstrate the practical eﬃcacy and computational behavior of the prop osed estimation metho ds through sp eciﬁc computational examples. The section is organized into three distinct parts. The ﬁrst subsection tests the conv ergence p erformance of the prop osed algorithm. The second subsection 29 utilizes synthetic datasets generated under con trolled conditions to systematically in vestigate the ﬁnite-sample p erformance of the F enchel-Y oung estimator. In this con text, w e place particular emphasis on the W asserstein DR O form ulation, analyzing its n umerical properties and verifying its capacit y to mitigate ﬁnite-sample bias as predicted by theory . In the third subsection, we transition to an empirical application using the Swissmetro dataset, a canonical b enc hmark in discrete c hoice analysis. This real-w orld case study allows us to rigorously test the prop osed robust mo dels against standard sp eciﬁcations, assessing their estimation stabilit y and predictive accuracy in a practical transp ortation planning scenario. 5.1 T esting the Solution Algorithms In this subsection, we ev aluate the computational performance and con vergence prop erties of the prop osed solution algorithms for PUM estimation featuring non-separable perturbation functions. Sp eciﬁcally , we test the optimization algorithms designed for b oth the standard F enchel-Y oung estimator and the tractable saddle-p oin t reformulation of the W asserstein DRO estimator, which emplo ys the Pro jected Extragradien t metho d. T o assess the algorithms b ey ond the standard additive separable cases, we construct a non-separable quadratic regularizer. This is op erationalized by deﬁning a strictly p ositiv e deﬁnite matrix Q ∈ R m × m , generated as Q = K ⊤ K + I m , where K is a random matrix with entries drawn from a standard normal distribution, and I m is the identit y matrix. The corresp onding p erturbation function takes the form Λ( p ) = µ 2 p ⊤ Qp , in tro ducing complex cross-alternative correlations that c hallenge the optimization pro cess. Figure 2: Con vergence p erformance of the Pro jected Extragradient algorithm for the standard F enchel-Y oung estimator under a non-separable quadratic p erturbation. Panel (a) illustrates the rapid stabilization of the parameter error in Euclidean distance. P anel (b) demonstrates the monotonic decrease of the KKT residual on a logarithmic scale. The algorithm achiev es linear con vergence in this scenario. Figure 2 illustrates the con vergence b eha vior of the F enchel-Y oung estimator using the Pro jected Extragradien t algorithm. The numerical exp erimen t is conducted on a large-scale problem featuring a 10-dimensional parameter space, 10 discrete alternativ es, and 5,000 observ ations. Remark ably , the algorithm completes 500 iterations in merely 0.77 seconds, demonstrating the exceptional computational eﬃciency and stability of the PEG metho d for such problem scales. T o quan titatively assess the optimalit y , the KKT residual is computed as the norm of the gradient mapping, i.e., the 30 Euclidean distance b et ween the current primal-dual v ariables and their pro jected states follo wing a gradien t step, normalized by the respective step sizes. Panel (b) conﬁrms the strict monotonic decrease of this KKT residual, v alidating the theoretical conv ergence of the saddle-p oin t form ulation. Ho wev er, as observed in P anel (a), the parameter error relative to the true data-generating parameters is not strictly monotonically decreasing. This phenomenon o ccurs b ecause the empirical risk minimizer, derived from a ﬁnite sample of 5,000 observ ations, inheren tly deviates from the true p opulation parameters due to sampling noise. Consequen tly , the optimization tra jectory may temp orarily tra verse regions in the parameter space that are closer to the true v alues b efore ultimately settling at the empirical optimum. Figure 3: Conv ergence performance of the solution algorithm for the W asserstein DR O estimator. P anel (a) illustrates the smo oth stabilization of the parameter error. Panel (b) demonstrates the oscillated decrease of the KKT residual on a logarithmic scale o ver 5,000 iterations. Figure 3 illustrates the con v ergence behavior of the W asserstein DRO estimator. This numerical exp erimen t is conducted in a data-scarce regime, featuring a 10-dimensional parameter space, 10 discrete alternatives, and a limited sample size of only 50 observ ations. The algorithm completes 5,000 iterations in just 1.26 seconds. Notably , the bilinear saddle-point form ulation introduces the dual v ariable γ and the adversarial lab el distribution q ◦ . The purely linear coupling of these v ariables can cause sev ere oscillatory “bang-bang” b eha vior and destabilize con vergence under certain parameter conﬁgurations. T o address this inherent instabilit y in the min-max game, the algorithm eﬀectively employs a t wo-time-scale strategy . By setting a larger step size for the inner maximization v ariables and p erforming m ultiple inner iterations p er primal up date, the adversarial distribution is forced to suﬃciently approximate the local w orst-case scenario. 5.2 Exp erimen ts on the Syn thetic Dataset T o ev aluate the ﬁnite-sample p erformance and robustness prop erties of the prop osed estimators, w e construct a syn thetic exp erimen tal en vironment gov erned b y certain PUM ground truths. The data generation pro cess is sp eciﬁcally designed to sim ulate realistic decision-making scenarios c haracterized b y inherent trade-oﬀs rather than obvious dominance. F or a set of m alternativ es and k attributes, we ﬁrst generate baseline feature vectors. Individual heterogeneity and measuremen t errors are in tro duced by sup erimposing indep enden t Gaussian noise on to these baseline features for eac h of the N decision-mak ers. The c hoice outcomes y n are sampled according to the probabilities 31 deriv ed from the linear systematic utilities V ni = β ⊤ x ni . T o rigorously stress-test the estimators, w e delib erately fo cus on a data-sc ar c e, high-varianc e regime (e.g., N = 40 with a lo w norm for the true parameters). This challenging setting serves to highligh t the instability of standard F enchel-Y oung estimation and demonstrate the stabilization eﬀects of the W asserstein-DR O framework. In the subsequen t analysis, we op erationalize the t wo limiting regimes of the DR O framew ork by denoting the h yp erparameter for the ℓ 2 -regularized estimator as λ reg , and the p enalt y parameter corresponding to the Hinge loss estimator as λ ﬂip . 5.2.1 The ℓ 2 -Regularized Estimator First, w e ev aluate the p erformance of the ℓ 2 -regularized F enchel-Y oung estimator, which corresp onds to the limiting regime of the W asserstein robust framew ork where the label mismatch p enalt y κ → ∞ . W e set the data generation pro cess to follo w a standard MNL sp eciﬁcation. In our ev aluation, we ﬁrst ev aluate the derived scaling law, and then b enc hmark the p erformance and sensitivities of the prop osed estimator with standard F enc hel-Y oung estimator in terms of sample size, and ﬁnally ev aluate the general p erformance of the tw o aforementioned estimators. First, we v alidate the theoretical scaling law for the regularization parameter deriv ed in Prop osi- tion 5. W e propose the follo wing empirical formula for the regularization co eﬃcien t: λ pred = 0 . 13 p d/ N ∥ β ∗ ∥ 2 . (82) In our exp erimen t, to ensure robustness, w e ﬁrst generate 100 test cases by randomly v arying the sample size N ∈ [20 , 200], the feature dimension d , and the signal strength ∥ β ∗ ∥ 2 . F or eac h test case, to assess the eﬃcacy of this heuristic, w e p erformed an exhaustive line search for eac h generated instance to identify the “oracle” regularization parameter λ opt that strictly minimizes the mean squared error (MSE), and the ev aluation of MSE is done b y Monte-Carlo estimation with 100,000 randomly generated problem instances. Figure 4 compares the predicted parameters against the optimal ones found via line search. The scatter plot reveals a strong log-linear correlation, with data p oin ts clustering tightly along the y = x iden tity line. Quantitativ ely , utilizing this closed-form heuristic yields an estimator whose MSE is, on av erage, only approximately 23% higher than the theoretical minimum achiev able by the oracle parameter. This conﬁrms that the prop osed scaling law provides a reliable, data-driv en guideline for h yp erparameter tuning, eﬀectively capturing the optimal bias-v ariance trade-oﬀ without the need for computationally exp ensiv e cross-v alidation. Next, we conduct a con trolled Mon te Carlo sim ulation to ev aluate the ﬁnite-sample behavior of the estimators. W e ﬁrst conduct a controlled exp erimen t to ev aluate and b enc hmark the estimator’s p erformance sensitivity to sample size. Therefore, we ﬁx the parameter v ector and generate the data using β ∗ = [1 . 0 , 2 . 0 , 0 . 5] ⊤ . W e v aried the sample size N from 20 to 200, and for each sample size, we conduct Monte-Carlo estimation with 100,000 randomly generated problem instances. The regularized estimator utilized the dynamic scaling la w v alidated in the previous section. The results, illustrated in Figure 5, demonstrate the sup erior predictive accuracy of the W asserstein-robust framew ork. Crucially , the ℓ 2 -regularized estimator ac hieves a lo wer MSE than the original unregularized F enc hel-Y oung estimator across the en tire range of sample sizes in vestigated ( N ≤ 200). This p erformance gap is particularly pronounced in the limited-data regime, where the robust estimator signiﬁcantly reduces estimation error by curbing the v ariance inherent to small samples. Notably , even as the sample size increases tow ards 200, the regularized estimator maintains a consistent adv antage ov er the original mo del, indicating that the prop osed scaling law eﬀectiv ely balances bias and v ariance without being o vertak en b y the unregularized MLE within this range. 32 1 0 3 1 0 2 1 0 1 E m p i r i c a l O p t i m a l o p t 1 0 3 1 0 2 1 0 1 S c a l i n g L a w P r e d i c t e d p r e d R 2 = 0 . 9 1 7 Figure 4: V alidation of the regularization scaling la w. The scatter plot compares the optimal regularization parameter λ obtained via exhaustiv e line searc h against the predicted v alue from the empirical formula. The strong alignmen t along the y = x line demonstrates that the formula λ reg = 0 . 13 √ d/ N ∥ β ∗ ∥ 2 yields results that are highly consistent with the optimal parameters. Next, w e ev aluate and b enc hmark the estimators’ robustness to v arying parameter scales b y testing the mo del across a div erse range of β magnitudes. T o this end, we ﬁrst generate 1,000 set of parameters using β ∈ ([ − 5 , 5] 3 ), for eac h set of parameters, w e v ary the sample size from 20 to 200 and utilize Monte-Carlo exp eriment with 100,000 repetitions to obtain the estimator performance. The comparative p erformance of the prop osed estimator is sho wn in T able 1, T able 1: Robustness of the prop osed ℓ 2 estimator to v arying parameter scales. The rep orted win rate represents the p ercen tage of tested parameters where the ℓ 2 estimator achiev es a low er MSE. The Average MSE reduction indicates the a verage p ercentage of MSE reduced by the ℓ 2 estimator compared to the standard estimator. Ov erall, the ℓ 2 estimator is sup erior in all cases and can signiﬁcan tly reduce the MSE of the parameter estimates. Sample size Reg win rate Av erage MSE reduction 20 100% 99.5% 60 100% 99.9% 100 100% 99.8% 140 100% 99.6% 180 100% 99.4% 33 20 50 100 200 S a m p l e S i z e N 0.1 1.0 10.0 100.0 M e a n S q u a r e d E r r o r * 2 2 Standard FY Estimator Proposed Estimator Figure 5: Comparison of MSE b et ween the estimators. Notably , when the sample size is small ( N ≤ 200), the ℓ 2 regularized F enchel-Y oung estimator achiev es a signiﬁcan tly low er av erage MSE compared to the original FY estimator, demonstrating b etter robustness in data-scarce regimes. The results in T able 1 illustrate that the adv an tage of the ℓ 2 estimator is parameter-indep enden t and consistent. In all sample sizes and parameters tested, the ℓ 2 estimator alwa ys outp erforms the standard FY estimator, demonstrating its consisten t adv antage. F urthermore, the MSE reduction is consisten t, as the ℓ 2 estimator is shown to b e able to reduce the av erage MSE b y ov er 99.4%. 5.2.2 The Hinge Loss Estimator In parallel to the previous analysis, w e no w turn our atten tion to the pure margin-enforcing estimator. This corresp onds to the regime where the parameter regularization is remo ved ( λ reg = 0), and the distributional robustness relies en tirely on the label mismatc h p enalt y λ ﬂip . The exp erimen tal setup remains identical to that of the ℓ 2 -regularized estimator, co vering a div erse range of sample sizes N and dimensions d under a MNL speciﬁcation. A key h yp othesis of our framework is that the optimal safet y margin should be inv ersely prop ortional to the statistical diﬃcult y of the problem. W e sp eciﬁcally test the empirical scaling la w: λ pred = 2 . 9 p d/ N ∥ β ∗ ∥ 2 ! − 1 . (83) T o v alidate this, we p erformed a grid search to determine the optimal λ pred for eac h generated instance. The numerical exp erimen t procedure is consistent with the setting and parameters of the scaling law v alidation in Section 5.2.1. As sho wn in Figure 6, there is a high degree of correlation b et w een the optimal v alues and those predicted b y the in verse formula. This strong alignment conﬁrms that the prop osed heuristic accurately captures the relationship b et w een data sparsity and the necessary adversarial margin. 34 1 0 1 1 0 2 E m p i r i c a l O p t i m a l o p t 1 0 1 1 0 2 S c a l i n g L a w P r e d i c t e d p r e d R 2 = 0 . 7 3 6 Figure 6: V alidation of the Margin Scaling Law. This scatter plot compares the optimal lab el mismatc h p enalt y λ ﬂip obtained via exhaustive grid search against the v alue predicted by the empirical in verse-complexit y formula λ pred ∝  √ d/ N ∥ β ∥  − 1 . The tight clustering along the identit y line ( y = x ) conﬁrms that the optimal safety margin is in versely prop ortional to the statistical diﬃcult y of the problem. Next, we b enc hmark the p erformance of the prop osed Hinge estimator against the standard FY estimator. First, retaining the n umerical experiment pro cedure in generating Figure 5, we ev aluate the relativ e performance using β ∗ = [1 . 0 , 2 . 0 , 0 . 5] ⊤ . The corresp onding MSE comparisons of the pure margin estimator are illustrated in Figure 7. Similar to the regularized case, the Hinge Loss estimator consistently ac hieves a signiﬁcan tly lo wer Mean Squared Error than the original unregularized F enchel-Y oung estimator across the entire observed range ( N ≤ 200). This conﬁrms that enforcing an adv ersarial safet y margin is a highly eﬀectiv e strategy for robustness, capable of suppressing estimation error ev en in the absence of explicit parameter shrink age. Ho wev er, a notable distinction arises in the shap e of the conv ergence tra jectory . Unlike the smo oth descent observ ed in the ℓ 2 -regularized mo del, the MSE of the Hinge Loss estimator do es not decrease monotonically with the sample size. This ﬂuctuation is lik ely attributable to the nature of the hyperparameter speciﬁcation. While the scaling la w for λ reg is supp orted by a rigorous theoretical deriv ation based on asymptotic normality , the in verse scaling formula for λ ﬂip is primarily empirical. Consequen tly , the predicted margin may not b e lo cally optimal for every sp eciﬁc sample size, leading to minor oscillations in performance despite the ov erall adv an tage in robustness. Finally , we ev aluate the robustness of the Hinge estimator to wards ground truth parameter magnitude and shap e. The n umerical exp eriment pro cedure and parameter are the same as those of T able 1 in Section 5.2.1. The results are illustrated in T able 2. Similar to the ℓ 2 estimator, the ﬁndings in T able 1 demonstrate that the Hinge estimator’s 35 20 50 100 200 S a m p l e S i z e N 0.1 1.0 10.0 100.0 M e a n S q u a r e d E r r o r * 2 2 Standard FY Estimator Proposed Estimator Figure 7: The ﬁgure illustrates the conv ergence of the MSE as the sample size N increases. The W asserstein robust estimator (green solid line), conﬁgured with a pure adversarial margin ( λ reg = 0), demonstrates sup erior p erformance in the small-sample regime ( N ≤ 200) compared to the standard FY estimator (red dashed line). T able 2: Robustness of the prop osed Hinge estimator to v arying parameter scales. The win rate indicates the frequency with which the Hinge estimator achiev es a low er MSE, while the av erage MSE reduction measures its p ercen tage improv emen t ov er the standard approac h. In all cases, the Hinge estimator prov es strictly sup erior, signiﬁcantly diminishing the MSE of the parameter estimates. Sample size Hinge win rate Av erage MSE reduction 20 100% 99.2% 60 100% 99.9% 100 100% 99.9% 140 100% 99.6% 180 100% 98.8% sup eriorit y is b oth parameter-inv arian t and robust across diﬀeren t sample sizes. In ev ery tested scenario, the Hinge estimator systematically outp erforms the standard FY b enc hmark. This p erformance edge is remark ably stable, with the prop osed metho d consistently reducing the av erage MSE by more than 98.8%. 5.2.3 Comparisons b et ween Two W asserstein F enchel-Y oung Estimators Ha ving analyzed the tw o limiting regimes of the W asserstein distributionally robust framework separately , we no w turn to a direct comparison betw een the ℓ 2 -regularized estimator (corresponding to κ → ∞ ) and the Hinge Loss estimator (corresp onding to κ, ϵ → 0). T o ensure the generality of our conclusions, we conducted extensive n umerical experiments co vering a wide grid of conﬁgurations, 36 v arying the sample size N , the feature dimension k , and the signal strength ∥ β ∗ ∥ 2 . The results displa y a remark ably consisten t pattern across virtually all tested scenarios. As exempliﬁed in Figure 8, when b oth estimators are tuned to their resp ectiv e optimal hyperparameters, the Hinge Loss estimator consistently yields a lo wer MSE than its ℓ 2 -regularized counterpart. This p erformance gap suggests that the topology of the am biguity set deﬁned by the label mismatc h cost is often more aligned with the in trinsic geometry of discrete choice errors than the feature transport cost. W e attribute this phenomenon to the diﬀeren t mec hanisms by which the t wo estimators induce robustness. The ℓ 2 -regularization imposes a global, isotropic shrink age on the parameter v ector, eﬀectiv ely reducing v ariance by biasing all estimates to wards zero regardless of the sp eciﬁc data structure. In contrast, the Hinge Loss estimator op erates via a more reﬁned truncation mechanism. By enforcing an adversarial margin, it sp eciﬁcally robustiﬁes the mo del against samples that are predicted with insuﬃcien t conﬁdence, while preven ting the mo del from becoming ov er-conﬁdent on “easy” samples. This acts as a selective regularization that adapts to the hardness of individual observ ations, oﬀering a more precise trade-oﬀ b et w een bias and v ariance than the blunt instrument of uniform norm con traction. Figure 8: The ﬁgure illustrates one problem instance for testing the t wo robust estimators. It c an b e observed that the hinge loss estimator achiev es smaller MSE at the optimal parameter. 5.2.4 T esting the Sparsemax PUM In this ﬁnal set of exp erimen ts, w e extend our v alidation framework to the Sparsemax mo del. Unlik e the Multinomial Logit mo del, Sparsemax induces sparse probabilit y distributions where lo w-utility alternativ es are assigned zero probability , pro viding a rigorous testb ed for inherent data sparsity . First, to ev aluate the small sample performance of the proposed estimators in the sparsemax case, we test the estimators on β ∗ = [1 . 0 , 2 . 0 , 0 . 5] ⊤ . Using this parameter, we v ary the sample size from 20 to 200 and use Mon te-Carlo exp erimen t with 100,000 repetitions to estimate the MSE. W e benchmark four distinct estimators: the Missp eciﬁed MLE (incorrectly assuming a Logit distribution), the Standard FY estimator (correctly sp eciﬁed but unregularized), and t wo W asserstein-derived formulations—the Robust ℓ 2 estimator (utilizing dynamic regularization λ reg ) 37 20 50 100 200 S a m p l e S i z e N 0 0 0 0 1 M e a n S q u a r e d E r r o r * 2 2 1e8 Standard FY MNL MLE P r o p o s e d ( 2 r e g ) Proposed (Hinge) Figure 9: Comparisons among four diﬀerent estimators for the Sparsemax instances. and the Hinge Loss estimator (utilizing dynamic margin λ ﬂip ). The comparative results are summarized in Figure 9. Three key observ ations emerge from this analysis. First, the Missp eciﬁed MLE consistently yields the worst p erformance across all instances. This conﬁrms that forcing a dense probability mo del (Logit) on to sparse data in tro duces a severe structural bias that cannot b e o vercome b y sample size alone. Second, consistent with our MNL ﬁndings, b oth W asserstein-derived robust estimators signiﬁ- can tly outp erform the standard unregularized FY estimator. The regularization eﬀectively curbs the high v ariance asso ciated with the non-smo oth gradient transitions of the Sparsemax function in limited-data regimes. Third, in the head-to-head comparison b etw een the tw o robust formulations, the Hinge Loss estimator demonstrates sup erior adaptabilit y . While both metho ds provide substan tial impro vemen ts, the Hinge estimator achiev es the low est MSE in the ma jority of test cases. This suggests that the adv ersarial margin mec hanism, whic h fo cuses regularization on the decision b oundaries, is geometrically more compatible with the sparse supp ort of the Sparsemax mo del than the isotropic parameter shrink age imp osed by the ℓ 2 -norm. W e also ev aluate the robustness of the estimators in the sparsemax case by ev aluating the p erformance using a set of diﬀeren t parameter β s. The results are shown in T able 3. On av erage, b oth robust estimators reduce the MSE b y more than 99.9%. 5.3 Exp erimen ts on the Swissmetro Dataset T o v alidate the practical eﬃcacy of the prop osed estimators, w e conduct an empirical study using the Swissmetro dataset, a canonical benchmark in discrete choice analysis (Bierlaire et al., 2001). Collected in 1998, this Stated Preference (SP) survey captures the decision-making b eha vior of 38 T able 3: Robustness of the prop osed estimators to v arying parameter scales in the sparsemax setting. The win rate denotes the frequency with which the corresp onding estimator ac hieves a lo wer MSE than the standard FY estimator, while the a verage MSE reduction quantiﬁes the mean improv emen t in estimation accuracy . These results demonstrate b oth estimators’ consistent sup eriorit y and their abilit y to signiﬁcan tly enhance parameter precision across all ev aluated cases. Sample size Reg win rate Reg MSE reduction Hinge win rate Hinge MSE reduction 20 100% 99.9% 100% 99.9% 60 100% 99.9% 100% 99.9% 100 100% 99.9% 100% 99.9% 140 100% 99.9% 100% 99.9% 180 100% 99.9% 100% 99.9% comm uters and business trav elers along the St. Gallen-Genev a corridor. Resp ondents w ere presented with hypothetical choice situations inv olving three alternatives: priv ate car, existing public train services, and the Swissmetro—a revolutionary underground mag-lev system op erating at high sp eeds. The choices are driven by k ey attributes s uc h as trav el time, trav el cost, and headwa y . By simulating data-scarce scenarios via subsampling from this real-world dataset, w e rigorously ev aluate the ability of our W asserstein robust estimation framew ork to recov er stable preference parameters and mitigate o verﬁtting compared to standard F enchel-Y oung estimators (MLE in the context of MNL). In our exp erimen tal setup, we ﬁrst pre-processed the raw data b y retaining only observ ations where the car alternativ e was a v ailable and removing in v alid choice indicators, ensuring a consistent c hoice set. The systematic utilit y functions are sp eciﬁed according to the standard mo del structure deﬁned in Bierlaire et al. (2001), incorporating attributes such as trav el time, cost, frequency , and season tick et o wnership (GA) across three alternatives: T rain, Swissmetro, and Car. T o ev aluate the estimators in a con trolled data-scarce en vironment, we adopt a “population-proxy” approac h where the parameters estimated from the full dataset serv e as the ground truth (Oracle). W e then conduct a Mon te Carlo sim ulation b y randomly subsampling datasets of size N ranging from 20 to 400. F or eac h N , we a verage the p erformance metrics o ver 5,000 indep enden t trials, comparing the standard unregularized estimator (MLE) against our t wo W asserstein estimation form ulations: the ℓ 2 -regularized estimator and the Hinge Loss estimator. Crucially , the regularization hyperparameters are determined directly via our scaling la ws, thereby testing the v alidity of our deriv ation without relying on cross-v alidation. The empirical results presen ted in Figure 10 pro vide comp elling evidence for the sup eriorit y of the W asserstein estimation framew ork in data-scarce regimes. As illustrated in P anel (b), there exists a dramatic div ergence in estimator performance when the sample size N is small ( N < 100). The standard MLE (blue line) suﬀers from severe ov erﬁtting, exhibited by an estimation error gap exceeding 10 4 at N = 20. This empirically conﬁrms the “curse of limited data” discussed in Section 4.1, where the unregularized estimator chases the idiosyncratic noise of the ﬁnite sample rather than capturing the structural preferences. In sharp con trast, both W asserstein-derived estimators main tain a tight proximit y to the Oracle b enc hmark throughout the sampling range. The robustiﬁcation signiﬁcan tly suppresses the v ariance, reducing the log-lik eliho o d gap by nearly tw o orders of magnitude in the smallest samples. Notably , the Hinge Loss estimator (green triangle) demonstrates exceptional stability in the extreme low- sample limit, supp orting the hypothesis that enforcing an adversarial margin is particularly eﬀectiv e 39 50 100 150 200 250 300 350 400 S a m p l e S i z e N 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 L o g - L i k e l i h o o d ( ) 1e6 Standard FY Estimator P r o p o s e d 2 E s t i m a t o r Proposed Hinge Estimator Asymptotic Limit (a) Log-Likelihoo d on F ull Dataset 50 100 150 200 250 300 350 400 S a m p l e S i z e N 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 E s t i m a t i o n E r r o r G a p | ( ) * | ( L o g S c a l e ) Standard FY Estimator P r o p o s e d 2 E s t i m a t o r Proposed Hinge Estimator (b) Absolute Estimation Error Gap (Log Scale) Figure 10: P erformance comparisons on the Swissmetro dataset. (a) The evolution of Log-Likelihoo d on the full dataset as sample size N increases. (b) The absolute gap | ℓ ( ˆ β ) − ℓ ∗ | relativ e to the Oracle estimator, plotted on a logarithmic scale. The W asserstein FY estimators (b oth ℓ 2 and Hinge) demonstrate faster con vergence and lo wer error in data-scarce regimes compared to MLE. when data is insuﬃcient to deﬁne a clear decision boundary . F urthermore, as N increases, all estimators show consistent conv ergence, y et the robust formulations consistently outp erform the MLE. Most signiﬁcantly , these p erformance gains were realized using the closed-form scaling la ws deriv ed in our theoretical analysis (e.g., λ ∝ p d/ N ), without relying on computationally exp ensiv e cross-v alidation. T o rigorously assess the robustness of our framework under realistic survey conditions, we in tro duce a second exp erimental setup fo cusing on de cision-maker-b ase d sampling (ID sampling). Unlik e the previous i.i.d. row sampling, this approach resp ects the panel structure of the Swissmetro dataset, where eac h resp onden t t ypically answers nine distinct choice situations in volving micro- v ariations in attributes. W e sim ulate data scarcity b y randomly drawing K unique decision-makers, v arying K from 6 to 40, and utilizing all asso ciated c hoice observ ations for estimation. The results, visualized in Figure 11, corrob orate our earlier ﬁndings: the W asserstein robust estimators (b oth ℓ 2 and Hinge) consisten tly outperform the standard MLE, maintaining a low er estimation error throughout the sampling range. Ho wev er, tw o critical distinctions emerge when con trasting this cluster sampling with the previous random sampling regime. First, despite comparable total observ ation coun ts, the log-lik eliho od gaps | ℓ ( ˆ β ) − ℓ ∗ | for all three estimators are mark edly higher in the ID sampling case. This phenomenon indicates that ID sampling exacerbates the negativ e eﬀects of the small-sample regime. The high in tra-p ersonal correlation implies that the eﬀe ctive sample size is low er than the nominal observ ation count, making the empirical distribution signiﬁcantly more sparse and prone to ov erﬁtting the idiosyncratic preferences of the few selected individuals. Second, to counteract this in tensiﬁed structural noise, the robust estimators required stronger regularization p enalties. In this sp eciﬁc instance, the eﬀective hyperparameters were scaled to λ reg = 2 and λ ﬂip = 0 . 03, v alues notably stronger than those required in the i.i.d. setting. This suggests that when data is not only scarce but also clustered, the “uncertain ty radius” around the empirical distribution m ust be expanded, necessitating a more aggressiv e robustiﬁcation strategy to reco ver stable p opulation-lev el parameters. 40 5 10 15 20 25 30 35 40 S a m p l e S i z e N 800000 600000 400000 200000 0 L o g - L i k e l i h o o d ( ) Standard FY Estimator P r o p o s e d 2 E s t i m a t o r Proposed Hinge Estimator Asymptotic Limit (a) Log-Likelihoo d on F ull Dataset 5 10 15 20 25 30 35 40 S a m p l e S i z e N 1 0 3 1 0 4 1 0 5 1 0 6 E s t i m a t i o n E r r o r G a p | ( ) * | ( L o g S c a l e ) Standard FY Estimator P r o p o s e d 2 E s t i m a t o r Proposed Hinge Estimator (b) Absolute Estimation Error Gap (Log Scale) Figure 11: P erformance comparisons on the Swissmetro dataset under decision-maker-based sampling. (a) The ev olution of Log-Likelihoo d on the full dataset as the num b er of unique decision-mak ers K increases (ranging from 6 to 40). (b) The absolute gap | ℓ ( ˆ β ) − ℓ ∗ | relativ e to the Oracle estimator, plotted on a logarithmic scale. The W asserstein FY estimators (b oth ℓ 2 and Hinge) demonstrate sup erior stability and faster conv ergence compared to MLE. 6 Concluding Remarks This pap er establishes a uniﬁed, statistically robust estimation framework for Perturbed Utility Mo dels (PUMs), oﬀering a mathematically rigorous alternative to classical Maximum Likelihoo d Estimation (MLE). W e highlighted the structural deﬁciencies of the traditional logarithmic scoring rule, sp eciﬁcally its susceptibilit y to non-con v exity and n umerical instability when ev aluating modern, sparse choice arc hitectures lik e the Sparsemax mo del. T o resolve these fundamen tal pathologies, we in tro duced the F enc hel-Y oung estimator, which serves as a geometrically natural ob jective tailored to PUMs. By directly aligning the estimation loss with the conv ex conjugate structure of the c hoice probabilit y mapping, the prop osed F enc hel-Y oung estimator guarantees global conv exity and b ounded gradients, thereb y securing computational stability even in strictly sparse regimes. F urthermore, to o vercome the p erv asive c hallenge of data scarcit y , we extended this frame- w ork through W asserstein Distributionally Robust Optimization (DR O). This robust formulation explicitly incorp orates distributional uncertaint y surrounding the empirical data, successfully cur- tailing ﬁnite-sample bias and ov erﬁtting without compromising computational tractability . A cen tral theoretical con tribution of this extension is the rigorous uniﬁcation of previously distinct regularization paradigms. Sp eciﬁcally , we pro ved that b oth standard ℓ 2 -regularization and the margin-enforcing Hinge loss naturally emerge as exact limiting cases of one tractable appro ximation of the W asserstein-robust F enchel-Y oung estimator. Our theoretical claims were thoroughly v alidated through extensive numerical studies, encom- passing b oth controlled syn thetic en vironments and the empirical Swissmetro benchmark. Across these exp erimen ts, the prop osed robust estimators consisten tly demonstrated sup erior p erformance o ver standard metho dologies, recov ering stable preferences in data-scarce and high-noise regimes where traditional estimators fail. Ultimately , this uniﬁed framew ork provides a cohesiv e, mathemat- ically grounded toolkit for discrete c hoice analysis across transp ortation, economics, and marketing, ensuring reliable parameter inference even under sev ere data limitations and mo del sparsity . Sev eral promising directions for future research emerge from this work. First, our current 41 framew ork treats the p erturbation function Λ as a ﬁxed structural primitive (e.g., assuming a priori an en tropic or quadratic form). A natural extension would b e to incorp orate the p erturbation mec hanism itself in to the estimation pro cess. Dev eloping metho ds to jointly learn the systematic utilit y parameters and the shap e of the choice kernel w ould allow for more ﬂexible data-driven represen tations of decision-maker uncertaint y and substitution patterns. Second, while this analysis fo cused on standard discrete choice scenarios with explicit choice sets, extending these robust estimators to more complex, structured domains is a critical frontier. In particular, applying the W asserstein-F enchel-Y oung framew ork to sto c hastic route c hoice problems, where the choice set comprises com binatorial paths in a net work, presents unique challenges. Adapting the conv exit y results and dual reform ulations to handle the recursive nature of suc h high-dimensional decision en vironments represents a fertile ground for future inv estigation. Ac kno wledgmen t The work describ ed in this pap er was partly supp orted by researc h grants from the National Science F oundation (CMMI- 2233057). References Alb ert, A. and Anderson, J. A. (1984). On the existence of maximum likelihoo d estimates in logistic regression mo dels. Biometrika , 71(1):1–10. Anderson, S. P ., de Palma, A., and Thisse, J.-F. (1992). Discr ete Choic e The ory of Pr o duct Diﬀer entiation . MIT Press , Cambridge, MA. Ben-Akiv a, M. E. and Lerman, S. R. (1985). Discr ete choic e analysis: the ory and applic ation to tr avel demand , v olume 9. MIT press. Bierlaire, M., Axhausen, K., and Abay , G. (2001). The acceptance of mo dal innov ation: The case of swissmetro. I n Swiss tr ansp ort r ese ar ch c onfer enc e , v olume 1. Blanc het, J. and Murth y , K. (2019). Quantifying distributional model risk via optimal transp ort. Mathematics of Op er ations R ese ar ch , 44(2):565–600. Blondel, M., Martins, A., and Niculae, V. (2019). Learning classiﬁers with fenchel-y oung losses: Generalized entropies, margins, and algorithms. In The 22nd International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics , pages 606–615. PMLR. Blondel, M., Martins, A. F., and Niculae, V. (2020). Learning with fenc hel-young losses. Journal of Machine L e arning R ese ar ch , 21(35):1–69. Clark, M. D., Determann, D., Petrou, S., Moro, D., and de Bekker-Grob, E. W. (2014). Discrete c hoice exp erimen ts in health economics: a review of the literature. Pharmac o e c onomics , 32(9):883– 902. Daganzo, C. F., Bouthelier, F., and Sheﬃ, Y. (1977). Multinomial probit and qualitativ e c hoice: A computationally eﬃcient algorithm. T r ansp ortation Scienc e , 11(4):338–358. Donoso, P . and de Grange, L. (2010). A microeconomic in terpretation of the maximum en tropy estimator of multinomial logit mo dels and its equiv alence to the maximum likelihoo d estimator. Entr opy , 12(10):2077–2084. 42 Firth, D. (1993). Bias reduction of maximum likelihoo d estimates. Biometrika , 80(1):27–38. F osgerau, M., P aulsen, M., and Rasm ussen, T. K. (2022). A p erturb ed utilit y route c hoice model. T r ansp ortation R ese ar ch Part C: Emer ging T e chnolo gies , 136:103514. F uden b erg, D., Iijima, R., and Strzalec ki, T. (2015). Sto c hastic choice and revealed p erturbed utility . Ec onometric a , 83(6):2371–2409. Gabaix, X. (2014). A sparsit y-based mo del of b ounded rationality . The Quarterly Journal of Ec onomics , 129(4):1661–1710. Gao, R. and Kleyw egt, A. (2023). Distributionally robust stochastic optimization with w asserstein distance. Mathematics of Op er ations R ese ar ch , 48(2):603–655. Hausman, J. and McF adden, D. (1984). Sp eciﬁcation tests for the multinomial logit mo del. Ec onometric a: Journal of the e c onometric so ciety , pages 1219–1240. Hofbauer, J. and Sandholm, W. H. (2002). On the global conv ergence of sto c hastic ﬁctitious play . Ec onometric a , 70(6):2265–2294. Kuhn, D., Esfahani, P . M., Nguyen, V. A., and Shaﬁeezadeh-Abadeh, S. (2019). W asserstein distributionally robust optimization: Theory and applications in mac hine learning. In Op er ations r ese ar ch & management scienc e in the age of analytics , pages 130–166. Informs. Lin, T., Jin, C., and Jordan, M. I. (2025). Tw o-timescale gradien t descent ascent algorithms for noncon vex minimax optimization. Journal of Machine L e arning R ese ar ch , 26(11):1–45. Malhotra, N. K. (1984). The use of linear logit mo dels in mark eting researc h. Journal of Marketing r ese ar ch , 21(1):20–31. Martins, A. and Astudillo, R. (2016). F rom softmax to sparsemax: A sparse model of atten tion and multi-label classiﬁcation. In International c onfer enc e on machine le arning , pages 1614–1623. PMLR. McF adden, D. (1972). Conditional logit analysis of qualitative c hoice behavior. McF adden, D. (1974). The measurement of urban tra vel demand. Journal of public e c onomics , 3(4):303–328. McF adden, D. (1981). Econometric models of probabilistic choice. Structur al Analysis of Discr ete Data with Ec onometric Applic ations , pages 198–272. McF adden, D. L. and F osgerau, M. (2012). A theory of the perturb ed consumer with general budgets. T echnical rep ort, National Bureau of Economic Research. Moha jerin Esfahani, P . and Kuhn, D. (2018). Data-driv en distributionally robust optimization using the wasserstein metric: P erformance guarantees and tractable reformulations. Mathematic al Pr o gr amming , 171(1):115–166. Mokh tari, A., Ozdaglar, A. E., and Pattathil, S. (2020). Conv ergence rate of o(1/k) for optimistic gradien t and extragradien t metho ds in smo oth con vex-conca ve saddle p oint problems. SIAM Journal on Optimization , 30(4):3230–3251. 43 Nestero v, Y. (1983). A metho d of solving a c on vex programming problem with conv ergence rate O (1 /k 2 ). Soviet Mathematics Doklady , 27(2):372–376. Ra y , P . (1973). Indep endence of irrelev an t alternativ es. Ec onometric a: Journal of the Ec onometric So ciety , pages 987–991. Sur, P . and Cand ` es, E. J. (2019). A mo dern maximum-lik eliho o d theory for high-dimensional logistic regression. Pr o c e e dings of the National A c ademy of Scienc es , 116(29):14516–14525. T ouc hette, H. (2005). Legendre-fenchel transforms in a n utshell. URL http://www. maths. qmul. ac. uk/ht/ar chive/lfth2. p df , page 25. W ald, A. (1949). Note on the consistency of the maxim um lik eliho od estimate. The Annals of Mathematic al Statistics , 20(4):595–601. W en, C.-H. and Kopp elman, F. S. (2001). The generalized nested logit mo del. T r ansp ortation R ese ar ch Part B: Metho dolo gic al , 35(7):627–641. Y ao, R., F osgerau, M., P aulsen, M., and Rasmussen, T. K. (2024). P erturb ed utilit y sto c hastic traﬃc assignment. T r ansp ortation Scienc e , 58(4):876–895. 44 A Pro ofs A.1 Pro of of Lemma 1 Fix ( x, y ). Recall that for any PUM, the F enchel–Y oung loss can be written as ℓ FY ( β ; x, y ) = Ω  V ( x ; β )  − y ⊤ V ( x ; β ) , where y ∈ { 0 , 1 } K is the one-hot represen tation of the observed choice. W e ﬁrst sho w that Ω( · ) is contin uous. By deﬁnition, Ω( V ) = sup p ∈ ∆  p ⊤ V − Λ( p )  . F or each ﬁxed p ∈ ∆, the map V 7→ p ⊤ V − Λ( p ) is aﬃne (hence con tinuous) in V . Since ∆ is compact and Λ is con tinuous on ∆, the supremum o ver a compact set of con tin uous functions is con tinuous. Thus Ω( V ) is contin uous in V . Because V ( x ; β ) is aﬃne in β , the comp osition β 7→ Ω( V ( x ; β )) is con tinuous. The second term β 7→ y ⊤ V ( x ; β ) is linear in V and th us con tinuous in β . Therefore β 7→ ℓ FY ( β ; x, y ) = Ω( V ( x ; β )) − y ⊤ V ( x ; β ) is contin uous on B for ev ery ﬁxed ( x, y ). W e now work on a compact parameter set B . W e ﬁrst obtain a linear gro wth b ound for Ω( V ). Since ∆ is compact and Λ is con tin uous on ∆, there exists a ﬁnite constan t c Λ : = sup p ∈ ∆ | Λ( p ) | < ∞ . Then for any V ∈ R K , Ω( V ) = sup p ∈ ∆  p ⊤ V − Λ( p )  ≤ sup p ∈ ∆ p ⊤ V + c Λ . Because p ∈ ∆ is a probability vector, we ha ve p ⊤ V ≤ ∥ V ∥ ∞ , so Ω( V ) ≤ ∥ V ∥ ∞ + c Λ . Similarly , using inf p ∈ ∆ Λ( p ) > −∞ , one obtains a matching low er b ound, and thus   Ω( V )   ≤ ∥ V ∥ ∞ + C 0 for some ﬁnite constan t C 0 dep ending only on Λ. Next, for each ( x, y ) and β ∈ B , ℓ FY ( β ; x, y ) = Ω( V ( x ; β )) − y ⊤ V ( x ; β ) , so   ℓ FY ( β ; x, y )   ≤   Ω( V ( x ; β ))   +   y ⊤ V ( x ; β )   . Since y is a one-hot v ector (and th us y ∈ ∆), we ha ve | y ⊤ V | ≤ ∥ V ∥ ∞ . Therefore:   ℓ FY ( β ; x, y )   ≤ 2 ∥ V ( x ; β ) ∥ ∞ + C 0 . Since B is compact, there exists C β > 0 such that ∥ β ∥ ≤ C β for all β ∈ B . Moreo ver, assuming eac h component of V ( x ; β ) is of the form V i ( x ; β ) = β ⊤ x i , we hav e ∥ V ( x ; β ) ∥ ∞ ≤ ∥ β ∥ max i ∥ x i ∥ ≤ C β max i ∥ x i ∥ ≤ C β ∥ x ∥ , 45 for a suitable norm ∥ · ∥ on the feature space collecting all { x k } . Th us w e obtain   ℓ FY ( β ; x, y )   ≤ C 1 ∥ x ∥ + C 2 for some ﬁnite constan ts C 1 , C 2 indep enden t of β , with C 1 and C 2 dep ending only on B and Λ. Deﬁne the env elop e M ( x, y ) : = C 1 ∥ x ∥ + C 2 . Under the standing momen t condition that E P ⋆ [ ∥ X ∥ ] < ∞ , we ha ve M ( X , Y ) in tegrable. By construction,   ℓ FY ( β ; x, y )   ≤ M ( x, y ) for all β ∈ B and P ⋆ -almost all ( x, y ). This pro ves the domination claim. A.2 Pro of of Theorem 1 W e split the pro of into t wo main steps. P art 1: The p opulation risk is uniquely minimized at β ⋆ . Recall the F enc hel–Y oung loss for a single observ ation ( x , y ): ℓ FY ( β ; x , y ) = Ω  V ( x ; β )  − y ⊤ V ( x ; β ) , where V ( x ; β ) ∈ R K is the vector of systematic utilities and y ∈ { 0 , 1 } K is the one-hot represen tation of the observed c hoice. Denote the p opulation (conditional) risk at cov ariate v alue x b y L x ( β ) : = E P ⋆  ℓ FY ( β ; X , Y )   X = x  = Ω  V ( x ; β )  − E P ⋆ [ Y | X = x ] ⊤ V ( x ; β ) . By the w ell-sp eciﬁed assumption, there exists a true parameter β ⋆ suc h that the conditional exp ectation of the observed choice matches the mo del probability: E P ⋆ [ Y | X = x ] = p ( x ; β ⋆ ) for P ⋆ -a.e. x , where p ( x ; β ) = ∇ V Ω( V ( x ; β )) is the c hoice probability vector predicted by the PUM. Fix an x in the supp ort (we omit the x -dep endence in notation when clear). Let V : = V ( x ; β ), V ⋆ : = V ( x ; β ⋆ ), and p ⋆ : = p ( x ; β ⋆ ). Then the conditional risk simpliﬁes to: L x ( β ) = Ω( V ) − ( p ⋆ ) ⊤ V . By the deﬁnition of the con vex conjugate Ω( V ) = sup p ∈ ∆ { p ⊤ V − Λ( p ) } , we hav e the F enc hel– Y oung inequality: Ω( V ) ≥ ( p ⋆ ) ⊤ V − Λ( p ⋆ ) , ∀ p ⋆ ∈ ∆ . Rearranging this inequality yields a lo w er bound for the conditional risk: L x ( β ) = Ω( V ) − ( p ⋆ ) ⊤ V ≥ − Λ( p ⋆ ) . No w, ev aluate the risk at the true parameter β ⋆ : L x ( β ⋆ ) = Ω( V ⋆ ) − ( p ⋆ ) ⊤ V ⋆ . 46 F rom the prop erties of conv ex conjugates and the iden tity p ⋆ = ∇ Ω( V ⋆ ), the supremum in the deﬁnition of Ω is attained exactly at p ⋆ . Th us, the equalit y holds: Ω( V ⋆ ) = ( p ⋆ ) ⊤ V ⋆ − Λ( p ⋆ ) , whic h implies L x ( β ⋆ ) = − Λ( p ⋆ ) . Com bining the inequalit y and the v alue at β ⋆ , we obtain: L x ( β ) ≥ L x ( β ⋆ ) for P ⋆ -a.e. x and all β ∈ B . Moreo ver, equality L x ( β ) = L x ( β ⋆ ) holds if and only if equality holds in the F enc hel–Y oung inequalit y . This occurs if and only if p ⋆ is a subgradient of Ω at V , i.e., p ⋆ = ∇ Ω( V ) = ⇒ p ( x ; β ⋆ ) = p ( x ; β ) . Therefore, for P ⋆ -almost all x , L x ( β ) = L x ( β ⋆ ) ⇐ ⇒ p ( x ; β ) = p ( x ; β ⋆ ) . The total p opulation risk is R ( β ) : = E P ⋆ [ L X ( β )]. Integrating the point wise inequality implies R ( β ) ≥ R ( β ⋆ ). F urthermore, if R ( β ) = R ( β ⋆ ), then L x ( β ) = L x ( β ⋆ ) almost surely , whic h implies p ( x ; β ) = p ( x ; β ⋆ ) almost surely . By the iden tiﬁability Lemma 2, this implies β = β ⋆ . Th us, β ⋆ is the unique minimizer. P art 2: Uniform conv ergence and consistency . Deﬁne the empirical FY risk as R N ( β ) : = 1 N N X n =1 ℓ FY ( β ; X n , Y n ) . By Lemma 1, the loss function is con tinuous on the compact set B and dominated b y an integrable en velope. These conditions are suﬃcient to inv oke the Uniform Law of Large Numbers (see, e.g., New ey and McF adden, 1994, Lemma 2.4), yielding: sup β ∈B   R N ( β ) − R ( β )   p − → 0 . W e now apply the standard consistency argumen t for extremum estimators. Fix ε > 0. Since B is compact and β ⋆ is the unique minimizer of the contin uous function R ( β ), the minim um of R o ver the complement of the ε -ball is strictly separated from the global minimum: δ ε : = inf { β ∈B : ∥ β − β ⋆ ∥≥ ε } R ( β ) − R ( β ⋆ ) > 0 . Consider the ev ent {∥ ˆ β N − β ⋆ ∥ ≥ ε } . On this even t, w e ha ve R ( ˆ β N ) ≥ R ( β ⋆ ) + δ ε . At the same time, by deﬁnition of the estimator, R N ( ˆ β N ) ≤ R N ( β ⋆ ). W e can b ound the p opulation risk diﬀerence: R ( ˆ β N ) − R ( β ⋆ ) =  R ( ˆ β N ) − R N ( ˆ β N )  +  R N ( ˆ β N ) − R N ( β ⋆ )  +  R N ( β ⋆ ) − R ( β ⋆ )  ≤   R ( ˆ β N ) − R N ( ˆ β N )   + 0 +   R N ( β ⋆ ) − R ( β ⋆ )   ≤ 2 sup β ∈B   R N ( β ) − R ( β )   . 47 Com bining these inequalities, the even t ∥ ˆ β N − β ⋆ ∥ ≥ ε implies that 2 sup β | R N − R | ≥ δ ε . Therefore: P  ∥ ˆ β N − β ⋆ ∥ ≥ ε  ≤ P sup β ∈B   R N ( β ) − R ( β )   ≥ δ ε 2 ! . By uniform conv ergence, the probabilit y on the right-hand side conv erges to 0 as N → ∞ . Thus, ˆ β N p − → β ⋆ . 48

Fenchel-Young Estimators of Perturbed Utility Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment