Amortized Inference for Correlated Discrete Choice Models via Equivariant Neural Networks

Amortized Inference for Correlated Discrete Choice Mo dels via Equiv arian t Neural Net w orks Easton Huc h ∗ Mic hael Keane † Marc h 2026 Abstract Discrete c hoice mo dels are fundamen tal to ols in managemen t science, economics, and mark eting for understanding and predicting decision-making. Logit-based mo dels are dominant in applied w ork, largely due to their conv enien t closed-form expressions for c hoice probabilities. Ho wev er, these mo dels en tail restrictive assumptions on the sto c hastic utilit y comp onen t, constraining our abilit y to capture realistic and theoreti- cally grounded choice b ehavior—most notably , substitution patterns. In this work, we prop ose an am ortized inference approach using a neural net w ork emulator to approxi- mate c hoice probabilities for general error distributions, including those with correlated errors. Our proposal includes a sp ecialized neural net work arc hitecture and accompany- ing training pro cedures designed to resp ect the inv ariance prop erties of discrete choice mo dels. W e pro vide group-theoretic foundations for the arc hitecture, including a pro of of universal approximation given a minimal set of inv arian t features. Once trained, the emulator enables rapid likelihoo d ev aluation and gradient computation. W e use Sob olev training, augmenting the lik eliho o d loss with a gradient-matc hing p enalty so that the emulator learns b oth choice probabilities and their deriv atives. W e show that em ulator-based maximum likelihoo d estimators are consistent and asymptotically nor- mal under mild approximation conditions, and we pro vide sandwich standard errors that remain v alid ev en with imp erfect lik e liho o d approximation. Sim ulations sho w signiﬁcan t gains o ver the GHK simulator in accuracy and sp eed. Keyw ords: amortized inference; DeepSet; discrete choice; inv arian t theory; m ultino- mial probit; neural netw ork emulator; p ermutation equiv ariance; Sob olev training ∗ P ostdo ctoral F ellow, Johns Hopkins Carey Business Sc ho ol, ehuc h@jhu.edu. † Carey Distinguished Professor, Johns Hopkins Carey Business Sc ho ol, mkeane14@jh u.edu. 1 1 In tro duction Discrete choice mo dels are a widely used to ol in management science, economics, marketing, and other ﬁelds for understanding how individuals and organizations make decisions among ﬁnite sets of alternatives ( McF adden , 1974 ). These models provide a structural framew ork that deliv ers in terpretable parameters—suc h as willingness-to-pay and demand elasticities— and that also enables predictions ab out the eﬀects of hypothetical in terven tions, including pricing c hanges, pro duct introductions, and p olicy mo diﬁcations. The dominan t discrete c hoice mo del in applied w ork is the multinomial logit (MNL), whic h assumes that the utility consumer i obtains from alternativ e j takes the form U j = v j + ϵ j where v j is the deterministic comp onen t of utilit y , t ypically written as a linear function x ⊤ j β of alternative j ’s attributes x j , and ϵ j is a Type I extreme v alue (Gumbel) error that is independent and identically distributed (iid) across alternativ es j = 1 , . . . , K . This setup yields simple closed-form choice probabilities: the probabilit y of choosing alternativ e j is giv en by the “softmax” function exp( v j ) / P K k =1 exp( v k ). The resulting computational con venience has made MNL the default choice in countless applications. Ho wev er, this conv enience comes at a w ell-known cost. The iid Gum b el error assumption implies the restrictive indep endence of irrelev an t alternativ es (I IA) property: the o dds ratio b et w een an y tw o alternativ es is unaﬀected b y the presence or attributes of other alternativ es, whic h can lead to unrealistic substitution patterns. A p opular generalization of MNL is the mixed logit mo del (mixed-MNL), in which the utility weigh t β is allo wed to be heterogeneous across consumers. Estimation is straigh tforw ard via simulation metho ds ( T rain , 2009 ). The mixed-MNL relaxes IIA at the aggregate lev el, but I IA still holds at the lev el of consumer i ’s individual c hoices given the corresp onding random utility weigh t, β i . The multinomial probit (MNP) is a well-kno wn alternative to MNL that relaxes I IA and allo ws for ﬂexible substitution patterns b y assuming ϵ 1 , . . . , ϵ K come from a m ultiv ariate nor- mal distribution with cross-alternative correlations. Despite this adv an tage, MNP has seen relativ ely limited adoption, largely due to its computational demands. MNP c hoice probabil- ities hav e no closed form and require ev aluating multiv ariate normal rectangle probabilities. Estimation relies on more sophisticated simulation metho ds than needed for mixed-MNL: the GHK simulator in a classical setting ( Gewek e , 1989 ; Ha jiv assiliou and McF adden , 1998 ; Keane , 1994 ) or MCMC in a Bay esian setting ( McCullo ch and Rossi , 1994 ). W e prop ose a fundamentally diﬀeren t approach: Rather than sim ulating c hoice probabilities anew for eac h likelihoo d ev aluation, we train a neural netw ork emulator to directly approx- 2 imate the c hoice probabilit y function. This strategy—known as amortized inference—shifts the computational burden from inference time to a one-time training phase. Once trained, the emulator provides rapid deterministic approximations to c hoice probabilities via simple function calls. The amortized inference framew ork has prov en highly successful in multiple scien tiﬁc domains for appro ximating computationally in tensive sim ulation mo dels ( Lueck- mann et al. , 2019 ; Cranmer et al. , 2020 ). Here, we adapt it to discrete c hoice mo dels, extending b ey ond MNL and MNP to general correlated error distributions. In the econometrics literature, Norets ( 2012 ) emplo ys a related strategy , appro ximating the exp ected v alue function in dynamic discrete c hoice mo dels via a neural netw ork. He applies this strategy in the context of a sp eciﬁc parametric mo del, so changes in the dynamic mo del require mo difying and retraining the neural net work. In con trast, our emulator op erates on the generic c hoice problem given only the deterministic utilities and the parameters of the error distribution, enabling mo diﬁcations to the parametric mo del (e.g., the functional form of the deterministic utilities) without altering or retraining the emulator. Our metho dology relies on sev eral new contributions. First, w e dev elop a neural net work arc hitecture sp eciﬁcally designed for choice mo dels. It resp ects their fundamen tal inv ariance prop erties, which w e formalize as symmetries (group actions) of the c hoice-probabilit y map. 1 Choice probabilities are inv ariant to lo cation shifts (adding a constant to all utilities) and scale transformations (m ultiplying utilities b y a p ositive constant), and equiv ariant with resp ect to p ermutations of c hoice alternativ es. Our architecture incorp orates a preprocessing transformation that reduces the size of the feature space b y enforcing lo cation and scale in v ariance. The pro cessed features are then passed to a per-alternative enco der mo dule based on the DeepSet architecture ( Zaheer et al. , 2017 ). The output is then concatenated across alternativ es and pro cessed through equiv ariant la y ers that imp ose a sum-to-one constraint on the output probabilities while resp ecting the p erm utation equiv ariance prop erty . Second, we establish the theoretical foundations of our architecture. W e prov e that it can univ ersally appro ximate choice probabilities on compact subsets of the parameter space, outside a measure-zero exceptional set. This result connects our architecture to the theory of orbit separation under group actions, extending the universal appro ximation results of Blum- Smith et al. ( 2025 ) on symmetric matrices to the joint space of utility vectors and cen tered co v ariance matrices. Blum-Smith et al. ( 2025 ) emplo y Galois theory to pro ve generic orbit separation, while w e provide a direct pro of based on an inv ariant reconstruction argument. 1 If these prop erties are not embedded in the architecture, the NN must learn them during training. As a con tin uum of mo dels generate equiv alent c hoice probabilities, this would slo w learning considerably . 3 Third, w e establish the statistical prop erties of maximum lik eliho o d estimators (MLEs) formed using an em ulator approximation of the true discrete c hoice probabilities. W e sho w that if the em ulator approximates the true log lik eliho o d suﬃcien tly well—speciﬁcally , if the av erage appro ximation error is o p ( n − 1 )—then the em ulator-based estimator inherits the consistency and asymptotic normalit y of the exact MLE. When this condition is not met, w e show that v alid inference can still b e obtained via sandwich standard errors, treating the em ulator as a working mo del in the quasi-maximum likelihoo d framework. By careful design, our prepro cessing transformations and architecture enforce smo othness of c hoice probabilities with resp ect to the inputs: the deterministic utilities and corresp onding scale (co v ariance) matrix. W e use Sob olev training ( Czarnec ki et al. , 2017 ) to match em ulator gradien ts to those of the log c hoice probabilities. T ogether, these design c hoices enable reliable use of automatic diﬀeren tiation for downstream mo del-ﬁtting and inference tasks. With a pretrained emulator, generalizing from logit to probit (or an y other error distribution) requires only the replacemen t of closed-form softmax probabilities with em ulator ev aluations of c hoice probabilities when forming the lik eliho o d. In the probit case, given a ﬁxed compu- tational budget, we sho w via simulations that an ML estimator using our amortized inference pro cedure matc hes or exceeds the p erformance of one using the GHK algorithm in terms of estimation error and cov erage rates. F urthermore, our approac h can easily handle mo dels other than MNP where an eﬃcien t simulation algorithm like GHK is una v ailable. F or concreteness, muc h of our exp osition fo cuses on MNP mo dels. Ho wev er, the emulator approac h and supp orting theory are largely agnostic to the assumed parametric form of the errors. Generalizing to other error distributions (e.g., correlated Gumbel) is straightforw ard, requiring only mo dest c hanges to the training data generation and emulator inputs. The remainder of this pap er is organized as follo ws. Section 2 summarizes related literature on choice modeling. Section 3 describ es the MNP mo del and our inferential goals. Sec- tion 4 presents the em ulator arc hitecture and training pro cedure. Section 5 establishes the theoretical prop erties of the arc hitecture and em ulator-based estimators. Section 6 presen ts sim ulation results, and Section 7 concludes. 2 Related Literature The literature on applying mac hine learning metho ds to discrete choice can b e divided into t wo broad streams. The ﬁrst stream maintains the classic random utilit y mo del (RUM) of c hoice, in whic h choice probabilities are generated by a p opulation of rational consumer 4 t yp es with diﬀerent (but stable) preference orderings ov er the univ erse of choice ob jects, as explained by Blo ch and Marschak ( 1960 ) and McF adden and Rich ter ( 1990 ). 2 MNL, MNP , and mixed-MNL are all mem b ers of the RUM class pro vided one main tains the RUM assumption that the utility of an option dep ends only on its own c haracteristics. Key pap ers in this strand start from this basic MNL structure and use neural net w orks to generalize the functional form of utilit y: Ben tz and Merunk a ( 2000 ), Sifringer et al. ( 2020 ), W ang et al. ( 2020 ), Han et al. ( 2022 ), and Singh et al. ( 2023 ). Tw o imp ortan t recen t pap ers extend this w ork: Aouad and Desir ( 2025 ) develop an architecture that implements the mixed logit mo del with a ﬂexible distribution of taste heterogeneit y (R UMnet), and Bagheri et al. ( 2025 ) dev elop another architecture that generalizes the Gumbel error assumption (RUM-NN). Both pap ers use softmax-smo othed sample a v erages to approximate c hoice probabilities within the loss function, resulting in increased computation time relative to pure logit-based mo dels. Aouad and Desir ( 2025 ) build on the generic appro ximation prop erty of mixed-MNL mo dels sho wn in McF adden and T rain ( 2000 ). This result relies on a ﬂexible basis expansion of the utility function and a p oten tially nonparametric mixing distribution. In practice, these mo del attributes are unkno wn to analysts. As emphasized in b oth Aouad and Desir ( 2025 ) and McF adden and T rain ( 2000 ), this appro ximation prop ert y is not unique to Gum b el errors. Rather, it holds more generally for a larger class of hierarchical discrete choice mo dels, including mixed-MNP mo dels. McF adden and T rain ( 2000 ) use this appro ximation prop ert y to justify adoption of mixed-MNL as a computationally conv enient alternative to other mo dels lacking closed-form c hoice probabilities, suc h as MNP . But this justiﬁcation of mixed-MNL becomes less compelling giv en a practical, general-purpose alternativ e lik e the em ulators we prop ose here. The second stream dispenses with the R UM structure, often motiv ated by a desire to relax restrictiv e MNL assumptions like I IA and to allow more ﬂexible substitution patterns. Some pap ers in this stream view discrete c hoice as a general classiﬁcation problem that is amenable to machine learning metho ds. This is exempliﬁed by W ang and Ross ( 2018 ), Lh ´ eritier et al. ( 2019 ), Rosenfeld et al. ( 2020 ), Chen and Mi ˇ si´ c ( 2022 ), and Chen et al. ( 2025 ). Others main tain an MNL structure at the top lev el—that is, choice probabilities are determined b y a vector of alternativ e-sp eciﬁc utilities that enter a softmax function—but a neural net is used to construct the alternative-speciﬁc utility functions in ﬂexible wa ys that deviate from RUM assumptions. F or instance, the utility of alternative j is allow ed to dep end on 2 Supp ose there are ¯ K ob jects in the univ erse and a consumer is presented with a choice set (or “assort- men t”) that contains K ≤ ¯ K elemen ts. A k ey implication of RUM is that the utilit y a consumer derives from pro duct k is inv arian t to the c hoice set, ruling out context or assortmen t eﬀects. 5 attributes of other alternativ es to generate con text eﬀects. This is exempliﬁed by W ang et al. ( 2021 ), W ong and F aro o q ( 2021 ), Cai et al. ( 2022 ), Pfannsc hmidt et al. ( 2022 ), and Berb eglia and V enk ataraman ( 2025 ). Bet ween these tw o streams, a fundamental tension emerges: While RUM mo dels are fa v ored for their interpretabilit y and grounding in economic theory , the dominant R UM mo dels in empirical w ork (linear MNL and mixed-MNL) enforce constraints on deterministic utilities and substitution patterns that may b e to o restrictiv e in some applications; in particular, they assume I IA at the level of individual c hoices. The ﬁrst stream aims to address this c hallenge by adding additional ﬂexibilit y to R UM mo dels. How ev er, many of these prop osals main tain the assumption of indep enden t logit errors, constraining substitution patterns and p oten tially biasing utility estimates. The exceptions—namely , RUMnet and RUM-NN— address substitution, but they do so at the exp ense of interpretabilit y and/or computational eﬃciency due to their (a) (p otentially) nonlinear utility functions and (b) exp ensive sample- a verage appro ximations of the lik eliho o d function. The second stream abandons the R UM structure altogether, making it diﬃcult or imp ossible to obtain reliable inferences for man y economically meaningful quantities, suc h as consumer w elfare, willingness-to-pay measures, demand elasticities, and substitution eﬀects. In summary , this tension results in a tradeoﬀ b et w een mo deling ﬂexibility and interpretation, ranging from rigid interpretable mo dels like MNL on one hand to ﬂexible atheoretic classiﬁcation metho ds on the other. MNP mo dels constitute a notable exception to the ab ov e tradeoﬀ as they allo w for ﬂexi- ble substitution patterns via interpr etable cov ariance relationships. Moreov er, if additional ﬂexibilit y is desired, they can b e extended to allow the deterministic utilities to follow the functional form of a neural netw ork ( Hrusc hk a , 2007 ). Conv ersely , if researchers desire a more parsimonious or interpretable mo del, analysts can constrain the cov ariance structure via p enalization metho ds as in Jiang et al. ( 2025 ) or via factor structures as we illustrate in Section 6 . Despite these virtues, empirical applications of MNP are relativ ely sparse in the literature due, in large part, to the diﬃculty of ev aluating MNP c hoice probabilities. More broadly , MNP mo dels are just one mem b er of a larger class of R UM mo dels featuring correlated error terms. Analogous to MNP , other mem b ers can b e generated by assuming errors of the form ϵ = Σ 1 / 2 ϵ ∗ , where Σ 1 / 2 is a square-ro ot scale matrix and ϵ ∗ is a vector of exc hangeable errors. F or example, ϵ ∗ j iid ∼ Gumbel(0 , 1) results in a correlated Gum b el distribu- tion with scale matrix Σ . As another example, we could assume ϵ ∗ ∼ Multiv ariate- t ( 0 , I , ν ) to capture hea vy tail behavior via the degrees-of-freedom parameter ν > 0 ( I denotes an iden tity matrix). Bagheri et al. ( 2025 ) consider correlated error distributions formed in this 6 manner, resorting to exp ensiv e sim ulation metho ds to appro ximate the choice probabilities b ecause an eﬃcien t algorithm (GHK) is av ailable only in the Gaussian case. The primary contribution of this w ork is an amortized inference framew ork that pro duces accurate, reusable, and computationally eﬃcient estimates of c hoice probabilities for RUM mo dels with general error distributions, including those featuring non trivial correlation struc- tures. The framew ork presents a p oten tial resolution to the ﬂexibility–in terpretabilit y trade- oﬀ highligh ted abov e, enabling practical estimation of ﬂexible choice mo dels without sacri- ﬁcing the economically meaningful insigh ts and parsimony of RUM mo dels. The framew ork is supp orted b y strong theoretical justiﬁcation, including a universal approximation guaran- tee and asymptotic inference results under mild approximation conditions (see Section 5 ). Moreo ver, the framew ork is largely complementary to recent adv ances in mac hine learning metho ds, including the tw o streams discussed ab ov e. In many cases, these metho ds could b e enhanced with emulator-based likelihoo d ev aluations, resulting in ﬂexible mo dels with in terpretable substitution patterns and manageable computational demands. 3 Problem Setup This section summarizes the problem setup, including the in v ariance prop erties of discrete c hoice mo dels and our inferen tial goals. 3.1 Discrete Choice Mo dels W e consider a decision-mak er choosing among K mutually exclusive alternatives. The decision-mak er assigns a latent utility U j to each alternative j ∈ { 1 , . . . , K } and selects the alternativ e with the highest utility: Y = arg max j ∈{ 1 ,...,K } U j . (1) The laten t utilities decomp ose into deterministic and sto chastic comp onen ts: U j = v j + ϵ j , j = 1 , . . . , K, (2) where v j is the deterministic (or systematic) utility that dep ends on observ able c haracteris- tics, and ϵ j is a cen tered random error capturing unobserv ed factors. As explained in Section 1 , the deterministic utility t ypically takes a linear form v j = x ⊤ j β , where x j is a v ector of alternativ e-sp eciﬁc attributes and β is a parameter vector to b e estimated. 7 Diﬀeren t distributional assumptions on the error v ector ϵ = ( ϵ 1 , . . . , ϵ K ) ⊤ yield diﬀerent c hoice models. The MNL mo del assumes that the ϵ j are indep endent and identically dis- tributed with the Gum b el(0 , 1) distribution. In contrast, the MNP mo del assumes that ϵ follo ws a m ultiv ariate normal distribution with mean zero and co v ariance matrix Σ . Belo w, w e consider correlated mo dels in which ϵ = Σ 1 / 2 ϵ ∗ , where Σ 1 / 2 is a square-ro ot scale matrix as describ ed in Section 2 . W e imp ose the follo wing assumption on ϵ ∗ . Assumption 1. The error v ector ϵ ∗ admits a density , f ϵ ∗ : R K → [0 , ∞ ), with resp ect to Leb esgue measure. Moreov er, the elemen ts of ϵ ∗ are exc hangeable so that f ϵ ∗ ( P π ϵ ′ ) = f ϵ ∗ ( ϵ ′ ) for all ϵ ′ ∈ R K and an y p erm utation matrix, P π . This setup and assumption include the MNP as a leading example with f ϵ ∗ equal to the PDF of the N ( 0 K , I K ) distribution. Ho wev er, the setup is general enough to include many other p ossibilities, such as the correlated Gum b el and m ultiv ariate- t distributions describ ed in Section 2 . W e defer treatmen t of additional regularity conditions until Section 5 when we dev elop the theoretical prop erties of our framework in the con text of an assumed parametric mo del with parameter v ector θ ∈ Θ ⊂ R p . W e denote the c hoice probabilit y for alternative j as P j ( v , Σ ) = Pr( U j ≥ U k for all k  = j ) = Pr( ϵ k − ϵ j ≤ v j − v k for all k  = j ) , (3) where v = ( v 1 , . . . , v K ) ⊤ . In practice, we ma y also c ho ose to include additional parameters go verning the distribution of ϵ ∗ , such as the degrees-of-freedom parameter, ν , describ ed in Section 2 , but these parameters are suppressed in the notation for simplicity . The c hoice probabilit y can b e expressed as the following integral: P j ( v , Σ ) = Z R j f ϵ ∗ ( ϵ ∗ ) d ϵ ∗ , (4) R j =  ϵ ∗ ∈ R K : v j +  Σ 1 / 2  j • ϵ ∗ ≥ v k +  Σ 1 / 2  k • ϵ ∗ for all k  = j  , (5) where R j is the region of the error space where alternative j is chosen and  Σ 1 / 2  j • denotes the j th row of Σ 1 / 2 so that  Σ 1 / 2  j • ϵ ∗ = ϵ j . In general, this in tegral is not analytically tractable. In MNP mo dels, in particular, it do es not hav e a closed-form solution for K ≥ 3, necessitating n umerical metho ds, such as the GHK algorithm, whic h simulates these probabilities via recursiv e conditioning. F or each ev aluation, GHK requires R sim ulation dra ws, with the appro ximation error decreasing at rate O ( R − 1 / 2 ). 8 3.2 In v ariance Prop erties Choice probabilities satisfy three fundamen tal inv ariance prop erties. Lo cation in v ariance. Adding a constant to all utilities do es not c hange choice probabilities: P j ( v + C 1 K , Σ ) = P j ( v , Σ ) (6) where 1 K is the K -vector of ones and C ∈ R may b e random. This follo ws b ecause choices dep end only on utilit y diﬀerences. Scale in v ariance. Scaling all utilities and the scale matrix preserves choice probabilities: P j ( α v , α 2 Σ ) = P j ( v , Σ ) for all α > 0 . (7) This follo ws b ecause the scaled mo del is equiv alent to the original mo del with v and ϵ replaced b y α v and α ϵ , resp ectively . P ermutation equiv ariance. Relab eling alternatives permutes the choice probabilities cor- resp ondingly . F or an y p erm utation π of { 1 , . . . , K } with asso ciated p ermutation matrix P π : P j ( v , Σ ) = P π − 1 ( j ) ( P π v , P π ΣP ⊤ π ) . (8) A naive attempt at constructing an emulator may not resp ect these prop erties. T ranslating, rescaling, or p ermuting the alternatives could pro duce diﬀerent emulator predictions, result- ing in slo w learning and p o or generalization b ecause the emulator would need to learn these prop erties manually . In Section 4 , we propose a solution that automatically satisﬁes these prop erties based on tw o design comp onen ts: a prepro cessing step that ﬁxes the lo cation and scale, and an equiv ariant arc hitecture that ensures p erm utation equiv ariance without sacriﬁcing expressivit y . The ab ov e in v ariance prop erties imply that the choice mo del is not identiﬁed without normal- ization. W e assume throughout that the mo del has b een appropriately normalized s o that the parameter v ector θ is identiﬁed. This is t ypically ac hiev ed in practice b y setting the utility of one alternativ e to zero deterministically and ﬁxing one of the nonzero diagonal elemen ts in Σ (see Keane ( 1992 )). Similarly , we assume that the data exhibit suﬃcient v ariation to iden tify the remaining parameters, whic h t ypically requires v ariation in alternativ e-sp eciﬁc co v ariates, such as prices. 9 3.3 Inferen tial Goals Giv en observ ations { ( y i , X i ) } n i =1 , where y i ∈ { 1 , . . . , K } is the observ ed c hoice and X i rep- resen ts the co v ariates, w e seek to estimate the parameter vector, θ . This parameter vector includes b oth the co eﬃcients aﬀecting deterministic utilities and additional parameters gov- erning the scale matrix. The (a veraged) log-likelihoo d function is ℓ n ( θ ) = 1 n n X i =1 log P y i { v i ( X i , θ ) , Σ i ( X i , θ ) } , (9) where v i ( X i , θ ) and Σ i ( X i , θ ) denote the utilit y vector and scale matrix for observ ation i as functions of the co v ariates and parameters. Maxim um likelihoo d estimation requires maximizing ℓ n ( θ ) o v er Θ, which in turn requires rep eated ev aluation of ℓ n ( θ ) and its gradien t ∇ θ log P j ( v , Σ ). Similarly , Bay esian metho ds such as Hamiltonian Monte Carlo require rep eated ev aluations of b oth ℓ n ( θ ) and ∇ θ log P j ( v , Σ ), often thousands of times throughout the estimation pro cess. Our goal is to construct a neural net w ork em ulator ˆ P j ( v , Σ ; ϕ ) that provides fast, accurate appro ximations of the true c hoice probabilities, along with analytic gradients via automatic diﬀeren tiation. The em ulator is trained once on sim ulated data spanning the relev ant input space, after whic h it can b e used for rapid inference on many datasets and mo dels. 4 Em ulator Design This section outlines the emulator preprocessing transformations, neural netw ork architec- ture, training pro cedure, and inference pro cess. 4.1 Design Goals The em ulator must satisfy four desiderata: 1. Resp ect in v ariance prop erties. The em ulator should resp ect the in v ariance prop- erties describ ed in Section 3.2 : in v ariance with resp ect to lo cation shifts and scale transformations, and equiv ariance with resp ect to p ermutations. 2. Provide smo oth appro ximations. The em ulator should b e diﬀeren tiable with re- sp ect to its inputs, enabling gradient-based optimization and automatic diﬀeren tiation. 10 3. Generalize across sp eciﬁcations. A single trained emulator should w ork for div erse utilit y and cov ariance structures. 4. Be computationally eﬃcient. Emulator ev aluation should b e fast enough to enable routine use in estimation. W e ac hiev e these goals through a com bination of prepro cessing transformations and a care- fully designed neural net work architecture. 4.2 Prepro cessing: Centering and Scaling W e address lo cation and scale in v ariance through prepro cessing transformations that pro ject the inputs on to a canonical subspace. Cen tering. Deﬁne the cen tering matrix M = I K − 1 K 1 K 1 ⊤ K , (10) whic h w e emplo y to pro ject the vector of utilities, U ∈ R K , onto the subspace orthogo- nal to 1 K , resulting in transformed utilities ˜ U = M ( v + ϵ ) = v + ϵ − 1 K ( ¯ v + ¯ ϵ ), where ¯ v = 1 K P K j =1 v j and ¯ ϵ = 1 K P K j =1 ϵ j . These transformed utilities sum to zero within each realization: P K j =1 ˜ U j = 0. In eﬀect, this transformation results in mo diﬁed em ulator inputs: ˜ v = M v = v − 1 K ¯ v, ˜ Σ = MΣM ⊤ . (11) Scaling. W e normalize by the trace of the cen tered scale matrix, pro ducing transformed utilities U ∗ = q K/ tr  ˜ Σ  ˜ U . The ﬁnal transformed parameters are then v ∗ = s K tr( ˜ Σ ) ˜ v , Σ ∗ = K tr( ˜ Σ ) ˜ Σ . (12) W e can omit the case tr( ˜ Σ ) = 0 as this results in kno wn, deterministic c hoices. The complete transformation pro duces normalized inputs ( v ∗ , Σ ∗ ) satisfying: 1. P K j =1 v ∗ j = 0 (cen tered deterministic utilities), 2. Σ ∗ 1 K = 0 (centered scale matrix), 11 3. tr( Σ ∗ ) = K (scale-normalized), 4. Σ ∗ is p ositiv e semideﬁnite with rank at most K − 1. This normalization reduces the size of the emulator’s input space, accelerating learning and impro ving generalization. Imp ortantly , this transformation commutes with p erm utation ma- trices: for any permutation matrix P π , applying the transformation to ( P π v , P π ΣP ⊤ π ) yields ( P π v ∗ , P π Σ ∗ P ⊤ π ). This ensures that the prepro cessing preserves p ermutation equiv ariance. 4.3 Neural Net w ork Arc hitecture After preprocessing, we construct a neural net work that maps ( v ∗ , Σ ∗ ) to choice probabilities while mainta ining p ermutation equiv ariance. The architecture consists of three comp onents: a p er-alternative enco der, p erm utation-equiv ariant la yers, and an output la y er. Belo w, we in terpret Σ as a co v ariance matrix to simplify the exposition, referring to its diagonal and oﬀ- diagonal elemen ts as v ariances and cov ariances, resp ectively . This interpretation is correct for MNP mo dels. F or other mo dels, it is only illustrativ e as these parameters may not exactly coincide with the v ariances and co v ariances. P er-alternative enco der. F or each alternative j ∈ { 1 , . . . , K } , we construct a repre- sen tation z j ∈ R d z that captures ho w alternativ e j relates to all other alternatives. This represen tation is built from t wo complementary DeepSet netw orks—a diagonal DeepSet and an oﬀ-diagonal DeepSet—whose outputs are combined with alternativ e j ’s own features. The diagonal DeepSet pro cesses pairwise relationships b etw een j and eac h other alterna- tiv e, while the oﬀ-diagonal DeepSet summarizes the co v ariance structure among alternatives other than j . In the descriptions b elow, the features lab eled as “base inputs” are suﬃcien t for universal approximation of c hoice probabilities (see Section 5.1 ). W e include additional features to improv e the expressivity of the emulator, as we ﬁnd this improv es its predictions. Diagonal DeepSet. F or each pair ( j, k ) with k  = j , we construct a feature v ector d j k con taining diﬀeren tiable functions of the utilities and co v ariances asso ciated with alternatives j and k . The base inputs are the utilities v ∗ j and v ∗ k , the v ariances Σ ∗ j j and Σ ∗ kk , and the co v ariance Σ ∗ j k . F rom these, we derive additional feature transformations that facilitate learning, including standard deviations σ j = p Σ ∗ j j and σ k = p Σ ∗ kk , the correlation ρ j k = Σ ∗ j k / ( σ j σ k ), and the standardized utilit y diﬀerence (z-score): z j k = v ∗ j − v ∗ k p Σ ∗ j j + Σ ∗ kk − 2Σ ∗ j k . (13) 12 The diagonal DeepSet pro cesses these features using the architecture of Zaheer et al. ( 2017 ): h diag j = ρ diag X k  = j ϕ diag ( d j k ) ! , (14) where ϕ diag is a multi-la y er p erceptron (MLP) applied to eac h pair, the sum aggregates ov er all k  = j , and ρ diag is another MLP that pro duces a learned nonlinear representation of the sum. W e use the notation ϕ diag and ρ diag follo wing Zaheer et al. ( 2017 ); the MLPs ρ diag and ρ oﬀ (deﬁned b elo w) can b e distinguished from the correlation ρ j k b y their subscripts. Singh et al. ( 2023 ) also emplo y a v ariant of the DeepSet arc hitecture. Ho wev er, their metho d assumes iid errors and operates on problem-sp eciﬁc features, which requires their neural net work to b e mo diﬁed and retrained to accommo date new data structures. In contrast, our arc hitecture relies on generic RUM mo del inputs, such as deterministic utilities. Oﬀ-diagonal DeepSet. F or each pair ( k , l ) with k < l and k , l  = j , w e construct a feature v ector o kl con taining diﬀeren tiable, symmetric functions comparing alternatives k and l ; we require symmetry b ecause the pairs do not p ossess a natural ordering. The base input is Σ ∗ kl , and we include other deriv ed features, suc h as the correlation ρ kl , squared utility diﬀerence ( v ∗ k − v ∗ l ) 2 , the squared z-score z 2 kl , and the v ariance sum Σ ∗ kk + Σ ∗ ll . The oﬀ-diagonal DeepSet has an analogous structure to that of the diagonal DeepSet: h oﬀ j = ρ oﬀ    X k 0 con trols the weigh t of the gradien t-matching term. The cross-en tropy comp onen t is the multinomial negative log-likelihoo d (up to constants): L CE ( ϕ ) = − K X j =1 ˆ P sim j log ˆ P j ( v ∗ , Σ ∗ ; ϕ ) , (21) where ˆ P sim j = n j / P K k =1 n k is the sim ulated choice frequency for alternative j . F or the gradient-matc hing comp onent, we follo w the sto c hastic Sob olev training approach of Czarnec ki et al. ( 2017 ), whic h a voids computing full Jacobian matrices b y instead matc hing directional deriv atives along random directions. Let d = ( d v , vec h( D Σ )) denote a random direction in the joint space of utilities and cov ariances, where d v ∈ R K , D Σ ∈ R K × K is symmetric, and v ech( · ) denotes the half-vectorization (upp er triangular elemen ts including the diagonal). The gradient-matc hing loss is: L grad ( ϕ ) = K X j =1 ˆ P sim j  ∇ d log ˆ P j − ∇ d log P target j  2 , (22) where ∇ d denotes the directional deriv ativ e along d . W eighting by the simulated c hoice frequencies ˆ P sim j fo cuses learning on alternatives that are frequently c hosen, whose gradients most inﬂuence parameter estimation in do wnstream inference tasks. Constrain t-resp ecting directions. The canonical inputs ( v ∗ , Σ ∗ ) lie on a constrained 17 manifold: X K = ( ( v ∗ , Σ ∗ ) : K X j =1 v ∗ j = 0 , Σ ∗ 1 K = 0 , tr( Σ ∗ ) = K , Σ ∗ ⪰ 0 ) . (23) Random directions must resp ect these constrain ts to remain in the tangent space of X K . W e construct v alid directions as follo ws: 1. Sample ˜ d v ∼ N ( 0 , I K ) and cen ter: d v = ˜ d v − ¯ d v 1 K , where ¯ d v = 1 K P j ˜ d v ,j . 2. Sample a symmetric matrix ˜ D Σ with iid standard normal entries on and ab ov e the diagonal, then double-cen ter: D ′ Σ = M ˜ D Σ M , (24) where M = I K − 1 K 1 K 1 ⊤ K is the cen tering matrix from Section 4 . 3. Remov e the trace comp onent to ensure tr( D Σ ) = 0: D Σ = D ′ Σ − tr( D ′ Σ ) K − 1 M . (25) 4. Set d = ( d v , vec h( D Σ )) and normalize: d ← d / ∥ d ∥ . Computing directional deriv atives. W e compute the emulator’s directional deriv atives via ﬁnite diﬀerences: ∇ d log ˆ P j ≈ log ˆ P j ( v ∗ + ϵ d v , Σ ∗ + ϵ d Σ ; ϕ ) − log ˆ P j ( v ∗ − ϵ d v , Σ ∗ − ϵ d Σ ; ϕ ) 2 ϵ , (26) where ϵ > 0 is a small step size (we use ϵ = 10 − 5 ). This approac h is computationally eﬃcien t, requiring only tw o additional forw ard passes p er direction regardless of the input dimension. Alternativ ely , one could compute the directional deriv ativ e via automatic diﬀerentiation as in Czarnec ki et al. ( 2017 ). T arget directional deriv atives are computed from pre-stored Jacobians. During data gen- eration, we compute and store the full Jacobians ∇ θ log P target j with resp ect to v ∗ and Σ ∗ . During training, the target directional deriv ativ e is obtained via the inner pro duct: ∇ d log P target j = * ∂ log P target j ∂ v ∗ , d v + + * ∂ log P target j ∂ Σ ∗ , D Σ + F , (27) where ⟨· , ·⟩ F denotes the F rob enius inner pro duct. 18 T arget gradien t computation. Computing target gradients for Sob olev training requires diﬀeren tiating through the c hoice sim ulation pro cess. Since discrete c hoices are inherently non-diﬀeren tiable, w e emplo y a smo oth relaxation using a temperature-scaled softmax. F or eac h Mon te Carlo draw r with utilit y v ector U ( r ) = v ∗ + ϵ ( r ) , w e appro ximate the hard c hoice indicator with: ˜ Y ( r ) j = exp( τ U ( r ) j ) P K k =1 exp( τ U ( r ) k ) , (28) where τ > 0 is a temperature parameter. As τ → ∞ , this conv erges to the hard argmax; for ﬁnite τ , it pro vides a diﬀerentiable approximation. The target log-probability is then: log P target j = log 1 R R X r =1 ˜ Y ( r ) j ! , (29) and the target gradien ts ∂ log P target j /∂ v ∗ and ∂ log P target j /∂ Σ ∗ are obtained via automatic diﬀeren tiation through this expression. This approac h is closely related to the “Gum b el– softmax tric k” frequen tly used for gradien t estimation with neural netw orks ( Jang et al. , 2017 ; Maddison et al. , 2017 ). W e use τ = 3 in our exp erimen ts, which pro vides a balance b et w een gradient accuracy (low bias) and computational stability . T raining details. Our training pro cedure mak es use of t wo r eplay buﬀers , a common tec hnique in deep learning ( Mnih et al. , 2015 ). The ﬁrst repla y buﬀer stores precomputed training examples including inputs ( v ∗ , Σ ∗ ), sim ulated choice frequencies, and target Jaco- bians. The second stores random directions for the gradient-matc hing loss. At eac h training step, we sample random pairs from the repla y buﬀers, compute the loss, and up date the net work parameters via sto chastic gradien t descent. The repla y buﬀer accelerates training b y allo wing reuse of expensive high-precision simulations and gradient computations. W e set the gradien t p enalty weigh t λ grad = 10 − 8 in our exp eriments; empirically , we ﬁnd that this v alue provides suﬃciently well-behav ed deriv atives. Multi- K training. Remark ably , none of the weigh t vectors and matrices has a dimension dep ending on K . The dimension K en ters the neural netw ork only through the input data dimension; thus, we can design a single emulator capable of pro cessing m ultiple v alues of K simultaneously . The training pro cedures describ ed abov e can then be adapted to iterate o ver a range of K v alues, sa y K ∈ { 3 , 4 , 5 } , and the resulting emulator can then learn to appro ximate choice probabilities for a v arying n umber of choices (see Section 6.3 ). Under this conﬁguration, w e recommend including K as an element of s j . 19 4.5 Inference with the T rained Emulator Once trained, the em ulator enables rapid inference. Giv en a new dataset { ( y i , X i ) } n i =1 and candidate parameter v ector θ : 1. Compute the deterministic utilit y vector v i ( X i , θ ) and scale matrix Σ i ( X i , θ ) for eac h observ ation. 2. Apply the prepro cessing transformation to obtain ( v ∗ i , Σ ∗ i ). 3. Ev aluate the em ulator to obtain estimated c hoice probabilities ˆ P ij for all alternatives. 4. Compute the log likelihoo d ˆ ℓ n ( θ ) = 1 n P n i =1 log ˆ P i,y i . As the prepro cessing transformation and neural netw ork are b oth smo oth functions, the gra- dien ts of the emulator log lik eliho o d with resp ect to θ can b e computed exactly via automatic diﬀeren tiation. This enables the use of gradien t-based optimizers for maximum likelihoo d estimation, as well as gradien t-based sampling metho ds suc h as Hamiltonian Mon te Carlo for Ba yesian inference. The smo othness of the em ulator also facilitates computation of standard errors, whic h can b e estimated via outer pro ducts of gradients (see Section 5.2 ). 5 Theory This section establishes the theoretical foundations for the emulator architecture and the statistical prop erties of em ulator-based discrete choice estimators. 5.1 Univ ersal Appro ximation W e ﬁrst show that the p er-alternative enco der arc hitecture can univ ersally appro ximate c hoice probabilities. The key insigh t is that the multisets pro cessed b y the enco der generically determine the input up to the relev ant symmetry group. F or eac h alternativ e j , deﬁne the in v ariant function g j : X K → (Multiset , Multiset) by: T j ( v ∗ , Σ ∗ ) = { { ( v ∗ k , Σ ∗ kk , Σ ∗ j k ) : k  = j } } , (30) O j ( Σ ∗ ) = { { Σ ∗ kl : 1 ≤ k < l ≤ K , k  = j, l  = j } } , (31) g j ( v ∗ , Σ ∗ ) = ( T j ( v ∗ , Σ ∗ ) , O j ( Σ ∗ )) . (32) The m ultiset T j consists of K − 1 triples, eac h coupling an alternativ e k ’s utility and v ariance with its cov ariance with alternativ e j . The multiset O j consists of the oﬀ-diagonal co v ariances 20 among alternatives other than j . These are precisely the base inputs pro cessed by the p er- alternativ e enco der’s diagonal and oﬀ-diagonal DeepSets. Theorem 1 (Generic Separation) . F or e ach alternative j ∈ { 1 , . . . , K } , ther e exists a close d, me asur e-zer o set B j ⊂ X K such that for any ( v ∗ 1 , Σ ∗ 1 ) , ( v ∗ 2 , Σ ∗ 2 ) ∈ X K \ B j : g j ( v ∗ 1 , Σ ∗ 1 ) = g j ( v ∗ 2 , Σ ∗ 2 ) = ⇒ ( v ∗ 1 , Σ ∗ 1 ) and ( v ∗ 2 , Σ ∗ 2 ) ar e in the same S ( j ) K − 1 -orbit , (33) wher e S ( j ) K − 1 is the sub gr oup of p ermutations ﬁxing alternative j . The pro of, giv en in App endix A.8 , pro ceeds in three steps: Step 1: Genericity conditions. W e deﬁne tw o genericit y conditions on ( v ∗ , Σ ∗ ): (a) the triples ( v ∗ k , Σ ∗ kk , Σ ∗ j k ) for k  = j are pairwise distinct as elemen ts of R 3 ; and (b) for each k  = j , there exists a unique subset of oﬀ-diagonal co v ariances that sums to − Σ ∗ kk − Σ ∗ j k . Condition (b) follo ws from the row-sum constrain t Σ ∗ 1 K = 0 : the oﬀ-diagonal entries in ro w k (excluding Σ ki ) sum to − Σ ∗ kk − Σ ∗ j k . Step 2: Reco v ery pro cedure. When b oth conditions hold, we can recov er ( v ∗ , Σ ∗ ) from g j up to relab eling of alternat iv es other than j . First, order the triples in T j lexicographically to assign lab els to alternativ es. Then, for eac h alternative k , iden tify its row of oﬀ-diagonal en tries as the unique subset in O j with the required sum. Finally , reco ver shared entries Σ ∗ kl as the singleton in tersection of the subsets for rows k and l . Step 3: Measure-zero exceptional set. The set B j where either condition fails is a ﬁnite union of zero sets of nontrivial analytic functions on the aﬃne subspace X K . By the principle that prop er analytic sub v arieties hav e measure zero ( Mity agin , 2020 ), B j has measure zero. This separation result, com bined with the Stone–W eierstrass theorem, yields univ ersal ap- pro ximation: Theorem 2 (MLP Univ ersal Approximation) . Under Assumption 1 , for e ach alternative j , the choic e pr ob ability function P j c an b e uniformly appr oximate d on any c omp act subset of X K \ B j by a multi-layer p er c eptr on taking the c omp onents of g j as input. Pr o of sketch. W e apply the pro of strategy of Blum-Smith et al. ( 2025 ). The c hoice probabil- it y P j is con tin uous and S ( j ) K − 1 -in v ariant. By Theorem 1 , the functions encoding g j —sorted triples and sorted oﬀ-diagonal entries—separate p oin ts in the quotient space ( X K \ B j ) /S ( j ) K − 1 . Since S ( j ) K − 1 is ﬁnite (hence compact), standard results imply that the quotien t is Hausdorﬀ 21 ( Bredon , 1972 ). By the Stone–W eierstrass theorem, the algebra generated by these functions is dense in contin uous functions on compact subsets of the quotient. Pulling bac k to X K and applying the univ ersal appro ximation theorem for MLPs ( Cyb enk o , 1989 ; Hornik et al. , 1989 ) yields the result. See App endix A.8 for details. Corollary 1 (Arc hitecture Universal Approximation) . L et B = ∪ K j =1 B j . Under Assumption 1 , for e ach alternative j , the choic e pr ob ability function P j c an b e uniformly appr oximate d on any c omp act subset of X K \ B j by the p er-alternative enc o der. F urther, the ve ctor of choic e pr ob abilities ( P 1 , . . . , P K ) c an b e uniformly appr oximate d on any c omp act subset of X K \ B by the neur al network ar chite ctur e of Se ction 4.3 . Pr o of. Apply the pro of of Theorem 2 in tandem with univ ersal approximation results for DeepSet ( Zaheer et al. , 2017 ) and equiv ariant architectures ( Segol and Lipman , 2020 ). The exceptional set B has measure zero, so it do es not aﬀect appro ximation in practice: for an y probabilit y distribution o v er X K that is absolutely con tinuous with resp ect to Leb esgue measure, inputs lie in B with probability zero. Corollary 1 guaran tees that the p er-alternative enco der alone is suﬃcien t for universal ap- pro ximation. The inclusion of the ﬁnal equiv ariant lay ers ensures that the estimated proba- bilities sum to one and impro ves the expressivity of the neural net w ork. F or large choice sets, w e ma y opt for a simpler arc hitecture that applies the p er-alternativ e enco der to only the c hosen alternative with no equiv ariant la yers, resulting in a single output, P j . This mo diﬁcation maintains the univ ersal approximation property; how ever, it relaxes the sum-to-one constrain t in return for computational eﬃciency . Corollary 1 could also b e extended to hold simultaneously for a ﬁnite set of K v alues under a suitable extension of the space X K and its exceptional sets. Section 6.3 presen ts n umeri- cal results showing that the prop osed architecture can pro vide accurate appro ximations for m ultiple v alues of K simultaneously . 5.2 Consistency and Asymptotic Normalit y W e now establish that em ulator-based estimators inherit the asymptotic prop erties of exact maxim um likelihoo d estimators under appropriate conditions. Let { ( y i , X i ) } n i =1 b e iid observ ations from the true mo del with parameter θ 0 ∈ Θ. Deﬁne the 22 true and em ulator log-likelihoo d contributions as m ( y , X ; θ ) = log P y { v ∗ ( X , θ ) , Σ ∗ ( X , θ ) } , (34) ˆ m n ( y , X ; θ ) = log ˆ P y { v ∗ ( X , θ ) , Σ ∗ ( X , θ ); ϕ n } , (35) with corresp onding score functions ψ ( y , X ; θ ) = ∇ θ m ( y , X ; θ ) , ˆ ψ n ( y , X ; θ ) = ∇ θ ˆ m n ( y , X ; θ ) . (36) The true and em ulator (a veraged) log lik eliho o ds are ℓ n ( θ ) = 1 n P n i =1 m ( y i , X i ; θ ) and ˆ ℓ n ( θ ) = 1 n P n i =1 ˆ m n ( y i , X i ; θ ). Deﬁne the MLE and emulator-based estimator as ˜ θ n = arg max θ ∈ Θ ℓ n ( θ ) , ˆ θ n = arg max θ ∈ Θ ˆ ℓ n ( θ ) . (37) W e require that the em ulator log likelihoo d con verges to the true log likelihoo d. Assumption 2 (Emulator Approximation Qualit y) . F or some α ≥ 0, the em ulator satisﬁes sup θ ∈ Θ    ˆ ℓ n ( θ ) − ℓ n ( θ )    = o p ( n − α ) . (38) The n referenced in Assumption 2 is the sample size of the observed data. Consequen tly , this assumption in v olv es an asymptotic regime in which the emulator becomes increasingly accurate as n → ∞ , which is achiev able b y progressiv ely (a) increasing the num b er of simu- lated training examples and (b) increasing the neural net work complexit y (i.e., the num b er of lay ers and hidden units). App endix B.1 provides more primitive suﬃcien t conditions for Assumption 2 with references to closel y related work on con v ergence rates of neural netw orks and their usage in econometric theory . Notably , for mo dels with analytic c hoice probabilities, w e exp ect this assumption to require only O ( n α + δ ) training examples for an y δ > 0. W e also require suﬃcien t regularit y of the true mo del. Assumption 3 (Regularity) . The mo del satisﬁes: (i) θ 0 is the unique maximizer of E { m ( y , X ; θ ) } in the compact set Θ with θ 0 ∈ int(Θ) and E {| m ( y , X ; θ 0 ) |} < ∞ ; (ii) m ( y , X ; θ ) is twice contin uously diﬀeren tiable in θ for all ( y , X ); (iii) E { sup θ ∈ Θ ∥ ψ ( y , X ; θ ) ∥ 2 } < ∞ ; 23 (iv) E { sup θ ∈ Θ ∥∇ 2 θ m ( y , X ; θ ) ∥} < ∞ ; (v) the Fisher information J ( θ 0 ) = E  ψ ( y , X ; θ 0 ) ψ ( y , X ; θ 0 ) ⊤  is p ositiv e deﬁnite. Theorem 3 (Consistency) . Under Assumptions 2 and 3 , ˆ θ n p − → θ 0 as n → ∞ . Pr o of. By Assumption 2 with α ≥ 0, sup θ | ˆ ℓ n ( θ ) − ℓ n ( θ ) | p − → 0. F urther, the regularit y conditions of Assumption 3 are suﬃcient to apply the uniform la w of large num b ers (ULLN) so that sup θ | ℓ n ( θ ) − E { m ( y , X ; θ ) } | p − → 0. The triangle inequalit y giv es uniform conv ergence of ˆ ℓ n to E { m ( y , X ; θ ) } , whic h is uniquely maximized at θ 0 b y Assumption 3 (i). By the consistency theorem for extrem um estimators ( New ey and McF adden , 1994 , Theorem 2.1), ˆ θ n p − → θ 0 . Under the stronger condition α ≥ 1, the emulator-based estimator is asymptotically normal. Theorem 4 (Asymptotic Normality) . Under Assumptions 2 and 3 with α ≥ 1 , √ n ( ˆ θ n − θ 0 ) d − → N  0 , J ( θ 0 ) − 1  . (39) The pro of, giv en in App endix B.2 , sho ws that ˆ θ n is asymptotically equiv alent to ˜ θ n . R emark 1 (Approximate Maximizers) . If a maxim um is not uniquely attained, one can work with appro ximate maximizers satisfying ˆ ℓ n ( ˆ θ n ) ≥ sup θ ˆ ℓ n ( θ ) − o p ( n − α ). The consistency and asymptotic normality results contin ue to hold; see Newey and McF adden ( 1994 , Theorem 2.1) and App endix B.2 . T o estimate the Fisher information, deﬁne the true and emulator outer-pro duct-of-scores estimators: J n ( θ ) = 1 n n X i =1 ψ ( y i , X i ; θ ) ψ ( y i , X i ; θ ) ⊤ , (40) ˆ J n ( θ ) = 1 n n X i =1 ˆ ψ n ( y i , X i ; θ ) ˆ ψ n ( y i , X i ; θ ) ⊤ . (41) Assumption 4 (Gradient Approximation Quality) . The emulator satisﬁes sup θ ∈ Θ    ˆ J n ( θ ) − J n ( θ )    p − → 0 as n → ∞ . (42) Assumption 4 is justiﬁed by Sobolev training, which penalizes discrepancies b et ween em u- lator and target gradients. The p enalization is required b ecause training on log-likelihoo d 24 v alues alone do es not guaran tee con vergence of gradients. As with Assumption 2 , see Ap- p endix B.1 for more primitiv e suﬃcient conditions and additional discussion. Prop osition 1 (Consisten t Estimation of Fisher Information) . Under Assumptions 2 , 3 , and 4 , ˆ J n ( ˆ θ n ) p − → J ( θ 0 ) . Pr o of. By the triangle inequality , ∥ ˆ J n ( ˆ θ n ) − J ( θ 0 ) ∥ ≤ ∥ ˆ J n ( ˆ θ n ) − J n ( ˆ θ n ) ∥ + ∥ J n ( ˆ θ n ) − J ( ˆ θ n ) ∥ + ∥ J ( ˆ θ n ) − J ( θ 0 ) ∥ . The ﬁrst term v anishes by Assumption 4 . F or the second term, the regularity conditions of Assumption 3 are suﬃcient to apply the ULLN to J n ; thus, this term also con verges in probability to zero. Finally , the third term v anishes b y contin uity of J and consistency of ˆ θ n . The emulator score ˆ ψ n ( y i , X i ; θ ) is computable via automatic diﬀerentiation through the smo oth neural net w ork. Standard errors for ˆ θ n are obtained as the square ro ots of the diagonal elemen ts of ˆ J n ( ˆ θ n ) − 1 /n . An alternativ e estimator based on the Hessian, ˜ J n = − 1 n P n i =1 ∇ 2 θ ˆ m n ( y i , X i ; ˆ θ n ), requires second-deriv ative accuracy not directly targeted b y ﬁrst-order Sob olev training. The score- based estimator ˆ J n ( ˆ θ n ) is therefore preferred. 5.3 Inference Under Missp eciﬁcation When the em ulator do es not p erfectly approximate the true likelihoo d, w e can still obtain v alid inference b y treating the em ulator as a working mo del, follo wing the quasi-maximum lik eliho o d framework ( White , 1982 ). In this section, we treat the neural net work as ﬁxed with n , dropping the n subscript from the observ ation log lik eliho o d, ˆ m ( y , X ; θ ), and score function, ˆ ψ ( y , X ; θ ). Deﬁne the pseudo-true parameter as the population maximizer of the em ulator log likelihoo d: θ † 0 = arg max θ ∈ Θ E { ˆ m ( y , X ; θ ) } . (43) When the em ulator closely appro ximates the true likelihoo d, θ † 0 ≈ θ 0 . The asymptotic distribution dep ends on t wo matrices: A ( θ ) = − E  ∇ 2 θ ˆ m ( y , X ; θ )  , B ( θ ) = E n ˆ ψ ( y , X ; θ ) ˆ ψ ( y , X ; θ ) ⊤ o . (44) Under correct sp eciﬁcation (i.e., a p erfect em ulator), the information matrix equalit y giv es A ( θ 0 ) = B ( θ 0 ) = J ( θ 0 ). Under missp eciﬁcation, these matrices diﬀer, necessitating the sandwic h co v ariance form. Under regularit y conditions stated in App endix B.3 , we obtain 25 the follo wing results. Theorem 5 (Consistency Under Missp eciﬁcation) . Under Assumptions 5 and 6 , ˆ θ n p − → θ † 0 . Theorem 6 (Asymptotic Normality Under Missp eciﬁcation) . Under Assumptions 5 and 6 , √ n ( ˆ θ n − θ † 0 ) d − → N n 0 , A ( θ † 0 ) − 1 B ( θ † 0 ) A ( θ † 0 ) −⊤ o . (45) The pro ofs follow standard M-estimator argumen ts; see App endix B.4 . The sandwich co- v ariance V = A ( θ † 0 ) − 1 B ( θ † 0 ) A ( θ † 0 ) −⊤ can b e consisten tly estimated by ˆ V n = ˆ A − 1 n ˆ B n ˆ A −⊤ n , where ˆ A n = − 1 n n X i =1 ∇ 2 θ ˆ m ( y i , X i ; ˆ θ n ) , ˆ B n = 1 n n X i =1 ˆ ψ ( y i , X i ; ˆ θ n ) ˆ ψ ( y i , X i ; ˆ θ n ) ⊤ . (46) Both quan tities are computable via automatic diﬀerentiation. Note that ˆ B n corresp onds to ˆ J n ( ˆ θ n ) from the previous section under an asymptotic regime with a ﬁxed emulator. Theorem 6 pro vides v alid inference for θ † 0 , not necessarily θ 0 . If w e emplo y an asymptotic regime where the em ulator v aries with n , then the pseudo-true parameter also v aries with n , so we denote it as θ † n . When α ∈ [0 , 1) in Assumption 2 , we ha ve θ † n p − → θ 0 as n → ∞ , but this conv ergence ma y b e to o slo w relativ e to √ n sampling v ariabilit y . The sandwich standard errors remain v alid for θ † n , but conﬁdence interv als should not b e interpreted as targeting θ 0 . Even if A ( θ † 0 ) and B ( θ † 0 ) b oth approximate J ( θ 0 ), the centering at θ † n  = θ 0 in tro duces asymptotic bias. V alid inference for θ 0 requires α ≥ 1 (Theorem 4 ). Similarly , Prop osition 1 pro vides conditions under which ˆ B n p − → J ( θ 0 ). Ho wev er, our ﬁrst- order Sobolev training pro cedure pro vides no guarantee that the emulator Hessian will ap- pro ximate the Hessian of the true mo del; thus, ˆ A n need not conv erge to an y quantit y from the true mo del. How ever, Theorem 6 requires only that A ( θ † 0 ) b e nonsingular, not that it matc h the true model. When the emulator additionally pro vides accurate second deriv atives, the sandwic h cov ariance matrix reduces to J ( θ 0 ) − 1 . 6 Sim ulation Study W e ev aluate the p erformance of em ulator-based MNP estimators across a range of sam- ple sizes, num b ers of alternatives, and mo del sp eciﬁcations. W e compare their statistical p erformance and computational requiremen ts to estimators based on the GHK simulator. 26 6.1 Sim ulation Design Mo del sp eciﬁcation. W e consider MNP mo dels with K ∈ { 3 , 5 , 10 } alternatives and p = 2 cov ariates. Alternativ e K serv es as the reference with utilit y normalized to zero, represen ting an outside option. F or each non-reference alternativ e j ∈ { 1 , . . . , K − 1 } , the deterministic utilit y is v j = x ⊤ j β , where x j ∈ R p is a vector of alternativ e-sp eciﬁc attributes and β = (0 , 1) ⊤ is the true co eﬃcient v ector. W e consider tw o sp eciﬁcations for Σ , denoting them as Dense and F actor. The Dense sp eciﬁcation assumes a dense structure for Σ 1:( K − 1) , 1:( K − 1) , the upp er ( K − 1) × ( K − 1) submatrix of Σ . F or iden tiﬁcation, we set Σ 11 = 1 and Σ • K = Σ ⊤ K • = 0 K ; the latter equation sets the v ariance and co v ariances of the zero-utility option to zero. The F actor sp eciﬁcation assumes a one-factor structure for Σ 1:( K − 1) , 1:( K − 1) of the form Σ 1:( K − 1) , 1:( K − 1) = Ψ + γ γ ⊤ , where γ ∈ R K − 1 and Ψ ∈ R ( K − 1) × ( K − 1) is a diagonal ma- trix of uniquenesses. F or identiﬁcation, w e imp ose an anc hored normalization that ﬁxes the ﬁrst uniqueness and ﬁrst factor loading: Ψ 11 = 0 , γ 1 = 1 . (47) As with the Dense sp eciﬁcation, w e set Σ • K = Σ ⊤ K • = 0 K . These constrain ts imply Σ 11 = 1, ﬁxing the scale of the mo del, and Σ 1 j = γ j for j ≥ 2, so that the factor loadings directly enco de co v ariances with the ﬁrst alternativ e. Giv en these constrain ts, w e parameterize the mo del in terms of K − 2 free uniquenesses (Ψ 22 , . . . , Ψ K − 1 ,K − 1 ) and K − 2 free loadings ( γ 2 , . . . , γ K − 1 ). This anchored normalization is designed to eliminate ﬂat directions in the lik eliho o d surface. T able 1 sho ws the num b er of nonredundan t parameters by t yp e for the Dense and F actor sp eciﬁcations. With the anc hored normalization, the F actor sp eciﬁcation is exactly iden tiﬁed for K = 3 and o veriden tiﬁed for K ≥ 4. F or K = 10, the F actor sp eciﬁcation results in fewer than half as man y parameters as the Dense sp eciﬁcation (18 vs. 46). Data generation. F or each replication, w e generate cov ariates x ij iid ∼ N ( 0 , I p ) for obser- v ations i = 1 , . . . , n and alternativ es j = 1 , . . . , K − 1, with x iK = 0 for the reference alternativ e. F or the Dense sp eciﬁcation, w e dra w the co v ariance matrix according to a Wishart( I K − 1 , K + 10) distribution and then normalize it so that Σ 11 = 1, resulting in a probability densit y 27 Co v ariance ( Σ ) Sp eciﬁcation K Co eﬃcien ts V ariances (Uniquenesses) Co v ariances (Loadings) T otal Dense 3 2 1 1 4 5 2 3 6 11 10 2 8 36 46 F actor 3 2 1 1 4 5 2 3 3 8 10 2 8 8 18 T able 1: Parameter counts by Σ sp eciﬁcation (Dense or F actor) and K . function (PDF) giv en by Γ( s ) · | Σ 1:( K − 1) , 1:( K − 1) | 5 2 s Γ K − 1  K +10 2  ·  1 2 tr( Σ 1:( K − 1) , 1:( K − 1) )  s , (48) where s = ( K + 10)( K − 1) / 2 and Γ K − 1 ( · ) denotes the ( K − 1)-dimensional m ultiv ariate gamma function (see App endix C.2 ). F or the F actor sp eciﬁcation, w e sample the free uniquenesses on the log scale, drawing log(Ψ j j ) iid ∼ N (0 , 0 . 1 2 ) for j = 2 , . . . , K − 1, and exp onen tiate to obtain Ψ j j > 0. F or the factor loadings, we draw γ j iid ∼ N (0 , 0 . 2 2 ) for j = 2 , . . . , K − 1. The ﬁxed v alues Ψ 11 = 0 and γ 1 = 1 complete the parameterization. Giv en the co v ariates and true parameters, we compute deterministic utilities and sim ulate c hoices y i ∈ { 1 , . . . , K } according to one of t w o data-generating pro cesses (DGPs): Probit or Em ulator. The Probit RUM generates c hoices from the exact MNP mo del, drawing errors ϵ i ∼ N ( 0 , Σ ) and selecting the utility-maximizing alternative. The Emulator RUM generates c hoices b y sampling directly from the categorical distribution deﬁned by the em ulator’s predicted choice probabilities. When coupled with the Emulator R UM, the em ulator-based estimation metho d is correctly speciﬁed, enabling a comparison of em ulator-based estimators under both correct sp eciﬁcation (Em ulator RUM) and approximate inference (Probit R UM). Estimation. T o ensure p ositive deﬁniteness of Σ , we estimate the Dense sp eciﬁcation on the scale of the Cholesky factor, with the diagonal elements optimized on the log scale to enforce p ositivit y . Similarly , we estimate the F actor sp eciﬁcation with the free uniquenesses on the log scale, optimizing log(Ψ j j ) for j = 2 , . . . , K − 1. 28 W e form estimators using t wo metho ds, Em ulator and GHK, corresp onding to whether w e appro ximate the log-likelihoo d function using a pretrained emulator or the GHK sim ulator. W e ﬁt b oth metho ds to data generated by the Probit R UM; we also ﬁt the Em ulator metho d to data generated by the Emulator R UM to analyze its p erformance under correct sp eciﬁca- tion. W e apply the GHK metho d with R ∈ { 10 , 50 , 250 } dra ws, representing low, mo derate, and high computational eﬀort. Both metho ds, Emulator and GHK, p erform regularized maximum likelihoo d estimation. The regularization consists of t wo comp onen ts: a barrier p enalty and a pseudo-prior. The barrier p enalt y is applied to log(Ψ j j ) for the F actor sp eciﬁcation and to the log-diagonal Cholesky parameters for the Dense sp eciﬁcation. F or a latent parameter η , the barrier p enalt y is λ reg · log( η + 8) /n . The pseudo-prior terms take the form λ reg · log { p ( θ ) } /n , where p ( θ ) is the density function of θ under the data-generating pro cess. W e add these p enalties together to form a com bined penalty of the form λ reg · r ( θ ) /n , where r ( θ ) sums the barrier and log-prior p enalties and λ reg is a regularization h yp erparameter. W e then estimate θ as the maximizer of the sum of the (appro ximate) log lik eliho o d and λ reg · r ( θ ) /n . W e tak e λ reg = 0 . 1 so that the p enalty is small and asymptotically negligible. In practice, we ﬁnd that these penalties lead to more reliable con vergence of the mo del-ﬁtting algorithm and help address w eak iden tiﬁcation of the cov ariance parameters that arises when K is large and n is small. W e accoun t for these p enalties in the standard error computation by estimating Co v( ˆ θ n ) as  n · ˆ J n ( ˆ θ n ) − λ reg · ∇ 2 θ r ( ˆ θ n )  − 1 . Again, the impact of the p enalization is asymptotically negligible b ecause the term n · ˆ J n ( ˆ θ n ) dominates asymptotically . F or all metho ds, w e maximize the (approximate) regularized log likelihoo d using L-BFGS optimization ( Liu and Nocedal , 1989 ). In practice, w e found it necessary to limit the mini- m um step size (we used 10 − 6 ) and allo w small decreases in the ob jectiv e function to enable the optimizer to escap e lo cal minima. Exp erimen tal design. W e consider sample sizes n ∈ { 1 , 000 , 10 , 000 , 100 , 000 } and con- duct 100 replications for eac h conﬁguration. W e compute p erformance metrics aggregated across all parameter types. W e compute Mon te Carlo standard errors for the cov erage and a verage timings using scaled empirical standard deviations. F or the remaining metrics, w e apply b o otstrap resampling (across rep etitions) with 1,000 b o otstrap samples to estimate the Mon te Carlo standard errors. Computational en vironmen t. All sim ulations were implemented in PyT orc h ( P aszk e et al. , 2019 ) and executed on a p ersonal laptop with 48 GB RAM and an Apple M4 Pro 29 pro cessor. W e ﬁt the emulators using the Apple Metal P erformance Shaders (MPS). T o obtain comparable estimation timings, how ever, we p erformed the rest of the sim ulation on CPU. 6.2 Sim ulation Results W e ﬁt the em ulators for 400 , 000 training episo des using 400 , 000 pregenerated training ex- amples, each consisting of 10 6 c hoices. Details on the optimizer, learning rate sc hedule, and net work arc hitecture are provided in App endix C.1 . This mo del-ﬁtting pro cedure completed in 2.1, 4.6, and 16.6 hours for K = 3 , 5 , and 10, respectively , represen ting a mo dest level of computational eﬀort. 10 0 10 1 10 2 10 3 10 4 10 5 Episode Number 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Cross-Entrop y Loss T rain Ev al (a) K = 3 10 0 10 1 10 2 10 3 10 4 10 5 Episode Number 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Cross-Entrop y Loss T rain Ev al (b) K = 5 10 0 10 1 10 2 10 3 10 4 10 5 Episode Number 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Cross-Entrop y Loss T rain Ev al (c) K = 10 Figure 2: T raining and ev aluation loss o v er 400 , 000 training episodes for em ulators with K ∈ { 3 , 5 , 10 } alternativ es. The training loss (blue) decreases consistently throughout training, and the ev aluation loss (orange) closely trac ks the training loss, indicating that the emulators do not o verﬁt the simulated training data. Figure 2 displays the training and ev aluation loss for the emulators ov er time. A t eac h v alue of K ∈ { 3 , 5 , 10 } , the training loss decreases consisten tly ov er time even at the end of training, indicating that further training w ould lik ely lead to additional improv ements. The ev aluation loss closely tracks the training loss, implying that the emulator generalizes w ell with minimal ov erﬁtting to the simulated training probabilities. W e compute the ev aluation loss ev ery 200 episo des; these p erio dic calculations pro duce signiﬁcan tly less visible v ariation in the ev aluation loss compared to the training loss. T able 2 displays p erformance metrics for K = 3 comparing the p erformance of the emulator and GHK estimation metho ds. The p erformance metrics are the ro ot mean squared error (RMSE), ro ot mean squared bias (RMS Bias), ratio of estimated to empirical standard errors (SE Ratio), 95% conﬁdence in terv al cov erage rates (Cov erage), and elapsed mo del-ﬁtting time in seconds (Time (s)). The emulator metho ds p erform w ell across all sample sizes, 30 n DGP Metho d RMSE RMS Bias SE Ratio Co v erage Time (s) 1,000 Em ulator Em ulator 0.159 (0.016) 0.010 (0.009) 1.00 (0.05) 0.94 (0.01) 0.11 (0.00) Probit Emulator 0.206 (0.032) 0.049 (0.018) 0.95 (0.08) 0.95 (0.01) 0.10 (0.00) Probit GHK(10) 0.202 (0.031) 0.047 (0.019) 0.96 (0.08) 0.94 (0.02) 0.08 (0.00) Probit GHK(50) 0.203 (0.032) 0.048 (0.018) 0.96 (0.09) 0.95 (0.01) 0.14 (0.00) Probit GHK(250) 0.207 (0.033) 0.049 (0.019) 0.95 (0.08) 0.95 (0.01) 0.41 (0.01) 10,000 Em ulator Em ulator 0.052 (0.006) 0.007 (0.004) 0.94 (0.07) 0.94 (0.01) 0.36 (0.00) Probit Emulator 0.053 (0.009) 0.008 (0.004) 0.99 (0.11) 0.94 (0.01) 0.36 (0.00) Probit GHK(10) 0.052 (0.008) 0.010 (0.003) 1.00 (0.09) 0.94 (0.01) 0.30 (0.00) Probit GHK(50) 0.052 (0.009) 0.007 (0.004) 1.00 (0.10) 0.95 (0.01) 0.44 (0.01) Probit GHK(250) 0.053 (0.009) 0.008 (0.005) 0.99 (0.11) 0.94 (0.01) 0.93 (0.01) 100,000 Em ulator Em ulator 0.015 (0.002) 0.001 (0.001) 0.97 (0.08) 0.94 (0.01) 1.80 (0.01) Probit Emulator 0.016 (0.002) 0.001 (0.001) 1.06 (0.07) 0.98 (0.01) 1.82 (0.02) Probit GHK(10) 0.020 (0.002) 0.007 (0.001) 0.83 (0.06) 0.83 (0.02) 0.90 (0.01) Probit GHK(50) 0.016 (0.002) 0.002 (0.001) 1.03 (0.08) 0.97 (0.01) 1.90 (0.02) Probit GHK(250) 0.016 (0.002) 0.001 (0.001) 1.06 (0.07) 0.98 (0.01) 7.22 (0.09) T able 2: Comparison of Emulat or and GHK estimation metho ds for m ultinomial probit mo dels with K = 3. The em ulator p erforms comp etitiv ely with GHK(50) and GHK(250) and requires computation time b etw een that of GHK(10) and GHK(50). Performance metrics are a v eraged across all parameters (co eﬃcients, v ariances, and cov ariances) with Monte Carlo standard errors sho wn in parentheses. matc hing the p erformance of GHK(50) and GHK(250) in terms of statistical p erformance but at a low er computational cost. GHK(10) signiﬁcantly undercov ers at n = 100 , 000, exhibiting substan tially higher RMS Bias than the other methods. This p o or p erformance is likely an artifact of the GHK sim ulator pro viding biased estimates on the log scale due to Jensen’s inequality; for suﬃciently large n , this bias is even tually nonnegligible relativ e to 1 / √ n sampling error. T able 3 displays similar results for K = 5. The em ulator is comp etitiv e with GHK(50) and GHK(250) at all sample sizes. Compared to GHK(50), the compute time for the Emulator metho d is sligh tly lo wer for n = 1 , 000, approximately equal for n = 10 , 000, and slightly higher for n = 100 , 000. T able 4 displa ys p erformance results for K = 10. The em ulator performs comparably to GHK(250) with m uch low er computational requiremen ts. At n = 1 , 000, the emulator ﬁts the mo del in less than 3 seconds on av erage, outpacing even GHK(10). The computation time of the em ulator falls betw een that of GHK(10) and GHK(50) at n = 10 , 000 and b etw een GHK(50) and GHK(250) at n = 100 , 000. Compared to T ables 2 and 3 , the standard error estimates are less reliable for the smaller sample sizes, presumably due to weak iden tiﬁcation; the problem is ampliﬁed when we omit the p enalt y on Σ . The standard errors improv e 31 n DGP Metho d RMSE RMS Bias SE Ratio Co v erage Time (s) 1,000 Em ulator Em ulator 0.243 (0.017) 0.032 (0.012) 1.11 (0.06) 0.94 (0.01) 0.41 (0.01) Probit Emulator 0.281 (0.020) 0.045 (0.016) 1.04 (0.04) 0.93 (0.01) 0.42 (0.00) Probit GHK(10) 0.278 (0.015) 0.074 (0.017) 0.95 (0.03) 0.89 (0.01) 0.29 (0.00) Probit GHK(50) 0.280 (0.017) 0.051 (0.017) 1.02 (0.04) 0.92 (0.01) 0.62 (0.01) Probit GHK(250) 0.284 (0.020) 0.046 (0.017) 1.03 (0.04) 0.93 (0.01) 2.02 (0.03) 10,000 Em ulator Em ulator 0.086 (0.006) 0.010 (0.004) 0.99 (0.05) 0.95 (0.01) 2.27 (0.02) Probit Emulator 0.084 (0.005) 0.009 (0.004) 1.06 (0.04) 0.96 (0.01) 2.26 (0.02) Probit GHK(10) 0.095 (0.006) 0.032 (0.006) 0.88 (0.03) 0.89 (0.01) 0.89 (0.01) Probit GHK(50) 0.086 (0.006) 0.013 (0.005) 1.02 (0.04) 0.95 (0.01) 2.27 (0.03) Probit GHK(250) 0.084 (0.005) 0.009 (0.004) 1.06 (0.04) 0.96 (0.01) 4.95 (0.05) 100,000 Em ulator Em ulator 0.028 (0.004) 0.003 (0.001) 0.97 (0.09) 0.95 (0.01) 18.09 (0.15) Probit Emulator 0.028 (0.002) 0.002 (0.001) 1.00 (0.06) 0.95 (0.01) 18.30 (0.17) Probit GHK(10) 0.056 (0.004) 0.024 (0.003) 0.49 (0.02) 0.60 (0.02) 3.75 (0.03) Probit GHK(50) 0.030 (0.002) 0.004 (0.002) 0.93 (0.05) 0.92 (0.01) 9.45 (0.11) Probit GHK(250) 0.028 (0.002) 0.002 (0.001) 1.00 (0.06) 0.95 (0.01) 42.96 (0.49) T able 3: Comparison of Emulat or and GHK estimation metho ds for m ultinomial probit mo dels with K = 5. As in T able 2 , the emulator p erforms comp etitively with GHK(50) and GHK(250) at a comparable or lo wer computational cost. n DGP Metho d RMSE RMS Bias SE Ratio Co v erage Time (s) 1,000 Em ulator Em ulator 0.324 (0.014) 0.195 (0.011) 1.56 (0.14) 0.93 (0.01) 2.24 (0.02) Probit Emulator 0.343 (0.016) 0.207 (0.012) 1.71 (0.24) 0.93 (0.01) 2.21 (0.02) Probit GHK(10) 0.388 (0.017) 0.217 (0.013) 0.35 (0.07) 0.22 (0.02) 5.86 (0.30) Probit GHK(50) 0.363 (0.016) 0.215 (0.013) 0.70 (0.07) 0.54 (0.03) 14.16 (0.95) Probit GHK(250) 0.351 (0.017) 0.214 (0.012) 1.43 (0.11) 0.91 (0.01) 14.87 (0.41) 10,000 Em ulator Em ulator 0.172 (0.010) 0.073 (0.007) 1.74 (0.19) 0.96 (0.01) 19.28 (0.26) Probit Emulator 0.200 (0.011) 0.080 (0.010) 1.81 (0.20) 0.95 (0.01) 19.04 (0.19) Probit GHK(10) 0.236 (0.014) 0.123 (0.010) 1.57 (0.52) 0.79 (0.02) 8.95 (0.30) Probit GHK(50) 0.203 (0.010) 0.090 (0.010) 1.29 (0.07) 0.91 (0.01) 33.54 (0.83) Probit GHK(250) 0.198 (0.010) 0.079 (0.010) 1.41 (0.07) 0.95 (0.01) 50.22 (0.70) 100,000 Em ulator Em ulator 0.069 (0.004) 0.013 (0.003) 1.07 (0.05) 0.95 (0.01) 165.59 (1.15) Probit Emulator 0.082 (0.005) 0.020 (0.004) 0.92 (0.03) 0.93 (0.01) 164.88 (1.49) Probit GHK(10) 0.139 (0.009) 0.073 (0.006) 0.48 (0.02) 0.60 (0.01) 31.48 (0.35) Probit GHK(50) 0.085 (0.005) 0.027 (0.004) 0.89 (0.03) 0.90 (0.01) 93.23 (0.98) Probit GHK(250) 0.081 (0.005) 0.020 (0.004) 0.97 (0.03) 0.93 (0.01) 404.22 (5.59) T able 4: Comparison of Emulat or and GHK estimation metho ds for m ultinomial probit mo dels with K = 10. The emulator p erforms comparably to GHK(250) in terms of statistical p erformance but requires far less computation time. Compare to T ables 2 and 3 . 32 substan tially with the sample size (with or without p enalization), reaching SE Ratios of 0.92–1.07 for the Em ulator and GHK(250) metho ds at n = 100 , 000. n DGP Metho d RMSE RMS Bias SE Ratio Co v erage Time (s) 1,000 Em ulator Em ulator 0.089 (0.004) 0.022 (0.006) 1.18 (0.05) 0.95 (0.01) 0.11 (0.00) Probit Emulator 0.095 (0.004) 0.007 (0.005) 1.10 (0.04) 0.96 (0.01) 0.10 (0.00) Probit GHK(10) 0.096 (0.004) 0.011 (0.005) 1.08 (0.04) 0.96 (0.01) 0.08 (0.00) Probit GHK(50) 0.094 (0.004) 0.007 (0.005) 1.11 (0.04) 0.96 (0.01) 0.15 (0.00) Probit GHK(250) 0.095 (0.004) 0.007 (0.005) 1.10 (0.04) 0.96 (0.01) 0.42 (0.01) 10,000 Em ulator Em ulator 0.038 (0.002) 0.003 (0.002) 0.99 (0.04) 0.93 (0.01) 0.39 (0.01) Probit Emulator 0.038 (0.002) 0.007 (0.003) 1.00 (0.05) 0.96 (0.01) 0.38 (0.00) Probit GHK(10) 0.040 (0.002) 0.013 (0.002) 0.99 (0.05) 0.95 (0.01) 0.33 (0.00) Probit GHK(50) 0.038 (0.002) 0.008 (0.003) 1.00 (0.05) 0.96 (0.01) 0.48 (0.01) Probit GHK(250) 0.038 (0.002) 0.007 (0.003) 1.00 (0.05) 0.95 (0.01) 0.96 (0.01) 100,000 Em ulator Em ulator 0.011 (0.001) 0.001 (0.001) 1.08 (0.05) 0.96 (0.01) 1.97 (0.02) Probit Emulator 0.011 (0.001) 0.001 (0.001) 1.04 (0.04) 0.96 (0.01) 1.91 (0.01) Probit GHK(10) 0.016 (0.001) 0.009 (0.001) 0.88 (0.04) 0.82 (0.02) 0.98 (0.01) Probit GHK(50) 0.012 (0.001) 0.002 (0.001) 1.04 (0.04) 0.95 (0.01) 2.07 (0.03) Probit GHK(250) 0.011 (0.001) 0.001 (0.001) 1.04 (0.05) 0.96 (0.01) 7.52 (0.09) T able 5: Comparison of Emulator and GHK estimation metho ds for factor-structured multi- nomial probit mo dels with K = 3. P erformance metrics are a veraged ov er the factor param- eters (co eﬃcients, uniquenesses, and factor loadings). The Em ulator’s performance mirrors that of GHK(250) at a m uch low er computational cost. T able 5 displa ys analogous results for the F actor sp eciﬁcation with K = 3. Performance metrics are av eraged o ver the nonredundan t elemen ts of β , Ψ , and γ . The Emulator’s p erformance mirrors that of GHK(250) but at a m uch low er computational cost. T ables 6 and 7 displa y analogous results for 5 and 10 alternatives, resp ectiv ely . In b oth cases, the Em ulator performs comparably to GHK(50) and GHK(250) despite not b eing trained sp eciﬁcally for factor mo dels. The Emulator is faster than GHK(50) for n ∈ { 1 , 000 , 10 , 000 } . A t n = 100 , 000, the Emulator’s computation time falls b etw een that of GHK(50) and GHK(250). Across all scenarios, the Emulator metho d p erforms comp etitively with or signiﬁcan tly b etter than GHK. W e exp ect that the relative p erformance of the emulator would increase with (a) additional training and (b) implemen tation on sp ecialized deep-learning hardware, such as GPUs. The latter is likely to b e esp ecially adv antageous for the em ulator b ecause its computations are trivially parallelizable and highly optimized on mo dern GPUs. 33 n DGP Metho d RMSE RMS Bias SE Ratio Co v erage Time (s) 1,000 Em ulator Em ulator 0.128 (0.004) 0.016 (0.005) 1.20 (0.04) 0.97 (0.01) 0.32 (0.01) Probit Emulator 0.131 (0.003) 0.012 (0.004) 1.19 (0.03) 0.97 (0.01) 0.32 (0.00) Probit GHK(10) 0.134 (0.004) 0.017 (0.005) 1.12 (0.03) 0.95 (0.01) 0.23 (0.00) Probit GHK(50) 0.131 (0.003) 0.012 (0.004) 1.19 (0.03) 0.98 (0.01) 0.49 (0.01) Probit GHK(250) 0.132 (0.003) 0.012 (0.004) 1.19 (0.03) 0.97 (0.01) 1.61 (0.02) 10,000 Em ulator Em ulator 0.059 (0.002) 0.007 (0.003) 1.01 (0.04) 0.95 (0.01) 1.92 (0.02) Probit Emulator 0.059 (0.003) 0.005 (0.002) 1.04 (0.04) 0.95 (0.01) 1.97 (0.02) Probit GHK(10) 0.062 (0.003) 0.011 (0.003) 0.93 (0.04) 0.93 (0.01) 0.77 (0.01) Probit GHK(50) 0.059 (0.003) 0.004 (0.003) 1.02 (0.04) 0.95 (0.01) 2.02 (0.03) Probit GHK(250) 0.059 (0.003) 0.007 (0.003) 1.04 (0.04) 0.95 (0.01) 4.67 (0.08) 100,000 Em ulator Em ulator 0.019 (0.001) 0.002 (0.001) 1.02 (0.03) 0.96 (0.01) 15.46 (0.18) Probit Emulator 0.020 (0.001) 0.006 (0.001) 1.04 (0.03) 0.94 (0.01) 15.58 (0.18) Probit GHK(10) 0.029 (0.001) 0.017 (0.001) 0.75 (0.03) 0.75 (0.02) 3.21 (0.04) Probit GHK(50) 0.019 (0.001) 0.004 (0.001) 1.02 (0.03) 0.95 (0.01) 8.47 (0.12) Probit GHK(250) 0.019 (0.001) 0.002 (0.001) 1.04 (0.03) 0.96 (0.01) 37.30 (0.59) T able 6: Comparison of Emulator and GHK estimation metho ds for factor-structured multi- nomial probit mo dels with K = 5. As in T able 5 , the emulator p erforms comp etitively with GHK(50) and GHK(250) at a comparable or low er computational cost. n DGP Metho d RMSE RMS Bias SE Ratio Co v erage Time (s) 1,000 Em ulator Em ulator 0.188 (0.004) 0.023 (0.004) 1.14 (0.02) 0.96 (0.00) 1.06 (0.01) Probit Emulator 0.198 (0.005) 0.030 (0.005) 1.10 (0.02) 0.96 (0.01) 1.01 (0.01) Probit GHK(10) 0.220 (0.005) 0.059 (0.006) 0.92 (0.02) 0.90 (0.01) 0.77 (0.01) Probit GHK(50) 0.205 (0.005) 0.039 (0.005) 1.06 (0.02) 0.95 (0.01) 2.44 (0.03) Probit GHK(250) 0.199 (0.005) 0.030 (0.005) 1.11 (0.02) 0.96 (0.01) 5.93 (0.08) 10,000 Em ulator Em ulator 0.080 (0.002) 0.011 (0.003) 1.24 (0.04) 0.98 (0.00) 7.58 (0.06) Probit Emulator 0.082 (0.002) 0.012 (0.003) 1.22 (0.04) 0.97 (0.01) 7.48 (0.06) Probit GHK(10) 0.103 (0.002) 0.049 (0.003) 0.95 (0.03) 0.88 (0.01) 4.03 (0.04) Probit GHK(50) 0.087 (0.002) 0.019 (0.003) 1.14 (0.04) 0.96 (0.01) 14.83 (0.16) Probit GHK(250) 0.084 (0.002) 0.012 (0.003) 1.20 (0.04) 0.97 (0.01) 24.85 (0.25) 100,000 Em ulator Em ulator 0.043 (0.002) 0.004 (0.002) 0.94 (0.04) 0.93 (0.01) 69.83 (0.57) Probit Emulator 0.036 (0.001) 0.004 (0.001) 1.11 (0.04) 0.97 (0.01) 69.05 (0.66) Probit GHK(10) 0.071 (0.002) 0.047 (0.002) 0.65 (0.02) 0.60 (0.02) 19.73 (0.20) Probit GHK(50) 0.039 (0.001) 0.012 (0.001) 1.04 (0.04) 0.93 (0.01) 55.94 (0.75) Probit GHK(250) 0.036 (0.001) 0.004 (0.001) 1.12 (0.05) 0.96 (0.01) 218.97 (2.94) T able 7: Comparison of Emulator and GHK estimation metho ds for factor-structured multi- nomial probit mo dels with K = 10. The emulator p erforms comparably to GHK(250) in terms of statistical p erformance but requires far less computation time. Compare to T ables 5 and 6 . 34 6.3 Multi- K T raining Accuracy W e conclude this section b y providing numerical results from training a neural net work em ulator to appro ximate c hoice probabilities for K ∈ { 3 , 4 , 5 } sim ultaneously . W e use the same neural net w ork as w e used in the previous section for K = 5. As ab ov e, we train for 400 , 000 episo des, but we lo op ov er training examples with K = 3, 4, and 5 alternatives within eac h episo de. The training pro cess completes in 13.0 hours. 10 0 10 1 10 2 10 3 10 4 10 5 Episo de Num b er 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Cross-En tropy Loss T rain Ev al (a) T raining Loss 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 Estimated Probability ( ˆ P ) − 0 . 010 − 0 . 005 0 . 000 0 . 005 0 . 010 Estimation Error ( P − ˆ P ) K=3 K=4 K=5 (b) Estimation Accuracy Figure 3: (a) T raining and ev aluation loss ov er 400 , 000 training episo des for em ulators with K ∈ { 3 , 4 , 5 } alternatives. The training loss (blue) decreases consistently throughout training, and the ev aluation loss (orange) closely trac ks the training loss, indicating minimal o verﬁtting. (b) Em ulator probabilit y estimates compared to estimation errors. The emulator sim ultaneously learns the choice probabilities for K = 3 , 4 , 5 with small error. Figure 3a shows the training and ev aluation loss steadily decreasing throughout the training pro cedure, approac hing levels similar to those of Figure 2 . Figure 3b plots the estimated c hoice probabilities against estimation errors for K ∈ { 3 , 4 , 5 } with 400 simulated exam- ples each. The small errors indicate that the mo del accurately appro ximates the choice probabilities across all three v alues of K simultaneously . 7 Conclusion This pap er prop oses an amortized inference approach for discrete c hoice mo dels. The ap- proac h uses a nov el neural net work arc hitecture to appro ximate c hoice probabilities under general error distributions, including those featuring non trivial cross-alternativ e correlations, 35 suc h as MNP mo dels. This generalit y enables ﬂexible mo deling of empirical choice behav- ior, including complex substitution patterns, without losing the theoretical grounding or in terpretability of RUM mo dels. Because the emulator op erates on the generic c hoice problem, it is agnostic to the assumed parametric form of the deterministic utilities or the accompanying scale (co v ariance) matrix. Giv en a pretrained em ulator, this generality enables drop-in adoption without the need for custom arc hitecture c hanges or additional training. In particular, generalizing from logit to probit requires only the replacement of closed-form softmax probabilities with emulator ev aluations. Other correlated error distributions are similarly straigh tforward. Our prop osed arc hitecture and training pro cess feature a prepro cessing transformation, DeepSet mo dules ( Zaheer et al. , 2017 ), equiv arian t lay ers, smo oth activ ation functions, and Sob olev training ( Czarnec ki et al. , 2017 ). T ogether, these design choices pro duce smo oth appro ximations of choice probabilities that resp ect their in v ariance prop erties and accelerate learning via parameter sharing. W e show that the arc hitecture p ossesses a universal appro x- imation property , extending recen t theoretical results for symmetric matrices ( Blum-Smith et al. , 2025 ) to the join t space of utility vectors and cen tered scale (co v ariance) matrices. In Monte Carlo sim ulations, w e ﬁnd that em ulator-based estimators p erform as well or better than GHK-based estimators in terms of b oth statistical p erformance and computational requiremen ts when estimating MNP mo dels. These ﬁndings supp ort our theoretical results sho wing that emulator-based estimators inherit the prop erties of exact MLEs under mild appro ximation conditions. W e fo cus on MNP mo dels in the sim ulation study to enable comparison to the GHK simulator, whic h is directly applicable only to Gaussian errors. The computational b eneﬁts of the amortized inference approac h are ampliﬁed by the fact that em ulator ev aluations are trivially parallelizable and highly optimized on mo dern deep learning hardw are, such as GPUs. Sev eral limitations merit discussion. Practitioners must ensure suﬃcien t o verlap b etw een the distribution of inputs in their application and the training distribution. Additionally , our asymptotic theory requires the emulator appro ximation error to v anish at an appropriate rate, meaning applications with very large samples may require more extensively trained em ulators. Lastly , mo dels with correlated errors require iden tifying restrictions not needed for indep endent logit-based mo dels. Although our approac h enables simple and eﬃcient computation of choice probabilities for these mo dels, the data requirements and mo deling decisions to ensure prop er iden tiﬁcation remain a separate challenge. 36 The framew ork also allo ws n umerous extensions. Our emulator architecture and training pro cedure could b e sp ecialized to sp eciﬁc settings, such as factor-structured scale matrices or mixture distributions. W e exp ect these could b e handled b y reconﬁguring the mo dules in our arc hitecture and, in some cases, generalizing the prepro cessing transformation. In particular, the t w o preceding examples would lik ely require an additional DeepSet op eration o ver factors or mixture comp onents. Our emulators could also b e used in tandem with other mo dern choice mo deling arc hitectures, suc h as those describ ed in Section 2 . Previous work has applied neural-net work-assisted estimation approaches to other econometric models with in tractable lik eliho o d functions, suc h as dynamic discrete choice ( Norets , 2012 ) and consumer searc h ( W ei and Jiang , 2025 ), in the context of speciﬁc parametric models. The amortized inference approac h presen ted here could likely b e generalized to man y of these settings, enabling eﬃcien t amortized inference without mo difying or retraining the neural netw ork. Ac kno wle dgmen ts The authors ackno wledge the use of AI to ols to assist in the developmen t of this manuscript. All con tent has b een manually reviewed for accuracy . 37 References Aouad, A. and Desir, A. (2025). Represen ting random utilit y c hoice mo dels with n ueral net works. Management Scienc e , Articles in Adv ance. Bagheri, N., Ghasri, M., and Barlow, M. (2025). A neural estimation framework for discrete c hoice mo dels with arbitrary error distributions. Journal of Choic e Mo del ling , 57:100583. Ben tz, Y. and Merunk a, D. (2000). Neural netw orks and the m ultinomial logit for brand c hoice mo delling: a hybrid approach. Journal of F or e c asting , 19(3):177–200. Berb eglia, G. and V enk ataraman, A. (2025). The generalized stochastic preference c hoice mo del. arXiv pr eprint arXiv:1803.04244 . Blo c h, H. and Marsc hak, J. (1960). Sto c hastic rationality and revealed stochastic preference. In Contributions to pr ob ability and Statistics . Stanford Universit y Press. Blum-Smith, B., Huang, N. T., Cuturi, M., and Villar, S. (2025). F unctions on symmetric matrices and point clouds via ligh t weigh t in v ariant features from Galois theory . SIAM Journal on Applie d A lgebr a and Ge ometry , 9(4):902–938. Bredon, G. E. (1972). Intr o duction to Comp act T r ansformation Gr oups , volume 46. Academic Press, New Y ork. Cai, Z., W ang, H., T alluri, K., and Li, X. (2022). Deep learning for choice mo deling. arXiv pr eprint arXiv:2208.09325 . Chen, N., Gallego, G., and T ang, Z. (2025). The use of binary choice forests to mo del and estimate discrete c hoices. arXiv pr eprint arXiv:1908.01109 . Chen, Y.-C. and Mi ˇ si´ c, V. V. (2022). Decision forest: A nonparametric approach to mo deling irrational c hoice. Management Scienc e (INF ORMS) , 68(10):7090–7111. Co cola, J. and Hand, P . (2020). Global conv ergence of Sob olev training for ov erparameterized neural net works. In International Confer enc e on Machine L e arning, Optimization, and Data Scienc e , pages 574–586. Springer. Cranmer, K., Brehmer, J., and Loupp e, G. (2020). The frontier of simulation-based inference. Pr o c e e dings of the National A c ademy of Scienc es , 117(48):30055–30062. Cyb enk o, G. (1989). Appro ximation by sup erp ositions of a sigmoidal function. Mathematics of Contr ol, Signals and Systems , 2(4):303–314. 38 Czarnec ki, W. M., Osindero, S., Jaderb erg, M., Swirszcz, G., and P ascan u, R. (2017). Sob olev training for neural netw orks. In A dvanc es in Neur al Information Pr o c essing Sys- tems , v olume 30, pages 4278–4287. De Ryck, T., Lanthaler, S., and Mishra, S. (2021). On the approximation of functions b y tanh neural net works. Neur al Networks , 143:732–750. Dugas, C., Bengio, Y., B´ elisle, F., Nadeau, C., and Garcia, R. (2000). Incorp orating second- order functional kno wledge for b etter option pricing. In Leen, T., Dietterich, T., and T resp, V., editors, A dvanc es in Neur al Information Pr o c essing Systems , v olume 13. MIT Press. E, W. and W ang, Q. (2018). Exp onential con vergence of the deep neural netw ork approxi- mation for analytic functions. Scienc e China Mathematics , 61(10):1733–1740. F arrell, M. H., Liang, T., and Misra, S. (2021). Deep neural netw orks for estimation and inference. Ec onometric a , 89(1):181–213. Gew eke, J. (1989). Bay esian inference in econometric mo dels using Monte Carlo in tegration. Ec onometric a , 57(6):1317–1339. Ha jiv assiliou, V. A. and McF adden, D. L. (1998). The metho d of sim ulated scores for the estimation of LD V mo dels. Ec onometric a , 66(4):863–896. Han, Y., Pereira, F. C., Ben-Akiv a, M., and Zegras, C. (2022). A neural-embedded choice mo del: Learning taste representation with strengthened. T r ansp ortation R ese ar ch B , 163:166–186. Hornik, K., Stinchcom b e, M., and White, H. (1989). Multilay er feedforw ard netw orks are univ ersal approximators. Neur al Networks , 2(5):359–366. Hrusc hk a, H. (2007). Using a heterogeneous m ultinomial probit model with a neural net extension to mo del brand c hoice. Journal of F or e c asting , 26(2):113–127. Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with Gum b el– softmax. In International Confer enc e on L e arning R epr esentations . Jiang, Z., Li, J., and Zhang, D. (2025). A high-dimensional c hoice mo del for online retailing. Management Scienc e (INF ORMS) , 71(4):3320–3339. Keane, M. P . (1992). A note on iden tiﬁcation in the m ultinomial probit mo del. Journal of Business & Ec onomic Statistics , 10(2):193–200. 39 Keane, M. P . (1994). A computationally practical sim ulation estimator for panel data. Ec onometric a , 62(1):95–116. Kingma, D. P . and Ba, J. L. (2015). Adam: A metho d for sto chastic optimization. In International Confer enc e on L e arning R epr esentations . Lh ´ eritier, A., Bo camazo, M., Delaha ye, T., and Acuna-Agost, R. (2019). Airline itinerary c hoice mo deling using mac hine learning. Journal of Choic e Mo del ling , 31:198–209. Liu, D. C. and Nocedal, J. (1989). On the limited memory BFGS metho d for large scale optimization. Mathematic al Pr o gr amming , 45(1):503–528. Loshc hilov, I. and Hutter, F. (2019). Decoupled w eight deca y regularization. In International Confer enc e on L e arning R epr esentations . Luec kmann, J.-M., Bassetto, G., Karaletsos, T., and Mac ke, J. H. (2019). Lik eliho o d-free inference with emulator netw orks. In Ruiz, F., Zhang, C., Liang, D., and Bui, T., edi- tors, Pr o c e e dings of The 1st Symp osium on A dvanc es in Appr oximate Bayesian Infer enc e , v olume 96 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 32–53. PMLR. Maddison, C. J., Mnih, A., and T eh, Y. W. (2017). The concrete distribution: A con tin- uous relaxation of discrete random v ariables. In International Confer enc e on L e arning R epr esentations . McCullo c h, R. and Rossi, P . E. (1994). An exact likelihoo d analysis of the m ultinomial probit mo del. Journal of Ec onometrics , 64(1):207–240. McF adden, D. (1974). Conditional logit analysis of qualitative choice b ehavior. In Zarem bk a, P ., editor, F r ontiers in Ec onometrics , pages 105–142. Academic Press, New Y ork. McF adden, D. and Rich ter, M. (1990). Sto chastic rationalit y and rev ealed sto c hastic pref- erence. In Pr efer enc es, Unc ertainty, and Optimality: Essays in Honor of L e o Hurwicz . W estview Press. McF adden, D. and T rain, K. (2000). Mixed mnl mo dels for discrete resp onse. Journal of Applie d Ec onometrics , 15(5):447–470. Mit yagin, B. S. (2020). The zero set of a real analytic function. Mathematic al Notes , 107(3):529–530. 40 Mnih, V., Kavuk cuoglu, K., Silver, D., Rusu, A. A., V eness, J., Bellemare, M. G., Grav es, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., An tonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-lev el control through deep reinforcement learning. Natur e , 518(7540):529–533. New ey , W. K. and McF adden, D. (1994). Large sample estimation and hypothesis testing. In Engle, R. F. and McF adden, D. L., editors, Handb o ok of Ec onometrics , v olume 4, pages 2111–2245. Elsevier. Norets, A. (2012). Estimation of dynamic discrete choice models using artiﬁcial neural net work approximations. Ec onometric R eviews , 31(1):84–106. P aszke, A., Gross, S., Massa, F., Lerer, A., Bradbury , J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., An tiga, L., et al. (2019). PyT orch: An imp erative st yle, high-p erformance deep learning library . A dvanc es in Neur al Information Pr o c essing Systems , 32. Pfannsc hmidt, K., Gupta, P ., Haddenhorst, B., and H ¨ ullermeier, E. (2022). Learning con text-dep enden t choice functions. International Journal of Appr oximate R e asoning , 140:116–155. Ramac handran, P ., Zoph, B., and Le, Q. V. (2018). Searching for activ ation functions. In International Confer enc e on L e arning R epr esentations . Rosenfeld, N., Oshiba, K., and Singer, Y. (2020). Predicting choice with set-dep enden t aggregation. In I I I, H. D. and Singh, A., editors, Pr o c e e dings of the 37th International Confer enc e on Machine L e arning , v olume 119 of Pr o c e e dings of Machine L e arning R e- se ar ch , pages 8220–8229. PMLR. Sc hmidt-Hieb er, J. (2020). Nonparametric regression using deep neural net w orks with ReLU activ ation function. The A nnals of Statistics , 48(4):1875–1897. Segol, N. and Lipman, Y. (2020). On univ ersal equiv arian t set net w orks. In International Confer enc e on L e arning R epr esentations . Shen, X., Jiang, C., Sakhanenko, L., and Lu, Q. (2023). Asymptotic prop erties of neural net work sieve estimators. Journal of Nonp ar ametric Statistics , 35(4):839–868. PMID: 38169985. Sifringer, B., Lurkin, V., and Alahi, A. (2020). Enhancing discrete choice mo dels with represen tation learning. T r ansp ortation R ese ar ch Part B: Metho dolo gic al , 140:236–261. 41 Singh, A., Liu, Y., and Y oganarasimhan, H. (2023). Choice mo dels and p erm utation in v ariance: Demand estimation in diﬀeren tiated products mark ets. arXiv pr eprint arXiv:2307.07090 . T rain, K. E. (2009). Discr ete Choic e Metho ds with Simulation . Cambridge Universit y Press, Cam bridge, 2nd edition. v an der V aart, A. W. (1998). Asymptotic Statistics . Cam bridge Series in Statistical and Probabilistic Mathematics. Cam bridge Universit y Press, Cam bridge, UK. W ang, F. and Ross, C. L. (2018). Mac hine learning trav el mo de c hoices: Comparing the p erformance of an extreme gradient b o osting mo del with a multinomial logit model. T r ans- p ortation R ese ar ch R e c or d , 2672(47):35–45. W ang, S., Mo, B., and Zhao, J. (2020). Deep neural netw orks for choice analysis: Arc hi- tecture design with alternative-speciﬁc utilit y functions. T r ansp ortation R ese ar ch Part C: Emer ging T e chnolo gies , 112:234–251. W ang, S., Mo, B., and Zhao, J. (2021). Theory-based residual neural net works: A synergy of discrete choice mo dels and deep neural netw orks. T r ansp ortation R ese ar ch Part B: Metho dolo gic al , 146:333–358. W ei, Y. M. and Jiang, Z. (2025). Estimating parameters of structural models using neural net works. Marketing Scienc e , 44(1):102–128. White, H. (1982). Maximum lik eliho o d estimation of missp eciﬁed mo dels. Ec onometric a , 50(1):1–25. W ong, M. and F aro o q, B. (2021). Reslogit: A residual neural netw ork logit mo del for data-driv en choice mo delling. T r ansp ortation R ese ar ch Part C: Emer ging T e chnolo gies , 126:103050. Zaheer, M., Kottur, S., Rav an bakhsh, S., Poczos, B., Salakh utdinov, R. R., and Smola, A. J. (2017). Deep sets. In A dvanc es in Neur al Information Pr o c essing Systems , volume 30. 42 Supplemen tary Materials for “Amortized Inference for Correlated Discrete Choice Mo dels via Equiv arian t Neural Net w orks” Easton Huc h P ostdo ctoral F ellow, Johns Hopkins Carey Business Sc ho ol Mic hael Keane Carey Distinguished Professor, Johns Hopkins Carey Business School App endix A presen ts results and pro ofs regarding generic separation and universal approxi- mation. Section A.8 prov es the results giv en in Section 5.1 . App endix B provides pro ofs of the estimation theory results in Section 5.2 . App endix C pro vides details on the sim ulation study . A Generic Separation and Univ ersal Appro ximation This app endix develops the group-theoretic foundations underlying the p er-alternative en- co der arc hitecture presented in the main text. W e establish that certain in v ariants generically separate orbits under p erm utation groups, yielding univ ersal appro ximation results via the Stone–W eierstrass theorem. Let S K denote the space of symmetric real-v alued K × K matrices. W e consider an extended group G K acting on S K b y p erm utation, scaling, and translation by matrices in the n ull space of double-cen tering. Section A.1 in tro duces the necessary notation and establishes basic prop erties of the cen tering matrix and the normalized double-centering function f . Sections A.2 – A.4 establish that the comp osition h = g ◦ f , which ﬁrst applies normalized double-cen tering and then extracts m ultisets of diagonal and oﬀ-diagonal elements, generi- cally separates G K -orbits. The exceptional set where separation fails is closed, G K -in v ariant, and has measure zero. Section A.5 considers the restriction to the cen tered, trace-normalized subspace S ∗ K = Im( f ), where g generically separates S K -orbits and the Stone–W eierstrass theorem yields universal A-1 appro ximation. Section A.6 develops a v arian t for a distinguished index j , showing that an inv arian t g j generically separates orbits under S ( j ) K − 1 , the subgroup ﬁxing j . Sections A.7 – A.8 extend the theory to incorp orate deterministic utilities, culminating in pro ofs of Theorem 1 and Theorem 2 from the main text. Throughout, X and Y represent scale matrices. As in Section 4.3 , we sometimes refer to these mat rices as co v ariance matrices although this in terpretation is not strictly correct outside of MNP mo dels. A.1 Preliminaries Throughout, let K ≥ 2 b e the n umber of alternativ es. Let I K denote the K × K identit y matrix and 1 K ∈ R K the v ector of all ones. Deﬁnition 1 (Cen tering Matrix) . The c entering matrix M ∈ R K × K is deﬁned as M = I K − 1 K 1 K 1 ⊤ K . Prop osition 2 (Prop erties of the Centering Matrix) . The c entering matrix M is symmetric ( M ⊤ = M ), idemp otent ( M 2 = M ), and annihilates c onstant ve ctors ( M1 K = 0 ). Pr o of. Symmetry is immediate. F or idemp otency , M 2 = I K − 2 K 1 K 1 ⊤ K + 1 K 2 1 K ( 1 ⊤ K 1 K ) 1 ⊤ K = I K − 2 K 1 K 1 ⊤ K + 1 K 1 K 1 ⊤ K = M . Finally , M1 K = 1 K − 1 K 1 K ( 1 ⊤ K 1 K ) = 1 K − 1 K = 0 . Deﬁnition 2 (Double-Cen tering F unction) . Deﬁne the domain U = { X ∈ S K : tr( MXM )  = 0 } and the function f : U → S K b y f ( X ) = K · MXM tr( MXM ) . Prop osition 3 (Explicit F orm ula) . F or X ∈ S K , let ¯ X i • = 1 K P K k =1 X ik , ¯ X • j = 1 K P K k =1 X kj , and ¯ X •• = 1 K 2 P i,j X ij . Then: (i) [ MXM ] ij = X ij − ¯ X i • − ¯ X • j + ¯ X •• . (ii) tr( MXM ) = tr( X ) − K ¯ X •• . Pr o of. P art (i) follo ws b y direct expansion. F or part (ii), use idemp otency and the cyclic prop ert y of trace: tr( MXM ) = tr( M 2 X ) = tr( MX ) = tr( X ) − 1 K 1 ⊤ K X1 K = tr( X ) − K ¯ X •• . A-2 Prop osition 4 (Domain Prop erties) . The set S K \ U = { X ∈ S K : tr( MXM ) = 0 } is a hyp erplane in S K with L eb esgue me asur e zer o. The domain U is op en and G K -invariant. Pr o of. By Prop osition 3 (ii), tr( MXM ) is a nontrivial linear function, so its zero set is a measure-zero h yp erplane. Op enness follows from contin uit y . F or G K -in v ariance: scaling b y c  = 0 gives tr( M ( c X ) M ) = c tr( MXM )  = 0; null-space translation by N = u 1 ⊤ K + 1 K u ⊤ satisﬁes MNM = 0 since M1 K = 0 ; p ermutation preserv es the trace b y the cyclic prop ert y . Deﬁnition 3 (Extended Group Action) . Let S K denote the symmetric group on K elements with asso ciated p erm utation matrices P π , and let R ∗ = R \ { 0 } . Deﬁne the extended group G K = R K ⋊ ( R ∗ × S K ), whic h acts on S K b y: ( c, u , π ) · X = c P π XP ⊤ π + u 1 ⊤ K + 1 K u ⊤ . Prop osition 5 ( f and G K ) . The function f satisﬁes: (i) Equivarianc e under p ermutation: f ( P π XP ⊤ π ) = P π f ( X ) P ⊤ π . (ii) Invarianc e under sc aling: f ( c X ) = f ( X ) for c  = 0 . (iii) Invarianc e under nul l-sp ac e tr anslation: f ( X + u 1 ⊤ K + 1 K u ⊤ ) = f ( X ) . Pr o of. F or (i), since P π 1 K = 1 K , we ha ve P π M = MP π , so M ( P π XP ⊤ π ) M = P π MXMP ⊤ π , and the trace is preserved. P art (ii) follows from f ( c X ) = cK MXM c · tr( MXM ) = f ( X ). F or (iii), M ( u 1 ⊤ K + 1 K u ⊤ ) M = 0 since M1 K = 0 . Lemma 1 (Preimage Characterization) . If f ( X ) = f ( Y ) for X , Y ∈ U , then ther e exist c ∈ R ∗ and w ∈ R K such that X = c Y + w 1 ⊤ K + 1 K w ⊤ . Pr o of. Let t X = tr( MXM ) and t Y = tr( MY M ). The assumption f ( X ) = f ( Y ) implies MXM = c MY M where c = t X /t Y ∈ R ∗ . F or symmetric Z , let µ Z = Z1 K /K (ro w av erages) and m Z = 1 ⊤ K Z1 K /K 2 (grand mean). Expanding MZM = Z − µ Z 1 ⊤ K − 1 K µ ⊤ Z + m Z 1 K 1 ⊤ K and applying to b oth sides: X = c Y + ( µ X − c µ Y ) 1 ⊤ K + 1 K ( µ X − c µ Y ) ⊤ − ( m X − cm Y ) 1 K 1 ⊤ K . Setting w = µ X − c µ Y − m X − cm Y 2 1 K giv es the result. A-3 Deﬁnition 4 (Centered, T race-Normalized Subspace) . Deﬁne S ∗ K = { Y ∈ S K : Y 1 K = 0 , tr( Y ) = K } . Deﬁnition 5 (In v ariant F unction g ) . F or Y ∈ S ∗ K , deﬁne g ( Y ) = ( D K ( Y ) , O K ( Y )) where D K ( Y ) = { { Y ii : i = 1 , . . . , K } } is the m ultiset of diagonal en tries and O K ( Y ) = { { Y ij : 1 ≤ i < j ≤ K } } is the multiset of oﬀ-diagonal entries. A.2 Generic Injectivit y W e now deﬁne the genericit y conditions under whic h g separates orbits. Deﬁnition 6 (Genericity Conditions) . Let Y ∈ S K . W e sa y Y satisﬁes: (a) Condition (a) if all en tries are distinct: Y ij  = Y kl for ( i, j )  = ( k , l ) with i ≤ j , k ≤ l . (b) Condition (b) if for eac h i , there exists exactly one ( K − 1)-subset S of oﬀ-diagonal p ositions suc h that P ( k,l ) ∈S Y kl = − Y ii . Condition (b) exploits the ro w-sum constrain t Y 1 K = 0 : the oﬀ-diagonal en tries in ro w i sum to − Y ii , so the canonical subset S ∗ i = { ( i, j ) : j  = i } alw ays satisﬁes the equation. Condition (b) requires this subset to b e unique among the  K ( K − 1) / 2 K − 1  p ossible ( K − 1)-subsets. Lemma 2 (Reco very Lemma) . If Y ∈ S ∗ K satisﬁes c onditions (a) and (b), then g ( Y ) deter- mines Y up to S K -orbit. Pr o of. Giv en m ultisets D K and O K , w e reconstruct Y up to relab eling: Step 1: Order the distinct diagonal elemen ts as Y (1)(1) > · · · > Y ( K )( K ) , assigning canonical lab els to alternativ es. Step 2: F or eac h ( i ), identify R ( i ) ⊂ O K as the unique ( K − 1)-subset summing to − Y ( i )( i ) (exists b y condition (b)). Step 3: F or i  = j , recov er Y ( i )( j ) as the unique elemen t in R ( i ) ∩ R ( j ) (singleton by condition (a)). An y Y ′ with g ( Y ′ ) = g ( Y ) satisfying (a) and (b) yields the same reconstruction up to relab eling. A.3 The Exceptional Set Deﬁnition 7 (Exceptional Set) . Deﬁne E = U C ∪ { X ∈ U : f ( X ) violates (a) or (b) } . A-4 Lemma 3 (Non-T rivialit y of Constraints) . The fol lowing hold: (i) F or distinct p ositions ( i, j )  = ( k , l ) , the function f ( X ) ij − f ( X ) kl is not identic al ly zer o on U . (ii) F or e ach index i and ( K − 1) -subset S  = S ∗ i , the function P ( k,l ) ∈S f ( X ) kl + f ( X ) ii is not identic al ly zer o on U . Pr o of. Consider diagonal X = diag ( λ 1 , . . . , λ K ) with distinct λ j and P j λ j  = 0. Then [ MXM ] ii = λ i (1 − 2 /K ) + ¯ λ/K and [ MXM ] ij = ( ¯ λ − λ i − λ j ) /K for i  = j . F or generic λ j , all entries of f ( X ) are distinct, proving (i). F or (ii), the linear com bination for S  = S ∗ i has nonzero co eﬃcien t on λ i , so it is non trivial. Prop osition 6 (Prop erties of E ) . The set E is close d, G K -invariant, and has L eb esgue me asur e zer o. Pr o of. Closed: Conditions (a) and (b) are op en (strict inequalities and unique subset-sums are preserv ed under small p erturbations), so E C is op en. G K -in v ariant: By Prop osition 5 , scaling and n ull-space translation leav e f ( X ) unchanged. P ermutation conjugates f ( X ), preserving distinctness and subset-sum structure. Measure zero: Eac h entry f ( X ) ij is a rational (hence analytic) function on U b y Prop osi- tion 3 . By Lemma 3 , the constraints deﬁning violations of (a) and (b) are nontrivial analytic functions. Their zero sets hav e measure zero by Mit yagin ( 2020 ), and E is a ﬁnite union of suc h sets. A.4 Basic Result for the F ull Space Theorem 7 (Main Theorem) . F or any X , Y ∈ S K \ E : g { f ( X ) } = g { f ( Y ) } implies X and Y ar e in the same G K -orbit. Pr o of. By Lemma 2 , g { f ( X ) } = g { f ( Y ) } implies f ( Y ) = P π f ( X ) P ⊤ π for some π ∈ S K . By Prop osition 5 (i), f ( P π XP ⊤ π ) = f ( Y ). By Lemma 1 , Y = c P π XP ⊤ π + w 1 ⊤ K + 1 K w ⊤ for some c ∈ R ∗ , w ∈ R K . A-5 A.5 Restriction to Cen tered, T race-Normalized Matrices Prop osition 7 (Prop erties of S ∗ K ) . The subsp ac e S ∗ K is a nonempty, close d aﬃne subsp ac e of dimension ( K + 1)( K − 2) / 2 , satisﬁes S ∗ K = Im( f ) , and is S K -invariant. Pr o of. The dimension follo ws from dim( S K ) − ( K + 1) = K ( K + 1) / 2 − K − 1 = ( K + 1)( K − 2) / 2. F or S ∗ K = Im( f ): if X ∈ U , then f ( X ) 1 K = 0 and tr( f ( X )) = K , so Im( f ) ⊆ S ∗ K ; conv ersely , if Y ∈ S ∗ K , then MY M = Y and f ( Y ) = Y . F or S K -in v ariance: ( P π Y P ⊤ π ) 1 K = P π Y 1 K = 0 and the trace is preserved. Deﬁnition 8 (Restricted Exceptional Set) . Deﬁne E ∗ = { Y ∈ S ∗ K : Y violates (a) or (b) } . Theorem 8 (Generic Separation on S ∗ K ) . The set E ∗ is close d, S K -invariant, and has me a- sur e zer o in S ∗ K . F or any Y 1 , Y 2 ∈ S ∗ K \ E ∗ : g ( Y 1 ) = g ( Y 2 ) implies Y 1 and Y 2 ar e in the same S K -orbit. Pr o of. The prop erties of E ∗ follo w from Prop osition 6 restricted to S ∗ K , using Lemma 3 (which pro vides nontrivial constraints via S ∗ K = Im( f )). Separation follows from Lemma 2 . Corollary 2 (Universal Approximation on S ∗ K ) . Any c ontinuous, S K -invariant function on S ∗ K \ E ∗ c an b e uniformly appr oximate d on c omp act subsets by an MLP taking the sorte d diagonal and oﬀ-diagonal entries as input. Pr o of. F ollowing Blum-Smith et al. ( 2025 ), enco de g via con tinuous functions: g d k ( Y ) = the k th largest diagonal entry , and g o ℓ ( Y ) = the ℓ th largest oﬀ-diagonal en try . These are S K -in v ariant and separate p oints in ( S ∗ K \ E ∗ ) /S K b y Theorem 8 . Since S K is ﬁnite (hence compact), the quotient is Hausdorﬀ ( Bredon , 1972 ). By Stone–W eierstrass, the algebra generated b y these functions is dense. By the univ ersal appro ximation theorem ( Cyb enko , 1989 ; Hornik et al. , 1989 ), MLPs can approximate an y con tin uous function of these inputs. A.6 Restriction to Submatrices F or a distinguished index j , we develop an inv ariant separating S ( j ) K − 1 -orbits, where S ( j ) K − 1 = { π ∈ S K : π ( j ) = j } . Deﬁnition 9 (Submatrix In v ariant g j ) . F or Y ∈ S ∗ K and index j , deﬁn e g j ( Y ) = ( P j ( Y ) , O j ( Y )) A-6 where: P j ( Y ) = { { ( Y kk , Y j k ) : k  = j } } , O j ( Y ) = { { Y kl : 1 ≤ k < l ≤ K , k  = j, l  = j } } . The m ultiset P j couples eac h diagonal entry Y kk (for k  = j ) with the cov ariance Y j k . The m ultiset O j con tains oﬀ-diagonal en tries of Y − j, − j . Note that Y j j = K − P k  = j Y kk is implicitly reco verable. Deﬁnition 10 (Genericity Conditions for Submatrices) . W e say Y ∈ S ∗ K satisﬁes (a j ) if all entries except p ossibly Y j j are distinct, and (b j ) if for eac h k  = j , there is a unique ( K − 2)-subset S of oﬀ-diagonal p ositions in Y − j, − j with P ( k,m ) ∈S Y km = − Y kk − Y j k . The target sum − Y kk − Y j k arises from the row-sum constraint: Y kk + Y kj + P l  = k ,l  = j Y kl = 0. Theorem 9 (Generic Separation via g j ) . Deﬁne E j = { Y ∈ S ∗ K : Y violates ( a j ) or ( b j ) } . Then E j is close d, S ( j ) K − 1 -invariant, and has me asur e zer o. F or Y 1 , Y 2 ∈ S ∗ K \ E j : g j ( Y 1 ) = g j ( Y 2 ) implies Y 1 , Y 2 ar e in the same S ( j ) K − 1 -orbit. Pr o of. The pro of follows Theorem 8 . F or separation, the recov ery pro cedure parallels Lemma 2 : order pairs in P i lexicographically , identify eac h row’s oﬀ-diagonal en tries via the unique subset-sum, and reco ver shared entries as singleton intersections. Corollary 3 (Univ ersal Approximation via g j ) . Any c ontinuous, S ( j ) K − 1 -invariant function on S ∗ K \ E j c an b e uniformly appr oximate d on c omp act subsets by an MLP taking the sorte d p airs fr om P j and sorte d entries fr om O j as input. Pr o of. The pro of is analogous to Corollary 2 , with pairs replacing diagonal en tries. A.7 Extension to Utilit y-Co v ariance P airs: Global In v arian ts W e extend the theory to pairs ( v , Y ) consisting of a utilit y v ector and cov ariance matrix. Deﬁnition 11 (Centered Utilit y-Cov ariance Space) . Deﬁne X K = { ( v , Y ) ∈ R K × S K : P K i =1 v i = 0 , Y 1 K = 0 , tr( Y ) = K } . This space has dimension ( K − 1) + ( K + 1)( K − 2) / 2 = ( K 2 + K − 4) / 2. It is closed, Hausdorﬀ, and S K -in v ariant under π · ( v , Y ) = ( P π v , P π Y P ⊤ π ). Deﬁnition 12 (Global Joint Inv arian t) . F or ( v , Y ) ∈ X K , deﬁne ˜ g ( v , Y ) = ( P ( v , Y ) , O ( Y )) where P ( v , Y ) = { { ( v i , Y ii ) : i = 1 , . . . , K } } and O ( Y ) = { { Y ij : 1 ≤ i < j ≤ K } } . A-7 The k ey extension is that P con tains p airs ( v i , Y ii ) coupling utilities with v ariances. Deﬁnition 13 (Genericit y Conditions) . W e sa y ( v , Y ) ∈ X K satisﬁes (˜ a) if the pairs ( v i , Y ii ) are pairwise distinct and the oﬀ-diagonal entries are pairwise distinct, and ( ˜ b) if condition (b) holds for Y . Theorem 10 (Generic Separation on X K ) . Deﬁne ˜ B = { ( v , Y ) ∈ X K : (˜ a ) or ( ˜ b ) fails } . Then ˜ B is close d, S K -invariant, and has me asur e zer o. F or ( v 1 , Y 1 ) , ( v 2 , Y 2 ) ∈ X K \ ˜ B : ˜ g ( v 1 , Y 1 ) = ˜ g ( v 2 , Y 2 ) implies the p airs ar e in the same S K -orbit. Pr o of. The pro of follows Theorem 8 . The reco v ery procedure orders pairs lexicographically to assign lab els, then recov ers oﬀ-diagonal entries via subset-sum intersections. The measure- zero argumen t extends since v i − v j is non trivial on X K . A.8 Extension to Utilit y-Co v ariance P airs: P er-Alternativ e In- v arian ts This section establishes Theorem 1 (Generic Separation) and Theorem 2 (MLP Univ ersal Appro ximation) from the main text. Deﬁnition 14 (Per-Alternativ e Inv arian t g j ) . F or ( v , Y ) ∈ X K and index j , deﬁn e g j ( v , Y ) = ( T j ( v , Y ) , O j ( Y )) where: T j ( v , Y ) = { { ( v k , Y kk , Y j k ) : k  = j } } , O j ( Y ) = { { Y kl : 1 ≤ k < l ≤ K , k  = j, l  = j } } . This com bines extensions from Sections A.6 and A.7 : T j con tains triples ( v k , Y kk , Y j k ) cou- pling each alternativ e k ’s utility and v ariance with its cov ariance with j , while O j con tains oﬀ-diagonal entries of Y − j, − j . The v alues v j = − P k  = j v k and Y j j = K − P k  = j Y kk are implicitly reco verable. Deﬁnition 15 (Genericity Conditions) . W e sa y ( v , Y ) ∈ X K satisﬁes (˜ a j ) if the triples ( v k , Y kk , Y j k ) for k  = j are pairwise distinct and the oﬀ-diagonal en tries of Y − j, − j are pairw ise distinct, and ( ˜ b j ) if condition (b j ) holds. Deﬁnition 16 (Exceptional Set) . Deﬁne B j = { ( v , Y ) ∈ X K : (˜ a j ) or ( ˜ b j ) fails } . Prop osition 8 (Prop erties of B j ) . The set B j is close d, S ( j ) K − 1 -invariant, and has me asur e zer o in X K . Pr o of. The pro of follo ws Prop osition 6 . Closedness holds b ecause (˜ a j ) and ( ˜ b j ) are op en conditions. In v ariance under S ( j ) K − 1 holds b ecause p erm utations ﬁxing j preserv e distinctness A-8 and subset-sum structure. F or measure zero: coincidence of tw o triples deﬁnes the inter- section of three hyperplanes { v k = v l } ∩ { Y kk = Y ll } ∩ { Y j k = Y j l } , each non trivial; the subset-sum constraints are similarly nontrivial by arguments analogous to Lemma 3 (ii). Lemma 4 (Recov ery Lemma) . If ( v , Y ) ∈ X K satisﬁes ( ˜ a j ) and ( ˜ b j ), then g j ( v , Y ) deter- mines ( v , Y ) up to S ( j ) K − 1 -orbit. Pr o of. The pro of extends Lemma 2 to triples. Order triples in T j lexicographically to assign lab els (1) , . . . , ( K − 1). F or eac h ( k ), identify R ( k ) ⊂ O j as the unique ( K − 2)-subset summing to − Y ( k )( k ) − Y j ( k ) . Recov er Y ( k )( l ) as the singleton R ( k ) ∩ R ( l ) . Recov er v j and Y j j from the constrain ts. W e now prov e the main results from Section 5 . Pr o of of The or em 1 . By Proposition 8 , B j is closed, S ( j ) K − 1 -in v ariant, and has measure zero. F or ( v 1 , Y 1 ) , ( v 2 , Y 2 ) ∈ X K \ B j with g j ( v 1 , Y 1 ) = g j ( v 2 , Y 2 ), Lemma 4 implies b oth pairs are determined up to S ( j ) K − 1 -orbit b y this common v alue. Pr o of of The or em 2 . Enco de g j via contin uous functions: sort triples lexicographically to obtain 3( K − 1) functions g t k, 1 = v ( k ) , g t k, 2 = Y ( k )( k ) , g t k, 3 = Y j ( k ) ; sort oﬀ-diagonal entries to obtain ( K − 1)( K − 2) / 2 functions g o ℓ . These are con tinuous and S ( j ) K − 1 -in v ariant, and separate p oin ts on ( X K \ B j ) /S ( j ) K − 1 b y Theorem 1 . Since S ( j ) K − 1 is ﬁnite, X K /S ( j ) K − 1 is Hausdorﬀ ( Bredon , 1972 ). By Stone–W eierstrass, the algebra generated by these functions is dense on compact subsets. By the universal appro xi- mation theorem ( Cyb enk o , 1989 ; Hornik et al. , 1989 ), MLPs can approximate an y con tinuous function of these inputs. Because the choice probabilities are deﬁned according to the integral giv en in ( 4 ) under Assumption 1 , they are contin uous functions; thus, the result holds. B Estimation Theory This appendix con tains additional information on the estimation theory results in Sections 5.2 and 5.3 . Section B.1 pro vides additional discussion of the appro ximation conditions: Assumptions 2 and 4 . Section B.2 pro vides a pro of of Theorem 4 . Section B.3 contains the regularit y conditions for Section 5.3 . Section B.4 pro vides a pro of of Theorem 6 . A-9 B.1 Em ulator Appro ximation Assumptions This section provides additional discussion of Assumptions 2 and 4 , sho wing ho w they can b e derived from more primitiv e conditions on the em ulator and justifying these conditions through related theoretical results. Primitiv e assumptions. W e consider a sieve regime in which a sequence of em ulators ˆ P n is trained with increasing precision as the sample size n grows. This is achiev ed b y progressiv ely (a) increasing the n umber of sim ulated training examples s = s n and (b) increasing the neural netw ork complexit y c = c n (e.g., the n umber of lay ers or hidden units). W e imp ose the following primitive conditions: (P1) Comp act c ovariate supp ort : The co v ariate space C is compact. (P2) Smo oth utility and c ovarianc e mappings : The mappings θ 7→ v ∗ i ( X i , θ ) and θ 7→ Σ ∗ ( X i , θ ) are con tinuously diﬀerentiable for all X i ∈ C . (P3) Interior pr ob ability c ondition : There exist a compact set ¯ X K ⊂ X K and p > 0 suc h that inf ( v ∗ , Σ ∗ ) ∈ ¯ X K min j P j ( v ∗ , Σ ∗ ) ≥ p and the image of C × Θ under the mapping ( X , θ ) 7→ ( v ∗ ( X , θ ) , Σ ∗ ( X , θ )) is con tained in ¯ X K . (P4) Pr ob ability-sc ale appr oximation : F or some α ≥ 0, the em ulator satisﬁes sup ( v ∗ , Σ ∗ ) ∈ ¯ X K max j =1 ,...,K | ˆ P n,j ( v ∗ , Σ ∗ ) − P j ( v ∗ , Σ ∗ ) | = o p ( n − α/ 2 ) . (49) (P5) Gr adient appr oximation : The emula tor gradien ts satisfy sup ( v ∗ , Σ ∗ ) ∈ ¯ X K max j =1 ,...,K    ∇ ( v ∗ , Σ ∗ ) log ˆ P n,j ( v ∗ , Σ ∗ ) − ∇ ( v ∗ , Σ ∗ ) log P j ( v ∗ , Σ ∗ )    = o p (1) . (50) W e next show ho w these assumptions imply Assumptions 2 and 4 . W e then provide justiﬁ- cation of assumptions (P4) and (P5) using related results in approximation theory . F rom probability-scale errors to log-lik eliho o d errors. W e now sho w that conditions (P1)–(P4) imply Assumption 2 . W rite the log-likelihoo d error as ˆ ℓ n ( θ ) − ℓ n ( θ ) = 1 n n X i =1 ξ i , where ξ i = log ˆ P n,y i ( θ ) − log P y i ( θ ) , (51) A-10 and y i ∈ { 1 , . . . , K } denotes the observ ed c hoice. Conditional on co v ariates and the trained em ulator, the ξ i are indep enden t. W e decomp ose in to bias and centered comp onents: ˆ ℓ n ( θ ) − ℓ n ( θ ) = 1 n n X i =1 E ( ξ i | X i ) | {z } bias + 1 n n X i =1 { ξ i − E ( ξ i | X i ) } | {z } centered term . (52) Let δ n = sup ( v ∗ , Σ ∗ ) ∈ ¯ X K max j | ˆ P n,j − P j | = o p ( n − α/ 2 ) denote the probabilit y-scale error from (P4). T a ylor expanding log ˆ P n,j around P j giv es log ˆ P n,j − log P j = ˆ P n,j − P j P j − ( ˆ P n,j − P j ) 2 2 P 2 j + O  δ 3 n p 3  . (53) The conditional exp ectation of ξ i is E ( ξ i | X i ) = K X j =1 P j (log ˆ P n,j − log P j ) . (54) The ﬁrst-order con tribution is P j P j · ( ˆ P n,j − P j ) /P j = P j ( ˆ P n,j − P j ) = 0, since b oth probabilit y vectors sum to one. Therefore, E ( ξ i | X i ) = − K X j =1 ( ˆ P n,j − P j ) 2 2 P j + O  δ 3 n p 3  = O  δ 2 n p  . (55) F or the cen tered term, the mean v alue theorem giv es | log ˆ P n,j − log P j | ≤ | ˆ P n,j − P j | /p ≤ δ n /p , and hence | ξ i | ≤ δ n /p . The conditional v ariance satisﬁes V ar( ξ i | X i ) ≤ E ( ξ 2 i | X i ) ≤ δ 2 n /p 2 . Since the cen tered terms ξ i − E ( ξ i | X i ) are conditionally independent with mean zero, the v ariance of their sample av erage is V ar " 1 n n X i =1 { ξ i − E ( ξ i | X i ) }    X 1 , . . . , X n # = 1 n 2 n X i =1 V ar( ξ i | X i ) ≤ δ 2 n np 2 . (56) By Cheb yshev’s inequality , the centered term is O p  δ n p √ n  . Com bining via the triangle inequality: | ˆ ℓ n ( θ ) − ℓ n ( θ ) | = O  δ 2 n p  + O p δ n √ np ! . (57) A-11 F or α ∈ [0 , 1], the bias term O ( δ 2 n /p ) = o ( n − α ) dominates the centered term O p  δ n p √ n  = o p  n − ( α +1) / 2  . Since conditions (P1) and (P2) ensure that the mapping from ( X , θ ) to ( v ∗ , Σ ∗ ) is con tinuous on the compact set C × Θ, the b ound holds uniformly o ver θ ∈ Θ: sup θ ∈ Θ | ˆ ℓ n ( θ ) − ℓ n ( θ ) | = o p ( n − α ) , (58) whic h is Assumption 2 . F rom gradien t appro ximation to Fisher information consistency . W e now sho w that conditions (P1), (P2), (P3), and (P5) imply Assumption 4 . By the c hain rule, ∇ θ log ˆ P n,j ( θ ) = {∇ θ ( v ∗ , Σ ∗ ) } ⊤ ∇ ( v ∗ , Σ ∗ ) log ˆ P n,j , (59) and similarly for ∇ θ log P j ( θ ). By condition (P2), the Jacobian ∇ θ ( v ∗ , Σ ∗ ) is con tinuous on the compact set C × Θ and hence uniformly b ounded: sup ( X , θ ) ∈C × Θ ∥∇ θ ( v ∗ , Σ ∗ ) ∥ ≤ C for some constan t C < ∞ . Let ε n = sup ( v ∗ , Σ ∗ ) ∈ ¯ X K max j    ∇ ( v ∗ , Σ ∗ ) log ˆ P n,j − ∇ ( v ∗ , Σ ∗ ) log P j    = o p (1) (60) denote the gradien t approximation error from (P5). Then sup θ ∈ Θ    ∇ θ log ˆ P n,j ( θ ) − ∇ θ log P j ( θ )    ≤ C ε n = o p (1) . (61) Recall the em ulator and true outer-pro duct-of-scores estimators are ˆ J n ( θ ) = 1 n n X i =1 ∇ θ log ˆ P n,y i ( θ ) ∇ θ log ˆ P n,y i ( θ ) ⊤ , (62) J n ( θ ) = 1 n n X i =1 ∇ θ log P y i ( θ ) ∇ θ log P y i ( θ ) ⊤ . (63) Let ˆ s i = ∇ θ log ˆ P n,y i ( θ ) and s i = ∇ θ log P y i ( θ ). The iden tity ˆ s i ˆ s ⊤ i − s i s ⊤ i = ( ˆ s i − s i ) ˆ s ⊤ i + s i ( ˆ s i − s i ) ⊤ (64) implies ∥ ˆ s i ˆ s ⊤ i − s i s ⊤ i ∥ ≤ ∥ ˆ s i − s i ∥ · ( ∥ ˆ s i ∥ + ∥ s i ∥ ) . (65) A-12 The true scores s i are uniformly b ounded: (P3) ensures that all relev ant inputs ( v ∗ , Σ ∗ ) lie in ¯ X K where P j ≥ p > 0, guaran teeing b ounded gradien ts ∇ ( v ∗ , Σ ∗ ) log P j , and (P2) ensures the Jacobian ∇ θ ( v ∗ , Σ ∗ ) is contin uous and hence b ounded on C × Θ. Let M = sup θ ∈ Θ max j ∥∇ θ log P j ( θ ) ∥ < ∞ , so that ∥ s i ∥ ≤ M for all i . Since ∥ ˆ s i − s i ∥ ≤ C ε n = o p (1), the em ulator scores satisfy ∥ ˆ s i ∥ ≤ ∥ s i ∥ + C ε n ≤ M + o p (1). Therefore, sup θ ∈ Θ ∥ ˆ J n ( θ ) − J n ( θ ) ∥ ≤ 1 n n X i =1 ∥ ˆ s i − s i ∥ · ( ∥ ˆ s i ∥ + ∥ s i ∥ ) ≤ 1 n n X i =1 C ε n · { 2 M + o p (1) } = o p (1) , (66) whic h is Assumption 4 . Justiﬁcation from appro ximation theory . The primitiv e conditions (P4) and (P5) are supp orted by related theoretical results on neural net work approximation and estimation, though a complete theory tailored to our sp eciﬁc setting remains an area for future work. The total emulator error can b e decomp osed in to appro ximation error and estimation error: sup ( v ∗ , Σ ∗ ) ∈ ¯ X K max j =1 ,...,K | ˆ P n,j ( v ∗ , Σ ∗ ) − P j ( v ∗ , Σ ∗ ) | ≤ a ( c n ) | {z } approximation + e ( s n , c n ) | {z } estimation , (67) where a ( c ) is the distance from the true c hoice probabilit y function to the b est appro ximator in the net work class of complexity c , and e ( s, c ) is the deviation of the trained netw ork from this b est-in-class appro ximator due to training on s simulated examples. F or the approximation error, pro vided the choice probabilities are real-analytic functions of ( v ∗ , Σ ∗ ) on the in terior of the parameter space, neural net works with smo oth activ a- tions can approximate them with exp onen tially small error. De Ryck et al. ( 2021 ) es- tablish that tanh netw orks with t wo hidden lay ers achiev e approximation error decaying as O  N k/ ( d +1) exp( − C · N 1 / ( d +1) log N )  in Sob olev norms W k, ∞ , where N is the net work width, d is the input dimension, and C > 0 dep ends on the analyticit y parameters of the tar- get function. Earlier w ork established similar exp onential rates for deep ReLU net works ( E and W ang , 2018 ), though under the stronger assumption that the target function admits an absolutely conv ergen t pow er series expansion. Given these exp onential con vergence results, ac hieving approximation error ε requires net work complexity scaling only p olylogarithmically in 1 /ε . A-13 F or the estimation error, standard results from empirical pro cess theory yield b ounds of order O n p c log ( sc ) /s o in L 2 norms, where c is a complexity parameter ( Sc hmidt-Hieb er , 2020 ; F arrell et al. , 2021 ; Shen et al. , 2023 ). F or our setting, ho wev er, we require an L ∞ b ound. The results below assume minimally that e ( s, c ) = O  c k p olylog( s ) / √ s  for some k < ∞ . Then, with netw ork complexit y growing p olylogarithmically in the target accuracy , the dep endence on c is negligible, and the estimation error is eﬀectively of order O (1 / √ s ) up to logarithmic factors. Ac hieving the rate giv en in (P4) requires that b oth a ( c n ) = o ( n − α/ 2 ) and e ( s n , c n ) = o p ( n − α/ 2 ); note that the approximation error is nonsto c hastic b ecause it concerns pure func- tion appro ximation. The exp onen tial appro ximation rates for analytic functions imply that the ﬁrst condition is satisﬁed with p olylogarithmic gro wth in c n . The second condition re- quires s n to grow faster than n α up to logarithmic factors. In practice, since sim ulation from the MNP mo del is computationally inexp ensiv e, generating a training set with s n ≫ n is feasible. The cases α = 0 and α = 1 corresp ond to diﬀeren t inferen tial goals. When α = 0, As- sumption 2 requires only that the log-lik eliho o d error v anish in probability , which suﬃces for consistency of the em ulator-based estimator (Theorem 3 ). When α = 1, the stronger requiremen t that the error v anish faster than n − 1 ensures that the emulator-based estimator is asymptotically equiv alent to the true MLE, inheriting its √ n -consistency and asymptotic normalit y (Theorem 4 ). The latter requires a larger training set: s n gro wing faster than n v ersus s n gro wing faster than n 0 = 1. F or the gradient appro ximation in (P5), Czarnec ki et al. ( 2017 ) show that neural netw orks are universal appro ximators in Sob olev norms, and the b ounds of De Ryck et al. ( 2021 ) hold in W k, ∞ norms, establishing that smo oth netw orks can simultaneously appro ximate b oth function v alues and deriv ativ es. Additionally , Co cola and Hand ( 2020 ) pro ve global con vergence of gradien t ﬂo w for Sob olev training with o verparameterized tw o-la y er net works, ensuring that the trained net work achiev es small Sob olev loss under appropriate conditions. These results supp ort the plausibility of (P5), though the precise rates and conditions for our setting merit further in vestigation. B.2 Pro of of Theorem 4 Pr o of. W e pro ve the result for appro ximate maximizers satisfying ˆ ℓ n ( ˆ θ n ) ≥ sup θ ∈ Θ ˆ ℓ n ( θ ) − η n where η n = o p ( n − 1 ). Exact maximizers corresp ond to η n = 0. W e proceed in three A-14 steps: ﬁrst b ounding the diﬀerence in log likelihoo ds, then sho wing that the estimators are asymptotically equiv alen t, and ﬁnally app ealing to the asymptotic normalit y of the exact MLE. Step 1: Bounding the diﬀerence in log lik eliho o ds. Deﬁne δ n = sup θ ∈ Θ | ˆ ℓ n ( θ ) − ℓ n ( θ ) | = o p ( n − 1 ) by Assumption 2 with α ≥ 1. By the triangle inequalit y , | ℓ n ( ˆ θ n ) − ℓ n ( ˜ θ n ) | ≤ | ℓ n ( ˆ θ n ) − ˆ ℓ n ( ˆ θ n ) | + | ˆ ℓ n ( ˆ θ n ) − ℓ n ( ˜ θ n ) | . (68) By Assumption 2 , the ﬁrst term is b ounded b y δ n . F or the second term, w e establish upp er and lo wer b ounds on ˆ ℓ n ( ˆ θ n ). Since ˜ θ n maximizes ℓ n : ˆ ℓ n ( ˆ θ n ) ≤ ℓ n ( ˆ θ n ) + δ n ≤ ℓ n ( ˜ θ n ) + δ n . (69) F or the lo w er b ound, b y the approximate maximizer prop erty and the fact that ˜ θ n ∈ Θ: ˆ ℓ n ( ˆ θ n ) ≥ sup θ ∈ Θ ˆ ℓ n ( θ ) − η n ≥ ˆ ℓ n ( ˜ θ n ) − η n ≥ ℓ n ( ˜ θ n ) − δ n − η n . (70) Com bining these b ounds, | ˆ ℓ n ( ˆ θ n ) − ℓ n ( ˜ θ n ) | ≤ δ n + η n . Therefore, | ℓ n ( ˆ θ n ) − ℓ n ( ˜ θ n ) | ≤ 2 δ n + η n = o p ( n − 1 ) , (71) whic h implies n | ℓ n ( ˆ θ n ) − ℓ n ( ˜ θ n ) | = o p (1). Step 2: Asymptotic equiv alence of the estimators. W e show that √ n ( ˆ θ n − ˜ θ n ) p − → 0 . Fix ϵ > 0 and deﬁne the neigh b orho o d B n,ϵ = n θ ∈ Θ : √ n ∥ θ − ˜ θ n ∥ ≤ ϵ o . (72) W e show that ˆ θ n ∈ B n,ϵ with probabilit y approaching one. F or an y θ / ∈ B n,ϵ , we hav e ∥ θ − ˜ θ n ∥ > ϵ/ √ n . Since ˜ θ n maximizes ℓ n , the score v anishes: 1 n P n i =1 ψ ( y i , X i ; ˜ θ n ) = 0 . By T aylor expansion around ˜ θ n , ℓ n ( θ ) = ℓ n ( ˜ θ n ) + 1 2 ( θ − ˜ θ n ) ⊤ ∇ 2 θ ℓ n ( ¯ θ )( θ − ˜ θ n ) , (73) where ¯ θ lies on the segment b etw een θ and ˜ θ n . By consistency , b oth ˆ θ n p − → θ 0 and ˜ θ n p − → θ 0 , A-15 so ¯ θ p − → θ 0 . By the uniform law of large num b ers and Assumption 3 (ii) and (iv), ∇ 2 θ ℓ n ( ¯ θ ) p − → − J ( θ 0 ) . (74) Let λ min > 0 denote the minim um eigenv alue of J ( θ 0 ), which is positive by Assumption 3 (v). F or suﬃciently large n , with probability approac hing one, the Hessian ∇ 2 θ ℓ n ( ¯ θ ) has all eigen- v alues b ounded ab ov e by − λ min / 2. Thus, for θ / ∈ B n,ϵ , n n ℓ n ( ˜ θ n ) − ℓ n ( θ ) o ≥ λ min 4 n ∥ θ − ˜ θ n ∥ 2 > λ min 4 · ϵ 2 = κ > 0 (75) with probabilit y approaching one as n → ∞ . No w suppose Pr( ˆ θ n / ∈ B n,ϵ ) → 0. On the even t that both ˆ θ n / ∈ B n,ϵ and the Hessian b ound holds, we ha ve n ( ℓ n ( ˜ θ n ) − ℓ n ( ˆ θ n )) > κ . Since the Hessian b ound holds with probability approac hing one, this ev en t has probability bounded aw ay from zero, whic h implies that Pr hn n ( ℓ n ( ˜ θ n ) − ℓ n ( ˆ θ n ) o > κ i → 0. But this contradicts n | ℓ n ( ˆ θ n ) − ℓ n ( ˜ θ n ) | = o p (1) from Step 1. Therefore, Pr( ˆ θ n ∈ B n,ϵ ) → 1. Since ϵ > 0 w as arbitrary , w e conclude that √ n ( ˆ θ n − ˜ θ n ) p − → 0 . Step 3: Asymptotic normalit y . By standard maxim um likelihoo d theory (Theorem 5.39 and Lemma 7.6 of v an der V aart ( 1998 )), the regularit y conditions in Assumption 3 ensure that the mo del is diﬀerentiable in quadratic mean at θ 0 with nonsingular Fisher information. Since ˜ θ n is consistent, √ n ( ˜ θ n − θ 0 ) d − → N { 0 , J ( θ 0 ) − 1 } . Since √ n ( ˆ θ n − ˜ θ n ) p − → 0 , Slutsky’s theorem giv es √ n ( ˆ θ n − θ 0 ) = √ n ( ˆ θ n − ˜ θ n ) + √ n ( ˜ θ n − θ 0 ) d − → N  0 , J ( θ 0 ) − 1  . (76) B.3 Regularit y Conditions for Inference Under Missp eciﬁcation This app endix states the regularit y conditions required for the results in Section 5.3 . These conditions parallel standard assumptions for M-estimators ( v an der V aart , 1998 , Chapter 5) and are mild for neural net work emulators with smo oth activ ation functions. Recall that in Section 5.3 , w e treat the neural netw ork as ﬁxed (not v arying with n ), so w e use ˆ m and ˆ ψ without n subscripts. Assumption 5 (Identiﬁcation and Uniform Conv ergence) . The follo wing conditions hold: A-16 (i) The parameter space Θ is compact. (ii) The pseudo-true parameter θ † 0 = arg max θ ∈ Θ E { ˆ m ( y , X ; θ ) } exists uniquely and lies in the in terior of Θ. (iii) The pseudo-true parameter is well-separated: for ev ery ϵ > 0, sup θ : ∥ θ − θ † 0 ∥≥ ϵ E { ˆ m ( y , X ; θ ) } < E n ˆ m ( y , X ; θ † 0 ) o . (77) (iv) The sample criterion function conv erges uniformly to its exp ectation: sup θ ∈ Θ      1 n n X i =1 ˆ m ( y i , X i ; θ ) − E [ ˆ m ( y , X ; θ )]      p − → 0 . (78) (v) The estimator ˆ θ n is an appro ximate maximizer of the sample criterion function: 1 n n X i =1 ˆ m ( y i , X i ; ˆ θ n ) ≥ sup θ ∈ Θ 1 n n X i =1 ˆ m ( y i , X i ; θ ) − o p ( n − 1 ) . (79) Assumption 6 (Regularity for Asymptotic Normality) . The following conditions hold: (i) (Diﬀerentiabilit y) F or eac h ( y , X ), the map θ 7→ ˆ m ( y , X ; θ ) is diﬀeren tiable on Θ with gradien t ˆ ψ ( y , X ; θ ) = ∇ θ ˆ m ( y , X ; θ ). (ii) (Lo cal dominance) There exists δ > 0 and a measurable function ˙ m ( y , X ) ha ving E [ ˙ m ( y , X ) 2 ] < ∞ suc h that sup θ : ∥ θ − θ † 0 ∥≤ δ ∥ ˆ ψ ( y , X ; θ ) ∥ ≤ ˙ m ( y , X ) . (80) (iii) (Second-order expansion) The map θ 7→ E { ˆ m ( y , X ; θ ) } admits a second-order T a ylor expansion at θ † 0 : E { ˆ m ( y , X ; θ ) } = E n ˆ m ( y , X ; θ † 0 ) o − 1 2 ( θ − θ † 0 ) ⊤ A ( θ † 0 )( θ − θ † 0 ) + o ( ∥ θ − θ † 0 ∥ 2 ) , (81) where A ( θ † 0 ) = − E n ∇ 2 θ ˆ m ( y , X ; θ † 0 ) o is p ositiv e deﬁnite. A-17 B.4 Pro of of Theorems 5 and 6 Pr o of of The or em 5 . Deﬁne the sample and p opulation criterion functions M n ( θ ) = 1 n n X i =1 ˆ m ( y i , X i ; θ ) , M ( θ ) = E { ˆ m ( y , X ; θ ) } . (82) By Assumption 5 (iv), sup θ ∈ Θ | M n ( θ ) − M ( θ ) | p − → 0. By Assumption 5 (ii)–(iii), θ † 0 is the unique maximizer of M ( θ ) and is well-separated in the sense that sup θ : ∥ θ − θ † 0 ∥≥ ϵ M ( θ ) < M ( θ † 0 ) for ev ery ϵ > 0. By Assumption 5 (v), M n ( ˆ θ n ) ≥ sup θ ∈ Θ M n ( θ ) − o p ( n − 1 ), whic h implies M n ( ˆ θ n ) ≥ sup θ ∈ Θ M n ( θ ) − o p (1). The result ˆ θ n p − → θ † 0 no w follows from Theorem 5.7 of v an der V aart ( 1998 ). Pr o of of The or em 6 . W e verify the conditions of Theorem 5.23 of v an der V aart ( 1998 ) and apply it to obtain the result. By Assumption 6 (i), the map θ 7→ ˆ m ( y , X ; θ ) is diﬀeren tiable at θ † 0 for ev ery ( y , X ), with deriv ative ˆ ψ ( y , X ; θ † 0 ). By Assumption 6 (ii), the score is lo cally dominated by a square- in tegrable function, which implies the Lipschitz condition | ˆ m ( y , X ; θ 1 ) − ˆ m ( y , X ; θ 2 ) | ≤ ˙ m ( y , X ) ∥ θ 1 − θ 2 ∥ (83) for θ 1 , θ 2 in a neighborho o d of θ † 0 via the mean v alue theorem. Assumption 6 (iii) provides the required second-order T aylor expansion of θ 7→ E { ˆ m ( y , X ; θ ) } at the p oint of maximum θ † 0 , with nonsingular second deriv ative matrix − A ( θ † 0 ). By Theorem 5 , ˆ θ n p − → θ † 0 . By Assumption 5 (v), the near-maximization condition M n ( ˆ θ n ) ≥ sup θ ∈ Θ M n ( θ ) − o p ( n − 1 ) is satisﬁed. Applying Theorem 5.23 of v an der V aart ( 1998 ): √ n ( ˆ θ n − θ † 0 ) = A ( θ † 0 ) − 1 1 √ n n X i =1 ˆ ψ ( y i , X i ; θ † 0 ) + o p (1) . (84) By Assumption 6 (ii), E n ∥ ˆ ψ ( y , X ; θ † 0 ) ∥ 2 o ≤ E { ˙ m ( y , X ) 2 } < ∞ . The cen tral limit theorem and Slutsky’s lemma yield √ n ( ˆ θ n − θ † 0 ) d − → N n 0 , A ( θ † 0 ) − 1 B ( θ † 0 ) A ( θ † 0 ) −⊤ o , (85) A-18 where B ( θ † 0 ) = E n ˆ ψ ( y , X ; θ † 0 ) ˆ ψ ( y , X ; θ † 0 ) ⊤ o . C Sim ulation Study Details This app endix provides additional details on the factor model parameterization, emulator training pro cedure, and the scaled Wishart distribution used in the simulation study . C.1 Em ulator T raining W e train the em ulator using the Adam optimizer ( Kingma and Ba , 2015 ) with the AdamW w eight decay v arian t ( Loshc hilov and Hutter , 2019 ). The base learning rate is 0 . 003 with momen tum parameters β 1 = 0 . 9 and β 2 = 0 . 999, numerical stability constant ϵ = 10 − 8 , and w eight decay 3 × 10 − 6 . Learning rate sc hedule. W e use a learning rate sc hedule with linear warm up follo wed b y in verse-square-root deca y . During the ﬁrst 5 , 000 steps, the learning rate increases linearly from zero to the base rate. After w armup, the learning rate decays as p t warm up /t , where t is the curren t step and t warm up = 5 , 000. Drop out sc hedule. W e apply drop out during training to encourage div ersit y among learned represen tations. W e set the drop out rate to 1 / (10 t ), where t indexes the training episo des from 1 to 400 , 000. T raining procedure. W e pregenerate 400 , 000 training examples, eac h consisting of a canonicalized input ( v ∗ , Σ ∗ ), sim ulated c hoice frequencies based on 10 6 Mon te Carlo dra ws, and target gradien ts computed using the soft relaxation describ ed in Section 4 . During training, w e randomly pair these examples with pre-computed tangent space directions for the directional deriv ativ e loss. Gradients are clipp ed to a maximum norm of 0 . 01. W e train for 400 , 000 optimizer steps with a batch size of 10 , 000. Net work arc hitecture. T able 8 summarizes the neural netw ork arc hitecture for each v alue of K . All netw orks use the Swish activ ation function. Larger v alues of K require larger net works to maintain approximation accuracy . C.2 Deriv ation of Scaled Wishart PDF This section derives the PDF of a scaled Wishart random matrix as describ ed in Section 6.1 . A-19 Let W ∼ W d ( V , n ) denote a d × d Wishart random matrix with scale matrix V and n ≥ d degrees of freedom. W e derive the distribution of Σ = W /W j j , where W j j is the ( j, j ) elemen t of W for some j ∈ { 1 , . . . , d } . The Wishart densit y is f W ( W ) = | W | ( n − d − 1) / 2 exp  − 1 2 tr( V − 1 W )  2 nd/ 2 | V | n/ 2 Γ d ( n/ 2) (86) where Γ d ( · ) is the m ultiv ariate gamma function. The in verse transformation is W = W j j Σ . Note that Σ is p ositiv e deﬁnite with Σ j j = 1, so it has d ( d + 1) / 2 − 1 free parameters. T ogether with W j j , this matc hes the d ( d + 1) / 2 unique elemen ts of W . The Jacobian determinant of this transformation is w d ( d +1) / 2 − 1 j j , where w j j denotes the v alue of W j j . Substituting W = W j j Σ into the Wishart densit y and applying the Jacobian yields the joint densit y f W j j , Σ ( w j j , Σ ) = | Σ | ( n − d − 1) / 2 2 nd/ 2 | V | n/ 2 Γ d ( n/ 2) · w nd/ 2 − 1 j j exp n − w j j 2 tr( V − 1 Σ ) o . (87) In tegrating ov er w j j > 0 using the gamma in tegral gives the marginal density of Σ : f Σ ( Σ ) = Γ( nd/ 2) 2 nd/ 2 | V | n/ 2 Γ d ( n/ 2) · | Σ | ( n − d − 1) / 2  1 2 tr( V − 1 Σ )  nd/ 2 (88) for symmetric, p ositiv e deﬁnite Σ with Σ j j = 1 for some ﬁxed j ∈ { 1 , . . . , d } . In the simulation study , w e use d = K − 1, n = K + 10, V = I K − 1 , and j = 1. Substituting these v alues and simplifying yields the densit y giv en in Section 6.1 . A-20 Comp onen t K = 3 K = 5 K = 10 Diagonal De epSet ϕ output dim 8 24 32 ϕ hidden la yers 0 1 1 ϕ hidden dim N/A 24 32 ρ output dim 8 24 32 ρ hidden la yers 0 0 0 P arameters 152 1,440 2,432 Oﬀ-diagonal De epSet ϕ output dim 8 24 32 ϕ hidden la yers 0 1 1 ϕ hidden dim N/A 24 32 ρ output dim 8 24 32 ρ hidden la yers 0 0 0 P arameters 128 1,368 2,336 Combining MLP Output dim 8 24 32 Hidden dim 12 48 64 Hidden la yers 2 2 2 P arameters 572 6,312 10,976 Equivariant layers Hidden dim 4 12 16 Hidden la yers 1 1 1 P arameters 76 612 1,072 T otal p ar ameters 928 9,732 16,816 T able 8: Neural net work arc hitecture b y n umber of alternatives K . All conﬁgurations use an initial drop out rate of 0 . 1 and Swish activ ations. When the num b er of hidden lay ers is 0, the hidden dimension is not applicable (N/A). A-21

Amortized Inference for Correlated Discrete Choice Models via Equivariant Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment