Extension of coupling via the Projection of Optimal Transport

EXTENSION OF COUPLING VIA THE PR OJECTION OF OPTIMAL TRANSPOR T JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* Abstract. In man y statistical settings, t w o types of data are av ailable: cou- pled data, which preserve the joint structure among v ariables but are limited in size due to cost or priv acy constrain ts, and marginal data, whic h are a v ail- able at larger scales but lack joint structure. Since standard metho ds require coupled data, marginal information is often discarded. W e prop ose a fully non- parametric pro cedure that integrates decoupled marginal data with a limited amount of coupled data to improv e the downstream analysis. The approach can b e understoo d as an extension of coupling via pro jection in optimal trans- port. Speciﬁcally , the estimator is a solution for the optimal transp ort pro jec- tion o ver the space of probabilit y measures, whic h gen uinely provides a natural geometric interpretation. Not only is its stability established, but its sample complexity is also derived using recent adv ances in statistical optimal trans- port. In addition to this, w e presen t its explicit form ula based on “shado w,” a notion introduced b y Eckstein and Nutz. F urthermore, the estimator can b e approximated in almost linear time and in parallel by entropic shado w, which demonstrates the theoretical and practical strengths of our metho ds. Lastly , we present exp erimen ts with real and syn thetic data to justify the p erformance of our method. Contents 1. In tro duction 2 1.1. Ov erview of Our Approac h 3 1.2. Organization 5 2. Preliminaries 5 2.1. Review: Optimal T ransp ort 5 2.2. Notation 6 3. Main results 7 3.1. Optimal transp ort pro jection 7 Date : March 31, 2026. Key wor ds and phr ases. Extension of coupling, Reconstruction of joint distribution, Opti- mal transp ort, Optimal transp ort pro jection, W asserstein distance pro jection, Shadow, Entropic Shadow. *: Equal contribution. JK is supported by CUHK-SZ start-up UDF03004229. YHK is partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), with Disco very Grant RGPIN-2019-03926 and R GPIN-2025-06747, as w ell as Exploration Grant (NFRFE-2019-00944) from the New F rontiers in Research F und (NFRF). YHK is also a mem b er of the Kan toro vich Initiative (KI), whic h is supp orted b y the PIMS Research Net work (PRN) program of the Paciﬁc Institute for the Mathematical Sciences (PIMS). W e thank PIMS for their generous supp ort. This research is partially done while YHK was visiting Korea Adv anced Institute of Science and T e c hnology (KAIST), and he thanks their generous supp ort and excellent research en vironment. © 2026 by the authors. All rights reserv ed. 1 2 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* 3.2. En tropic regularization 9 4. Limit distribution and conﬁdence set: ﬁnite supp ort case 11 5. Sim ulation 13 5.1. Inﬁnite Supp ort Case 13 5.2. Finite Supp ort Case 16 6. Real-w orld Application 17 7. Conclusion 19 References 20 App endices 26 A. App endix: Optimal transp ort pro jection framework 26 A.1. En tropic optimal transp ort 27 A.2. Limit distribution for ﬁnite supp ort case 28 A.3. The asymptotic conﬁdence sets 30 B. App endix: Additional Simulation Results 31 1. Introduction Man y statistical metho ds aim to elucidate relationships among a set of v ariables Z . Standard approac hes—regression, mac hine learning, and related to ols—assume access to indep enden t and iden tically distributed (i.i.d.) observ ations Z 1 , . . . , Z m , eac h a complete realization of the random vector Z drawn from its joint distribution. This intact joint structure is crucial: it ensures that the observed sample reﬂects the true relationships in Z , up to sampling error, and theoretical guarantees of statistical metho ds are expressed in terms of the sample size, to the exten t that certain prop erties (e.g., consistency , central limit theorems) are attained as m → ∞ . Throughout this manuscript, we refer to data preserving this joint structure as c ouple d data . Unfortunately , fully c oupled data are often a v ailable only for a limited frac- tion of the p opulation, t ypically due to priv acy , administrativ e constrain ts, or high collection costs. F or instance, the U.S. Census Bureau’s American Communit y Surv ey (ACS) 2018 5-year Public Use Microdata Sample (PUMS) contains approx- imately m ≈ 6 . 13 × 10 5 individual records for Illinois. Each record includes mul- tiple attributes—such as demographic and so cioeconomic characteristics—join tly observ ed, allo wing estimation of relationships among v ariables (e.g., conditional probabilities). Although this sample is relatively large, it represents only ab out 4.8% of Illinois’s total p opulation ( n ≈ 1 . 28 × 10 7 ), leaving the v ast ma jority un- represen ted in the join t mo del. In contrast, marginal or low-dimensional aggregate statistics are often a v ailable at muc h larger scales. The ACS 2018 5-y ear Summary File, for example, provides p opulation-level marginal distributions of each charac- teristic in Illinois, co vering the full p opulation. How ev er, these data do not preserve the joint structure present in PUMS, lea ving the dep endencies b et ween v ariables unobserv ed. W e refer to such datasets—containing only marginal information with- out the coupling structure—as mar ginal data . In practice, marginal data are often regarded as “auxiliary” b ecause they can- not b e readily utilized b y oﬀ-the-shelf statistical metho ds, which typically require coupled data as input. Consequently , researchers often restrict their analysis to the coupled dataset, justifying this choice with the assumption that the data are a EXTENSION OF COUPLING VIA OT 3 represen tativ e sample of the p opulation. While this justiﬁcation is w ell-grounded, relying solely on coupled data inherently discards other av ailable information. This observ ation leads to the o v erarc hing question of this paper: “Can w e leverage the information in marginal data to enhance the estimation of relationships among v ariables?” The challenge of enhancing statistical inference by in tegrating detailed but small- scale coupled data with aggregate but large-scale marginal data is well-established across statistics and econometrics. In the survey sampling literature, this is pri- marily framed as c alibr ation [ 34 , 57 , 74 , 101 , 102 ], a metho d that re-weigh ts survey samples so that w eighted aggregates matc h kno wn population totals. In the missing data literature, marginal data provide crucial information for imputing unobserved comp onen ts of the coupled data, thereby enabling reco v ery of the full joint distribu- tion [ 2 , 90 ]. Similarly , in econometrics [ 22 , 52 , 54 , 75 ], marginal data (often referred to as “macro” data) serv e as moment restrictions within likelihoo d or generalized metho d of momen ts (GMM; 51 ) framew orks to improv e the precision of micro- lev el parameter estimates. More recently , researchers ha ve utilized deep generative metho ds to create synthetic p opulations that mirror coupled data while resp ecting the structure of marginal data [ 88 , 91 ]. Crucially , although these approaches incorp orate marginal data as constraints or as kno wn ground truth, they diﬀer from our ob jective in several imp ortant resp ects. First, calibration metho ds primarily aim to impro v e estimation of p opulation means for selected v ariables, rather than to reco v er complex relationships among v ariables; moreo ver, some calibration techniques can pro duce negative or unstable w eights, leading to counterin tuitive or unreliable results. Second, in the missing data lit- erature, marginal data are leveraged under speciﬁc parametric assumptions ab out the missingness mechanism and often require computationally intensiv e Bay esian inference to reconstruct the joint distribution, including careful justiﬁcation of prior selection. Third, GMM approac hes with marginal constraints, as commonly used in econometrics, generally target estimation of sp eciﬁc low-dimensional parameters (e.g., regression co eﬃcien ts), rather than recov ery of the full join t distribution. Fi- nally , although recen t deep generativ e metho ds share our goal of reconstructing lost coupling structure, their black-bo x nature often limits in terpretability and obscures the underlying statistical principles. 1.1. Ov erview of Our Approac h. Our approach complemen ts the existing liter- ature by pro viding a statistically principled framework that explicitly reconstructs the join t coupling structure without relying on black-box architectures or restrictiv e parametric assumptions. W e formally introduce the problem and our main contributions as follows. F or a space S , P ( S ) denotes the set of probability measures on S . Given a probability π 0 ∈ P ( X 1 × · · · × X K ), let { Z j = ( Z 1 j , . . . , Z K j ) : j = 1 , . . . , m } b e the set of coupled data i.i.d. from π 0 , of which marginal distributions are µ i ∈ P ( X i ) for i = 1 , . . . , K . Denoting by µ := ( µ 1 , . . . , µ K ), the vector of K -marginals, it can b e rewritten shortly as (1.1) π 0 ∈ Π( µ ) := Π( µ 1 , . . . , µ K ) , whic h is the set of join t distributions o v er X 1 × · · · × X K whose marginals are µ . On the other hand, there are marginal data { X 1 j : j = 1 , . . . , n } , . . . , { X K j : j = 4 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* 1 , . . . , n } . W e assume that they are originally i.i.d. from the same π 0 but observ ed only marginally , hence X ij ∼ µ i indep enden tly for i = 1 , . . . , K . In the ACS application of Section 6 , each Z j represen ts the j th individual’s join tly observ ed record of K = 5 so cio economic characteristics: health insurance status ( Z 1 j ), working-age indicator ( Z 2 j ), sex ( Z 3 j ), race ( Z 4 j ), and education level ( Z 5 j ), dra wn from the PUMS dataset of size m ≈ 6 . 13 × 10 5 . In addition, { X ij : j = 1 , . . . , n } represents the p opulation-lev el distribution of the i th c haracteristic. F or example, { X 3 j } is the marginal distribution of sex, obtained from the Summary File co verin g n ≈ 1 . 28 × 10 7 individuals in Illinois. Lastly , π 0 represen ts the unknown join t distribution of these characteristics within the Illinois p opulation. In practice, even when n is muc h larger than m (as in the ACS example ab ov e) and the marginal data con tain information ab out π 0 , they are often discarded b ecause many existing statistical approac hes require the input data to exhibit cou- pling structures. The goal is to reconstruct the c oupling of mar ginal datasets by using a r elatively smal l amount of c ouple d data , so that the marginal data can b e incorp orated in to the analysis and ov erall inference can b e improv ed. T o formalize, deﬁne empirical measures based on the samples π 0 m := 1 m m X j =1 δ Z j , µ i n := 1 n n X j =1 δ X ij for i = 1 , . . . , K , µ n = ( µ 1 n , . . . , µ K n ) . Our main problem is then expressed as: optimal ly r e c onstruct π 0 fr om π 0 m and µ n . (1.2) Although π 0 ∈ Π( µ ), π 0 m ∈ Π( µ n ) almost surely . How ever, it is still reasonable to exp ect that under the hypothesis ( 1.1 ) there exists a ˆ π ∈ Π( µ n ) that is similar to π 0 m , thereby providing a go od candidate for a coupling among µ i ’s. Our approach to ( 1.2 ) is to ﬁnd a p oin t in Π( µ n ) which is closest to π 0 m in the W asserstein distance ( W p ; see Section 2.2 ); it is indeed a metric pro jection. Such an optimal tr ansp ort pr oje ction has b een considered in another context b y Eckstein and Nutz [ 37 , Remark 4.2] (and [ 6 ]) for a diﬀeren t purp ose: they considered a particular t yp e of such a pro jection called shadow and used it in studying the stabilit y of Sinkhorn iterations. Our problem, instead, is statistical, and we use such an optimal transp ort pro jection as a metho d to extend the given sampled coupling to a coupling of the en tire set of marginals, serving as an estimator of the hidden ground truth coupling; we view it as a k ey no velt y of our present pap er to frame marginal data in tegration as an OT pro jection problem. W e use shado w metho d of [ 37 ] to characterize our statistical estimator. In addition, we establish new sample- complexit y bounds tailored to the tw o-sample structure ( m, n ) of our problem, deriv e the limit distribution of the estimator under a ﬁnite-supp ort assumption, and provide asso ciated conﬁdence sets based on the recent w orks [ 62 , 67 , 95 ]. The pro jection is formulated (see Eckstein and Nutz [ 37 , Remark 4.2]) as (1.3) π 0 m 7− → ˆ π := argmin π ′ ∈ Π( µ n ) W p ( π ′ , π 0 m ) . In the ACS application, ˆ π is the estimated joint distribution of the ﬁve so cio eco- nomic v ariables for the full Illinois p opulation. It is constructed by ﬁnding the coupling in Π( µ n )—the set of join t distributions whose marginals match those ob- serv ed in the Summary File—that is closest to the empirical joint distribution π 0 m from the smaller PUMS dataset in terms of the W asse rstein metric. In other words, EXTENSION OF COUPLING VIA OT 5 ˆ π extends the coupling structure learned from the PUMS to the entire p opulation while resp ecting the p opulation-lev el marginal constraints provided by the Sum- mary File. Since W p is contin uous and Π( µ n ) is weakly compact, ( 1.3 ) alwa ys attains a solution. Under the hypothesis ( 1.1 ), π 0 is the unique minimizer of the W asserstein pro jection of π 0 on to Π( µ ), therefore it is exp ected that ˆ π is close to π 0 as µ n is close to µ . (Ho wev er, note that ˆ π is not unique in typical cases as the marginals µ i n ’s are discrete measures; see Theorem 3.5 .) Main contributions of this work can b e summarized as follo ws. (i) A fully nonparametric approac h for ( 1.2 ) by using ( 1.3 ). In w ords, w e frame the reconstruction problem as a minimization problem ov er W asser- stein space. Leveraging the geometry of this space, our framework pro vides a principled alternative to existing metho ds with several adv antages. Unlike traditional surv ey calibration, it go es b ey ond simple re-weigh ting to recon- struct the full joint distribution ge ometrically , av oiding negative w eigh ts and the limitation of fo cusing only on population means. Compared with missing data metho ds, it do es not require parametric assumptions ab out the missingness mechanism or the sp eciﬁcation of sub jective priors. Fi- nally , unlik e deep generative approac hes, the W asserstein-based pro jection replaces black-box optimization with a in terpretable and mathematically grounded geometric problem. (ii) W e establish sev eral theoretically and practically signiﬁcant prop erties of the estimator. In particular, we prov e consistency and derive upp er and lo w er b ounds of its conv ergence rate, characterizing its ﬁnite-sample com- plexit y (Theorem 3.1 and Theorem 3.3 in Section 3 ). In the sp ecial case where the supp ort of a probability measure is ﬁnite, we further describ e the estimator’s limiting distribution and corresp onding conﬁdence sets, en- abling formal statistical inference (Theorem 4.5 in Section 4 ). (iii) W e also pro vide concrete and eﬃcient computation metho ds for ˆ π using shadow and entr opic r e gularization , demonstrating that it is eﬃciently com- putable. These metho ds are illustrated in Section 5 and Section 6 through sim ulation studies and real-w orld applications. 1.2. Organization. W e will brieﬂy discuss the background of optimal transp ort and prerequisite notation in Section 2 . In Section 3 , we will introduce the optimal transp ort pro jection framework and its theoretical results in detail. W e contin ue to discuss the limit distribution and the conﬁdence sets of the prop osed estimator for ﬁnite supp ort case in Section 4 . In Section 5 and Section 6 , syn thetic and real data analysis will supp ort the practicalit y of our metho d. Lastly , we will conclude this paper in Section 7 . More details on the pro ofs and simulation results will b e discussed in Section A and Section B , resp ectiv ely . 2. Preliminaries 2.1. Review: Optimal T ransp ort. In the present pap er, we prop ose an opti- mal transp ort (OT) metho d to reconstruct couplings from partial data. OT was prop osed by Monge [ 73 ] and developed by Kantoro vich [ 55 , 56 ], who developed the Kantoro vich relaxation, for which he received the Nob el Prize in economic sci- ences. Since then, OT has interw ov en partial diﬀerential equations, geometry , and 6 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* probabilit y [ 7 , 14 , 16 , 17 , 21 , 41 , 42 , 49 , 58 , 68 , 69 , 80 , 81 ]. Recen tly , OT plays a central role in data sciences: optimization of mean ﬁeld tw o-la y er neural net- w orks [ 24 , 70 , 89 , 94 ], diﬀusion models [ 31 , 50 , 100 ], v ariational inference [ 35 , 63 ], and adversarial training [ 45 , 87 ]. Late 18th the Monge problem [ 73 ] was introduced to ﬁnd a transp ort map to minimize the cost with the pushforward condition. Later, Kantoro vich studied its con v ex relaxation, in which the solution space is extended to the set of couplings [ 55 , 56 ]. Here, we fo cus on a sp ecial case of the Kantoro vic h problem, which is obtained when a P olish space X = Y and c = d p where d is a metric on X . The p -Wasserstein distanc e for 1 ≤ p ≤ ∞ , or Kantor ovich-Rubinstein distanc e , is deﬁned o v er P p ( X ), the set of probabilit y measures with ﬁnite p -th order momen t, as follows: W p ( µ, ν ) := inf γ ∈ Π( µ,ν )  Z X ×X d ( x, y ) p dγ ( x, y )  1 p for 1 ≤ p < ∞ , W ∞ ( µ, ν ) := inf γ ∈ Π( µ,ν ) ess sup ( X,Y ) ∼ γ d ( X, Y ) . It is well known that P p ( X ) equipp ed with W p is a Polish space: in particular, ( P 2 ( X ) , W 2 ) exhibits a Riemannian structure [ 80 ]. An imp ortan t ingredient used in this work is the recent progress of statistical optimal transp ort, whic h, roughly sp eaking, aims at understanding the b eha viour of W p ( µ n , µ ) where µ n is an empirical distribution of i.i.d. samples drawn from µ . Since Dudley [ 36 ] pav ed this ﬁeld, there hav e b een numerous pap ers in this direction [ 10 – 12 , 33 , 43 , 53 , 64 , 72 , 77 ]. An interesting extension of OT is multimar ginal optimal tr ansp ort (MOT). Giv en µ i ∈ P ( X i ) for i = 1 , . . . , K and c : X 1 × · · · × X K → R , MOT is deﬁned as min π ∈ Π( µ 1 ,...,µ K ) Z c ( x 1 , . . . , x K ) dπ ( x 1 , . . . , x K ) . Since Gangb o and ´ Swiec h [ 44 ] studied the multimarginal Monge problem, there ha v e been a lot of subsequen t pap ers establishing its geometric and analytic aspects [ 59 – 61 , 82 – 84 ]. MOT problems hav e also b een found in other ﬁelds as w ell: density functional theory in physics [ 15 , 25 , 26 , 71 , 92 ], theoretical economics [ 18 , 23 , 39 ], W asserstein barycen ters [ 1 , 8 , 19 , 28 , 32 , 96 ] and adversarial training in machine learning [ 46 , 47 , 97 ]. 2.2. Notation. Let Z := X 1 × · · · × X K , and x := ( x 1 , . . . , x K ) ∈ Z . Recall µ := ( µ 1 , . . . , µ K ) b e the vector of K -marginals, and similarly µ n := ( µ 1 n , . . . , µ K n ) where µ i n ’s are empirical measures of µ i ’s. F or simplicity , w e use ( ⊗ µ ) := µ 1 ⊗ · · · ⊗ µ K , Π( µ ) := Π( µ 1 , . . . , µ K ) and Π( µ , π 0 ) := Π( µ 1 , . . . , µ K , π 0 ); here, Π( µ 1 , . . . , µ K ) denotes the set of probability measures with marginals µ 1 , . . . , µ K . Giv en metric spaces ( X i , d i ) for i = 1 , . . . , K , deﬁne the sep ar able pro duct metric d Z ,p o v er Z as (2.1) d Z ,p ( x , z ) :=     P K i =1 d i ( x i , z i ) p  1 p for p ∈ [1 , ∞ ) , max i =1 ,...,K d i ( x i , z i ) for p = ∞ . Unless otherwise noted, the p -W asserstein distances on X i and Z are deﬁned with resp ect to d i and d Z ,p , resp ectiv ely . F or notational conv enience and c larit y , w e use EXTENSION OF COUPLING VIA OT 7 W p ( π , π ′ ) to the p -W asserstein distance on Z only , and W p ( µ , ν ) := K X i =1 W p p ( µ i , ν i ) ! 1 p for p ∈ [1 , ∞ ) , W ∞ ( µ , ν ) := max i =1 ,...,K W ∞ ( µ i , ν i ) . On the space Z × Z , P x i and P z i denote pro jections onto X i of the ﬁrst Z , and that of the second, resp ectiv ely . Similarly , w e use P x and P z to denote pro jections on to the ﬁrst Z and the second, resp ectiv ely . 3. Main resul ts 3.1. Optimal transp ort pro jection. Since π 0 ∈ Π( µ ), note that π 0 is the unique minimizer for the W asserstein pro jection on to Π( µ ), that is, π 0 = argmin π ′ ∈ Π( µ ) W p ( π ′ , π 0 ) . It is exp ected that a solution ˆ π for ( 1.3 ), ev en though it is nonunique in t ypical cases (see theorem 3.5 ), will con verge to π 0 as m and n , the num b ers of samples, grow. Not only is this true, but also the upp er and lo w er b ounds of the p -W asserstein distance betw een ˆ π and π 0 can be obtained as follo ws as a simple consequence of [ 37 , Lemma 3.2]. Bounds are describ ed in terms of the distances b et ween the true and empirical distributions only . Theorem 3.1 (Stability) . L et ν i m b e the i -th mar ginal of π 0 m for i = 1 , . . . , K , and ν m := ( ν 1 m , . . . , ν K m ) . Then W p ( µ , µ n ) ≤ W p ( ˆ π , π 0 ) ≤ W p ( µ n , ν m ) + W p ( π 0 m , π 0 ) . Pr o of. Since ˆ π ∈ Π( µ n ), a low er b ound follo ws as W p ( ˆ π , π 0 ) ≥ min π ∈ Π( µ n ) W p ( π , π 0 ) = W p ( µ , µ n ) , where the last equation is by [ 37 , Lemma 3.2] and ( 1.1 ). An upp er b ound is attained by [ 37 , Lemma 3.2] as W p ( ˆ π , π 0 ) ≤ W p ( ˆ π , π 0 m ) + W p ( π 0 m , π 0 ) = min π ∈ Π( µ n ) W p ( π , π 0 m ) + W p ( π 0 m , π 0 ) = W p ( µ n , ν m ) + W p ( π 0 m , π 0 ) . □ R emark 3.2 (Statistical in terpretation of the stability b ounds) . The tw o terms on the right-hand side of the upp er b ound in Theorem 3.1 hav e distinct statistical in terpretations. The term W p ( µ n , ν m ) measures ho w w ell the marginals of the small coupled data ( ν m ) match the marginals of the large decoupled data ( µ n ); this term is small when the coupled sample is representativ e of the population marginals. The term W p ( π 0 m , π 0 ) measures the quality of the coupled empirical measure as an approximation of the true joint distribution π 0 ; this is the standard sampling error from the coupled data alone, which v anishes as m grows. The lo w er b ound W p ( µ , µ n ) shows that the estimator cannot b e b etter than the marginal con v ergence rate determined by the large dataset of size n . T ogether, these b ounds conﬁrm qualitatively that the prop osed estimator b eneﬁts from b oth data sources. 8 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* Com bined with [ 43 , Theorem 1], Theorem 3.1 yields the upp er and lo wer bounds of the sample complexity of ˆ π . Here we assume that all dimensions are suﬃciently larger than p for simplicity . Corollary 3.3 (Sample complexit y) . Assume that X i = R d i and d i > 2 p for 1 ≤ i ≤ K , also that ther e is some q > p P d i P d i − p such that the q -th moment of π 0 (henc e al l µ i ’s) is ﬁnite. Then ther e ar e c onstants C 1 , C 2 , C 3 which do not dep end on n, m but on p, P d i , q and q -th moments of e ach of the me asur es such that C 1 K X i =1 n − p d i ! 1 p ≤ E W p ( ˆ π , π 0 ) ≤ C 2 m − 1 d 1 + ··· + d K + C 3 K X i =1 n − p d i ! 1 p . R emark 3.4 (Curse of dimensionalit y) . The corollary pro vides the quan titative esti- mation error by the tw o sample sizes m and n . F olklore in the OT communit y says the rate of conv ergence of empirical distributions in the W asserstein distance suﬀers the curse of dimensionalit y: O ( n − 1 d ) where d is the ambien t dimension. Since the marginal terms n − p/d i are negligible when n ≫ m , the o v erall error is dominated b y the coupled data term m − 1 / ( d 1 + ··· + d K ) . This reconﬁrms that the estimation error exhibits the curse of dimensionality , whic h conforms with the fact that estimating the joint distribution b ecomes increasingly diﬃcult in higher dimensions. On the other hand, if the supp ort of the measure enjo ys intrinsic low dimen- sionalit y , the exp onen ts can b e improv ed. Niles-W eed and Bach [ 76 ] prov ed the optimal rate of conv ergence of empirical distributions in the W asserstein distance is O ( n − 1 k ) where k relates to the Wasserstein dimension provided that its supp ort is compact. Thus, the sample complexity of ˆ π can b e improv ed if π 0 p ossesses the in trinsic low dimensionality . The next natural question is how to compute ˆ π . Rew riting ( 1.3 ), it is equiv alent to the following MOT problem: (3.1) min γ ∈ Π( µ n ,π 0 m )  Z d Z ,p ( x , z ) p dγ  1 p for p ∈ [1 , ∞ ) , min γ ∈ Π( µ n ,π 0 m ) max i =1 ,...,K ess sup ( X i ,Z i ) ∼ γ i d i ( X i , Z i ) for p = ∞ where the marginal constraints should b e understo od as (P x ) # ( γ ) = π ′ for some π ′ ∈ Π( µ n ) and (P z ) # ( γ ) = π 0 m . Note that γ i = (P x i , P z i ) # γ ∈ Π( µ i n , ν i m ). Denoting an optimal m ultimarginal coupling of ( 3.1 ) by ˆ γ , an optimal solution ˆ π of ( 1.3 ) is obtained from ˆ γ by ˆ π := (P x ) # ( ˆ γ ). R emark 3.5 . Note for instance the ab o ve explains nonuniqueness of ˆ π as an opti- mizer ˆ γ of the MOT problem ( 3.1 ) is generally not unique. OT (linear programming in general) is solv able in O ( n 3 ) computational complex- it y by the simplex metho d or the Hungarian metho d. In particular, ( 3.1 ) requires O ( n 3( K +1) ) complexity . Note that solving the generic MOT is NP-hard as the n um b er of marginals increases [ 4 ]. Using the separability of d Z ,p , how ev er, ˆ π can b e computed eﬃciently based on shadow as prop osed b y Eckstein and Nutz [ 37 ]: EXTENSION OF COUPLING VIA OT 9 Theorem 3.6 (Shadow [ 37 ]) . L et ˆ γ i b e an optimal c oupling for W p ( µ i n , ν i m ) , which c an b e written as ˆ γ i ( dx i , dz i ) = ν i m ( dz i ) ˆ κ i ( dx i | z i ) by disinte gr ation formula. Then ˆ γ ( d x , d z ) = π 0 m ( d z ) κ ( d x | z ) is optimal for ( 3.1 ) wher e ˆ κ ( d x | z ) := κ 1 ( dx 1 | z 1 ) ⊗ · · · ⊗ κ K ( dx K | z K ) . In p articular, an optimal ˆ π (the shadow of π 0 m onto Π( µ n ) ) for ( 1.3 ) is obtaine d by ˆ π ( d x ) = Z Z π 0 m ( d z ) ˆ κ ( d x | z ) . F urthermor e, ( 1.3 ) = ( 3.1 ) = W p ( µ n , ν m ) . Pr o of. See [ 37 , Lem ma 3.2]. □ R emark 3.7 . Note that optimal solutions to the optimal transp ort pro jection prob- lem ( 1.3 ), equiv alently ( 3.1 ), are not necessarily given by a shado w; see Eckstein and Nutz [ 37 , Remark 4.2]. The computational b eneﬁt of shadow is straightforw ard. Each ˆ γ i can b e com- puted with O ( n 3 ) computational cost. Overall, the total computational cost of ˆ γ is O ( K n 3 ), whic h is muc h less than the O ( n 3( K +1) ) complexit y of ( 3.1 ). In addi- tion to this eﬃciency , parallel computation for ˆ γ is applicable in this case, which impro v es the sp eed of the computation. 3.2. En tropic regularization. There is an estimator with almost linear time com- plexit y to approximate OT, the so-called entr opic r e gularization metho d, prop osed b y Cuturi [ 27 ]. This framework allows one to compute the approximate solution in almost linear time [ 3 , 66 ]. Fix η > 0, called the entropic parameter. F or our MOT problem ( 3.1 ), the usual en tropic regularized version is deﬁned as follo ws: min γ ∈ Π( µ n ,π 0 m ) Z d Z ,p ( x , z ) p dγ + η KL  γ ∥ ( ⊗ µ n ) ⊗ π 0 m  for p ∈ [1 , ∞ ) , min γ ∈ Π( µ n ,π 0 m ) max i =1 ,...,K ess sup ( X i ,Z i ) ∼ γ d i ( X i , Z i ) + η KL  γ ∥ ( ⊗ µ n ) ⊗ π 0 m  for p = ∞ . Due to the strict conv exity of the KL divergence, the entropic MOT provides a unique optimal solution. Also, imp ortan tly , the entropic MOT b oosts the sp eed of computation. Ho w ever, this approac h do es not tak e adv antage of the separability of the cost d Z ,p ( x , z ) p . R emark 3.8 . The generic K marginals MOT requires O ( η − 2 K 3 n K log n ) computa- tional cost if eac h of marginals has O ( n ) p oin ts [ 66 , Theorem 16]. Let us introduce entr opic shadow , which allo ws us to lev erage the separability structure signiﬁcantly . Consider the follo wing en tropic regularized version of ( 3.1 ): (3.2) min γ ∈ Π( µ n ,π 0 m ) K X i =1 Z d i ( x i , z i ) p dγ i + η KL( γ i ∥ µ i n ⊗ ν i m ) for p ∈ [1 , ∞ ) , min γ ∈ Π( µ n ,π 0 m ) max i =1 ,...,K ess sup ( X i ,Z i ) ∼ γ i d i ( X i , Z i ) + η KL( γ i ∥ µ i n ⊗ ν i m ) for p = ∞ . Denoting by ˆ γ i,η ∈ Π( µ i n , ν i m ) the (unique) optimal entropic coupling for Z d i ( x i , z i ) p dγ i + η KL( γ i ∥ µ i n ⊗ ν i m ) or ess sup ( X i ,Z i ) ∼ γ i d i ( X i , Z i ) + η KL( γ i ∥ µ i n ⊗ ν i m ) , 10 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* deﬁne the k ernel ˆ κ i,η ( dx i | z i ) such that ˆ κ i,η ( dx i | z i ) ν i ( dz i ) = ˆ γ i,η ( dx i , dz i ). Similarly to Theorem 3.6 , the optimal entropic MOT coupling for ( 3.2 ), denoted by ˆ γ η , is obtained as ˆ γ η ( d x , d z ) = π 0 m ( d z ) ˆ κ η ( d x | z ) where ˆ κ η = ˆ κ 1 ,η ( dx 1 | z 1 ) ⊗ · · · ⊗ ˆ κ K,η ( dx K | z K ). Then the entropic estimator ˆ π η is ac hieved by ˆ π η ( d x ) = Z Z π 0 m ( d z ) ˆ κ η ( d x | z ) . (3.3) The name entropic shadow is ro oted from π 0 m whic h is obtained via the entropic regularized p -W asserstein distance. R emark 3.9 (Non uniqueness) . Notice that since the entropic marginal couplings ˆ γ i,η ’s are uniquely determined, the entropic shado w is also uniquely determined; ho w ever, this do es not imply uniqueness of an optimizer for ( 3.2 ) as one consider a coun terexample similar to that given in [ 37 , Remark 4.2]. A strong adv an tage of entropic shadow is its computational sp eed and eﬃciency . Note that it can b e computed in parallel as b efore, and eac h ˆ γ i can b e computed b y Sinkhorn algorithm, of which computational cost is almost linear. Ov erall, it decreases the computational cost to e O ( K n 2 ), which is muc h less than e O ( K 3 n K ). A drawbac k of entropic approach is that that ˆ π η do es not recov er the true π 0 in general as n, m → ∞ as long as η > 0, due to the entropic bias. One could ask whether the bias v anishes as the regularization disapp ears. Suc h a con vergence is w ell known in the literature. Under mild conditions, Γ -c onver genc e guarantees the con v ergence of entropic estimator to π 0 ; see Theorem A.4 . R emark 3.10 . Let us discuss a brief history for the Γ-con vergence of the entropic OT to the v anilla OT. L ´ eonard [ 65 ] pro ves the Γ-con vergence by relating it to the large deviation theory under the assumption that c is nonnegativ e and low er semicon tinu ous. Carlier et al. [ 20 ] pro v e the same result for c = d 2 with the emphasis of n umerical applications. Bernton et al. [ 9 ] prov e the same result with the large deviation theory and provide characterization of the rate function in terms of Kan toro vich p oten tials based on geometric argument. Nutz and Wiesel [ 79 ] pro ve the conv ergence of en tropic dual p oten tials to Kantoro vich p oten tials. Prop osition 3.11 (Γ-con vergence) . As η → 0 , ( 3.2 ) Γ -c onver ges to ( 3.1 ) . F ur- thermor e, ther e is a subse quenc e of solutions of ( 3.2 ) which c onver ges to a solution for ( 3.1 ) . Pr o of. See [ 78 , Section 5]. □ Corollary 3.12 (V anishing bias) . Ther e is a subse quenc e of ˆ π η c onver ging to π 0 we akly as η → 0 , and n, m → ∞ . Pr o of. By [ 48 , Theorem 1.4], ˆ γ η → γ η w eakly as n, m → ∞ , hence so as ˆ π η → π η w eakly . A t the same time, Theorem 3.11 implies that there is a subsequence of π η con v erging to π 0 . Cho osing ( n k , m k , η k ) accordingly , therefore, one can ﬁnd a subsequence of ˆ π η k n k ,m k con v erging to π 0 w eakly . □ EXTENSION OF COUPLING VIA OT 11 4. Limit distribution and confidence set: finite suppor t case Discrete data are common in man y applications. F or example, in the ACS dataset, all ﬁve v ariables (health insurance status, working-age indicator, sex, ed- ucation level, and race) are categorical. Binary v ariables take tw o v alues, w hile race has four categories, meaning that the joint distribution spans at most 2 × 2 × 2 × 4 × 2 = 64 cells; thus, π 0 is inherently a ﬁnitely supp orted distribution. More generally , survey and administrative data in so cial science, public health, and gov- ernmen t often inv olve categorical or discretized resp onses, making ﬁnite supp ort a natural assumption. Motiv ated by this, we focus on the case where π 0 (hence each of the µ i ’s) has ﬁnite supp ort, aligning with this practical motiv ation. The goal of this section is to derive the limit distribution of an estimator ˆ π . F or the ﬁnite supp ort case, our problem reduces to a ﬁnite-dimensional linear program; hence, understanding the limit distribution of ˆ π requires studying the asymptotic behavior of the corresp ond- ing linear programming, whic h has b een recently inv estigated b y Sommerfeld and Munk [ 95 ], Klatt et al. [ 62 ] and Liu et al. [ 67 ]. Let X i = spt( µ i ) = { x i 1 , . . . , x is i } and s i := |X i | for i = 1 , . . . , K and s 0 := | spt( π 0 ) | ≤ s 1 · · · s K . Also, w e will use S + := s 0 + s 1 + · · · + s K and S ∗ := s 0 s 1 . . . s K . After vectorizing all the ob jects, ( 3.1 ) b ecomes a ﬁnite-dimensional linear program: (4.1) min γ ∈ R S ∗ ⟨ c , γ ⟩ , s.t. A γ = b n,m , γ ≥ 0 where c ∈ R S ∗ is the v ectorization of the s 1 × · · · × s K × s 0 cost tensor (deﬁned as d p Z ,p ), A ∈ R S + × S ∗ is the pro jection matrix corresponding to the marginal constrain ts, and b n,m :=  µ n π 0 m  , b :=  µ π 0  ∈ R s 1 × · · · × R s K × R s 0 are the vector of marginal distributions. R emark 4.1 . F or generic MOT, a canonical wa y to vectorize the tensors c and γ is to write them in lexicographical order. Example 4.2 (Marginal constraint vectors in the ACS application) . T o il lustr ate the structur e of b n,m and b , c onsider a simpliﬁe d version of the ACS applic ation with K = 2 variables: sex ( X 3 , binary: s 1 = 2 ) and r ac e ( X 4 , 4-level: s 2 = 4 ). The joint supp ort has s 0 ≤ s 1 · s 2 = 8 c el ls. Then b n,m =    µ 3 n µ 4 n π 0 m    =    ( ˆ p Male , ˆ p F emale ) ( ˆ p White , ˆ p Black , ˆ p Asian , ˆ p Others ) ( ˆ p Male,White , ˆ p Male,Black , . . . , ˆ p F emale,Others )    ∈ R 2 × R 4 × R 8 , wher e ˆ p denotes empiric al pr op ortions: the ﬁrst two blo cks ar e estimate d fr om the Summary File (i.e., mar ginal data), and the last blo ck fr om the PUMS (i.e., c ouple d data). The p opulation c ounterp art b has the same structur e with true pr op ortions r eplacing ˆ p . The terms c and A are ﬁxed and only b n,m v aries, so w e denote the set of solutions for ( 4.1 ) by γ ( b n,m ). Klatt et al. [ 62 ] and Liu et al. [ 67 ] study the limit distribution of γ ( b n,m ) for the case (4.2) r n,m ( b n,m − b ) d − → G 12 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* for some random pro cess G with rate r n,m → ∞ as n, m → ∞ . F or the limit distribution of γ ( b n,m ), the following assumptions on ( 4.1 ) should b e imp osed. Assumption 4.3. [ 67 , Assumption 1] The c onstr aint matrix A has ful l r ank, the optimal solution set γ ( b ) is nonempty and b ounde d, and the Slater’s c ondition [ 13 , Se ction 5.2.3] is satisﬁe d, i.e., ther e exists γ 0 ∈ R S ∗ such that γ 0 > 0 , A γ 0 = b . R emark 4.4 (Slater’s condition) . In the statistical context, the Slater’s condition means that there must exist a joint distribution ov er Z with strictly p ositiv e prob- abilit y on every cell of the pro duct supp ort. In the ACS application, this requires ev ery com bination of health insurance status, working age, sex, race, and education to hav e p ositiv e probability in the Illinois p opulation, whic h is a mild and veriﬁable condition given the large p opulation size. Let ˆ π ( b n,m ) denote a solution for the pro jection, i.e., ˆ π ( b n,m ) = (P x ) # γ ( b n,m ). Also, let G µ ,π 0 b e the limit distribution of b n,m giv en in ( 4.2 ): see ( A.8 ) for more details. W e apply [ 62 , Theorem 6.1] and [ 67 , Theorem 4] to get the limit distribution of ˆ π ( b n,m ) as follows. Theorem 4.5. Assume that m n + m → λ ∈ (0 , 1) as n, m → ∞ , spt( π 0 ) is ﬁnite and c = d p Z ,p in ( 2.1 ) . L et ˆ π ( b n,m ) b e a solution for ( 1.3 ) with input µ n and π 0 m . Then r nm n + m  ˆ π ( b n,m ) − π 0  d − → (P x ) # ( p ∗ ( G µ ,π 0 )) wher e p ∗ ( G µ ,π 0 ) is the set of optimal solutions for the fol lowing line ar pr o gr am: min p ∈ R S ∗ ⟨ c , p ⟩ , s.t. A p = G µ ,π 0 , p ij ≥ 0 for al l ( ij ) / ∈ spt((Id , Id) # ( π 0 )) . Pr o of. Theorem A.7 shows that ( 4.1 ) satisﬁes Theorem 4.3 . Then, b y Theorem A.6 , w e hav e (4.3) r nm n + m ( γ ( b n,m ) − γ ( b )) d − → p ∗ ( G µ ,π 0 ) where G µ ,π 0 is the Gaussian deﬁned in ( A.8 ) and p ∗ ( G µ ,π 0 ) is the set of optimal solutions to the linear program: (4.4) min p ∈ R S ∗ ⟨ c , p ⟩ , s.t. A p = G µ ,π 0 , p ij ≥ 0 for all ( ij ) / ∈ spt( γ ( b )) . This follows from Theorem A.6 ([ 62 , Theorem 6.1] and [ 67 , Theorem 4]). Note that γ ( b ) = { (Id , Id) # ( π 0 ) } since c = 0 only on the diagonal. Recalling the fact that ˆ π ( b n,m ) = (P x ) # ( γ ( b n,m )), applying (P x ) # to ( 4.3 ) yields the conclusion by con tin uous mapping theorem. □ R emark 4.6 (Asymptotic Beha vior and Eﬃciency) . The conv ergence rate in theo- rem 4.5 , that is, p nm/ ( n + m ) is parametric (ro ot- n type) in the combined sample sizes. When n ≫ m as in the ACS data, this rate simpliﬁes to approximately √ m , indicating that the smaller sample gov erns the eﬀective rate. T he main practical b eneﬁt of incorp orating marginal data is not an increase in the conv ergence rate itself, but that the estimator ˆ π pro vides a more accurate p oin t estimate of π 0 than the empirical measure based only on the coupled data, π 0 m . This improv ement is reﬂected in the narro w er conﬁdence interv als shown in Sections 5 and 6 . EXTENSION OF COUPLING VIA OT 13 F or statistical inference tasks, one needs more than Theorem 4.5 since it do es not provide any data-driven description of asymptotically v alid conﬁdence sets. Tw o approaches can b e used to obtain such conﬁdence sets. The ﬁrst approach is subsampling [ 85 , 86 ], also kno wn as the m out of n b ootstrap. The v alidity of subsampling relies on the existence of an asymptotic distribution for the estima- tor, which is ensured by Theorem 4.5 . The pro cedure is implemented in av ailable soft war e, such as the moonboot R pac k age [ 29 , 30 ], and we refer readers to these references for implementation details. The second approac h follows [ 67 ], in which asymptotic conﬁdence sets are con- structed using the auxiliary linear program ( 4.4 ). Although their result rigorously guaran tees an at least 1 − α conﬁdence set, it is to o conserv ative to apply it di- rectly to real data, compared to the subsampling metho d. The full statement of this approach is presented in Section A.3 . R emark 4.7 . While the b ootstrap [ 38 ] is widely used and often view ed as a con- v enien t to ol for practical inference, we do not recommend it for inference on the estimator obtained from the linear program ( 4.1 ). The b ootstrap is known to b e in v alid for non-regular estimators [ 5 , 40 , 93 ], and the estimator derived from ( 4.1 ) is generally non-regular. As a result, b o otstrap pro cedures may fail to yield v alid statistical inference. Our simulation results indeed conﬁrm that the standard b oot- strap can pro duce inv alid conﬁdence in terv als; see Section 5 . 5. Simula tion W e conducted sim ulation studies to in vestigate the ﬁnite sample p erformance of the prop osed approach. 5.1. Inﬁnite Supp ort Case. First, w e considered a setting with inﬁnite supp ort, i.e., contin uous random v ariables. The num b er of coupled observ ations was v aried o v er m ∈ { 50 , 100 , 200 } and the num b er of decoupled observ ations was v aried o v er n ∈ { 5 m, 10 m, 25 m } . W e then generated ( m + n ) i.i.d. coupled data from the biv ariate normal distribution as follows: ( Z 1 , Z 2 ) ∼ π 0 = N  0 0  ,  1 ρ ρ 1  . The cov ariance parameter ρ was v aried o ver ρ ∈ { 0 , 0 . 25 , 0 . 75 } to examine the p erformance of the prop osed metho d under conditions ranging from no correlation to a strong asso ciation b et ween Z 1 and Z 2 . The marginal distributions of Z 1 and Z 2 are the same as µ = ν = N (0 , 1). The coupled dataset consisted of the ﬁrst m ( Z 1 , Z 2 ) pairs, while the decoupled dataset is obtained by de-coupling the remaining n pairs and randomly p erm uting them. W e estimated the coupling for the marginal data from ( 3.2 ) with the squared Euclidean cost. F or the entropic regularization parameter η , we considered three settings: (i) ﬁxed v alues η ∈ { 2 0 , 2 3 , 2 − 6 } ; and (ii) a data-driven choice of η based on a cross-v alidation pro cedure, whic h we denote by η cv ; the details of this pro cedure are pro vided in Algorithm 1 . Roughly speaking, the cross-v alidation procedure con- sists of the following steps: (i) partition the coupled dataset into non-ov erlapping, appro ximately equal-sized K folds; (ii) apply the prop osed extension of coupling framew ork to the dataset excluding one fold; (iii) ev aluate the cost on the held-out fold; (iv) rep eat steps (ii) and (iii) for each of the K folds. Finally , η cv is deﬁned as the v alue of η that minimizes the av erage cost across all held-out folds. T o reduce 14 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* the dependence on a particular split, steps (i)-(iv) can b e repeated R times. F or the sim ulation study , we used R = 10 rep etitions and K = 5 folds, with the candidate set of η v alues given by E = { 2 0 , 2 − 1 , 2 − 2 , 2 − 3 , 2 − 4 , 2 − 5 , 2 − 6 } . Algorithm 1 Rep eated Cross-v alidation for Selecting η 1: Deﬁne a set of candidate v alues for η , denoted by E = { η 1 , . . . , η J } 2: for j = 1 , . . . , J do 3: Let R b e the num b er of cross-v alidation rep etitions 4: for r = 1 , . . . , R do 5: P artition I = { 1 , . . . , m } into K non-o v erlapping, approximately equal- sized folds, denoted b y I r, 1 , . . . , I r,K 6: for k = 1 , . . . , K do 7: Let b π ( j ) r,k denote an estimate of π 0 obtained b y using I \ I r,k as the coupled dataset and I r,k as the marginal dataset, with η j as the en tropic regularization parameter 8: Ev aluate the cost o ver I r,k using the estimated coupling: C ( j ) r,k = 1 |I r,k | X i ∈I r,k Z c (( x, y ) , ( x i , y i )) d b π ( j ) r,k ( x, y ) 9: end for 10: end for 11: end for 12: Select the v alue of η that minimizes the av erage cost across folds and repetitions: η cv = η j ∗ where j ∗ = argmin j ∈{ 1 ,...,J } 1 R R X r =1 K X k =1 C ( j ) r,k , T o assess the p erformance of the estimated coupling b π , we computed estimates of the cov ariance and the cum ulativ e distribution function at ( − 0 . 25 , − 0 . 25), given b y b ρ = Z z 1 z 2 d b π ( z 1 , z 2 ) − Z z 1 d b π ( z 1 , z 2 ) · Z z 2 d b π ( z 1 , z 2 ) , b F ( − 0 . 25 , − 0 . 25) = Z 1 ( z 1 ≤ − 0 . 25 , z 2 ≤ − 0 . 25) d b π ( z 1 , z 2 ) . If the estimated coupling b π is close to the true coupling π 0 , then b ρ and b F ( − 0 . 25 , − 0 . 25) will also b e close to ρ and F ( − 0 . 25 , − 0 . 25) = R 1 ( z 1 ≤ − 0 . 25 , z 2 ≤ − 0 . 25) dπ 0 ( z 1 , z 2 ). F or comparison, w e also computed the same quan tities based on the m coupled data, denoted by e ρ and e F ( − 0 . 25 , − 0 . 25); note that these estimators are consistent b y the la w of large n umbers. T able 1 presen ts the ro ot mean squared error (RMSE) of e ρ and b ρ , a v eraged o ver 500 rep etitions, for each combination of ( m, n, ρ ). First, for each ρ and ﬁxed ratio n/m , the RMSE of e ρ decreases with m , approximately at the rate of 1 / √ m , which corresp onds to the parametric rate. Similarly , the RMSE of b ρ also decreases with m at roughly the same rate. Second, the p erformance of b ρ dep ends on b oth the entropic regularization pa- rameter η and the true v alue of ρ . Sp eciﬁcally , when ρ = 0 (i.e., z 1 and z 2 are indep enden t), larger v alues of η yield the smallest RMSE among the candidate EXTENSION OF COUPLING VIA OT 15 c hoices. Con versely , when ρ = 0 . 75 (i.e., z 1 and z 2 are strongly dep enden t), smaller v alues of η achiev e the lo w est RMSE. Moreov er, if η is mismatched to ρ —for exam- ple, a small η when ρ is small or a large η when ρ is large—the resulting b ρ p erforms w orse than e ρ . This indicates that, under such settings, the prop osed optimal ex- tension of coupling metho d oﬀers no practical adv antage, as the estimator based solely on the coupled data p erforms b etter. Third, regardless of the v alue of ρ , the cross-v alidation pro cedure yields b ρ that outp erforms e ρ . This shows that selecting η via cross-v alidation consistently enables the use of marginal data to obtain an estimator more eﬃcien t than one based solely on the coupled data. As noted ab o ve, such uniform p erformance gains are not achiev able with a ﬁxed, a priori c hoice of η . Moreo v er, the last column of T able 1 shows that the pro cedure tends to select a large η when ρ is small and a small η when ρ is large, indicating that cross-v alidation automatically adapts to the underlying dep endence structure. ρ n/m m RMSE Average log 2 η cv e ρ b ρ η = 2 0 η = 2 − 6 η = η cv 0.00 5 50 14.19 8.71 14.34 13.00 -2.63 100 10.02 6.13 10.01 9.05 -2.75 200 6.89 4.22 6.93 6.19 -2.71 10 50 14.19 8.69 14.35 12.96 -2.53 100 10.02 6.11 10.00 9.18 -3.08 200 6.89 4.21 6.93 6.17 -2.70 25 50 14.19 8.66 14.29 13.02 -2.63 100 10.02 6.11 10.00 9.06 -2.83 200 6.89 4.21 6.94 6.19 -2.70 0.25 5 50 14.46 12.88 13.47 13.77 -4.03 100 10.28 11.76 9.46 10.04 -4.49 200 7.11 10.49 6.58 6.83 -5.21 10 50 14.46 12.79 13.35 13.71 -3.79 100 10.28 11.68 9.40 10.01 -4.51 200 7.11 10.47 6.53 6.82 -5.22 25 50 14.46 12.78 13.41 13.72 -3.82 100 10.28 11.67 9.35 9.88 -4.67 200 7.11 10.47 6.54 6.78 -5.14 0.75 5 50 17.25 31.27 9.09 9.15 -5.77 100 12.61 30.63 6.52 6.53 -5.97 200 8.72 29.77 4.32 4.32 -6.00 10 50 17.25 31.05 8.00 8.09 -5.73 100 12.61 30.36 5.66 5.66 -5.99 200 8.72 29.74 3.91 3.91 -6.00 25 50 17.25 30.89 7.36 7.44 -5.79 100 12.61 30.41 5.22 5.23 -5.99 200 8.72 29.73 3.61 3.61 -6.00 T able 1. Summary of simulation results under the inﬁnite supp ort case. The four subcolumns under “RMSE” report the RMSE of the estimators for ρ , computed ov er 500 simulation repetitions. All RMSE v alues are scaled b y a factor of 100. The last column, labeled “Average log 2 η cv ,” rep orts the mean of log 2 η cv across 500 repetitions. In App endix B , w e present the results for b F ( − 0 . 25 , − 0 . 25), which exhibit pat- terns similar to those in T able 1 . These simulat ion results indicate that the prop osed 16 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* metho d is consistent with the prop erties describ ed in Theorem 3.12 and demon- strates practical merit when the entropic regularization parameter η is selected in a data-driven manner. 5.2. Finite Supp ort Case. Next, we examined a setting with ﬁnite supp ort. In this setting, w e ﬁxed the size of the decoupled dataset at n = 10 6 and v aried the n um b er of coupled observ ations ov er m ∈ { 1000 , 2000 , 4000 , 8000 , 16000 , 32000 } . W e generated m + n i.i.d. observ ations of the discrete v ariables ( Z 1 , Z 2 , Z 3 ) ∈ { 1 , 2 } ⊗ { 1 , 2 , 3 , 4 } ⊗ { 1 , 2 , 3 , 4 , 5 } according to the followin g joint distribution: Pr( Z 1 = z 1 , Z 2 = z 2 , Z 3 = z 3 ) =    7 / 105 if z 1 = 2 , z 2 = 4 , z 3 ∈ { 1 , 2 , 3 , 4 } 42 / 105 if z 1 = 2 , z 2 = 4 , z 3 = 5 1 / 105 otherwise The coupled dataset w as formed from the ﬁrst m observ ations, while the decoupled dataset was generated by decoupling the remaining n observ ations. Giv en the ﬁnite supp ort of the data, w e estimated the coupling for the marginal data using the metho d described in Section 4 . W e employ ed a separable cost function deﬁned as: c (( z 1 , z 2 , z 3 ) , ( z ′ 1 , z ′ 2 , z ′ 3 )) = 3 X j =1 | z j − z ′ j | . T o assess the accuracy of the estimated coupling b π , we estimated the conditional probabilit y ψ 0 = Pr( Z 3 = 5 | Z 1 = 2 , Z 2 = 4) = 0 . 6. If b π is close to the true coupling π 0 , the resulting estimate b ψ should also b e close to ψ 0 . F or com- parison, we also computed the estimator e ψ = P m i =1 I ( Z i, 1 = 2 , Z i, 2 = 4 , X i, 3 = 5) / P m i =1 I ( Z i, 1 = 2 , Z i, 2 = 4) where ( Z i, 1 , Z i, 2 , Z i, 3 ) denotes the i th coupled ob- serv ation. Note that e ψ is the maximum likelihoo d estimator of ψ 0 based on the coupled data and is a consistent estimator of ψ 0 . F or inference, w e compare 95% conﬁdence interv als for ψ 0 constructed using the follo wing metho ds: • CI( e ψ , b oot): a 95% conﬁdence interv al based on e ψ obtained via the non- parametric b o otstrap. • CI( b ψ , b oot): a 95% conﬁdence interv al based on b ψ obtained via the non- parametric b o otstrap. • CI( b ψ , ss): a 95% conﬁdence interv al based on b ψ constructed using subsam- pling. • CI( b ψ , Liu): a 95% conﬁdence interv al based on b ψ constructed using the metho d of Liu et al. [ 67 ]. T able 2 rep orts the empirical bias and ro ot mean squared error (RMSE) of e ψ and b ψ , as well as the cov erage probabilities and av erage lengths of the asso ciated conﬁdence interv als , based on 1000 rep etitions. The estimator e ψ exhibits nearly negligible bias across all sample sizes, whereas b ψ shows a slightly larger, y et mod- est, ﬁnite-sample bias that diminishes as M increases. Despite this slight bias, b ψ demonstrates notable gains in eﬃciency , yielding a consisten tly lo wer RMSE than e ψ for all M . Regarding inference, the b ootstrap conﬁdence in terv al for e ψ achiev es empirical co v erage rates v ery close to the nominal 95% level across all v alues of m . This is EXTENSION OF COUPLING VIA OT 17 exp ected, as the b ootstrap is a v alid metho d for e ψ , whic h is the maximum likeli- ho od estimator for ψ 0 based on the coupled data. F or b ψ , the standard b o otstrap in terv al tends to undercov er (ranging around 88% to 90%). This aligns with our discussion in Section 4 , where we noted that the b ootstrap is generally an inv alid inference metho d for our estimator. Conv ersely , the conﬁdence in terv als based on subsampling and the approach of Liu et al. [ 67 ] are v alid, as they successfully at- tain the nominal cov erage. Betw een these tw o, the subsampling conﬁdence interv al p erforms b etter, with empirical cov erage approaching the nominal level as m in- creases, while the metho d of Liu et al. [ 67 ] tends to b e conserv ative, consistently o v ercov ering (around 99%). W e remark that the le ngths of these v alid in terv als for b ψ are shorter than CI( e ψ , b oot), indicating that our approach enables more eﬃc ien t inference. Ov erall, the simulation results demonstrate the promise of our approach: while e ψ , whic h relies only on the coupled data, provides minimal bias and v alid inference via the b ootstrap, b ψ —whic h further incorp orates marginal data—oﬀers more eﬃcient inference for the underlying distribution. m ( × 10 3 ) Bias ( × 10 3 ) RMSE ( × 10 3 ) CI Cov erage (%) CI Length ( × 10 3 ) e ψ b ψ e ψ b ψ e ψ , b oot b ψ , b oot b ψ , SS b ψ , Liu e ψ , b oot b ψ , b oot b ψ , SS b ψ , Liu 1 -1.1 -6.2 19.1 12.1 94.7 88.4 91.5 98.7 74.4 40.5 39.0 66.2 2 -0.8 -4.3 13.5 8.5 94.3 89.5 92.6 98.9 52.6 29.1 28.6 46.7 4 -0.4 -2.9 9.3 5.9 96.1 90.1 94.1 99.5 37.2 20.8 20.3 33.0 8 -0.2 -2.0 6.7 4.2 95.4 90.1 94.8 98.8 26.3 14.8 14.9 23.6 16 -0.3 -1.5 4.8 2.9 94.9 89.4 96.5 99.3 18.6 10.5 10.7 16.9 32 -0.1 -1.1 3.4 2.2 94.8 88.9 94.3 98.9 13.1 7.5 7.7 12.2 T able 2. Summary of simulation results under the ﬁnite support case. All bias, RMSE, and CI length v alues are scaled by a factor of 1000. CI cov erage is rep orted as a percentage. 6. Real-world Applica tion W e applied our proposed method to a real-world survey dataset. Consistent with Section 1 , w e focused on individuals in Illinois, using the U.S. Census Bureau’s ACS 2018 datasets for b oth the coupled and marginal data. Sp eciﬁcally , the coupled dataset consisted of the ACS 2018 5-year PUMS with m ≃ 6 . 13 × 10 5 , while the marginal dataset was dra wn from the A CS 2018 5-year Summary File, representing the Illinois p opulation with a total size of n ≈ 1 . 28 × 10 7 . F or these datasets, w e considered ﬁve discrete v ariables ( Z 1 , Z 2 , Z 3 , Z 4 , Z 5 ) ∈ { 0 , 1 } ⊗ { 0 , 1 } ⊗ { 0 , 1 } ⊗ { 0 , 1 , 2 , 3 } ⊗ { 0 , 1 } , deﬁned as: • Z 1 = I (Individual has health insurance) • Z 2 = I (Age ∈ [25 , 64], i.e., w orking age) • Z 3 = I (Sex = Male) • Z 4 = I (Race = White) + 2 I (Race = Black) + 3 I (Race = Asian) • Z 5 = I (Education ≤ High sc ho ol) Using these v ariables, we examined the relationship b et ween health insurance status and so cio economic factors, deﬁned as ψ 0 ij kl = Pr( Z 1 = 0 | Z 2 = i, Z 3 = j, Z 4 = k , Z 5 = l ), the conditional probability of not having health insurance given so cioeconomic factors. F ollowing the sim ulation study in Section 5 , we compared t w o estimators: b ψ ij kl , obtained from our prop osed metho d in Section 4 , and e ψ ij kl , 18 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* the maximum likelihoo d estimator based on the coupled data, and ev aluated four 95% conﬁdence interv als for inference. T ables 3 and 4 present the estimation results for ψ 0 00 kl , represen ting the con- ditional probabilities of not ha ving health insurance (scaled as percentages) for females outside the working-age range in Illinois, stratiﬁed by race and educational attainmen t. Consistent with the simulation results in Section 5 , incorporating the marginal data via our prop osed estimator b ψ yields noticeable diﬀerences compared to the standard maximum likelihoo d estimator e ψ derived solely from the coupled data. While the ov erall v alues b et ween the tw o estimators are similar, b ψ produces sligh tly higher estimates for most demographic groups (White, Black, and Others) and slightly low er estimates for the Asian demographic. In terms of conﬁdence interv als, b ψ generally deliv ers tighter conﬁdence in ter- v als under b oth b ootstrap and subsampling pro cedures compared with b o otstrap in terv als based on e ψ . How ever, although the 95% b o otstrap conﬁdence interv als based on b ψ are substantially narrow er than the alternativ es, simulation results indi- cate that they may fail to attain the nominal 95% cov erage level. Moreov er, while the conﬁdence interv als prop osed b y Liu et al. [ 67 ] are theoretically v alid, they are typically wider than the b ootstrap interv als based on e ψ , resulting in less eﬃ- cien t inference. Consistent with the sim ulation ﬁndings, we therefore recommend subsampling conﬁdence interv als for inference. Comparing the lengths of the 95% conﬁdence interv als from subsampling based on b ψ with the b ootstrap in terv als based on e ψ highlights the eﬃciency gains ac hieved b y incorp orating marginal information. F or example, for ψ 0 0011 (the ﬁrst row in T a- bles 3 and 4 ), the corresp onding interv al widths are 0.26 and 0.31, resp ectiv ely . This implies that, when an inv estigator relies solely on coupled data and conducts inference using e ψ , approximately (0 . 31 / 0 . 26) 2 ≃ 1 . 41 times more coupled data is required to achiev e the level of precision attainable when incorp orating marginal data through b ψ . Thus, lev eraging marginal data enables eﬃcien t statistical infer- ence without requiring substan tially more coupled data to achiev e the same level of precision. Ov erall, these results reinforce a k ey takea wa y for real-world applications: while incorp orating decoupled marginal data enables the construction of improv ed OT- based estimators, the choice of inference pro cedure remains critical for balancing eﬃciency with statistical v alidit y . Race Education ≤ HS? e ψ Estimate(%) 95% Bo otstrap CI White Y es 2 . 98 (2 . 83 , 3 . 14) [0 . 31] No 3 . 72 (3 . 33 , 4 . 13) [0 . 80] Black Y es 5 . 89 (4 . 96 , 6 . 78) [1 . 82] No 6 . 40 (5 . 71 , 7 . 11) [1 . 41] Asian Y es 1 . 77 (1 . 58 , 1 . 98) [0 . 40] No 2 . 80 (2 . 12 , 3 . 53) [1 . 40] Others Y es 5 . 04 (3 . 90 , 6 . 21) [2 . 31] No 3 . 78 (1 . 93 , 6 . 10) [4 . 17] T able 3. Summary of the real-world application for females outside the working-age group based on e ψ . All values are scaled by a factor of 100. Num- bers in square brack ets indicate the lengths of the corresp onding conﬁdence inter- v als. EXTENSION OF COUPLING VIA OT 19 Race Education ≤ HS? b ψ Estimate(%) 95% Bo otstrap CI 95% Subsampling CI 95% CI of Liu et al. [ 67 ] White Y es 3 . 01 (2 . 88 , 3 . 15) [0 . 27] (2 . 85 , 3 . 11) [0 . 26] (2 . 84 , 3 . 18) [0 . 34] No 3 . 80 (3 . 45 , 4 . 16) [0 . 71] (3 . 32 , 4 . 08) [0 . 76] (3 . 42 , 4 . 19) [0 . 78] Black Y es 5 . 73 (4 . 90 , 6 . 55) [1 . 65] (4 . 55 , 6 . 38) [1 . 83] (4 . 79 , 6 . 74) [1 . 96] No 6 . 73 (6 . 11 , 7 . 40) [1 . 29] (5 . 90 , 7 . 26) [1 . 36] (6 . 03 , 7 . 45) [1 . 42] Asian Y es 1 . 80 (1 . 63 , 1 . 98) [0 . 35] (1 . 56 , 1 . 92) [0 . 37] (1 . 57 , 2 . 03) [0 . 46] No 2 . 91 (2 . 29 , 3 . 55) [1 . 25] (1 . 84 , 3 . 24) [1 . 40] (2 . 21 , 3 . 61) [1 . 41] Others Y es 4 . 91 (3 . 93 , 5 . 90) [1 . 98] (3 . 47 , 5 . 51) [2 . 04] (3 . 60 , 5 . 96) [2 . 36] No 4 . 31 (3 . 04 , 5 . 96) [2 . 92] (1 . 03 , 5 . 05) [4 . 02] (2 . 73 , 6 . 23) [3 . 51] T able 4. Summary of the real-world application for females outside the working-age group based on e ψ . All values are scaled by a factor of 100. Num- bers in square brack ets indicate the lengths of the corresp onding conﬁdence inter- v als. 7. Conclusion In this w ork, we study the extension of coupling via the optimal transport pro jec- tion framew ork. Our prop osed approach provides the most geometrically optimal coupling among all p ossible ones without any assumptions, whic h guarantees full generalit y for any type of data. Compared to previous literature, which do es not establish an in tuitiv e interpretation, the prop osed estimator admits a very natural geometric interpretation, a pro jection o ver the W asserstein space, whic h allows us to v erify consistency , stability , sample complexity , limit distribution and conﬁdence sets. F urthermore, it enjoys nice computational b eneﬁts b y shadow and entropic regularization, b y which the corresp onding optimization is solv able almost linear time. Also, w e provide several real/synthetic data exp erimen ts that empirically supp ort the strength of the prop osed metho d. A practically imp ortan t ﬁnding is that subsampling conﬁdence interv als based on the OT estimator ˆ π are narro w er than b ootstrap conﬁdence interv als based on the coupled data-only estimator, as demonstrated in b oth the simulation and ACS application. This means that by in- corp orating freely av ailable marginal data through our framew ork, one can achiev e more eﬃcient inference ab out the join t distrib ution without the need to collect addi- tional coupled data. W e b eliev e that our work op ens a new direction for recov ering a true coupling distribution in terms of the fully nonparametric p erspective. There are man y new follow-up questions of in terest to b oth theorists and practi- tioners. Let us pose three questions. The ﬁrst is about the relation of the structural assumption of π 0 to the estimation. Although our framew ork works for general π 0 , sp ecialists ha ve their o wn a priori kno wledge ab out it whic h is informativ e for infer- ence. Thus, it is desirable if one can exploit this a priori knowledge to improv e the accuracy of the proposed estimator. F or instance, if π 0 is generated b y some map T whic h is smo oth and monotone, in other w ords, if strong correlation exists among v ariables, is there a b etter wa y to incorp orate it with the method? In con trast, if π 0 is close to a product measure, is just the pro duct of empirical measures optimal? The second is to extend the limit distribution theory to con tinuous settings. The limit distribution w e develop is v alid only for ﬁnite supp ort cases. There is a fundamen tal bottleneck for this limitation, whic h is related to the limit distribution of W asserstein distances of empirical distributions. That for dimensions greater than or equal to 3 is unknown due to the failure of classical empirical process theory to apply to it. Dev eloping limit distributions b ey ond ﬁnite settings is imp ortan t 20 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* for statistical inference, yet adequate mathematical to ols for studying them still app ear to b e lacking. The last question is ab out a data-driven c hoice of the entropic parameter dis- cussed in Section 5 . Our pro cedure is a v ariant of cross-v alidation, and under cer- tain regularity conditions, cross-v alidation is known to b e consisten t for selecting the (near) optimal regularization parameter [ 98 , 103 ]. While our approach show ed strong p erformance in sim ulations, it remains an interesting question whether it can b e formally justiﬁed from a theoretical standp oint. References [1] Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM J. Math. Anal. , 43(2):904–924, 2011. [2] Olanrew a ju Ak ande, Gabriel Madson, D Sunshine Hillygus, and Jerome P Reiter. Leveraging auxiliary information on marginal distributions in nonig- norable mo dels for item and unit nonresp onse. Journal of the R oyal Statistic al So ciety Series A: Statistics in So ciety , 184(2):643–662, 2021. [3] Jason Altsch uler, Jonathan Niles-W eed, and Philipp e Rigollet. Near-linear time approximation algorithms for optimal transp ort via sinkhorn iteration. A dvanc es in neur al information pr o c essing systems , 30, 2017. [4] Jason M. Altsch uler and Enric Boix-Adser` a. W asserstein barycenters are NP-hard to compute. SIAM Journal on Mathematics of Data Scienc e , 4(1): 179–203, 2022. [5] Donald WK Andrews. Inconsistency of the b o otstrap when a parameter is on the b oundary of the parameter space. Ec onometric a , pages 399–405, 2000. [6] Erhan Bayraktar, Stephan Eckstein, and Xin Zhang. Stabilit y and sample complexit y of div ergence regularized optimal transp ort. Bernoul li , 31(1):213– 239, 2025. [7] Jean-Da vid Benamou and Y ann Brenier. A computational ﬂuid mechanics solution to the Monge-Kantoro vich mass transfer problem. Numer. Math. , 84 (3):375–393, 2000. [8] Jean-Da vid Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyr ´ e. Iterative bregman pro jections for regularized transp ortation problems. SIAM Journal on Scientiﬁc Computing , 37(2):A1111–A1138, 2015. [9] Esp en Bernton, Promit Ghosal, and Marcel Nutz. Entropic optimal transp ort: geometry and large deviations. Duke Math. J. , 171(16):3363–3400, 2022. [10] Emman uel Boissard. Simple b ounds for conv ergence of empirical and o ccu- pation measures in 1-Wasserstein distance. Ele ctr on. J. Pr ob ab. , 16:no. 83, 2296–2333, 2011. [11] Emman uel Boissard and Thibaut Le Gouic. On the mean sp eed of conv ergence of empirical and o ccupation measures in Wasserstein distance. Ann. Inst. Henri Poinc ar´ e Pr ob ab. Stat. , 50(2):539–563, 2014. [12] F ran c cois Bolley , Arnaud Guillin, and C´ edric Villani. Quantitativ e concen- tration inequalities for empirical measures on non-compact spaces. Pr ob ab. The ory R elate d Fields , 137(3-4):541–593, 2007. [13] Stephen Boyd and Liev en V andenberghe. Convex optimization . Cambridge Univ ersit y Press, Cambridge, 2004. ISBN 0-521-83378-7. [14] Y ann Brenier. Polar factorization and monotone rearrangement of vector- v alued functions. Comm. Pur e Appl. Math. , 44(4):375–417, 1991. EXTENSION OF COUPLING VIA OT 21 [15] Giusepp e Buttazzo, Luigi De P ascale, and P aola Gori-Giorgi. Optimal- transp ort formulation of electronic densit y-functional theory . Physic al R eview A , 85(6):062502, 2012. [16] Luis Caﬀarelli and Alessio Figalli. Regularit y of solutions to the parab olic fractional obstacle problem. J. R eine Angew. Math. , 680:191–233, 2013. [17] Luis A. Caﬀarelli. Monotonicity prop erties of optimal transp ortation and the FK G and related inequalities. Comm. Math. Phys. , 214(3):547–563, 2000. [18] Guillaume Carlier and Iv ar Ekeland. Matching for teams. Ec onomic The ory , 42:397–418, 2010. [19] Guillaume Carlier, Adam Ob erman, and Edouard Oudet. Numerical methods for matc hing for teams and Wasserstein barycen ters. ESAIM Math. Mo del. Numer. Anal. , 49(6):1621–1642, 2015. [20] Guillaume Carlier, Vincent Duv al, Gabriel Peyr ´ e, and Bernhard Schmitzer. Con v ergence of entropic schemes for optimal transp ort and gradient ﬂows. SIAM J. Math. Anal. , 49(2):1385–1418, 2017. [21] J. A. Carrillo, M. DiF rancesco, A. Figalli, T. Lauren t, and D. Slepˇ cev. Global-in-time weak measure solutions and ﬁnite-time aggregation for nonlo- cal interaction equations. Duke Math. J. , 156(2):229–271, 2011. [22] Xiaohong Chen, Han Hong, and Alessandro T arozzi. Semiparametric e ﬃ- ciency in gmm models with auxiliary data. The Annals of Statistics , 36(2): 808–843, 2008. [23] Pierre-Andr ´ e Chiapp ori, Robert J. McCann, and Lars P . Nesheim. Hedonic price equilibria, stable matching, and optimal transp ort: equiv alence, top ol- ogy , and uniqueness. Ec onom. The ory , 42(2):317–354, 2010. [24] Lenaic Chizat and F rancis Bach. On the global conv ergence of gradient de- scen t for ov er-parameterized mo dels using optimal transp ort. A dvanc es in neur al information pr o c essing systems , 31, 2018. [25] Maria Colombo, Luigi De Pascale, and Simone Di Marino. Multimarginal optimal transport maps for one-dimensional repulsive costs. Canad. J. Math. , 67(2):350–368, 2015. [26] Co dina Cotar, Gero F riesec ke, and Claudia Kl ¨ upp elb erg. Densit y functional theory and optimal transp ortation with Coulomb cost. Comm. Pur e Appl. Math. , 66(4):548–599, 2013. [27] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal trans- p ort. In C.J. Burges, L. Bottou, M. W elling, Z. Ghahramani, and K.Q. W einberger, editors, A dvanc es in Neur al Information Pr o c essing Systems , v ol- ume 26. Curran Asso ciates, Inc., 2013. [28] Marco Cuturi and Arnaud Doucet. F ast computation of wasserstein barycen- ters. In International c onfer enc e on machine le arning , pages 685–693. PMLR, 2014. [29] Christoph Dalitz and F elix L¨ ogler. mo on b oot: An r pack age implementing m-out-of-n b ootstrap metho ds. The R Journal , 17(3):125–137, Octob er 2025. [30] Christoph Dalitz and F elix L¨ ogler. mo on b oot: An r pac k age implement- ing m-out-of-n b ootstrap metho ds. The R Journal , 17:125–137, 2025. h ttps://gith ub.com/cdalitz/mo on b oot/. [31] V alentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diﬀusion sc hr¨ odinger bridge with applications to score-based generativ e mo d- eling. A dvanc es in Neur al Information Pr o c essing Systems , 34:17695–17709, 22 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* 2021. [32] Julie Delon and Agnes Desolneux. A wasserstein-t yp e distance in the space of gaussian mixture mo dels. SIAM Journal on Imaging Scienc es , 13(2):936–970, 2020. [33] Steﬀen Dereic h, Michael Scheutzo w, and Reik Schottstedt. Constructiv e quan tization: appro ximation by empirical measures. Ann. Inst. Henri Poinc ar´ e Pr ob ab. Stat. , 49(4):1183–1203, 2013. [34] Jean-Claude Deville and Carl-Erik S¨ arndal. Calibration estimators in survey sampling. Journal of the Americ an statistic al Asso ciation , 87(418):376–382, 1992. [35] Mic hael Ziyang Diao, Krishna Balasubramanian, Sinho Chewi, and Adil Salim. F orward-bac kw ard gaussian v ariational inference via jko in the bures- w asserstein space. In International Confer enc e on Machine L e arning , pages 7960–7991. PMLR, 2023. [36] R. M. Dudley . The sp eed of mean glivenk o-cantelli conv ergence. The Annals of Mathematic al Statistics , 40(1):40–50, 1969. [37] Stephan Eckstein and Marcel Nutz. Quantitativ e stability of regularized op- timal transport and con vergence of sinkhorn’s algorithm. SIAM Journal on Mathematic al Analysis , 54(6):5922–5948, 2022. [38] Bradley Efron and Rob ert J Tibshirani. A n Intr o duction to the Bo otstr ap . Chapman and Hall/CRC, 1994. [39] Iv ar Ekeland. An optimal matching problem. ESAIM Contr ol Optim. Calc. V ar. , 11(1):57–71, 2005. [40] Zheng F ang and Andres San tos. Inference on directionally diﬀerentiable func- tions. The R eview of Ec onomic Studies , 86(1):377–412, 2019. [41] A. Figalli, F. Maggi, and A. Pratelli. A mass transp ortation approach to quan titativ e isop erimetric inequalities. Invent. Math. , 182(1):167–211, 2010. [42] A. Figalli, N. F usco, F. Maggi, V. Millot, and M. Morini. Isop erimetry and stabilit y prop erties of balls with respect to nonlo cal energies. Comm. Math. Phys. , 336(1):441–507, 2015. [43] Nicolas F ournier and Arnaud Guillin. On the rate of conv ergence in W asser- stein distance of the empirical measure. Pr ob ab. The ory R elate d Fields , 162 (3-4):707–738, 2015. [44] Wilfrid Gangb o and Andrzej ´ Swiec h. Optimal maps for the multidimensional monge-k antoro vich problem. Communic ations on Pur e and Applie d Mathe- matics , 51:23–45, 1998. [45] Nicol´ as Garc ´ ıa T rillos and Ry an W. Murra y . Adversarial classiﬁcation: Neces- sary conditions and geometric ﬂows. Journal of Machine L e arning R ese ar ch , 23(187):1–38, 2022. [46] Nicol´ as Garc ´ ıa T rillos, Matt Jacobs, and Jakwang Kim. On the existence of solutions to adv ersarial training in m ulticlass classiﬁcation. Eur op e an Journal of Applie d Mathematics , pages 1–21, 2024. [47] Nicol´ as Garc ´ ıa T rillos, Matt Jacobs, Jakwang Kim, and Matthew W eren- ski. An optimal transp ort approach for computing adversarial training low er b ounds in m ulticlass classiﬁcation. J. Mach. L e arn. R es. , 25(1), January 2024. [48] Promit Ghosal, Marcel Nutz, and Esp en Bernton. Stability of entropic opti- mal transp ort and schr¨ odinger bridges. Journal of F unctional Analysis , 283 (9):109622, 2022. EXTENSION OF COUPLING VIA OT 23 [49] Nathael Gozlan, Cyril Rob erto, P aul-Marie Samson, and Prasad T etali. Kan- toro vic h duality for general transp ort costs and applications. Journal of F unc- tional Analysis , 273(11):3327–3405, 2017. [50] Mohamed Hamdouc he, Pierre Henry-Lab ordere, and Huy ˆ en Pham. Gen- erativ e mo deling for time series via sc hr¨ odinger bridge. arXiv pr eprint arXiv:2304.05093 , 2023. [51] Lars Peter Hansen. Large sample prop erties of generalized metho d of mo- men ts estimators. Ec onometric a , 50(4):1029–1054, 1982. [52] Judith K Hellerstein and Guido W Imbens. Imp osing moment restrictions from auxiliary data b y weigh ting. R eview of Ec onomics and Statistics , 81(1): 1–14, 1999. [53] Jan-Christian H¨ utter and Philipp e Rigollet. Minimax estimation of smo oth optimal transp ort maps. Ann. Statist. , 49(2):1166–1194, 2021. [54] Guido W Imbens and T ony Lancaster. Combining micro and macro data in micro econometric mo dels. The R eview of Ec onomic Studies , 61(4):655–680, 1994. [55] Leonid V Kantoro vich. On the translo cation of masses. In Dokl. Akad. Nauk. USSR (NS) , volume 37, pages 199–201, 1942. [56] Leonid V Kantoro vich. On a problem of monge. In CR (Doklady) A c ad. Sci. URSS (NS) , volume 3, pages 225–226, 1948. [57] Jae Kwang Kim and Mingue P ark. Calibration estimation in survey sampling. International Statistic al R eview , 78(1):21–39, 2010. [58] Y oung-Heon Kim and Rob ert J. McCann. Contin uity , curv ature, and the general cov ariance of optimal transp ortation. J. Eur. Math. So c. (JEMS) , 12 (4):1009–1040, 2010. [59] Y oung-Heon Kim and Brendan P ass. Multi-marginal optimal transp ort on riemannian manifolds. Americ an Journal of Mathematics , 137:1045 – 1060, 2013. [60] Y oung-Heon Kim and Brendan P ass. A general condition for monge solu- tions in the multi-marginal optimal transp ort problem. SIAM Journal on Mathematic al Analysis , 46(2):1538–1550, 2014. [61] Jun Kitaga w a and B rendan P ass. The multi-marginal optimal partial trans- p ort problem. F orum Math. Sigma , 3:Paper No. e17, 28, 2015. [62] Marcel Klatt, Axel Munk, and Y oav Zemel. Limit la ws for empirical optimal solutions in random linear programs. Ann. Op er. R es. , 315(1):251–278, 2022. [63] Marc Lambert, Sinho Chewi, F rancis Bach, Silv ` ere Bonnab el, and Philipp e Rigollet. V ariational inference via wasserstein gradient ﬂo ws. A dvanc es in Neur al Information Pr o c essing Systems , 35:14434–14447, 2022. [64] Mic hel Ledoux and Jie-Xiang Zh u. On optimal matc hing of Gaussian samples I II. Pr ob ab. Math. Statist. , 41(2):237–265, 2021. [65] Christian L ´ eonard. F rom the Schr¨ odinger problem to the Monge-Kantoro vich problem. J. F unct. A nal. , 262(4):1879–1920, 2012. [66] Tian yi Lin, Nhat Ho, Marco Cuturi, and Michael I Jordan. On the complex- it y of approximating m ultimarginal optimal transp ort. Journal of Machine L e arning R ese ar ch , 23(65):1–43, 2022. [67] Sh uyu Liu, Floren tina Bunea, and Jonathan Niles-W eed. Asymptotic conﬁ- dence sets for random linear programs. In The Thirty Sixth A nnual Confer- enc e on L e arning The ory , pages 3919–3940. PMLR, 2023. 24 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* [68] John Lott and C´ edric Villani. Ricci curv ature for metric-measure spaces via optimal transp ort. Annals of Mathematics , 169:903–991, 2004. [69] Rob ert John McCann. A c onvexity the ory for inter acting gases and e quilib- rium crystals . Princeton Universit y , 1994. [70] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean ﬁeld view of the landscap e of tw o-lay er neural netw orks. Pr o c e e dings of the National A c ademy of Scienc es , 115(33):E7665–E7671, 2018. [71] Christian B Mendl and Lin Lin. Kan torovic h dual solution for strictly cor- related electrons in atoms and molecules. Physic al R eview B , 87(12):125106, 2013. [72] Quen tin M´ erigot, Filippo Santam brogio, and Cl´ ement Sarrazin. Non- asymptotic conv ergence b ounds for w asserstein approximation using point clouds. A dvanc es in Neur al Information Pr o c essing Systems , 34:12810–12821, 2021. [73] Gaspard Monge. M´ emoire sur la th ´ eorie des d´ eblais et des remblais. Mem. Math. Phys. A c ad. R oyale Sci. , pages 666–704, 1781. [74] Giorgio E Montanari and M Giov anna Ranalli. Nonparametric mo del cali- bration estimation in surv ey sampling. Journal of the Americ an Statistic al Asso ciation , 100(472):1429–1442, 2005. [75] Aviv Nevo. Using weigh ts to adjust for sample selection when auxiliary infor- mation is av ailable. Journal of Business & Ec onomic Statistics , 21(1):43–52, 2003. [76] Jonathan Niles-W eed and F rancis Bach. Sharp asymptotic and ﬁnite-sample rates of con vergence of empirical measures in Wasserstein distance. Bernoul li , 25(4A):2620–2648, 2019. [77] Jonathan Niles-W eed and Quentin Berthet. Minimax estimation of smo oth densities in Wasserstein distance. Ann. Statist. , 50(3):1519–1540, 2022. [78] Marcel Nutz. Intr o duction to Entr opic Optimal T r ansp ort , volume 129. 2022. [79] Marcel Nutz and Johannes Wiesel. Entropic optimal transp ort: conv ergence of p oten tials. Pr ob ab. The ory R elate d Fields , 184(1-2):401–424, 2022. [80] F elix Otto. The geometry of dissipativ e ev olution equations: the p orous medium equation. Comm. Partial Diﬀer ential Equations , 26:101–174, 2001. [81] F elix Otto and C ´ edric Villani. Generalization of an inequality by talagrand and links with the logarithmic sob olev inequality . Journal of F unctional Anal- ysis , 173:361–400, 2000. [82] Brendan Pass. Multi-marginal optimal transp ort: theory and applications. ESAIM Math. Mo del. Numer. A nal. , 49(6):1771–1790, 2015. [83] Brendan Pass and Y air Shenfeld. A dynamical formulation of multi-marginal optimal transp ort, 2025. [84] Brendan Pass and Adolfo V argas-Jim´ enez. A general framework for multi- marginal optimal transp ort. Math. Pr o gr am. , 208(1-2):75–110, 2024. [85] Dimitris N P olitis and Joseph P Romano. Large sample conﬁdence regions based on subsamples under minimal assumptions. The Annals of Statistics , pages 2031–2050, 1994. [86] D.N. Politis, J.P . Romano, and M. W olf. Subsampling . Springer Series in Statistics. Springer New Y ork, 1999. ISBN 9780387988542. [87] Muni Sreeniv as Pydi and V arun Jog. The many faces of adv ersarial risk. In Thirty-Fifth Confer enc e on Neur al Information Pr o c essing Systems , 2021. EXTENSION OF COUPLING VIA OT 25 [88] Xiao Qian, Utk arsh Gangwal, Shang jia Dong, and Rachel Da vidson. A deep generativ e framework for join t households and individuals p opulation synthe- sis. Applie d Soft Computing , page 114375, 2025. [89] G. M. Rotsk oﬀ and E. V anden-Eijnden. T rainability and accuracy of artiﬁcial neural net w orks: an interacting particle system approac h. Comm. Pur e Appl. Math. , 75(9):1889–1935, 2022. [90] Mauricio Sadinle and Jerome P Reiter. Sequentially additive nonignorable missing data mo delling using auxiliary marginal information. Biometrika , 106(4):889–911, 2019. [91] Ab doul Razac San´ e, Rachid Belaroussi, Pierre Hank ach, and Pierre-Olivier V andanjon. P opulation synthesis with deep generativ e mo del: A joint household-individual approach. Computational Urb an Scienc e , 5(1):34, 2025. [92] Mic hael Seidl, P aola Gori-Giorgi, and Andreas Sa vin. Strictly correlated elec- trons in density-functional theory: A general formulation with applications to spherical densities. Physic al R eview A , 75(4):042511, 2007. [93] Jun Shao. Bo otstrap sample size in nonregular cases. Pr o c e e dings of the A meric an Mathematic al So ciety , 122(4):1251–1262, 1994. [94] Justin Sirignano and Konstan tinos Spiliop oulos. Mean ﬁeld analysis of neural net w orks: A law of large n umbers. SIAM Journal on Applie d Mathematics , 80(2):725–752, 2020. [95] Max Sommerfeld and Axel Munk. Inference for empirical wasserstein dis- tances on ﬁnite spaces. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 80(1):219–238, 05 2017. [96] San vesh Sriv asta v a, Cheng Li, and Da vid B Dunson. Scalable ba yes via barycen ter in w asserstein space. The Journal of Machine L e arning R ese ar ch , 19(1):312–346, 2018. [97] Nicol´ as Garc ´ ıa T rillos, Jakwang Kim, and Matt Jacobs. The multimarginal optimal transport form ulation of adversarial multiclass classiﬁcation. J. Mach. L e arn. R es. , 24(1), mar 2024. [98] Aad W. v an der V aart, Sandrine Dudoit, and Mark J. v an der Laan. Oracle inequalities for multi-fold cross v alidation. Statistics & De cisions , 24(3):351– 371, 2006. [99] C ´ edric Villani. Optimal tr ansp ort , volume 338 of Grund lehr en der mathema- tischen Wissenschaften [F undamental Principles of Mathematic al Scienc es] . Springer-V erlag, Berlin, 2009. ISBN 978-3-540-71049-3. Old and new. [100] Gefei W ang, Y uling Jiao, Qian Xu, Y ang W ang, and Can Y ang. Deep genera- tiv e learning via schr¨ odinger bridge. In International c onfer enc e on machine le arning , pages 10794–10804. PMLR, 2021. [101] Changbao W u. Optimal calibration estimators in surv ey sampling. Biometrika , 90(4):937–951, 2003. [102] Changbao W u and Randy R Sitter. A mo del-calibration approac h to using complete auxiliary information from survey data. Journal of the Americ an Statistic al Asso ciation , 96(453):185–193, 2001. [103] Y uhong Y ang. Consistency of cross v alidation for comparing regression pro- cedures. The A nnals of Statistics , 35(6):2450–2473, 2007. 26 JAKW ANG KIM*, YOUNG-HEON KIM, AND CHAN P ARK* App endices A. Appendix: Optimal transpor t pr ojection framework The problem ( 1.3 ) can b e generalized as follows. Let µ i ∈ P ( X i ) for i = 1 , . . . , K and π 0 ∈ P ( Z ). Assume that c : X 1 × · · · × X K × Z → R is con tin uous. The problem of interest is (A.1) min π ∈ Π( µ ) min γ ∈ Π( π,π 0 ) Z c ( x , z ) dγ ( x , z ) . This is the optimal transp ort pro jection of π 0 on to Π( µ ). This pro jection problem can b e equiv alently formulated as the following MOT problem: (A.2) min γ ∈ Π( µ ,π 0 ) Z cdγ where the marginal constrain ts are given as (P x i ) # ( γ ) = µ i for i = 1 , . . . , K , (P z ) # ( γ ) = π 0 . Under the assumption on c as ab o ve, ( A.2 ) has a solution [ 99 , Theorem 4.1]. Let us denote an optimal m ultimarginal coupling by γ ∗ . Then, a solution for ( A.1 ), denoted by π ∗ , is attained from γ ∗ b y π ∗ := (P x ) # ( γ ∗ ) where P x is the pro jection onto X 1 × · · · × X K . The ﬁrst theorem concerns the stabilit y of ( A.2 ), hence of ( A.1 ) in terms of input measures. The con vergence result is already w ell known in the literature: see [ 99 , Theorem 5.20]. Theorem A.1. L et m = m ( t ) , n = n ( t ) → ∞ as t → ∞ . Assume that π 0 m → π 0 , µ i n → µ i for 1 ≤ i ≤ K we akly and lim inf t →∞ min γ ∈ Π( µ 1 n ,...,µ K n ,π 0 m ) Z cdγ < ∞ . If { γ ∗ t } is the se quenc e of minimizers for ( A.2 ) with µ 1 n ( t ) , . . . , µ K n ( t ) , π 0 m ( t ) , then ther e is a subse quenc e that c onver ges to γ ∗ , and γ ∗ is a minimizer for ( A.2 ) with µ 1 , . . . , µ K , π 0 . F urthermor e, ther e is a subse quenc e of { π ∗ t := (P x ) # ( γ ∗ t ) } that c onver ges to some π ∗ , and π ∗ is a solution for ( A.1 ) with µ 1 , . . . , µ K , π 0 . F or separable cost functions which are written as the sum of cost functions of pairwise co ordinates, one can obtain γ ∗ and π ∗ more precisely b y shadow prop osed b y Eckstein and Nutz [ 37 ]. Note that there are p ossibly other optimal solutions not obtained by shadow. Theorem A.2 (Separable case) . Assume that Z = Z 1 × · · · × Z K and a c ost function is sep ar able in the sense that c ( x , z ) = K X i =1 c i ( x i , z i ) 27 for c i : X i × Z i → R . F or e ach i = 1 , . . . , K , let ν i := (P z i ) # ( π 0 ) wher e P z i is the pr oje ction fr om Z to Z i . Then (A.3) min γ ∈ Π( µ ,π 0 ) Z cdγ = K X i =1 min γ i ∈ Π( µ i ,ν i ) Z c i dγ i . L et γ i, ∗ b e an optimal c oupling for min γ i ∈ Π( µ i ,ν i ) R c i dγ i , which c an b e written as γ i, ∗ ( dx i , dz i ) = ν i ( dz i ) κ i ( dx i | z i ) by disinte gr ation formula. Then γ ∗ ( d x , d z ) = π 0 ( d z ) κ ( d x | z ) is optimal wher e κ ( d x | z ) := κ 1 ( dx 1 | z 1 ) ⊗ · · · ⊗ κ K ( dx K | z K ) . In p articular, an optimal π ∗ for ( A.1 ) is also obtaine d by π ∗ ( d x ) = Z Z π 0 ( d z ) κ ( d x | z ) . Recall the problem setting. Let π 0 ∈ Π( µ ) b e the true coupling generating the data. W e denote by { Z j = ( Z 1 j , . . . , Z K j ) : j = 1 , . . . , m } and { X 1 j : j = 1 , . . . , n } , . . . , { X K j : j = 1 , . . . , n } the set of coupled data i.i.d. from π 0 , and the sets of marginal data, resp ectiv ely . W e use π 0 m := m X j =1 δ Z j , µ i n := n X j =1 δ X ij for i = 1 , . . . , K . The problem is to reconstruct a coupling which should be consistent with the mar- ginal observ ations. In other words, it should lie in Π( µ n ), and close to π 0 m as muc h as p ossible. Using the optimal transp ort-pro jection framework, or the equiv alent multimarginal optimal transp ort form ulation, one can construct a coupling ov er µ i n ’s as follows: ﬁrst, compute ˆ γ , which is a solution for min γ ∈ Π( µ n ,π 0 m ) Z cdγ , and the estimated coupling ˆ π is obtained by the push forw ard measure of ˆ γ by P x as ˆ π = (P x ) # ( ˆ γ ) . As a corollary of Theorem A.1 , w e can deriv e the consistency of an estimator ˆ π . Corollary A.3 (Consistency) . Assume that Z = X 1 ×· · ·× X K , c ∈ L 1 (( ⊗ µ n ) ⊗ π 0 ) , c is nonne gative and c ( x , x ) = 0 for al l x ∈ Z , and π 0 ∈ Π( µ ) . L et m = m ( t ) , n = n ( t ) → ∞ as t → ∞ , and ˆ π t b e a solution for ( A.1 ) with empiric al input. Then ther e is a subse quenc e of { ˆ π t } that c onver ges to π 0 . A.1. En tropic opti mal transp ort. The entropic regularization of ( A.1 ) with a general cost c is deﬁned as (A.4) min π ∈ Π( µ ) min γ ∈ Π( π,π 0 ) Z c ( x , z ) dγ ( x , z ) + η KL  γ ∥ ( ⊗ µ n ) ⊗ π 0  . It is equiv alent to (A.5) min γ ∈ Π( µ ,π 0 ) Z cdγ + η KL  γ ∥ ( ⊗ µ n ) ⊗ π 0  , 28 whic h is also regarded as the entropic regularization of ( A.2 ). Due to the strong con v exity of the KL divergence, there is a unique optimal coupling γ η ⋆ for ( A.5 ). This leads to a unique solution for ( A.4 ), denoted by π η ⋆ . Note that since (P x ) # γ η ⋆ ≪ ( ⊗ µ ), π η ⋆ is absolutely contin uous with resp ect to ( ⊗ µ ). F or the separable cost case, i.e., c ( x , z ) = P c i ( x i , z i ), one can utilize the entropic shado w approach. Given a coupling γ ∈ Π( µ , π 0 ), let γ i ( dx i , dz i ) := (P x i , P z i ) # ( γ ) b e its marginal coupling b etw een µ i and ν i = (P z i ) # ( π 0 ). Consider min γ ∈ Π( µ ,π 0 ) K X i =1 Z c i ( x i , z i ) γ ( d x , d z ) + η KL( γ i ∥ µ i ⊗ ν i ) . (A.6) F or i = 1 , . . . , K , let γ i,η ∈ Π( µ i , ν i ) b e optimal for min γ i ∈ Π( µ i ,ν i ) Z c i dγ i + η KL( γ i ∥ µ i ⊗ ν i ) , and κ i,η ( dx i | z i ) ν i ( dz i ) = γ i,η ( dx i , dz i ) by disintegration formula. The MOT en- tropic coupling for ( A.6 ), denoted by γ η † , is constructed b y γ η † ( d x , d z ) = π 0 ( d z ) κ η ( d x | z ) where κ η = κ 1 ,η ( dx 1 | z 1 ) ⊗ · · · ⊗ κ K,η ( dx K | z K ). Then, similarly , the pro jection solution is obtained b y π η † := (P x ) # γ η † . As men tioned, due to the entropic bias, b oth estimators π η ⋆ and γ η † are not necessarily the same as π ∗ . It is known that π η ⋆ and γ η † con v erge to π ∗ as η → 0 in Γ-con vergence sense under mild conditions. Deﬁnition A.4. Given the se quenc e of functionals { F n } over a top olo gic al sp ac e X , the Γ -lower limit Γ - lim inf n →∞ F n : X → R ∪ {±∞} as  Γ - lim inf n →∞ F n  ( x ) := inf n lim inf n →∞ F n ( x n ) : x n → x o , and the Γ -upp er limit Γ - lim sup n →∞ F n : X → R ∪ {±∞} as  Γ - lim sup n →∞ F n  ( x ) := inf  lim sup n →∞ F n ( x n ) : x n → x  , ar e deﬁne d for every x ∈ X . The Γ -limit is F : X → R ∪ {±∞} if for any x ∈ X , F ( x ) :=  Γ - lim n →∞ F n  ( x ) =  Γ - lim inf n →∞ F n  ( x ) =  Γ - lim sup n →∞ F n  ( x ) . Prop osition A.5 (Γ-con vergence) . Assume that either c is non-ne gative and lower semic ontinuous or c ontinuous such that c ∈ L 1 (( ⊗ µ n ) ⊗ π 0 ) . Then, as η → 0 , b oth ( A.5 ) and ( A.6 ) Γ -c onver ge to ( A.2 ) and ( A.3 ) , r esp e ctively. F urthermor e, ther e is a subse quenc e of π η ⋆ which c onver ges to a solution for ( A.1 ) . Similarly, ther e is a subse quenc e of π η † which c onver ges to a solution for ( A.3 ) . A.2. Limit distribution for ﬁnite supp ort case. Recall the linear programming of interest (A.7) min γ ∈ R S ∗ ⟨ c , γ ⟩ , s.t. A γ = b n,m , γ ≥ 0 29 where c ∈ R S ∗ is the v ectorization of s 1 × · · · × s K × s 0 cost tensor (deﬁned as d p Z ,p ), A ∈ R S + × S ∗ is the pro jection matrix corresp onding the marginal constraints, and b n,m :=  µ n π 0 m  , b :=  µ π 0  ∈ R s 1 × · · · × R s K × R s 0 b e the vector of marginal distributions. The lemma b elo w is a black b o x to attain Theorem 4.5 . Lemma A.6. [ 67 , The or em 4] Supp ose ( A.7 ) satisﬁes The or em 4.3 . If b n,m has the distribution limit G and the r ate r n,m , and γ ( b ) is singleton, then r n,m ( γ ( b n,m ) − γ ( b )) d − → p ∗ b ( G ) wher e p ∗ b ( G ) is the set of optimal solutions to the fol lowing line ar pr o gr am: min p ∈ R S ∗ ⟨ c , p ⟩ , s.t. A p = G , p ij ≥ 0 for al l ( ij ) / ∈ spt ( γ ( b )) . Her e γ ( b n,m ) − γ ( b ) := { γ − γ ( b ) : γ ∈ γ ( b n,m ) } . The next lemma v eriﬁes that our problem satisﬁes Theorem 4.3 . Lemma A.7. ( A.7 ) satisﬁes The or em 4.3 . Pr o of. First, since our problem is MOT, the solution set γ ( b ) ⊆ [0 , 1] S ∗ is alw ays nonempt y and b ounded. In particular, for generic c with v anishing diagonal, i.e., c ( x , x ) = 0 for all x ∈ Z , γ ( b ) is { (Id , Id) # ( π 0 ) } , hence singleton. A pro jection matrix A has rank S + − 1, so it do es not ha v e full rank. How ever, there is a canonical w ay to adjust ( A.7 ) so that A is of full rank discussed in [ 62 , Section 6]. Since µ 1 , . . . , µ K are probability measures, it is suﬃcient to consider µ 1 , † , . . . , µ K, † where µ † is the truncation of a probability vector obtained by re- mo ving the last co ordinate. Accordingly , let A † ∈ R ( S + − K ) × S ∗ denote the matrix obtained from A b y removing its s 1 -th, ( s 1 + s 2 )-th, . . . and ( s 1 + . . . , s K )-th rows. Then, the linear constrain t A γ = b reduces to A † γ =  µ † π 0  , b y whic h the full rank assumption is satisﬁed. T o a void multiple notation, w e abuse the notation A γ = b so that it should b e understo o d in the sense as we discussed ab o ve. The limit distribution of b n,m is as follo ws. Assuming m n + m → λ ∈ (0 , 1) as n, m → ∞ , it holds that r nm n + m ( b n,m − b ) d − → G µ ,π 0 =  √ λ G µ √ 1 − λ G π 0  (A.8) where G µ is the centered Gaussian on R s 0 with the cov ariance matrix      Σ( µ 1 ) Σ( µ 2 ) . . . Σ( µ K )      with Σ( µ i ), the s i × s i co v ariance matrix of multinomial distribution µ i , and G π 0 is the centered Gaussian on R s 0 with the cov ariance matrix Σ( π 0 ), which is the 30 s 0 × s 0 co v ariance matrix of m ultinomial distribution π 0 . The ratio λ is natural: see [ 62 , Remark 6.2] for more details. The last one is the Slater’s condition. Such a γ 0 in Theorem 4.3 is sometimes called strictly feasible. Slater’s condition implies that strong duality holds, and the problem b ecomes conv ex: See [ 13 , Section 5.2.3] for more details. T his condition is trivially satisﬁed in this setting b y γ 0 = ( ⊗ µ n ) ⊗ π 0 . Therefore, Theorem 4.3 for ( A.7 ) is satisﬁed. □ A.3. The asymptotic conﬁdence sets. In the below, we presen t the asymptotic conﬁdence sets for ˆ π from [ 67 ]. Let [ s ] := { 1 , . . . , s } . Recall that the columns of A can b e corresp onded b y [ s 1 ] × · · · × [ s K ] × [ s 0 ] (the length of the column of A is S ∗ ), which is arranged in lexicographical order. F or an y subset I ⊆ [ s 1 ] × · · · × [ s K ] × [ s 0 ], w e denote by A I the S + × | I | submatrix of A obtained by taking the columns of A which are con tained in I. Analogously , for γ ∈ R S ∗ , w e denote by γ I the vector of length | I | consisting of the co ordinates of γ con tained in I. Deﬁnition A.8. A set I ⊆ [ s 1 ] × · · · × [ s K ] × [ s 0 ] is a b asis if | I | = k, rank( A I ) = k . Given a b asis I , γ (I; b ) is c al le d the b asic solution which is the ve ctor γ satisfying γ (I; b ) I = A − 1 I b , γ (I; b ) I c = 0 . If it is fe asible, i.e., A − 1 I b ≥ 0 , γ (I; b ) is c al le d a b asic fe asible solution. Deﬁnition A.9. I ( b ) := { I : I is a b asis, γ (I; b ) is a b asic fe asible solution } , I ∗ ( b ) := { I ∈ I ( b ) : optimal for ( A.7 ) } . The set of optimal b asic solutions of ( A.7 ) is denote d by V ( b ) := { γ (I; b ) : I ∈ I ∗ ( b ) } . Let γ ( b ) b e a set of optimal solutions for ( A.7 ). In terms of linear programming language, V ( b ) is the set of extreme solutions, or vertex solutions, of whic h elemen ts cannot b e written as an y conv ex com bination of eleme n ts of γ ( b ). In fact, γ ( b ) = con v( V ( b )), the conv ex hull of V ( b ). Notice that γ ( b ) is alwa ys b ounded if b is the marginal (probabilit y) constrain t v ector since ( A.7 ) is (m ultimarginal) optimal transp ort problem. No w, supp ose a statistician solv ed ( A.7 ) with an empirical input b n,m and ob- tained a (random) solution ˆ γ ( b n,m ) ∈ V ( b n,m ) such that a corresp onding basis I n ∈ I ∗ ( b n,m ). The goal is to construct a conﬁdence set based on ˆ γ ( b n,m ) that will con tain at least one element of γ ( b ) with high probability . Let γ (I n ; b ) b e the basic solution obtained by random basis I n , which may b e neither feasible for ( A.7 ) nor optimal. T o remedy these issues, deﬁne the pro jection (A.9) γ ∗ n := argmin γ ∈ γ ( b ) ∥ γ (I n ; b ) − γ ∥ . The following result is ab out the conﬁdence set for ( A.7 ). 31 Lemma A.10. [ 67 , The or em 8] Supp ose that ( A.7 ) satisﬁes the or em 4.3 and b n,m satisﬁes the distributional limit G . L et G α b e an op en set such that P { G ∈ G α } ≥ 1 − α . Then (A.10) lim inf n →∞ P { r n,m ( ˆ γ ( b n,m ) − γ ∗ n ) ∈ γ (I n ; G α ) } ≥ 1 − α , wher e γ (I n ; G α ) := { γ (I n ; G ) : G ∈ G α } . Corollary A.11 (Conﬁdence set for an optimal solution) . [ 67 , Cor ol lary 9] In the setting of The or em A.10 , the set C n := { ˆ γ ( b n,m ) − r − 1 n,m γ : γ ∈ γ (I n ; G α ) } c ontains an element of γ ( b ) with asymptotic pr ob ability at le ast 1 − α . F rom Theorem A.11 , one can obtain asymptotic conﬁdence sets containing π 0 with probability at least 1 − α b y pro jection. W e omit the pro of. Theorem A.12 (Conﬁdence set) . Fix α ∈ (0 , 1) . Assume that m n + m → λ ∈ (0 , 1) as n, m → ∞ and c = d p Z ,p in ( 2.1 ) . L et ˆ π ( b n,m ) b e a solution for ( 1.3 ) with input µ n and π 0 m . L et G α b e an op en set such that P { G µ ,π 0 ∈ G α } ≥ 1 − α wher e G µ ,π 0 is the limit distribution of b n,m given in ( 4.2 ) . A c onﬁdenc e set C α := ( ˆ π ( b n,m ) − r n + m nm π : π ∈ π ( ˜ I; G α ) ) c ontains π 0 with asymptotic pr ob ability at le ast 1 − α wher e ˜ I ∈ I ∗ ( b n,m ) and π ( ˜ I; G α ) := { (P x ) # γ ( ˜ I; G ) : G ∈ G α } . B. Appendix: Additional Simula tion Resul ts T able 5 presents the ro ot mean squared error (RMSE) of e F ( − 0 . 25 , − 0 . 25) and b F ( − 0 . 25 , − 0 . 25), a veraged o ver 500 repetitions, for each combination of ( m, n, ρ ). The results are largely consistent with those in T able 1 . First, for each ρ and ﬁxed ratio n/m , the RMSE v alues of e F and b F decreases with m , approximately at the rate of 1 / √ m . Second, the optimal η dep ends on ρ , with smaller η preferred for large ρ and larger η for small ρ . Third, the cross-v alidation pro cedure yields b F that outp erforms e F , demonstrating the practical adv antage of incorp orating marginal data for more eﬃcien t estimation. 32 ρ n/m m RMSE Average log 2 η cv e F b F η = 2 0 η = 2 − 6 η = η cv 0.00 5 50 5.18 2.29 3.22 3.20 -2.63 100 3.65 1.63 2.36 2.29 -2.75 200 2.55 1.06 1.48 1.42 -2.71 10 50 5.18 1.86 2.94 2.87 -2.53 100 3.65 1.34 2.14 2.16 -3.08 200 2.55 0.92 1.39 1.35 -2.70 25 50 5.18 1.55 2.72 2.77 -2.63 100 3.65 1.17 2.04 1.99 -2.83 200 2.55 0.80 1.34 1.29 -2.70 0.25 5 50 5.46 2.83 3.32 3.53 -4.03 100 4.01 2.30 2.47 2.71 -4.49 200 2.84 1.91 1.60 1.76 -5.21 10 50 5.46 2.41 2.96 3.27 -3.79 100 4.01 2.11 2.24 2.48 -4.51 200 2.84 1.77 1.47 1.66 -5.22 25 50 5.46 2.19 2.69 3.04 -3.82 100 4.01 1.98 2.13 2.41 -4.67 200 2.84 1.69 1.39 1.57 -5.14 0.75 5 50 6.26 6.34 3.42 3.60 -5.77 100 4.33 6.09 2.60 2.63 -5.97 200 3.10 5.88 1.92 1.80 -6.00 10 50 6.26 6.08 2.95 3.18 -5.73 100 4.33 5.95 2.21 2.22 -5.99 200 3.10 5.82 1.67 1.53 -6.00 25 50 6.26 5.99 2.57 2.80 -5.79 100 4.33 5.93 2.01 1.99 -5.99 200 3.10 5.77 1.54 1.39 -6.00 T able 5. Each cell reports the RMSE of the estimators for F ( − 0 . 25 , − 0 . 25), computed ov er 500 simulation repetitions. The RMSE v alues are scaled by a factor of 100. The last column, labeled “Average log 2 η cv ,” rep orts the mean of log 2 η cv across 500 repetitions. School of Da t a Science, The Chinese University of Hong Kong, Shenzhen, Guang- dong, 518172, P.R. China. Email address : jakwangkim@cuhk.edu.cn Dep ar tment of Ma thema tics, University of British Columbia, 1984 Ma thema tics Ro ad, V ancouver, British Columbia, V6T 1Z2, Canada. Email address : yhkim@math.ubc.ca Dep ar tment of St a tistics, University of Illinois Urbana-Champ aign, 136 Computing Applica tions Building, 605 E Springfield A ve, Champ aign, IL 61820, USA. Email address : parkchan@illinois.edu

Extension of coupling via the Projection of Optimal Transport

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment