Concentration of discrepancy-based approximate Bayesian computation via Rademacher complexity

There has been increasing interest on summary-free solutions for approximate Bayesian computation (ABC) which replace distances among summaries with discrepancies between the empirical distributions of the observed data and the synthetic samples gene…

Authors: Sirio Legramanti, Daniele Durante, Pierre Alquier

Concentration of discrepancy-based approximate Bayesian computation via Rademacher complexity
CONCENTRA TION OF DISCREP ANCY -B ASED APPR O XIMA TE B A YESIAN COMPUT A TION VIA RADEMA CHER COMPLEXITY B Y S I R I O L E G R A M A N T I 1 , * D A N I E L E D U R A N T E 2 , † , A N D P I E R R E A L Q U I E R 3 , ‡ 1 Department of Economics, University of Ber gamo * sirio.le gramanti@unibg .it 2 Department of Decision Sciences and Institute for Data Science and Analytics, Bocconi University , † daniele.dur ante@unibocconi.it 3 Department of Information Systems, Decision Sciences and Statistics, ESSEC Business School, ‡ alquier@essec.edu There has been an increasing interest on summary-free solutions for ap- proximate Bayesian computation ( A B C ) which replace distances among sum- maries with discrepancies between the empirical distributions of the observed data and the synthetic samples generated under the proposed parameter v al- ues. The success of these strategies has motiv ated theoretical studies on the limiting properties of the induced posteriors. Ho wev er , there is still the lack of a theoretical framework for summary-free A B C that (i) is unified, instead of discrepancy-specific, (ii) does not necessarily require to constrain the analysis to data generating processes and statistical models meeting specific regularity conditions, b ut rather f acilitates the deriv ation of limiting properties that hold uniformly , and (iii) relies on verifiable assumptions that provide more explicit concentration bounds clarifying which factors govern the limiting behavior of the A B C posterior . W e address this gap via a novel theoretical frame work that introduces the concept of Rademacher complexity in the analysis of the lim- iting properties for discrepancy-based A B C posteriors, including in non-i.i.d. and misspecified settings. This yields a unified theory that relies on construc- ti ve arguments and provides more informativ e asymptotic results and uni- form concentration bounds, e ven in settings not cov ered by current studies. These adv ancements are obtained by relating the asymptotic properties of summary-free A B C posteriors to the behavior of the Rademacher complexity associated with the chosen discrepancy within the family of integral proba- bility semimetrics ( I P S ). The I P S class e xtends summary-based distances, and also includes the widely-implemented W asserstein distance and maximum mean discrepancy ( M M D ), among others. As clarified in specialized theoreti- cal analyses of popular I P S discrepancies and via illustrativ e simulations, this ne w perspectiv e improv es the understanding of summary-free A B C . 1. Introduction. The gro wing complexity of statistical models in modern applications not only yields intractable likelihoods, but also raises substantial challenges in the identification of ef fecti ve summary statistics (see e.g., Fearnhead and Prangle , 2012 ; Marin et al. , 2014 ; Frazier et al. , 2018 ). Such dra wbacks ha ve motiv ated an increasing adoption of A B C solutions, along with a shift aw ay from summary-based implementations (see e.g., Marin et al. , 2012 ) and to- wards summary-free strategies relying on discrepancies among the empirical distributions of the observed and synthetic data (see e.g., Dro vandi and Frazier , 2022 ). These solutions provide samples from an approximate posterior distrib ution for the parameter of interest θ ∈ Θ ⊆ R p under the only requirement that simulating from the assumed model µ θ is feasible. This is achie ved by retaining all those v alues of θ , drawn from the prior , that produced synthetic sam- MSC2020 subject classifications : Primary 62F15; secondary 62C10, 62E20. K eywor ds and phrases: ABC , Integral probability semimetrics , MMD , Rademacher comple xity , W asserstein distance . 1 2 ples z 1: m = ( z 1 , . . . , z m ) from µ θ whose empirical distribution is close enough to the one of the observed data y 1: n = ( y 1 , . . . , y n ) , under the chosen discrepancy . Remarkable examples of the above implementations are A B C versions that employ maximum mean discrepancy ( M M D ) ( P ark, Jitkrittum and Sejdinovic , 2016 ), Kullback–Leibler ( K L ) di- ver gence ( Jiang, W u and W ong , 2018 ), W asserstein distance ( Bernton et al. , 2019 ), energy statistic ( Nguyen et al. , 2020 ), Hellinger and Cramer –von Mises distances ( Frazier , 2020 ), and γ -div ergence ( Fujisawa et al. , 2021 ); see also Gutmann et al. ( 2018 ), Forbes et al. ( 2021 ) and W ang, Kaji and Rocko va ( 2022 ) for additional examples of summary-free A B C strategies. By ov ercoming the need to pre-select summaries, all these solutions reduce the potential informa- tion loss of summary-based A B C , thereby yielding improv ed performance in simulation studies and illustrati ve applications. These promising empirical results ha ve moti vated acti ve research on the theoretical properties of the induced A B C posterior , with a main focus on the limiting beha vior under different asymptotic regimes for the tolerance threshold and the sample size ( Jiang, W u and W ong , 2018 ; Bernton et al. , 2019 ; Nguyen et al. , 2020 ; Frazier , 2020 ; Fujisaw a et al. , 2021 ). Among these regimes, of particular interest are the two situations in which the A B C threshold is either fix ed or progressi vely shrinks as both n and m di verge. In the former case, the theory focus is on assessing whether the control on the discrepancy among the em- pirical distributions established by the selected A B C threshold yields a pseudo-posterior whose asymptotic form guarantees the same threshold-control on the discrepanc y among the underly- ing truths (e.g., Jiang, W u and W ong , 2018 ). The latter addresses instead the more challenging theoretical question on whether a suitably-decaying A B C threshold can provide meaningful rates of concentration for the sequence of A B C posteriors around those θ v alues yielding a µ θ close enough to the data generating process µ ∗ , as n and m div erge (e.g., Bernton et al. , 2019 ). A v ailable contributions along these lines of research hav e the merit of providing theoretical support to sev eral versions of summary-free A B C . Ho we ver , current theory is often tailored to the specific discrepancy analyzed, and generally relies on difficult-to-v erify e xistence assump- tions and concentration inequalities that constrain the analysis, either implicitly or explicitly , to data generating processes and statistical models which meet suitable re gularity conditions, thereby lacking results that hold uniformly . Recalling Bernton et al. ( 2019 ) and Nguyen et al. ( 2020 ), this also yields concentration bounds in volving sequences of control functions which are not made explicit. As such, although con ver gence and concentration can still be proved, the core factors that gov ern these asymptotic properties remain yet unexplored, thus limiting the methodological impact of current theory , while hindering the deriv ation of nov el informati ve results in more challenging settings. For example, the av ailable theoretical studies pre-assume that the discrepancy among the empirical distributions ˆ µ z 1: m and ˆ µ y 1: n suitably con ver ges to the one among the corresponding truths µ θ and µ ∗ , or , alternati vely , that both ˆ µ z 1: m and ˆ µ y 1: n con ver ge, under the selected discrepancy , to µ θ and µ ∗ , respectiv ely . While these assumptions can be v erified under suitable conditions and for specific discrepancies, as clarified within Sec- tions 2 and 3 , an in-depth theoretical understanding of summary-free A B C necessarily requires relating con ver gence to the learning properties of the selected discrepancy , rather than pre- assuming it. Such a more precise b ut yet-unexplored theoretical treatment has the potentials to shed light on the factors that govern the limiting behavior of discrepancy-based A B C posteriors. In addition, it could possibly provide more general, verifiable and e xplicit suf ficient conditions under which popular discrepancies are guaranteed to ensure that the previously pre-assumed CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 3 con ver gence and concentration hold uniformly over models and data generating processes, while facilitating the study of the limiting behavior of discrepancy-based A B C posteriors in more general situations where these con ver gence guarantees may lack. In this article, we address the abo ve gap by introducing an innov ati ve theoretical framework which analyses the limiting properties of discrepancy-based A B C posteriors, under a unified perspecti ve and for dif ferent asymptotic regimes, through the concept of Rademacher com- plexity (see e.g., W ainwright , 2019 , Chapter 4), within the general class of integral probability semimetrics ( I P S ) (e.g., Müller , 1997 ; Sriperumb udur et al. , 2012 ). Such a class naturally gen- eralizes distances among summaries and includes the widely-implemented M M D and W asser- stein distance, among others. As clarified in Sections 2 – 3 and in Appendix C of the Supplemen- tary Material, this perspecti ve, which to the best of our knowledge is no vel within A B C , allo ws to deri ve unified, informativ e and uniform concentration bounds for A B C posteriors under se v- eral discrepancies, in possibly misspecified and non-i.i.d. contexts. Moreover , it relies on more constructi ve arguments that clarify under which sufficient conditions a discrepancy within the I P S class guarantees uniform con ver gence and concentration of the induced A B C posterior; i.e., without necessarily requiring suitable regularity conditions for the underlying data generating process µ ∗ and the as sumed statistical model. This yields an important theoretical and method- ological adv ancement, since µ ∗ is often unkno wn in practice and, hence, verifying regularity conditions on the data generating process is generally unfeasible. Crucially , the theoretical framew ork we introduce allows to prove informati ve theoretical re- sults e ven in yet-une xplored settings that possibly lack those con vergence guarantees assumed in the literature. More specifically , in these settings we deriv e no vel upper and lo wer bounds for the limiting acceptance probabilities that clarify in which contexts the A B C posterior is still well-defined for n large enough. When this is the case, it is further possible to obtain infor- mati ve supersets for the support of such a posterior . These show that, when relaxing standard con ver gence assumptions employed in state-of-the-art theoretical studies, the control estab- lished by a fix ed A B C threshold on the discrepancy among the empirical distributions does not necessarily translate, asymptotically , into the same control on the discrepancy among the corre- sponding truths, but rather yields an upper bound equal to the sum between the A B C threshold and a multiple of the Rademacher complexity , namely a measure of richness of the class of functions that uniquely identify the chosen I P S ; see Section 3.1 . The abov e results clarify the fundamental relation among the limiting behavior of A B C pos- teriors and the learning properties of the chosen discrepancy , when measured via Rademacher complexity . In addition, the bounds deriv ed clarify that a sufficient condition to recov er a lim- iting pseudo-posterior with the same threshold-control on the discrepancy among the truths as the one enforced on the corresponding empirical distributions, is that the selected discrep- ancy has a Rademacher complexity v anishing to zero in the large-data limit. As prov ed within Section 3.2 , this setting also allows constructi ve deri vations of novel, informati ve and uniform concentration bounds for discrepancy-based A B C posteriors in the challenging regime where the threshold shrinks to wards zero as both m and n di ver ge. This is facilitated by the exis- tence of meaningful upper bounds for the Rademacher complexity associated to popular A B C discrepancies, along with the av ailability of constructiv e conditions for the deriv ation of these bounds ( Sriperumbudur et al. , 2012 ). Such results lev erage fundamental connections among the Rademacher complexity and other key quantities in statistical learning theory , such as the 4 V apnik–Cherv onenkis ( V C ) dimension and the notion of uniform Gliv enko–Cantelli classes (e.g., W ainwright , 2019 , Chapter 4). This yields an improved understanding of the f actors that gov ern the concentration of discrepancy-based A B C posteriors under a unified perspecti ve that further allo ws to (i) quantify rates of concentration and (ii) directly translate any adv ancement on Rademacher comple xity into nov el A B C theory . Section 4 illustrates point (i) through a spe- cific focus on M M D with routinely-implemented bounded kernels, and also clarifies that in the absence of guarantees on uniformly-v anishing Rademacher complexity (e.g., under W asser- stein distance in unbounded data spaces) concentration results can be still deri ved, but at the expense of regularity conditions for the data generating process µ ∗ and the assumed model. Point (ii) is clarified in Appendix C of the Supplementary Material, where we extend the the- ory from Section 3 to non-i.i.d. settings, le veraging results in Mohri and Rostamizadeh ( 2008 ) on the Rademacher complexity for β -mixing processes (e.g., Doukhan , 1994 ). The illustrati ve simulation studies in Section 5 show that the theoretical results deriv ed in Sections 3 – 4 find empirical evidence in practice, including also in scenarios characterized by model misspecification and data contamination (theoretical and empirical results under non- i.i.d. data generating processes can be found in Appendix C of the Supplementary Material). These findings suggest that discrepancies with guarantees of uniformly-v anishing Rademacher complexity provide a robust and sensible choice when the assumed statistical model and/or the underlying data generating process do not necessarily meet specific regularity conditions, or it is not possible to verify these conditions. This is a common situation in applications, since the data generating process is unkno wn in practice. As discussed in Section 6 , the unexplored bridge between discrepancy-based A B C and the Rademacher comple xity introduced in this article can also be le veraged to deri ve ev en more general theory by e xploiting the acti ve literature on the Rademacher complexity . For e xample, combining our perspecti ve with the recent unified treatment of I P S and f -di ver gences ( Agraw al and Horel , 2021 ; Birrell et al. , 2022 ) might set the premises to deriv e similarly-interpretable and general results for other discrepancies employed within A B C , such as the Kullback–Leibler di ver gence ( Jiang, W u and W ong , 2018 ) and Hellinger distance ( Frazier , 2020 ). More gener- ally , our contrib ution can hav e implications e ven beyond A B C , in particular on generalized Bayesian inference via pseudo-posteriors based on discrepancies ( Bissiri, Holmes and W alker , 2016 ; Chérief-Abdellatif and Alquier , 2020 ; Matsubara et al. , 2022 ; Frazier , Knoblauch and Drov andi , 2024 ). Proofs and additional results can be found in the Supplementary Material. 2. Integral probability semimetrics and Rademacher complexity . Denote with y 1: n = ( y 1 , . . . , y n ) ∈ Y n an i.i.d. sample from µ ∗ ∈ P ( Y ) , where P ( Y ) is the space of probability measures on Y , and assume that Y is a metric space endowed with distance ρ ; see Appendix C in the Supplementary Material for extensions to non-i.i.d. settings. Gi ven a statistical model { µ θ : θ ∈ Θ ⊆ R p } in P ( Y ) and a prior distrib ution π on θ , rejection A B C iterativ ely samples θ from π , draws a synthetic i.i.d. sample z 1: m = ( z 1 , . . . , z m ) from µ θ , and retains θ as a sample from the A B C posterior if the discrepanc y ∆( z 1: m , y 1: n ) between z 1: m and y 1: n is below a user - selected threshold ε n ≥ 0 . Although m may differ from n , here we follow common practice in A B C theory (e.g., Bernton et al. , 2019 ; Frazier , 2020 ) and set m = n . Rejection A B C does not sample from the exact posterior π n ( θ ) ∝ π ( θ ) µ n θ ( y 1: n ) , but rather from the A B C posterior CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 5 defined as π ( ε n ) n ( θ ) ∝ π ( θ ) Z Y n 1 { ∆( z 1: n , y 1: n ) ≤ ε n } µ n θ ( dz 1: n ) , whose properties clearly depend on the chosen discrepanc y ∆( · , · ) . W ithin summary-based A B C , ∆( z 1: n , y 1: n ) is a suitable distance, typically Euclidean, among summaries computed from the synthetic sample z 1: n and the observed data y 1: n . Ho wev er , recalling, e.g., Marin et al. ( 2014 ) and Frazier et al. ( 2018 ), the identification of summaries that do not lead to information loss is challenging for those complex models requiring A B C implementations. T o o vercome these challenges, A B C literature has progressiv ely mov ed to wards the adop- tion of discrepancies D : P ( Y ) × P ( Y ) → [0 , ∞ ] between the empirical distributions of the synthetic and observed data, that is ∆( z 1: n , y 1: n ) = D ( ˆ µ z 1: n , ˆ µ y 1: n ) = D ( n − 1 P n i =1 δ z i , n − 1 P n i =1 δ y i ) , where δ x is the Dirac delta at a generic x ∈ Y . Popular examples are A B C v ersions based on M M D , K L div ergence, W asserstein distance, energy statistic, Hellinger and Cramer–v on Mises distances, and γ -di ver gence, whose limiting properties hav e been studied in Park, Jitkrittum and Sejdinovic ( 2016 ), Jiang, W u and W ong ( 2018 ), Bernton et al. ( 2019 ), Nguyen et al. ( 2020 ), Frazier ( 2020 ) and Fujisaw a et al. ( 2021 ) under different asymptotic regimes and with a com- mon reliance on existence assumptions to ease the proofs. As a first step to ward our unified and constructi ve theoretical framew ork, we shall emphasize that, although most of the abov e contrib utions treat the discrepancies separately , some of these choices share, in fact, a common origin. For example, M M D , W asserstein distance and the energy statistic all belong to the I P S class ( Müller , 1997 ) in Definition 2.1 . Such a class also includes summary-based distances. D E FI N I T I O N 2.1 (Integral probability semimetric — I P S ) . Let F be a class of measurable functions f : Y → R . Then, the inte gral pr obability semimetric D F between µ 1 , µ 2 ∈ P ( Y ) is defined as D F ( µ 1 , µ 2 ) := sup f ∈ F     Z Y f dµ 1 − Z Y f dµ 2     , wher e F is the pr e-specified class of functions that char acterizes D F . Examples 2.2 – 2.4 sho w that routinely-employed discrepancies both in summary-free and summary-based A B C (e.g., Park, Jitkrittum and Sejdinovic , 2016 ; Bernton et al. , 2019 ; Nguyen et al. , 2020 ; Drov andi and Frazier , 2022 ) are, in fact, I P S with a known characterizing family F that uniquely identifies each discrepancy . E X A M P L E 2.2 . (W asserstein-1 distance) . Let us consider the Lipschitz seminorm defined as || f || L := sup {| f ( x ) − f ( x ′ ) | /ρ ( x, x ′ ) : x  = x ′ in Y } . If F = { f : || f || L ≤ 1 } , then D F coincides with the Kantor ovich metric. Recalling the Kantor ovic h–Rubinstein theor em (e.g., Dudle y , 2018 ), when Y is separable this metric coincides with the dual r epr esentation of the W asserstein-1 distance, which is ther efor e r ecover ed as an example of I P S . Recall also that such a distance is defined on the space of pr obability measur es having finite fir st moment (e.g., Sriperumb udur et al. , 2012 ; Bernton et al. , 2019 ). 6 E X A M P L E 2.3 . ( M M D and ener gy distance). Given a positive-definite k ernel k ( · , · ) on Y × Y , let F = { f : || f || H ≤ 1 } , wher e H is the r epr oducing kernel Hilbert space corr esponding to k ( · , · ) . Then D F is the M M D (e.g ., Muandet et al. , 2017 ). When k ( · , · ) is char acteristic, then D F is not only a semimetric, b ut also a metric — i.e., D F ( µ 1 , µ 2 ) = 0 implies µ 1 = µ 2 . Relevant examples of r outinely-implemented char acteristic kernels in Y = R d ar e the Gaussian exp( −∥ x − x ′ ∥ 2 /σ 2 ) and Laplace exp( −∥ x − x ′ ∥ /σ ) ones ( Muandet et al. , 2017 ). Note that M M D is also dir ectly r elated to the ener gy distance, due to the corr espondence among positive- definite kernels and ne gative-definite functions ( Sejdinovic et al. , 2013 ). E X A M P L E 2.4 . (Summary-based distances). Classical summary-based A B C employs dis- tances among a finite set of summaries f 1 , . . . , f K (e.g ., Dr ovandi and F razier , 2022 ). Inter- estingly , these distances can be r ecover ed as a special case of M M D with suitably-defined kernel and, hence, ar e guar anteed to belong to the I P S class. In particular , for a pr e-selected vector of summaries f ( x ) = [ f 1 ( x ) , . . . , f K ( x )] ∈ H = R K equipped with the standar d Eu- clidean norm ∥ f ( x ) ∥ 2 H = ⟨ f ( x ) , f ( x ) ⟩ = f 2 1 ( x ) + · · · + f 2 K ( x ) , one can define the kernel k ( x, x ′ ) = ⟨ f ( x ) , f ( x ′ ) ⟩ to obtain classical summary-based A B C r ecasted within the M M D frame work. As a r esult, popular kernels, such as the Gaussian one, r elying on an infinite di- mensional featur e space can be interpr eted as a limiting version of summary-based A B C . Although Examples 2.2 – 2.4 characterize the most popular instances of I P S discrepancies employed in A B C , it shall be emphasized that other interesting semimetrics belong to the I P S class (e.g., Sriperumbudur et al. , 2012 ; Birrell et al. , 2022 ). T wo rele vant ones are the total v ariation ( T V ) and K olmogorov–Smirno v distances discussed in the Supplementary Material. While A B C based on discrepancies, such as those presented abov e, ov ercomes the need to pre-select summaries, it still requires the choice of a discrepanc y . This moti vates theory on the asymptotic properties of the induced A B C posterior , under general discrepancies, for n → ∞ and relev ant thresholding schemes. Considering here a unified perspecti ve which applies to the whole I P S class, the current theory focuses both on fixed ε n = ε settings, and also on ε n → ε ∗ regimes, where ε ∗ = inf θ ∈ Θ D F ( µ θ , µ ∗ ) is the lo west attainable discrepancy between the as- sumed model and the data generating process ( Jiang, W u and W ong , 2018 ; Bernton et al. , 2019 ; Nguyen et al. , 2020 ; Frazier , 2020 ; Frazier , Robert and Rousseau , 2020 ; Fujisawa et al. , 2021 ). Under these settings, a v ailable theory in vestigates whether the ε n -control on D F ( ˆ µ z 1: n , ˆ µ y 1: n ) established by rejection- A B C translates into meaningful con vergence results, and upper bounds on the rates of concentration for the sequence of discrepancy-based A B C posteriors around the true data generating process µ ∗ , both under correctly specified models where µ ∗ = µ θ ∗ for some θ ∗ ∈ Θ , and in misspecified contexts where µ ∗ does not belong to { µ θ : θ ∈ Θ ⊆ R p } . A common strategy to deriv e the aforementioned results is to study con ver gence of the em- pirical distrib utions ˆ µ z 1: n and ˆ µ y 1: n to the corresponding truths µ θ and µ ∗ , in the chosen dis- crepancy D F , after noticing that, since D F is a semimetric, we ha ve D F ( µ θ , µ ∗ ) ≤ D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ z 1: n , ˆ µ y 1: n ) + D F ( ˆ µ y 1: n , µ ∗ ) . (1) T o this end, av ailable theoretical studies pre-assume suitable con ver gence results for the em- pirical measures along with specific concentration inequalities regulated by non-explicit se- quences of control functions; see Assumptions 1–2 in Proposition 3 of Bernton et al. ( 2019 ), CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 7 and Assumptions A1–A2 in Nguyen et al. ( 2020 ). Alternati vely , it is possible to directly re- quire that D F ( ˆ µ z 1: n , ˆ µ y 1: n ) → D F ( µ θ , µ ∗ ) almost surely as n → ∞ (see Jiang, W u and W ong , 2018 ; Nguyen et al. , 2020 , Theorems 1 and 2, respectiv ely). Ho wev er , as highlighted by the same authors, these conditions (i) can be dif ficult to v erify for se veral discrepancies, (ii) do not allo w to assess whether some of these discrepancies can achiev e con ver gence and concentra- tion uniformly o ver P ( Y ) , and (iii) often yield bounds which hinder an in-depth understanding of the factors regulating these limiting properties. In addition, when the abov e assumptions are not met, the asymptotic beha vior of A B C posteriors remains yet unexplored. As a first step to wards addressing the above issues, notice that the aforementioned results, when contextualized within the I P S class, are inherently related to the richness of the kno wn class of functions F that identifies each D F ; see Definition 2.1 . Intuiti vely , if F is too rich, it is alw ays possible to find a function f ∈ F yielding lar ge discrepancies between the empiricals and the corresponding truths ev en when ˆ µ z 1: n and ˆ µ y 1: n are arbitrarily close to µ θ and µ ∗ , re- specti vely . Hence, D F ( ˆ µ z 1: n , µ θ ) and D F ( ˆ µ y 1: n , µ ∗ ) will remain large with positi ve probability , making the triangle inequality in ( 1 ) of limited interest, since a low D F ( ˆ µ z 1: n , ˆ µ y 1: n ) will not necessarily imply a small D F ( µ θ , µ ∗ ) . In fact, in this context, it is clearly e ven not guaranteed that D F ( ˆ µ z 1: n , ˆ µ y 1: n ) will con ver ge to D F ( µ θ , µ ∗ ) . This suggests that the limiting properties of the A B C posterior are inherently related to the richness of F . In Section 3 we prov e that this is the case when such a richness is measured through the notion of Rademacher complexity clarified in Definition 2.5 ; see also Chapter 4 in W ainwright ( 2019 ) for an introduction. D E FI N I T I O N 2.5 (Rademacher complexity). Given a vector x 1: n = ( x 1 , . . . , x n ) of i.i.d. random variables with distribution µ ∈ P ( Y ) , the Rademacher complexity R µ,n ( F ) of a class F of r eal-valued measurable functions is R µ,n ( F ) = E x 1: n ,ϵ 1: n  sup f ∈ F | (1 /n ) P n i =1 ϵ i f ( x i ) |  , wher e ϵ 1 , . . . , ϵ n denote i.i.d. Rademacher variables, that is P ( ϵ i = 1) = P ( ϵ i = − 1) = 1 / 2 . As is clear from the abo ve definition, high v alues of the Rademacher complexity R µ,n ( F ) mean that F is rich enough to contain functions that can closely capture, on av erage, ev en the beha vior of full-noise v ectors. Hence, if R µ,n ( F ) is bounded away from zero for e very n , then F might be excessi vely rich for statistical purposes. Con versely , if R µ,n ( F ) goes to zero as n di ver ges, then F is a more parsimonious class. Lemma 2.6 formalizes this intuition. L E M M A 2.6 (Theorem 4.10 and Proposition 4.12 in W ainwright ( 2019 )) . Let x 1: n be i.i.d. fr om some distribution µ ∈ P ( Y ) . Then, for any b -uniformly bounded class F , i.e., any class F of functions f such that ∥ f ∥ ∞ ≤ b , any inte ger n ≥ 1 and scalar δ ≥ 0 , it holds that P x 1: n [ D F ( ˆ µ x 1: n , µ ) ≤ 2 R µ,n ( F ) + δ ] ≥ 1 − exp( − nδ 2 / 2 b 2 ) , (2) and P x 1: n [ D F ( ˆ µ x 1: n , µ ) ≥ R µ,n ( F ) / 2 − sup f ∈ F | E ( f ) | / 2 n 1 / 2 − δ ] ≥ 1 − exp( − nδ 2 / 2 b 2 ) . Lemma 2.6 provides bounds for the probability that D F ( ˆ µ x 1: n , µ ) takes v alues belo w a multi- ple and above a fraction of the Rademacher comple xity of F . Recalling the previous discussion on concentration of A B C posteriors, this result is crucial to study the con ver gence of both ˆ µ z 1: n 8 and ˆ µ y 1: n to µ θ and µ ∗ , respectiv ely . Our theory in Section 3 proves that this is possible when R n ( F ) := sup µ ∈P ( Y ) R µ,n ( F ) goes to 0 as n → ∞ . According to Lemma 2.6 , this is a neces- sary and sufficient condition for D F ( ˆ µ x 1: n , µ ) → 0 as n → ∞ , in P x 1: n –probability , uniformly ov er P ( Y ) . See Appendix C in the Supplementary Material for extensions to non-i.i.d. settings. 3. Asymptotic properties of discrepancy-based A B C posterior distributions. As antici- pated in Section 2 , the asymptotic results we deriv e lev erage Lemma 2.6 to connect the prop- erties of the A B C posterior under an I P S discrepanc y D F with the behavior of the Rademacher complexity for the associated family F . Section 3.1 clarifies the importance of this bridge by studying the limiting properties of the A B C posterior when ε n = ε is fixed and n → ∞ , includ- ing in yet-unexplored situations where con ver gence guarantees for the empirical measures and for the discrepancy among such measures are not necessarily a vailable. The theoretical results we deri ve in this case suggest that the use of discrepancies whose Rademacher complexity de- cays to zero in the limit is a suffici ent condition to obtain strong con ver gence guarantees. On the basis of these findings, Section 3.2 focuses on the re gime ε n → ε ∗ = inf θ ∈ Θ D F ( µ θ , µ ∗ ) as n → ∞ , and deri ves unified and informativ e concentration bounds that hold uniformly ov er P ( Y ) and are based on more constructi ve conditions than those employed in the av ailable discrepancy-specific theory . More specifically , to pro ve our theoretical results in Sections 3.1 – 3.2 , we rely on some or all of the follo wing assumptions: (I) the observed data y 1: n are i.i.d. from a data generating process µ ∗ ; (II) there exist L, c π ∈ R + such that, for ¯ ε small, π ( { θ : D F ( µ θ , µ ∗ ) ≤ ε ∗ + ¯ ε } ) ≥ c π ¯ ε L ; (III) F is b -uniformly bounded; i.e., there exists b ∈ R + such that || f || ∞ ≤ b for any f ∈ F ; (IV) R n ( F ) := sup µ ∈P ( Y ) R µ,n ( F ) goes to zero as n → ∞ . Condition ( I ) is the only assumption made on the data generating process and is present in, e.g., Nguyen et al. ( 2020 ) and in the supplementary materials of Bernton et al. ( 2019 ). Al- though the theory we deri ve in Appendix C of the Supplementary Material relax es ( I ) to study con ver gence and concentration also be yond the i.i.d. context, it shall be emphasized that some of the assumptions considered in the literature may not hold ev en in i.i.d. settings. Hence, an improv ed understanding of the A B C properties under ( I ) is crucial to clarify the range of appli- cability and potential limitations of a v ailable existence theory under more complex, possibly non-i.i.d. regimes. In fact, as shown within Section 3.1 , certain discrepancies may yield pos- teriors that are either not well-defined in the limit or lack strong con ver gence guarantees ev en in i.i.d. settings. Assumption ( II ) is standard in Bayesian asymptotics and A B C theory ( Bern- ton et al. , 2019 ; Nguyen et al. , 2020 ; Frazier , 2020 ), and requires that suf ficient prior mass is placed on those parameter v alues yielding models µ θ close to µ ∗ , under D F . Finally , ( III )–( IV ) provide conditions on F to ensure that D F yields meaningful concentration bounds, lev erag- ing Lemma 2.6 . Crucially , ( III )–( IV ) are made directly on the user-selected discrepancy D F , rather than on the data generating process or on the statistical model, thereby providing gen- eral constructiv e conditions that are more realistic to verify in practice compared to regularity assumptions on the unkno wn µ ∗ or on { µ θ : θ ∈ Θ ⊆ R p } , while guiding the choice of D F . Notice that ( III )–( IV ) depend, in principle, on P ( Y ) . Nonetheless, as clarified within Sec- tion 3.3 , a v ailable bounds on the Rademacher comple xity ensure that these two conditions hold CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 9 for the rele v ant I P S discrepancies in Examples 2.2 – 2.4 under either no constraints on P ( Y ) — thereby circumventing such a dependence — or for suitable spaces Y (e.g., bounded ones). Unlike for constraints on the unkno wn µ ∗ and on { µ θ : θ ∈ Θ ⊆ R p } , conditions on Y can be verified in practice; see Examples 3.5 – 3.7 , and refer to the Supplementary Material for the analysis of other relev ant I P S discrepancies. Moreov er , Assumptions ( III ) and ( IV ) are gener - ally not stronger than those implicit in the current theory on discrepancy-based A B C , and are inherently related with the notion of uniform Gliv enko–Cantelli classes (see e.g., W ainwright , 2019 , Chapter 4), arguably a minimal sufficient requirement for establishing con ver gence and concentration properties that, unlike those currently deri ved, hold uniformly ov er P ( Y ) . This latter point is further clarified in Propositions 4.3 and 4.4 , which state concentration results for M M D with unbounded kernel in R d (which also includes A B C with unbounded summary statis- tics) and W asserstein-1 distance in R d , respectiv ely . Although these two discrepancies meet ( III )–( IV ) for suitable constraints on Y , in general unbounded Y settings these conditions are no more guaranteed for such discrepancies. In this case, Propositions 4.3 and 4.4 clarify that concentration properties can still be deri ved, but at the expense of additional constraints on µ ∗ and { µ θ : θ ∈ Θ ⊆ R p } , which yield results that hold uniformly only in a suitable subset of P ( Y ) , rather than o ver the entire P ( Y ) . It shall be also emphasized that the con ver gence theory deri ved in Section 3.1 does not necessarily require a v anishing Rademacher comple xity to obtain informativ e results. Howe ver , it also clarifies that, when this condition is verified, the limiting pseudo-posterior inherits the same threshold-control on D F ( µ θ , µ ∗ ) as the one estab- lished by rejection A B C on D F ( ˆ µ z 1: n , ˆ µ y 1: n ) . As anticipated abo ve, Sections 3.1 – 3.2 consider the two scenarios where the sample size n di ver ges to infinity , while the A B C threshold ε n is either fixed at ε or progressiv ely shrinks to ε ∗ = inf θ ∈ Θ D F ( µ θ , µ ∗ ) . If the model is well-specified, i.e., µ ∗ = µ θ ∗ for some, possibly not unique, θ ∗ ∈ Θ , then ε ∗ = 0 . Although the regime characterized by ε n → 0 with fixed n is also of interest, this setting can be readily addressed under the unified I P S class via a direct adaptation of av ailable results in, e.g., Jiang, W u and W ong ( 2018 ) and Bernton et al. ( 2019 ); see also Miller and Dunson ( 2019 ) for results in the context of coarsened posteriors under the regime with n → ∞ and fixed ε n = ε , related to those in Jiang, W u and W ong ( 2018 ). 3.1. Limiting behavior with fixed ε n = ε and n → ∞ . Current A B C theory for the setting ε n = ε fixed and n → ∞ , studies con ver gence of the discrepancy-based A B C posterior π ( ε ) n ( θ ) ∝ π ( θ ) Z Y n 1 {D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε } µ n θ ( dz 1: n ) , under the ke y assumption that D F ( ˆ µ z 1: n , ˆ µ y 1: n ) → D F ( µ θ , µ ∗ ) almost surely as n → ∞ (see, e.g., Jiang, W u and W ong , 2018 , Theorem 1). When such a condition is met, it can be shown that lim n →∞ π ( ε ) n ( θ ) ∝ π ( θ ) 1 {D F ( µ θ , µ ∗ ) ≤ ε } . This means that the limiting A B C posterior coincides with the prior constrained within the support { θ : D F ( µ θ , µ ∗ ) ≤ ε } , where ε is the same threshold employed for D F ( ˆ µ z 1: n , ˆ µ y 1: n ) . Although this is a ke y result, v erifying the con ver gence assumption behind the above finding can be dif ficult for sev eral discrepancies (e.g., Jiang, W u and W ong , 2018 ). More crucially , such theory fails to pro vide informati ve results in those situations where D F ( ˆ µ z 1: n , ˆ µ y 1: n ) is not guaranteed to con ver ge to D F ( µ θ , µ ∗ ) . In fact, under this setting, the asymptotic properties of 10 discrepancy-based A B C posteriors ha ve been overlook ed, and it is not ev en clear whether a well-defined π ( ε ) n exists in the limit. Indeed, for specific choices of the discrepanc y and thresh- old, the A B C posterior may not be well-defined ev en for fixed n . For example, when D F is the T V distance (see Appendix A in the Supplementary Material), then D F ( ˆ µ z 1: n , ˆ µ y 1: n ) = 1 , almost surely , whene ver z 1: n and y 1: n are from two continuous distrib utions µ θ and µ ∗ , respectiv ely . Hence, for any ε < 1 , the A B C posterior is not defined, e ven if µ θ and µ ∗ are not mutually singular . Note that, as clarified in Appendix A, the Rademacher comple xity of the T V distance is always bounded away from zero in such a setting. In this case and, more generally , for I P S discrepancies whose R n ( F ) does not vanish to zero, Lemma 2.6 together with the triangle in- equality in ( 1 ) still allow to deriv e new informati ve upper and lo wer bounds for limiting accep- tance probabilities which further guarantee that D F ( µ θ , µ ∗ ) can be bounded from above, for a suf ficiently large n , by ε + 4 R n ( F ) , ev en without assuming that D F ( ˆ µ z 1: n , ˆ µ y 1: n ) → D F ( µ θ , µ ∗ ) almost surely for n → ∞ , as in the current con ver gence theory . Theorem 3.1 formalizes this intuition for the whole class of A B C posteriors arising from discrepancies in the I P S family . T H E O R E M 3.1 . Let D F be an I P S as in Definition 2.1 . Mor eover , assume ( I ), ( III ), and let c F = 4 lim sup R n ( F ) . Then, the acceptance pr obability p n = R Θ R Y n 1 {D ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε } µ n θ ( dz 1: n ) π ( dθ ) of r ejection A B C with discr epancy D F satisfies, for any fixed ε > 0 , π { θ : D F ( µ θ , µ ∗ ) ≤ ε − c F } − e n ≤ p n ≤ π { θ : D F ( µ θ , µ ∗ ) ≤ ε + c F } + e n , (3) almost sur ely with r espect to y 1: n i.i.d. ∼ µ ∗ , as n → ∞ , wher e e n = exp( − √ n/ (2 b 2 )) = o (1) . In particular , whenever ε > ˜ ε + c F , with ˜ ε = inf { ϵ > 0 : π { θ : D F ( µ θ , µ ∗ ) ≤ ϵ } > 0 } , the pr obability p n is bounded away fr om 0 , for n lar ge enough. This implies that, for every fixed ε > ˜ ε + c F , the A B C posterior is always well-defined for a lar ge-enough n and its support is almost sur ely included asymptotically in the set { θ : D F ( µ θ , µ ∗ ) ≤ ε + 4 R n ( F ) } . Namely , π ( ε ) n { θ : D F ( µ θ , µ ∗ ) ≤ ε + 4 R n ( F ) } → 1 , for any fixed ε > ˜ ε + c F , almost sur ely with r espect to y 1: n i.i.d. ∼ µ ∗ , as n → ∞ . Theorem 3.1 clarifies that, ev en when relaxing the assumptions behind current theory , the Rademacher complexity framew ork still allows to obtain guarantees on the acceptance prob- abilities, existence and limiting support of the A B C posterior . Such a relaxation highlights the ke y role of the richness of F in driving these properties. As pre viously discussed, the higher is R n ( F ) the lower are the guarantees that D F ( ˆ µ z 1: n , ˆ µ y 1: n ) is close enough to D F ( µ θ , µ ∗ ) in the large data limit. In fact, recalling Section 2 , when R n ( F ) does not v anish with n there are no guarantees that the con ver gence of D F ( ˆ µ z 1: n , ˆ µ y 1: n ) to D F ( µ θ , µ ∗ ) assumed in the literature is verified. Nonetheless, the inequalities in Lemma 2.6 still yield informati ve results that trans- late these arguments into an inflation of the superset containing the support of the limiting A B C posterior , and in a more conserv ati ve choice for the threshold ε to ensure that such a posterior is well-defined for large n . Notice that in Theorem 3.1 the dependence on the dimension of the parameter and data space is embedded within D F ( µ θ , µ ∗ ) ; see Equation (6) in the Appendix of Chérief-Abdellatif and Alquier ( 2022 ) for an example making this dependence e xplicit. Recall that, for the A B C posterior to be well-defined in the limit (with a fixed threshold ε ), the acceptance probability p n needs to be bounded aw ay from zero. According to Theorem 3.1 , CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 11 as n → ∞ this is guaranteed whene ver ε > ˜ ε + c F . For instance, recalling Section 2 , in mis- specified models D F ( µ θ , µ ∗ ) ≥ ε ∗ . As a consequence, ε > c F is not suf ficient to guarantee that π { θ : D F ( µ θ , µ ∗ ) ≤ ε − c F } in ( 3 ) is bounded away from zero (in f act, when c F < ε < c F + ε ∗ the lower bound in ( 3 ) approaches zero from below). T o achie ve this result, ε − c F needs to de- fine a ball with radius not lower than the one to which the prior is already guaranteed to assign strictly positi ve mass. This implies ε > ˜ ε + c F , with c F = 4 lim sup R n ( F ) . The results in Theorem 3.1 further clarify that, when R n ( F ) → 0 as n → ∞ , stronger and uniform con ver gence statements can be deriv ed, both in correctly-specified and in misspeci- fied models. More specifically , in this setting, the acceptance probability p n in Theorem 3.1 approaches π { θ : D F ( µ θ , µ ∗ ) ≤ ε } in the limit. Moreov er , the A B C posterior is well-defined for any ε > ˜ ε and its asymptotic support has the same ε -control as the one established by the A B C procedure on the discrepancy among the empirical distributions. These results allo w to establish the almost sure con ver gence of the A B C posterior stated in Corollary 3.2 . C O RO L L A RY 3.2 . Under the conditions in Theor em 3.1 , if also R n ( F ) → 0 as n → ∞ (i.e., Assumption ( IV ) is satisfied), then, for any fixed ε > ˜ ε , it holds that π ( ε ) n ( θ ) → π ( θ | D F ( µ θ , µ ∗ ) ≤ ε ) ∝ π ( θ ) 1 {D F ( µ θ , µ ∗ ) ≤ ε } , (4) almost sur ely with r espect to y 1: n i.i.d. ∼ µ ∗ , as n → ∞ . Corollary 3.2 clarifies that to obtain within the whole I P S class a con ver gence result in line with those stated in the current theory it is not necessary to assume that D F ( ˆ µ z 1: n , ˆ µ y 1: n ) con- ver ges almost surely to D F ( µ θ , µ ∗ ) . In fact, it is sufficient to v erify that R n ( F ) → 0 as n → ∞ . Crucially , this condition is imposed directly on the selected discrepancy within the I P S class, rather than on the model or the underlying data generating process and, therefore, can be con- structi vely verified for all the discrepancies in Examples 2.2 – 2.4 . For instance, while Jiang, W u and W ong ( 2018 ) is unable to provide conclusi ve guidelines on whether ( 4 ) holds under W asserstein distance and M M D , our Corollary 3.2 can be directly applied to prove con ver gence for both these di ver gences, lev eraging the results and discussions in Section 3.3 . Corollary 3.2 accounts also for misspecified models. In fact, the limiting pseudo-posterior in ( 4 ) is well-defined only when ε > ˜ ε . For example, when µ ∗ is not within { µ θ : θ ∈ Θ ⊆ R p } , then D F ( µ θ , µ ∗ ) ≥ ε ∗ . Hence, for any ε ≤ ε ∗ , we ha ve that 1 {D F ( µ θ , µ ∗ ) ≤ ε } = 0 for any θ ∈ Θ and the limiting pseudo-posterior is not well-defined. Moti v ated by the con ver gence result in Corollary 3.2 , the theory in Section 3.2 provides an in-depth study of the concentration properties for the A B C posterior when the tolerance thresh- old progressi vely shrinks. As is clear from ( 4 ), emplo ying a v anishing threshold might guaran- tee that, in the limit, the A B C posterior concentrates all its mass around µ ∗ , in correctly specified settings, or around the distribution µ θ ∗ closest to µ ∗ , among those in { µ θ : θ ∈ Θ ⊆ R p } for misspecified regimes. These results and the associated rates of concentration are provided in Section 3.2 by le veraging, again, Rademacher comple xity theory . 3.2. Concentration as ε n → ε ∗ and n → ∞ . Theorem 3.3 states our main concentration result. As in related A B C theory (see e.g., Proposition 3 in Bernton et al. ( 2019 ) and Nguyen et al. ( 2020 )), we also le verage the triangle inequality ( 1 ). Howe ver , we crucially av oid pre- assuming con ver gence of D F ( ˆ µ y 1: n , µ ∗ ) , and we do not rely on concentration inequalities for 12 D F ( ˆ µ z 1: n ,µ θ ) regulated by non-explicit sequences of control functions c n ( θ ) f n ( ¯ ε n ) ( Bernton et al. , 2019 ) and c n ( θ ) s n ( ¯ ε n ) ( Nguyen et al. , 2020 ). Rather , we le verage Lemma 2.6 under the proposed Rademacher complexity framew ork to obtain more direct and informativ e results, that are also guaranteed to hold uniformly ov er P ( Y ) . Indeed, Lemma 2.6 ensures that both D F ( ˆ µ y 1: n , µ ∗ ) and D F ( ˆ µ z 1: n , µ θ ) exceed 2 R n ( F ) with vanishing probability . Therefore, when R n ( F ) → 0 , we ha ve that D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≈ D F ( µ θ , µ ∗ ) , by combining ( 1 ) with D F ( µ θ , µ ∗ ) ≥ −D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ z 1: n , ˆ µ y 1: n ) − D F ( ˆ µ y 1: n , µ ∗ ) . This means that, if D F ( ˆ µ z 1: n , ˆ µ y 1: n ) is small, then D F ( µ θ , µ ∗ ) is also small with a high proba- bility . This clarifies the importance of Assumption ( IV ), which is further supported by the fact that, if R n ( F ) does not shrink to zero, by the second inequality in Lemma 2.6 , the tw o empiri- cal measures ˆ µ y 1: n and ˆ µ z 1: n do not con ver ge to µ ∗ and µ θ , respectiv ely , and hence there is no guarantee that a small D F ( ˆ µ z 1: n , ˆ µ y 1: n ) would imply a v anishing D F ( µ θ , µ ∗ ) . T H E O R E M 3.3 . Let ¯ ε n → 0 when n → ∞ , with n ¯ ε 2 n → ∞ and ¯ ε n / R n ( F ) → ∞ . Then, if D F is fr om the I P S class in Definition 2.1 and Assumptions ( I )–( IV ) hold, the A B C posterior with thr eshold ε n = ε ∗ + ¯ ε n satisfies π ( ε ∗ + ¯ ε n ) n n θ : D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n 3 + 2 R n ( F ) +  2 b 2 n log n ¯ ε L n  1 / 2 o ≤ 2 · 3 L c π n , with P y 1: n –pr obability going to 1 as n → ∞ . The proof of Theorem 3.3 can be found in Appendix D of the Supplementary Material and follo ws similar arguments considered to establish the concentration results in Bernton et al. ( 2019 ) and Nguyen et al. ( 2020 ), which, in turn, extend those in Frazier et al. ( 2018 ). Howe ver , as mentioned above, those proofs are specific to a single discrepancy , pre-assume the con ver - gence of D F ( ˆ µ y 1: n , µ ∗ ) , and rely on concentration inequalities for D F ( ˆ µ z 1: n , µ θ ) that depend on non-explicit sequences of control functions. Theorem 3.3 o vercomes these issues and proves a unified theory based on the single concentration inequality in Lemma 2.6 . This yields techni- cal dif ferences in the proof and, more importantly , it introduces a no vel and broadly-impactful perspecti ve for the analysis of concentration properties of discrepanc y-based A B C posteriors. Note that in Theorem 3.3 the constant b can be typically set equal to 1 either by definition or upon normalization of the class of b -uniformly bounded functions. Moreov er , as clarified in the Supplementary Material, Theorem 3.3 also holds when replacing n/ ¯ ε L n and c π n with M n / ¯ ε L n and c π M n , respecti vely , for any sequence M n > 1 . Nonetheless, to ensure concentration, it suf fices to let M n = n . In such a case, the quantities 4 ¯ ε n / 3 , 2 R n ( F ) , [(2 b 2 /n ) log( n/ ¯ ε L n )] 1 / 2 and 2 · 3 L /c π n con ver ge to 0 as n → ∞ , under the settings of Theorem 3.3 . This implies that the A B C posterior asymptotically concentrates around those θ v alues yielding a µ θ within dis- crepancy ε ∗ from µ ∗ . Appendix B in the Supplementary Material translates these concentration results from the space of distrib utions to the space of parameters. In contrast to a vailable theory , the concentration result in Theorem 3.3 holds uniformly ov er P ( Y ) , and replaces the currently-employed non-e xplicit control functions with a kno wn and well-studied quantity , i.e., the Rademacher complexity . Notice that, although Theorem 3.3 pro- vides a unified statement for all the I P S discrepancies, the bound we deriv e depends on R n ( F ) , CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 13 which is specific to each discrepancy D F and plays a fundamental role in controlling the rate of concentration of the A B C posterior . In particular , to make the bound as tight as possible, we must choose an ¯ ε n such that 4 ¯ ε n / 3 and [(2 b 2 /n ) log( n/ ¯ ε L n )] 1 / 2 are of the same order . By ne- glecting all the terms in log log n , such a choice leads to setting ¯ ε n of the order [log ( n ) /n ] 1 / 2 . In this case, the constraint ¯ ε n / R n ( F ) → ∞ may not be satisfied, thereby requiring a larger ¯ ε n , such as R n ( F ) log log ( n ) . Summarizing, when ¯ ε n = max { [log ( n ) /n ] 1 / 2 , R n ( F ) log log ( n ) } it follo ws, under the conditions of Theorem 3.3 , that π ( ε ∗ + ¯ ε n ) n ( { θ : D F ( µ θ , µ ∗ ) > ε ∗ + O ( ¯ ε n ) } ) ≤ 2 · 3 L / ( c π n ) . Notice that the conditions n ¯ ε 2 n → ∞ and ¯ ε n / R n ( F ) → ∞ do not allo w to set ¯ ε n < 1 / √ n . Although this regime is of interest, we are not aw are of explicit results in the discrepancy-based A B C literature for such a setting. In fact, av ailable studies rely on similar restrictions for some sequence of functions f n ( ¯ ε n ) which is not made explicit e xcept for specific e xamples that still point to ward setting ¯ ε n = [log ( n ) /n ] 1 / 2 > 1 / √ n (e.g., Bernton et al. , 2019 , supplementary materials). Faster rates for the A B C threshold ha ve been considered by Li and Fearnhead ( 2018 ) in the context of summary-based A B C , but with a substantially dif ferent theoretical focus and asymptotic regime relativ e to the one considered here. Recalling, e.g., Theorem 1 in Frazier et al. ( 2018 ), within our theory setting the rate 1 / √ n (up to log terms) cannot be improved for summary-based A B C ; see also Section 4.1 for additional discussion. It shall be also emphasized that ¯ ε n < 1 / √ n would not yield a faster concentration rate in Theorem 3.3 because of the terms 2 R n ( F ) and [(2 b 2 /n ) log( n/ ¯ ε L n )] 1 / 2 in the bound. Theorem 3.3 holds under both well-specified and misspecified models. In the former case, µ ∗ = µ θ ∗ for some θ ∗ ∈ Θ . Therefore, ε ∗ = 0 and the A B C posterior concentrates around those θ yielding µ θ = µ ∗ . Con versely , when the model is misspecified, the A B C posterior concen- trates on those θ yielding a µ θ within discrepancy ε ∗ from µ ∗ . Since ε ∗ = inf θ ∈ Θ D F ( µ θ , µ ∗ ) , the A B C posterior will concentrate on those θ such that D F ( µ θ , µ ∗ ) is arbitrarily close to ε ∗ . Example 3.4 pro vides an explicit bound for ε ∗ in a simple yet note worthy class of misspecified models. E X A M P L E 3.4 . (Huber contamination model) . In this model, with pr obability 1 − α n , the data ar e fr om a distribution µ θ ∗ belonging to the assumed model { µ θ : θ ∈ Θ ⊆ R p } , while with pr obability α n arise fr om a contaminating distribution µ C . Ther efor e, the data gener ating pr ocess is µ ∗ = (1 − α n ) µ θ ∗ + α n µ C , with α n ∈ [0 , 1) contr olling the amount of contamina- tion. In such a conte xt, Definition 2.1 and Assumption ( III ) imply ε ∗ ≤ D F ( µ θ ∗ , µ ∗ ) = D F ( µ θ ∗ , (1 − α n ) µ θ ∗ + α n µ C ) = α n D F ( µ θ ∗ , µ C ) ≤ 2 bα n . By plugging this bound into Theor em 3.3 , we can obtain the same statement with ε ∗ r eplaced by 2 bα n , wher e α n ∈ [0 , 1) is the amount of contamination. This means that the A B C posterior asymptotically contracts in a neighborhood of the contaminated model µ ∗ of radius at most 2 bα n . The pr evious bound on ε ∗ combined with a triangle inequality also implies D F ( µ θ , µ ∗ ) ≥ D F ( µ θ , µ θ ∗ ) − D F ( µ θ ∗ , µ ∗ ) ≥ D F ( µ θ , µ θ ∗ ) − 2 bα n . Ther efor e, r eplacing D F ( µ θ , µ ∗ ) with D F ( µ θ , µ θ ∗ ) guarantees concentration also ar ound the uncontaminated model µ θ ∗ , with ε ∗ r eplaced by 2 bα n + 2 bα n = 4 bα n , thus ensuring r obust- ness to Huber contamination. 14 3.3. V alidity of the assumptions. The main theorems in Section 3.1 – 3.2 (i.e., Theorems 3.1 and 3.3 ) le verage Assumptions ( I )–( IV ). As anticipated in Section 3 , Assumption ( I ) is useful to formally clarify the range of applicability along with the possible limitations of the current e x- istence theory e ven in simple i.i.d. settings, and it will be relaxed in Appendix C of the Supple- mentary Material to study uniform con ver gence and concentration of discrepanc y-based A B C posteriors beyond the i.i.d. context. Assumption ( II ) is not specific to our frame work. Rather , it defines a standard minimal requirement routinely employed in Bayesian asymptotics and A B C theory (e.g., Bernton et al. , 2019 ; Nguyen et al. , 2020 ; Frazier , 2020 ). Con versely , Assumptions ( III )–( IV ) pro vide suf ficient conditions on the I P S discrepancy D F to obtain guarantees of uni- form con vergence and concentration under the proposed Rademacher complexity perspectiv e. These two conditions essentially replace the pre-assumed con ver gence for D F ( ˆ µ y 1: n , µ ∗ ) and the non-explicit bounds within the concentration inequalities for D F ( ˆ µ z 1: n , µ θ ) le veraged by the current discrepancy-specific theory . While these latter assumptions may implicitly require regularity conditions on { µ θ : θ ∈ Θ ⊆ R p } and on the unknown data generating process µ ∗ , Assumptions ( III )–( IV ) are made directly on the user -selected discrepanc y D F . Le veraging the a vailable bounds on the Rademacher complexity , these two conditions crucially allo w to state results that hold uniformly ov er P ( Y ) . Notice that, to deri ve these uniform con ver gence and concentration properties, an arguably minimal sufficient requirement is that F is a uniform Gliv enko–Cantelli class (see e.g., Dudley , Giné and Zinn , 1991 , Proposition 10), i.e., for x 1: n i.i.d. from µ ∈ P ( Y ) , sup µ ∈P ( Y ) sup f ∈ F      1 n n X i =1 f ( x i ) − Z Y f dµ      = sup µ ∈P ( Y ) D F ( ˆ µ x 1: n , µ ) → 0 , (5) in P x 1: n –probability as n → ∞ . In fact, as already discussed within Section 2 , the lack of uni- form con ver gence guarantees for both D F ( ˆ µ y 1: n , µ ∗ ) and D F ( ˆ µ z 1: n , µ θ ) would fail to ensure that the control established by the A B C threshold on D F ( ˆ µ z 1: n , ˆ µ y 1: n ) applies, asymptotically , to D F ( µ θ , µ ∗ ) , uniformly ov er P ( Y ) . Interestingly , although the uniform Gliv enko–Cantelli property in ( 5 ) might seem more general and weaker than Assumptions ( III )–( IV ), by the upper and lo wer bounds in Lemma 2.6 along with the subsequent discussion, it follows immediately that ( 5 ) is exactly equi v alent to Assumption ( IV ), under ( III ); see also Chapter 4 in W ainwright ( 2019 ). Regarding ( III ), notice that, as discussed in Dudley , Giné and Zinn ( 1991 ), when F is a uniform Gliv enko–Cantelli class, then ¯ F := { ¯ f ( · ) := f ( · ) − inf x f ( x ) , f ∈ F } is uniformly bounded and D ¯ F = D F . Indeed, for any f ∈ F , R ¯ f dµ 1 − R ¯ f dµ 2 = R [ f − inf x f ( x )] dµ 1 − R [ f ( x ) − inf x f ( x )] dµ 2 = R f dµ 1 − R f dµ 2 . Thus, when the uniform Gli venk o–Cantelli property in ( 5 ) holds, it is always possible to re-define F , without af fecting D F , in order to ensure that ( III ) is verified, and hence also ( IV ) as a consequence of the abo ve discussion. The abov e connection clarifies that Assumptions ( III )–( IV ) are arguably at the core of the uniform con vergence and concentration properties of discrepancy-based A B C posteriors. More- ov er , although ( 5 ) is inherently related to ( III )–( IV ), such a uniform Gli venk o–Cantelli property only states a con ver gence in probability result which can be crucially refined through the notion of Rademacher complexity under the more precise concentration inequalities in Lemma 2.6 . Recalling the theoretical results in Sections 3.1 and 3.2 , this allo ws not only to state con ver- gence and concentration of specific A B C posteriors, but also to clarify the factors gov erning these limiting properties and possibly deri ve the associated rates. CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 15 As clarified in Examples 3.5 – 3.7 , Assumptions ( III )–( IV ) can be generally verified for the ke y I P S discrepancies presented in Examples 2.2 – 2.4 le veraging kno wn upper bounds on the Rademacher complexity , along with connections between this measure and other well-studied quantities in statistical learning theory . See, in particular , Chapter 4.3 of W ainwright ( 2019 ) for an ov erview of sev eral useful techniques for upper-bounding the Rademacher comple xity via, e.g., the notion of polynomial discrimination and V C dimension. The validity of ( III )–( IV ) for two other instances of the I P S class (i.e., the total v ariation distance and the K olmogorov– Smirnov distance) is discussed in Appendix A within the Supplementary Material. Notice that, albeit interesting, these two discrepancies are less common in A B C implementations relati ve to W asserstein distance, M M D and summary-based distances. E X A M P L E 3.5 . (W asserstein-1 distance) . When Y is a bounded subset of R d , Assumptions ( III )–( IV ) hold without further constraints under the W asserstein-1 distance. In particular , ( III ) follows immediately fr om the definition F = { f : || f || L ≤ 1 } , together with the fact that the diameter of Y is finite in this case (e.g ., V illani , 2021 , Remark 1.15). Assumption ( IV ) is in- stead a dir ect consequence of the bounds in Sriperumb udur et al. ( 2012 ). Although it would be desirable to r emove such a constraint on Y , it shall be emphasized that this condition is ubiq- uitous in state-of-the-art concentration r esults of empirical measur es, under the W asserstein distance, that ar e guaranteed to hold uniformly over P ( Y ) (e.g., T alagr and , 1994 ; Sriperum- b udur et al. , 2012 ; Ramdas, T rillos and Cuturi , 2017 ; W eed and Bach , 2019 ). One possibility to pr eserve ( III )–( IV ) under the W asserstein-1 distance be yond bounded Y is to consider a variable transformation via a monotone function g ( · ) (e.g ., logistic transform) mapping fr om Y = R d to a bounded subset of R d . In the original unbounded Y , this transformation induces a W asserstein-1 distance based on a bounded ¯ ρ ( x, x ′ ) = ρ ( g ( x ) , g ( x ′ )) . As such, when Y is not a bounded subset of R d , defining F as the class of Lipschitz functions with r espect to ¯ ρ ( x, x ′ ) = ρ ( g ( x ) , g ( x ′ )) satisfies ( III ) and ( IV ). However , this r equir es car e and further r e- sear ch on how g ( · ) affects the pr operties of the discr epancy . Alternatively , as shown in Pr opo- sition 4.4 within Section 4.2 , concentration r esults for W asserstein-1 distance in unbounded Y ⊆ R d can be still derived, b ut at the expense of some r estrictions on µ ∈ P ( Y ) . E X A M P L E 3.6 . ( M M D ). The pr operties of M M D inher ently depend on the selected kernel k ( · , · ) . This is e vident fr om the inequalities R µ,n ( F ) ≤ [ E x k ( x, x ) /n ] 1 / 2 , with x ∼ µ (see e .g., Lemma 22 in Bartlett and Mendelson , 2002 ), and | f ( x ) | ≤ [ k ( x, x )] 1 / 2 || f || H for e very x ∈ Y (see e.g., equation 16 in Hofmann, Schölk opf and Smola , 2008 ). By these two inequalities, all bounded kernels, including, e.g., the commonly-implemented Gaussian exp( −∥ x − x ′ ∥ 2 /σ 2 ) and the Laplace exp( −∥ x − x ′ ∥ /σ ) ones, ensur e that Assumptions ( III ) and ( IV ) ar e always met without r equiring additional r e gularity conditions on µ ∈ P ( Y ) , nor constraints on Y . Instead, when k ( · , · ) is unbounded, the inequality R µ,n ( F ) ≤ [ E x k ( x, x ) /n ] 1 / 2 is only infor- mative for those µ such that E x k ( x, x ) < ∞ , with x ∼ µ , wher eas the bound on | f ( x ) | does not guar antee that F is a uniformly bounded class in general, unless additional conditions ar e made. Due to the r ele vance of these M M D instances and the dir ect connections with the classi- cal summary-based A B C implementations lever aging unbounded summaries, Pr oposition 4.3 in Section 4.1 derives specific theory pr oving that concentration r esults similar to those in The- or em 3.3 can be derived, under dif fer ent assumptions, including conditions on µ ∈ P ( Y ) , also for unbounded kernels. 16 E X A M P L E 3.7 . (Summary-based distance). As discussed in Example 2.4 , classical A B C implementations r elying on a finite set of summaries f 1 , . . . , f K with K < ∞ , can be seen as a special case of M M D by letting f ( x ) = [ f 1 ( x ) , . . . , f K ( x )] and k ( x, x ′ ) = ⟨ f ( x ) , f ( x ′ ) ⟩ . Lever aging this bridge and the r esults for M M D discussed in Example 3.6 , it is clear that, if sup x ∈Y ⟨ f ( x ) , f ( x ) ⟩ is finite — i.e., the induced kernel is bounded — then ( III ) and ( IV ) ar e satisfied without r equiring r e gularity conditions for µ or constrains on Y . While this r esult clarifies that A B C with bounded summaries achie ves uniform con ver gence and concentration, classical A B C implementations often employ unbounded summaries, such as moments, i.e., f ( x ) = [ x, x 2 . . . , x K ] . In this case ( III ) is not satisfied. In fact, r ecalling again Example 2.4 , such a setting is a special case of M M D with unbounded kernel and, hence, lacks guarantees that ( III )–( IV ) hold, unless further conditions ar e imposed, e .g., on Y . Nonetheless, this con- nection also clarifies that the concentr ation theory we derive in Pr oposition 4.3 for M M D with unbounded kernels dir ectly applies to classical A B C with unbounded summaries. Examples 3.5 – 3.7 show that ( III )–( IV ) can be realistically verified for the ke y instances of the I P S class presented in Section 2 , and generally hold under either no additional conditions or for suitable constraints on Y which can be directly checked simply on the basis of the support of the data analyzed. From a practical perspecti ve, this is an important gain relati ve to the need of verifying more sophisticated regularity conditions on the assumed model and on the unknown data generating process. Notice that the boundedness condition on Y in Example 3.5 interest- ingly relates to Assumptions 1 and 2 of Bernton et al. ( 2019 ) which hav e been verified when Y is a bounded subset of R d by , e.g., W eed and Bach ( 2019 ). In this context, our Rademacher complexity perspectiv e further refines the important results in Bernton et al. ( 2019 ) by clarify- ing that it is possible to deriv e con ver gence and concentration results for W asserstein- A B C that are regulated by a kno wn complexity measure and hold uniformly over the space of probability measures defined on a bounded Y . Notice that an alternati ve possibility to verify Assumptions 1–2 in Bernton et al. ( 2019 ) is to le verage the results in Fournier and Guillin ( 2015 ) who replace the boundedness condition in W eed and Bach ( 2019 ) with assumptions on the e xistence of exponential moments; see also the supplementary materials in Bernton et al. ( 2019 ). A similar direction within our Rademacher complexity frame work would be to assume that the class of functions F defining the W asser- stein distance admits a uniform Gliv enko–Cantelli property over a subset ˜ P ( Y ) of P ( Y ) that comprises probability measures on Y meeting some suitable regularity conditions. Recalling the pre vious discussion on the connection among ( 5 ) and our assumptions, this would imply a relaxation of ( III )–( IV ) allo wing the theory in Sections 3.1 and 3.2 to still hold for statistical models and data generating processes belonging to ˜ P ( Y ) . Propositions 4.3 and 4.4 explore results along these lines for M M D with unbounded kernel in R d (which also includes A B C with unbounded summary statistics), and for the W asserstein-1 distance in R d , respecti vely , which do not meet ( III ) and ( IV ) when Y is unbounded. These two propositions clarify that concentra- tion results can still be stated under alternati ve, yet related, proofs based on suitable relaxations of ( III )–( IV ). Nonetheless, these relaxations require checking that { µ θ : θ ∈ Θ ⊆ R p } and µ ∗ meet the conditions characterizing ˜ P ( Y ) , which can be difficult since, again, µ ∗ is generally not kno wn in practice. Con versely , when ( III )–( IV ) hold (e.g., in M M D with bounded kernels and W asserstein-1 distance in bounded subsets of R d ) the con ver gence and concentration re- CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 17 sults in Section 3.1 – 3.2 are guaranteed without the need to worry about the peculiar properties of the assumed model and of the generally-unkno wn data generating process. 4. Asymptotic properties of ABC with maximum mean discrepancy and W asserstein-1 distance. Sections 4.1 – 4.2 specialize the theory deri ved within Section 3 to two remarkable distances in the I P S class; namely M M D (which includes distances among summaries as a spe- cial case) and W asserstein-1 distance, respectiv ely . Recalling Examples 3.5 – 3.7 these discrep- ancies are cov ered by the general results in Section 3 , under the assumption that the kernel or the space Y are bounded. For completeness, we extend such concentration results also to situ- ations where these two conditions are not met; see Propositions 4.3 and 4.4 . 4.1. Asymptotic pr operties of ABC with maximum mean discr epancy . As discussed in Sec- tions 1 – 3 , M M D stands out as a prominent example of discrepancy within summary-free A B C . Nonetheless, an in-depth and comprehensi ve study on the limiting properties of M M D - A B C is still lacking. In f act, no theory is av ailable in the original article proposing M M D - A B C methods ( Park, Jitkrittum and Sejdinovic , 2016 ), while con ver gence in the fixed ε n = ε and n → ∞ regime is explored in Jiang, W u and W ong ( 2018 ) but without conclusi ve results. Nguyen et al. ( 2020 ) study con ver gence and concentration of A B C with the energy distance in both fixed and v anishing ε n settings. Recalling Example 2.3 , the direct correspondence between M M D and the energy distance would allo w to translate these results to the M M D frame work. Ho we ver , as highlighted by the same authors, the theory deri ved relies on difficult-to-v erify existence assumptions which yield bounds depending on control functions that are not made explicit. Le veraging Theorems 3.1 – 3.3 and Corollary 3.2 along with the av ailable upper bounds on the Rademacher complexity of M M D — see Example 3.6 — Corollaries 4.1 – 4.2 substantially refine and expand av ailable kno wledge on the limiting properties of M M D - A B C with routinely- implemented bounded kernels. Crucially , M M D with such kernels automatically satisfies ( III )– ( IV ) without additional constraints on the model or on the data generating process. Notice that Corollaries 4.1 – 4.2 also hold for summary-based distances with bounded summaries as a direct consequence of the discussion in Examples 2.4 and 3.7 . C O RO L L A RY 4.1 . Consider the M M D with a bounded kernel k ( · , · ) defined on R d , wher e | k ( x, x ) | ≤ 1 for any x ∈ R d . Then, under ( I ), the acceptance pr obability p n of the r ejection- based A B C r outine employing discr epancy D M M D and the thr eshold ε satisfies p n → π { θ : D M M D ( µ θ , µ ∗ ) ≤ ε } , almost sur ely with r espect to y 1: n i.i.d. ∼ µ ∗ , as n → ∞ , for any µ θ , µ ∗ ∈ P ( Y ) . Mor eover , let ˜ ε be defined as in Theor em 3.1 . Then, for any fixed ε > ˜ ε , it holds that π ( ε ) n ( θ ) → π ( θ | D M M D ( µ θ , µ ∗ ) ≤ ε ) ∝ π ( θ ) 1 {D M M D ( µ θ , µ ∗ ) ≤ ε } , almost sur ely with r espect to y 1: n i.i.d. ∼ µ ∗ , as n → ∞ . The abov e result applies, e.g., to routinely-implemented Gaussian k ( x, x ′ ) = exp( −∥ x − x ′ ∥ 2 /σ 2 ) and Laplace k ( x, x ′ ) = exp( −∥ x − x ′ ∥ /σ ) k ernels on R d , which are both bounded by 1 , thereby implying || f || ∞ ≤ 1 and R n ( F ) ≤ n − 1 / 2 . These results are also crucial to prov e the concentration statement in Corollary 4.2 belo w . 18 C O RO L L A RY 4.2 . Consider the M M D with a bounded kernel k ( · , · ) defined on R d , wher e | k ( x, x ) | ≤ 1 for any x ∈ R d . Then, for any µ θ , µ ∗ ∈ P ( Y ) , under ( I )–( II ) and the settings of Theor em 3.3 , with ¯ ε n = [log ( n ) /n ] 1 / 2 , we have that π ( ε ∗ + ¯ ε n ) n n θ : D M M D ( µ θ , µ ∗ ) > ε ∗ +  10 3 + ( L + 2) 1 / 2  ·  log n n  1 / 2 o ≤ 2 · 3 L c π n , with P y 1: n –pr obability going to 1 as n → ∞ . Notice that R n ( F ) ≤ n − 1 / 2 implies R n ( F ) log log ( n ) ≤ log log( n ) /n 1 / 2 ≤ [log( n ) /n ] 1 / 2 for any n ≥ 1 . Hence, as a consequence of the previous discussion, the concentration rate is essentially minimized by setting ¯ ε n = [log( n ) /n ] 1 / 2 . This result interestingly aligns with the optimal rate deri ved by Frazier et al. ( 2018 ) for summary-based A B C under sub-Gaussian assumptions. The reason for such an agreement is immediately clear after noticing that M M D with bounded kernels includes, as a special case, A B C with bounded summaries. Corollaries 4.1 – 4.2 are ef fecti ve e xamples of the potentials of Theorems 3.1 – 3.3 and Corol- lary 3.2 , which can be readily specialized to an y discrepanc y within the I P S class. For example, in the context of M M D with Gaussian and Laplace kernels, Corollary 4.2 ensures informati ve posterior concentration without requiring assumptions on { µ θ : θ ∈ Θ ⊆ R p } or µ ∗ . Similar results can be obtained for all I P S discrepancies as long as ( III )–( IV ) are satisfied and R n ( F ) admits explicit upper bounds. For instance, if Y is bounded, this is possible for the W asserstein- 1 distance in Example 3.5 , le veraging the bounds for R n ( F ) in Sriperumb udur et al. ( 2012 ). While ( III ) and ( IV ) hold for M M D with a bounded kernel without additional assumptions, the currently-av ailable bounds on the Rademacher complexity ensure that M M D with an un- bounded kernel meets the abov e conditions only under specific models and data generating pro- cesses, e ven within the i.i.d. setting. In this context, it is howe ver possible to re visit the results for the W asserstein case in Proposition 3 of Bernton et al. ( 2019 ) under the new Rademacher complexity frame work introduced in the present article. In particular , as sho wn in Proposi- tion 4.3 , under M M D with an unbounded kernel, the e xistence Assumptions 1 and 2 in Bernton et al. ( 2019 ) can be directly related to constructiv e conditions on the kernel, inherently related to our Assumption ( IV ). This in turn yields informativ e concentration inequalities that are rem- iniscent of those in Theorem 3.3 and Corollary 4.2 . Notice that these inequalities also hold for summary-based A B C with routinely-used unbounded summaries (e.g., moments) as a direct consequence of the discussion in Example 3.7 . P RO P O S I T I O N 4.3 . Consider the M M D with unbounded kernel k ( · , · ) on R d . Assume ( I )– ( II ) along with (A1) E y [ k ( y , y )] < ∞ , (A2) R Θ E z [ k ( z , z )] π (d θ ) < ∞ , and (A3) ther e exist constants δ 0 > 0 and c 0 > 0 such that E z [ k ( z , z )] < c 0 for any θ satisfying ( E z ,z ′ [ k ( z , z ′ )] − 2 E z ,y [ k ( y , z )] + E y ,y ′ [ k ( y , y ′ )]) 1 / 2 ≤ ε ∗ + δ 0 , wher e z , z ′ ∼ µ θ and y , y ′ ∼ µ ∗ . Then, when n → ∞ , ¯ ε n → 0 and n ¯ ε 2 n → ∞ , for some C ∈ (0 , ∞ ) and any M n ∈ (0 , ∞ ) , it holds that π ( ε ∗ + ¯ ε n ) n n θ : D M M D ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n 3 +  M n n ¯ ε L n  1 / 2 o ≤ C M n , with P y 1: n –pr obability going to 1 as n → ∞ . CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 19 A popular example of unbounded kernel is pro vided by the polynomial one, which is defined as k ( x, x ′ ) = (1 + a ⟨ x, x ′ ⟩ ) q for some integer q ∈ { 2 , 3 , . . . } , and constant a > 0 . Under such a k ernel, it can be easily shown that if E y [ ∥ y ∥ q ] < ∞ , and θ 7→ E z ( ∥ z ∥ q ) is π -inte grable, then Assumptions ( A 1) – ( A 3) are satisfied. The latter conditions essentially require that the kernel has finite e xpectation under both µ ∗ and µ θ for suitable θ ∈ ¯ Θ ⊆ Θ , and is uniformly bounded for those µ θ close to µ ∗ . Recalling Example 2.3 and the bound R µ,n ( F ) ≤ [ E x ∼ µ k ( x, x ) /n ] 1 / 2 , ( A 1) – ( A 3) are inherently related to ( IV ), which ho wev er additionally requires that these ex- pectations are finite for any µ ∈ P ( Y ) . Notice that while the concentration result in Proposi- tion 4.3 does not make explicit the dependence on the re gularity of the dif ferent moments (e.g., Frazier et al. , 2018 ), as clarified in the polynomial kernel example, such a dependence on q is embedded within the definition of the kernel, which in turn enters the expression for D M M D . As such, translating the results in Proposition 4.3 from our o verarching focus on the space of distrib utions to the one of parameters would make this dependence explicit. Such a mapping is challenging to obtain analytically in general settings. Hence, further refinements along these lines are left for future research. Before mo ving to the W asserstein-1 distance, notice that by Proposition 4.3 a sensible setting for ¯ ε n in the unbounded-kernel case w ould be ¯ ε n = ( M n /n ) 1 / (2+ L ) . This yields π ( ε ∗ + ¯ ε n ) n ( { θ : D M M D ( µ θ , µ ∗ ) > ε ∗ + (7 / 3)( M n /n ) 1 / (2+ L ) } ) ≤ C / M n , which is essentially the tighter possible order of magnitude for the bound. In the unbounded- kernel setting, M n = n would not be suitable, b ut an y M n → ∞ slower than n can work; e.g., M n = n 1 / 2 yields π ( ε ∗ + ¯ ε n ) n ( { θ : D M M D ( µ θ , µ ∗ ) > ε ∗ + (7 / 3)(1 /n ) 1 / (2 L +4) } ) ≤ C /n 1 / 2 . 4.2. Asymptotic pr operties of A B C with W asserstein-1 distance. As pointed out earlier , the general results in Section 3 apply directly to the W asserstein-1 distance ( D W A S S ) when the data space Y is bounded. As clarified in Proposition 4.4 , if Y is unbounded, concentration results can still be deriv ed for D W A S S , b ut at the expense of regularity conditions on { µ θ : θ ∈ Θ ⊆ R p } and µ ∗ . These results and the associated proof mirror those for the M M D with unbounded kernel in Proposition 4.3 , and rely on exponential moment assumptions that allow to lev erage the recent concentration bounds for D W A S S in Lei ( 2020 ). P RO P O S I T I O N 4.4 . Consider the W asserstein-1 distance D W A S S (see Example 2.2 ) on Y = R d , with gr ound distance ρ ( x, x ′ ) = ∥ x − x ′ ∥ . Besides ( I )–( II ), assume ther e exists a small enough c > 0 such that (A1’) E y (exp( c ∥ y ∥ )) < ∞ and (A2’) sup µ θ ,θ ∈ Θ E z (exp( c ∥ z ∥ )) < ∞ , wher e z ∼ µ θ and y ∼ µ ∗ . Then, when n → ∞ and ¯ ε n → 0 such that n ¯ ε 2 n → ∞ and also n − 1 / max( d, 3) ≪ ¯ ε n , for some C ∈ (0 , ∞ ) and any M n ∈ (0 , ∞ ) , it holds that π ( ε ∗ + ¯ ε n ) n n θ : D W A S S ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n 3 +  1 c ′ n log M n ¯ ε L n  1 / 2 + c 1 n − 1 / max( d, 3) o ≤ C M n , with P y 1: n –pr obability going to 1 as n → ∞ . The results and proof of Proposition 4.4 are inherently related to those in Section 2.1 of the supplementary materials in Bernton et al. ( 2019 ). Ho wev er , rather than le veraging the concen- trations bounds by F ournier and Guillin ( 2015 ), Proposition 4.4 exploits the more recent ones in 20 Lei ( 2020 ) under the exponential moment assumptions in (A1’)–(A2’) that are generally satis- fied for sub-exponential random variables, including sub-Gaussian ones. W e refer to Lei ( 2020 ) for a more detailed discussion on the advantages of these bounds compared to those in Fournier and Guillin ( 2015 ). Notice that, under Proposition 4.4 , setting ¯ ε n = n − 1 / max( d, 3) p log( n ) is an admissible choice to ensure concentration, b ut the induced rate would be w orse than the one in Theorem 3.3 . Deri ving optimal rates for the W asserstein-1 distance in unbounded Y is not the scope of Proposition 4.4 , whose aim is, instead, to clarify that concentration beyond the settings considered in Section 3 can still be achie ved, but at the cost of regularity conditions on { µ θ : θ ∈ Θ ⊆ R p } and µ ∗ that do not allo w to state results uniformly over P ( Y ) . As such, we leav e questions on the optimality of the bounds for the W asserstein distance in unbounded spaces Y as future research, and refer to Bernton et al. ( 2019 ) for more in-depth and specialized results on the properties of A B C under such a distance. 5. Illustrativ e simulation in i.i.d. settings. Let us illustrate the theory we deri ved in the pre vious sections through a simple, yet insightful, simulation in i.i.d. settings (see Appendix C in the Supplementary Material for empirical results under non-i.i.d. re gimes). Note that sev eral empirical studies hav e already compared the performance of summary-free A B C under differ - ent discrepancies and comple x non-i.i.d. models. Recalling Drov andi and Frazier ( 2022 ), all these analyses clarify the practical feasibility of A B C based on v arious discrepancies, includ- ing those within the I P S class, and in se veral i.i.d. and non-i.i.d. scenarios which require A B C procedures. This feasibility is also supported by recent softwares (e.g., Dutta et al. , 2021 ). Rather than replicating the av ailable studies on benchmark examples, we complement cur - rent empirical e vidence on summary-free A B C by focusing on a misspecified and contaminated scenario that clarifies the possible challenges in con ver gence and concentration encountered e ven in basic i.i.d. settings. As clarified in T able 1 and Figure 1 , this scenario also sho wcases a ke y consequence of the nov el theoretical results in Sections 2 – 4 . Namely that ef fecti ve I P S dis- crepancies with guarantees of uniform con ver gence and concentration are a safe and sensibile choice in the absence of kno wledge on the specific properties of µ ∗ . Recalling Example 3.4 , we consider , in particular , an uncontaminated bi v ariate Student’ s t distrib ution µ θ 0 with 3 degrees of freedom, mean v ector (1 , 1) , and dispersion matrix having entries σ 11 = σ 22 = 1 and σ 12 = σ 21 = 0 . 5 . Such an uncontaminated data generating process is then perturbed with three dif ferent lev els α n = α ∈ { 0 . 05 , 0 . 10 , 0 . 15 } of contamination from a Student’ s t distribution µ C ha ving the same parameters as µ θ 0 , except for the mean vector which is set to (20 , 20) . As such, the data y 1: n from µ ∗ = (1 − α ) µ θ 0 + αµ C are obtained by T A B L E 1 Concentration and runtimes in seconds (for a single discr epancy evaluation) of A B C under M M D with Gaussian kernel, W asserstein-1 distance, summary-based distance (mean) and K L diverg ence for a misspecified Huber contamination model with a varying α ∈ { 0 . 05 , 0 . 10 , 0 . 15 } . M S E = ˆ E µ ∗ [ ˆ E A B C ( θ − θ 0 ) 2 ] . M S E ( α = 0 . 05 ) M S E ( α = 0 . 10 ) M S E ( α = 0 . 15 ) time ( I P S ) M M D 0.024 0.027 0.031 < 0.01” ( I P S ) W asserstein-1 0.027 0.067 0.122 < 0.01” ( I P S ) summary (mean) 0.841 2.648 2.835 < 0.01” (non- I P S ) K L 0.073 0.076 0.077 < 0.01” CONCENTRA TION OF DISCREP ANCY -B ASED ABC VIA RADEMA CHER COMPLEXITY 21 1 2 3 MMD Wasserstein-1 Summary (mean) KL α 0.05 0.10 0.15 Fig 1: Graphical representation of the A B C posterior for θ under M M D with Gaussian kernel, W asserstein-1 distance, summary-based distance (mean) and K L div ergence for one simulated dataset from a misspecified Huber contamination model with a varying α ∈ { 0 . 05 , 0 . 10 , 0 . 15 } ; see also Example 3.4 for details. The red dashed line corresponds to the location parameter θ 0 = 1 of the uncontaminated model. sampling n = 100 draws from the bi v ariate Student’ s t µ θ 0 and then replacing (100 · α )% of these draws with samples from the contaminating Student’ s t distribution µ C . For Bayesian inference, we focus on the parameter θ ∈ R defining the unkno wn location vector [1 , 1] ⊺ θ , and consider a misspecified biv ariate Gaussian model µ θ with mean vector [1 , 1] ⊺ θ and kno wn cov ariance matrix coinciding with that of the uncontaminated Student’ s t data generating pro- cess. Such a choice is interesting in providing a model that is slightly misspecified e ven when the data are not contaminated. Notice that, although this model does not necessarily require an A B C approach to allo w Bayesian inference, as discussed above, the issues outlined in T able 1 and Figure 1 for certain discrepancies, ev en in such a basic example, provide a useful empiri- cal insight that complements those in the extensi ve quantitiv e studies already av ailable in the literature for more complex settings. In performing A B C under the abov e model and for the different discrepancies of interest, we employ rejection-based A B C with m = n = 100 and a N (0 , 1) prior for θ . Follo wing the stan- dard practice in comparing discrepancies ( Drov andi and Frazier , 2022 ), we specify a common b udget of T = 25 , 000 simulations and define the A B C threshold to retain, for e very discrep- ancy analyzed, the 1% (i.e., 250 ) v alues of θ that generated the synthetic data closest to the observed ones, under such a discrepancy . Although the theory in Section 3 can potentially guide the choice of the threshold, similar guidelines are not yet av ailable beyond the I P S class. Hence, to ensure a fair assessment across all the discrepancies, we rely on the recommended practice in comparing the dif ferent A B C implementations. As clarified in T able 1 and Figure 1 , the discrepancies assessed are those most commonly used in A B C , namely , the W asserstein-1 distance as in Example 2.2 ( Bernton et al. , 2019 ), M M D with Gaussian kernel defined in Ex- ample 2.3 ( Park, Jitkrittum and Sejdinovic , 2016 ; Nguyen et al. , 2020 ), and a summary-based distance le veraging the sample mean as the summary statistics (see Example 2.4 ). For com- 22 parison, we also consider a popular discrepancy that does not belong to the I P S class, i.e., the Kullback–Leibler div ergence ( Jiang, W u and W ong , 2018 ). All these discrepancies can be ef- fecti vely implemented in R , lev eraging, in particular , the libraries transport and eummd . For M M D , the choice of the length-scale parameter σ 2 is based on the median heuristic ( Gretton et al. , 2012 ) automatically implemented in the function mmd . Le veraging the samples from the A B C posterior for θ under each discrepancy , we first esti- mate the mean squared error ˆ E A B C ( θ − θ 0 ) 2 with respect to the location θ 0 = 1 of the uncontam- inated Student’ s t , under the different discrepancies, thereby assessing performance with a fo- cus on a common metric on the space of parameters. T able 1 displays these error estimates, un- der each discrepancy and lev el of contamination, further av eraged ov er 50 simulated datasets to obtain a comprehensi ve assessment based on replicated studies. The results in T able 1 together with the graphical representation in Figure 1 of the A B C posteriors for one simulated dataset from the contaminated model with varying α ∈ { 0 . 05 , 0 . 10 , 0 . 15 } effecti vely illustrate an im- portant practical consequence of the theory deriv ed in Sections 2 – 4 . Namely , when one is not sure, or cannot check, whether the assumed statistical model and/or the underlying data gen- erating process meet specific regularity conditions, ef fectiv e I P S discrepancies with guarantees of uniform con ver gence and concentration (e.g., M M D with bounded kernel) provide a rob ust and safe default choice. When the lev el of contamination is mild ( α = 0 . 05 ), W asserstein- A B C achie ves comparable concentration. This result aligns with Proposition 4.4 , which ho wev er de- pends on the specific properties of the model and the underlying data generating process. Such a dependence is evident when the amount of contamination grows to α = 0 . 10 and α = 0 . 15 . In these settings, the performance of W asserstein- A B C tends to slightly deteriorate, mainly due to a location shift in the induced posterior . This shift is ev en more evident for summary-based A B C relying on the sample mean, that is not rob ust to location contaminations. Con v ersely , the Kullback–Leibler div ergence preserv es robustness but at the e xpense of a lower concentration. Note that all the discrepancies analyzed in T able 1 and Figure 1 are well-defined under the Student’ s t and Gaussian distributions considered in this simulation study . In particular , since the Student’ s t has 3 degrees of freedom, its mean and variance are finite. Thus, W asserstein-1 is well-defined for both the assumed model and the underlying data generating process. Considering the running times, as displayed in T able 1 all the discrepancies under analysis can be ev aluated in the order of milliseconds for a sample of size n = m = 100 on a standard laptop. This enables scalable and ef fecti ve R implementations. 6. Discussion. This article provides theoretical adv ancements with respect to the recent lit- erature on the asymptotic properties of discrepancy-based A B C posteriors by connecting these properties with the beha vior of the Rademacher complexity associated with the chosen discrep- ancy . As clarified in the article and in the Supplementary Material, although the Rademacher complexity has nev er been considered in A B C , this notion yields a powerful and promising per- specti ve to deri ve general, informati ve and uniform con ver gence and concentration properties. While the above contribution already provides key adv ancements, the proposed perspectiv e based on Rademacher comple xity has broader scope and sets the premises for additional future research. For example, as clarified in this article, any nov el result and bound on the Rademacher complexity of specific discrepancies can be directly applied to A B C theory through our frame- CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 23 work. This may yield tighter or more explicit bounds, possibly holding under milder assump- tions and more general discrepancies. For instance, to our knowledge, informati ve bounds for the Rademacher complexity of the W asserstein-1 distance are currently av ailable only for bounded Y and, hence, it would be of interest to lev erage future findings on the unbounded Y case to further refine Proposition 4.4 and broaden the range of models for which our theory , when specialized to W asserstein-1 distance, applies. T o this end, it might also be promising to explore results on local Rademacher complexities (e.g., Bartlett, Bousquet and Mendelson , 2005 ), along with the proposed variable transformation strate gy in Example 3.5 . This solution requires the choice of a mapping g ( · ) which clearly influences the learning properties of the in- duced distance and, as such, requires further in vestigation. Finally , although Appendix C in the Supplementary Material provides important extensions to non-i.i.d. settings, other relaxations beyond β -mixing processes could be of interest, such as, for instance, the case of independent b ut not identically distributed data. This setting can be addressed via the residual reconstruc- tion strate gy (e.g., Bernton et al. , 2019 , Section 4.2.3.) that would imply studying discrepancies among empirical distrib utions of residuals, for which i.i.d. assumptions are again reasonable. Although our focus is on the acti ve con ver gence and concentration theory for discrepancy- based A B C , other properties such as accuracy in uncertainty quantification of credible intervals and the limiting shapes of A B C posteriors, in correctly-specified models, ha ve attracted interest in the context of summary-based and summary-free A B C ( Frazier et al. , 2018 ; Frazier , 2020 ; W ang, Kaji and Rockov a , 2022 ). While this direction goes beyond the scope of our article, extending and unifying such results, as done for concentration properties, is also of interest. While the I P S class is broad, it does not cov er all discrepancies emplo yed in A B C . For exam- ple, the K L div ergence ( Jiang, W u and W ong , 2018 ) and the Hellinger distance ( Frazier , 2020 ) are not I P S , but rather belong to the class of f -div ergences. While this latter family has impor- tant dif ferences relativ e to the I P S class, it is worth in vestig ating f -div ergences in the light of our results. T o accomplish this goal, an option is to e xploit the unified treatment of these two classes in, e.g., Agraw al and Horel ( 2021 ) and Birrell et al. ( 2022 ). More generally , our results could also stimulate methodological and theoretical advancements in generalized likelihood- free Bayesian inference via discrepancy-based pseudo-posteriors (e.g., Bissiri, Holmes and W alker , 2016 ; Jewson, Smith and Holmes , 2018 ; Miller and Dunson , 2019 ; Chérief-Abdellatif and Alquier , 2020 ; Matsubara et al. , 2022 ). The recent contribution by Frazier , Knoblauch and Drov andi ( 2024 ) establishes, among other key results, important connections between such a latter frame work and A B C that could facilitate these adv ancements. Finally , notice that, for the sake of simplicity and ease of comparison with related studies, we ha ve focused on rejection A B C , and constrained the number m of synthetic samples to be equal to the sample size n of the observed data. While these settings are standard in state-of-the-art theory ( Bernton et al. , 2019 ; Frazier , 2020 ), other A B C routines and alternativ e scenarios where m gro ws, e.g., sub-linearly , with n deserve further in vestigation. This latter regime would be of interest in settings where the simulation of synthetic data is computationally expensi ve. Acknowledgments. W e are grateful to the Editor , the Associate Editor and the referees for the constructi ve feedbacks, which helped us in improving the preliminary version of the article. 24 . . . . SUPPLEMENT AR Y MA TERIAL T O “CONCENTRA TION OF DISCREP ANCY -B ASED APPR O XIMA TE B A YESIAN COMPUT A TION VIA RADEMA CHER COMPLEXITY" APPENDIX A: ADDITION AL INTEGRAL PR OB ABILITY SEMIMETRICS A.1. T wo additional examples of I P S discrepancies. While M M D , W asserstein-1 and summary-based distances provide the most notable examples of I P S discrepancies em- ployed in A B C , two other rele v ant I P S instances are the total variation ( T V ) distance and the K olmogorov–Smirno v ( K S ) distance, discussed below . E X A M P L E A.1 . (T otal variation distance). Although the total variation distance is not a common c hoice within discr epancy-based A B C , it still pr ovides a notable example of I P S , obtained when F is the class of measurable functions whose sup-norm is bounded by 1 ; i.e. F = { f : || f || ∞ ≤ 1 } . E X A M P L E A.2 . (K olmogorov–Smirno v distance). When Y = R and F = { 1 ( −∞ ,a ] } a ∈ R , then D F is the K olmogor ov–Smirno v distance, which can also be written as D F ( µ 1 , µ 2 ) = sup x ∈Y | F 1 ( x ) − F 2 ( x ) | , wher e F 1 and F 2 ar e the cumulative distribution functions associ- ated with µ 1 and µ 2 , r espectively . A.2. V alidity of ( III )–( IV ) for the T V distance and Kolmogor ov–Smirno v distance. Examples A.3 and A.4 verify the validity of assumptions ( III ) and ( IV ) under the T V distance and the K olmogorov–Smirno v distance, respecti vely . E X A M P L E A.3 . (T otal v ariation distance). The T V distance satisfies ( III ) by definition, b ut in gener al not Assumption ( IV ), unless the car dinality |Y | of Y is finite. In fact, when Y = R and µ ∈ P ( Y ) is continuous, the pr obability that ther e exists an index i  = i ′ such that x i = x i ′ is zer o. Hence , with pr obability 1 , for any vector ϵ 1: n of Rademac her variables ther e always exists a function f ϵ fr om Y to { 0; 1 } such that f ϵ ( x i ) = 1 { ϵ i =1 } . Ther efor e, sup f ∈ F | (1 /n ) P n i =1 ϵ i f ( x i ) | ≥ (1 /n ) P n i =1 1 { ϵ i =1 } , which implies that the Rademacher comple xity R µ,n ( F ) is bounded below by (1 /n ) P n i =1 P ( ϵ i = 1) = 1 / 2 . Nonetheless, as mentioned above , the T V distance can still satisfy ( IV ) in specific conte xts. F or instance, lever aging the bound in Lemma 5.2 of Massart ( 2000 ), when the car dinality |Y | of Y is finite, ther e will be r eplicates in [ f ( x 1 ) , . . . , f ( x n )] whenever n > |Y | . Hence, as n → ∞ , it will CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 25 be impossible to find a function in F which can interpolate any noise vector of Rademac her variables with [ f ( x 1 ) , . . . , f ( x n )] , thus ensuring R n ( F ) → 0 . E X A M P L E A.4 . (K olmogorov–Smirno v distance). The K S distance meets ( III ) by defi- nition and, similarly to M M D with bounded kernels, also condition ( IV ) is satisfied without the need to impose additional constraints on the model µ θ or on the data-generating pr o- cess. Mor e specifically , Assumption ( IV ) follows fr om the inequality R µ,n ( F ) ≤ 2[log( n + 1) /n ] 1 / 2 in Chapter 4.3.1 of W ainwright ( 2019 ). This is a consequence of the bounds on R µ,n ( F ) when F is a class of b -uniformly bounded functions such that, for some ν ≥ 1 , it holds card { f ( x 1: n ) : f ∈ F } ≤ ( n + 1) ν for any n and x 1: n in Y n . When F = { 1 ( −∞ ,a ] } a ∈ R each x 1: n would divide the r eal line in at most n + 1 intervals and every indicator function within F will take value 1 for all x i ≤ a and zer o otherwise, meaning that card { f ( x 1: n ) : f ∈ F } ≤ ( n + 1) . Ther efor e, by applying Equation (4.24) in W ainwright ( 2019 ), with b = 1 and ν = 1 , yields R µ,n ( F ) ≤ 2[log ( n + 1) /n ] 1 / 2 for any µ ∈ P ( Y ) , which implies that Assump- tion ( IV ) is met. These derivations clarify the usefulness of the available techniques for upper bounding the Rademacher complexity (e.g ., W ainwright , 2019 , Chapter 4.3), lever aging, in this case, the notion of polynomial discrimination and the closely-r elated V C dimension. APPENDIX B: CONCENTRA TION IN THE SP A CE OF P ARAMETERS Theorem 3.3 in the main article is stated for neighborhoods within the space of distrib u- tions. Although such a perspectiv e is in line with the o verarching focus of current theory for discrepancy-based A B C ( Jiang, W u and W ong , 2018 ; Bernton et al. , 2019 ; Nguyen et al. , 2020 ; Frazier , 2020 ; Fujisawa et al. , 2021 ), it shall be emphasized that similar results can be also deri ved in the space of parameters. T o this end, it suf fices to adapt Corollary 1 in Bern- ton et al. ( 2019 ) to our general frame work, under the same additional assumptions, which are adapted belo w to the whole I P S class. (V) The minimizer θ ∗ of D F ( µ θ , µ ∗ ) exists and is well separated, meaning that for any δ > 0 there is a δ ′ > 0 such that inf θ ∈ Θ: d ( θ ,θ ∗ ) >δ D F ( µ θ , µ ∗ ) > D F ( µ θ ∗ , µ ∗ ) + δ ′ ; (VI) The parameters θ are identifiable, and there exist positi ve constants K > 0 , ν > 0 and an open neighborhood U ⊂ Θ of θ ∗ such that, for any θ ∈ U , it holds that d ( θ , θ ∗ ) ≤ K [ D F ( µ θ , µ ∗ ) − ε ∗ ] ν . Assumptions ( V ) and ( VI ) essentially require that the parameters θ are identifiable, suf fi- ciently well-separated, and that the distance d ( · , · ) between parameter values has some rea- sonable correspondence with the discrepancy D F ( · , · ) among the associated distrib utions. Although these tw o assumptions introduce a condition on the model, it shall be emphasized that ( V ) and ( VI ) are not specific to our frame work (e.g., Frazier et al. , 2018 ; Bernton et al. , 2019 ; Frazier , 2020 ). On the contrary , these identifiability conditions are arguably custom- ary and minimal requirements in parameter inference. Moreov er , these two assumptions ha ve been checked in Chérief-Abdellatif and Alquier ( 2022 ) for M M D and in Bernton et al. ( 2019 ) for W asserstein distance, which are arguably the two most remarkable examples of I P S employed in the A B C context. Under ( V ) and ( VI ), it is possible to state Corollary B.1 . 26 C O R O L L A RY B.1 . Assume ( I )–( IV ) along with ( V )–( VI ), and that D F denotes a discr ep- ancy within the I P S class in Definition 2.1 . Mor eover , take ¯ ε n → 0 as n → ∞ , with n ¯ ε 2 n → ∞ and ¯ ε n / R n ( F ) → ∞ . Then, the A B C posterior with thr eshold ε n = ε ∗ + ¯ ε n satisfies π ( ε ∗ + ¯ ε n ) n n θ : d ( θ , θ ∗ ) > K h 4 ¯ ε n 3 + 2 R n ( F ) +  2 b 2 n log n ¯ ε L n  1 / 2 i ν o ≤ 2 · 3 L c π n , with P y 1: n –pr obability going to 1 as n → ∞ . As for Theorem 3.3 , also Corollary B.1 holds more generally when replacing both n/ ¯ ε L n and c π n with M n / ¯ ε L n and c π M n , respectiv ely , for any M n > 1 . The proof of Corollary B.1 follo ws directly from Theorem 3.3 and Assumptions ( V )–( VI ), thereby allowing to inherit the discussion after Theorem 3.3 , also when the concentration is measured directly within the parameter space via d ( θ , θ ∗ ) . For instance, when d ( · , · ) is the Euclidean distance and ν = 1 , this implies that whenev er R n ( F ) = O ( n − 1 / 2 ) the contraction rate will be in the order of O ([log ( n ) /n ] 1 / 2 ) , which is the expected rate in parametric models. APPENDIX C: EXTENSION TO NON-I.I.D. SETTINGS Although the theoretical results in Sections 2 – 4 provide an impro ved understanding of the limiting properties of discrepancy-based A B C posteriors, the i.i.d. assumption in ( I ) rules out important settings which often require A B C . A remarkable case is that of time-dependent observ ations (e.g., Fearnhead and Prangle , 2012 ; Bernton et al. , 2019 ; Nguyen et al. , 2020 ; Drov andi and Frazier , 2022 ). Section C.1 clarifies that the theory deriv ed under i.i.d. assumptions in Section 2 – 4 can be naturally e xtended to these non-i.i.d. settings le veraging results for Rademacher complexity in β -mixing stochastic processes ( Mohri and Rostamizadeh , 2008 ). Examples C.3 – C.4 be- lo w sho w that such a class embraces sev eral processes of direct practical interest. Extensions beyond this class, albeit rele v ant, are challenging e ven when the focus is on proving sim- pler , non-uniform, concentration results for a single discrepancy . Hence, these e xtensions are left for future research, that could be facilitated by the deri v ation of Rademacher comple xity bounds for general processes be yond the β -mixing ones studied in Mohri and Rostamizadeh ( 2008 ). C.1. Con ver gence and concentration bey ond i.i.d. settings. Let us assume again that Y is a metric space endo wed with distance ρ . Ho we ver , unlike the i.i.d. setting considered in Section 2 , we now focus on the situation in which the observed data y 1: n = ( y 1 , . . . , y n ) ∈ Y n are dependent and drawn from the joint distribution µ ∗ ( n ) ∈ P ( Y n ) , where P ( Y n ) is the space of probability measures on Y n . Under this more general frame work, the i.i.d. case is recov ered by assuming that µ ∗ ( n ) can be expressed as a product, i.e., µ ∗ ( n ) = Q n i =1 µ ∗ . In the follo wing, the abov e product structure is not imposed. Instead, we only assume that the mar ginal of µ ∗ ( n ) is constant, and denoted with µ ∗ . Such an assumption is met whene ver y 1: n is extracted from a stationary stochastic process ( y t ) t ∈ Z , thus embracing a broader va- riety of applications of direct interest. Under these settings, a statistical model is defined as CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 27 a collection of distributions in P ( Y n ) , i.e., { µ ( n ) θ : θ ∈ Θ ⊆ R p } , with a constant marginal denoted by µ θ . Notice that these assumptions of constant marginals µ ∗ and µ θ are made also in the a v ailable concentration theory under non-i.i.d. settings (see e.g., Bernton et al. , 2019 ; Nguyen et al. , 2020 ) when requiring con ver gence of D F ( ˆ µ y 1: n , µ ∗ ) and suitable concentration inequalities for D F ( ˆ µ z 1: n , µ θ ) . As a result, the settings we consider are not more restrictiv e than those addressed in discrepanc y-specific theory . In fact, both Bernton et al. ( 2019 ) and Nguyen et al. ( 2020 ) explicitly refer to stationary processes when discussing the validity of the assumptions on D F ( ˆ µ y 1: n , µ ∗ ) and D F ( ˆ µ z 1: n , µ θ ) in non-i.i.d. contexts. Gi ven the abo ve statistical model, a prior π on θ and a generic I P S discrepanc y D F , the A B C posterior with threshold ε n ≥ 0 is defined as π ( ε n ) n ( θ ) ∝ π ( θ ) Z Y n 1 {D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε n } µ ( n ) θ ( dz 1: n ) . This definition is the same as the one in Section 2 , with the only dif ference that µ n θ = Q n i =1 µ θ is replaced by the joint µ ( n ) θ , since no w the data are no more assumed to be independent. In order to extend the con v ergence result in Corollary 3.2 together with the concentration statement in Theorem 3.3 to the abov e frame work, we require an analog of Equation ( 2 ) in Lemma 2.6 for time-dependent data. This generalization can be deriv ed lev eraging results in Mohri and Rostamizadeh ( 2008 ) under the notion of β -mixing coef ficients. D E FI N I T I O N C.1 ( β -mixing) . Consider the stationary sequence ( x t ) t ∈ Z of random vari- ables, and let σ j ′ j be the σ –algebra generated by the r andom variables x k , j ≤ k ≤ j ′ , for any j, j ′ ∈ Z ∪ {−∞ , + ∞} . Then, for any inte ger k > 0 , the β -mixing coef ficient of ( x t ) t ∈ Z is defined as β ( k ) = sup t ∈ Z E [ sup A ∈ σ ∞ t + k | P ( A | σ t −∞ ) − P ( A ) | ] . If β ( k ) → 0 as k → ∞ , then the stochastic pr ocess ( x t ) t ∈ Z is said to be β -mixing. Intuiti vely , β ( k ) measures the dependence between the past (before t ) and the future (after t + k ) of the process. When such a dependence is weak, we expect that β ( k ) will decay to 0 fast when k → ∞ . In the most extreme case, when the x t ’ s are i.i.d., we hav e β ( k ) = 0 for all k > 0 . More generally , as clarified in Definition C.1 , a process having β ( k ) → 0 when k → ∞ is named β -mixing. W e refer the reader to Doukhan ( 1994 ) for an in-depth study of the main properties of β -mixing processes along with a more comprehensi ve discussion of rele v ant examples. The most remarkable ones will be also presented in the follo wing. Le veraging the notion of β -mixing coef ficient, Lemma C.2 extends Lemma 2.6 to the de- pendent setting. The proof can be found in Appendix D and combines Proposition 2 and Lemma 2 in Mohri and Rostamizadeh ( 2008 ). For readability , let us also introduce the no- tation s n = ⌊ n/ (2 ⌊ √ n ⌋ ) ⌋ . Note that s n ∼ √ n/ 2 as n → ∞ , and thus s n → ∞ . L E M M A C.2 . Define s n = ⌊ n/ (2 ⌊ √ n ⌋ ) ⌋ . Mor eover , consider the stationary stoc hastic pr ocess ( x t ) t ∈ Z and denote with β ( k ) , k ∈ N , its β -mixing coefficients. Let µ ( n ) be the joint 28 distrib ution of a sample x 1: n e xtracted fr om ( x t ) t ∈ Z and denote with µ = µ (1) its constant mar ginal. Then, for any b -uniformly bounded class F , any inte ger n ≥ 1 and scalar δ ≥ 0 , P x 1: n  D F ( ˆ µ x 1: n , µ ) ≤ 2 R µ,s n ( F ) + 4 b √ n + δ  ≥ 1 − 2 · exp  − s n δ 2 2 b 2  − 2 s n β ( ⌊ √ n ⌋ ) , (C.1) with R µ,s n ( F ) the Rademac her complexity in Definition 2.5 for an i.i.d. sample of size s n fr om µ . Equation ( C.1 ) extends ( 2 ) beyond i.i.d. settings. This extension provides a bound that still depends on the Rademacher complexity in Definition 2.5 for an i.i.d. sample — in this case from the common marginal µ of the process ( x t ) t ∈ Z . As such, Assumption ( IV ) requires no modifications, and no additional v alidity checks relativ e to those discussed in Section 3.3 . This suggests that the Rademacher complexity framew ork might also be le veraged to deri ve improv ed con vergence and concentration results for discrepancy-based A B C posteriors in more general situations which do not necessarily meet Assumption ( I ). T o pro ve these results we le verage ag ain Assumptions ( II ), ( III ) and ( IV ), and replace ( I ) with condition ( VII ). (VII) The data y 1: n are from a β -mixing stochastic process ( y t ) t ∈ Z with mixing coef ficients β ( k ) ≤ C β e − γ k ξ for some C β , γ , ξ > 0 , common mar ginal µ , and generic joint µ ∗ ( n ′ ) for a sample y 1: n ′ from ( y t ) t ∈ Z for any n ′ ∈ N . The same β -mixing conditions hold also for the process ( z t ) t ∈ Z associated with the synthetic data z 1: n from the assumed model. In this case, the joint distrib ution for a generic sample z 1: n ′ is µ ( n ′ ) θ , θ ∈ Θ , and the common mar ginal is denoted by µ θ . For simplicity and without loss of generality , we also assume that the constants C β , γ , and ξ are the same for ( y t ) t ∈ Z and ( z t ) t ∈ Z . Assumption ( VII ) is clearly more general than ( I ). As discussed previously , it embraces se veral stochastic processes of substantial interest in practical applications, including those in Examples C.3 – C.4 belo w; see Doukhan ( 1994 ) for additional examples and discussion. E X A M P L E C.3 (Doeblin-recurrent Markov chains) . Let ( x t ) t ∈ Z be a Markov chain on Y ⊂ R d with transition kernel P ( · , · ) . Such a Markov chain is said to be Doeblin-r ecurr ent if ther e exists a pr obability measur e q , a constant 0 < c ≤ 1 and an inte ger r > 0 suc h that, for any measur able set A and any x ∈ R d , P r ( x, A ) ≥ cq ( A ) . When this is the case , ( x t ) t ∈ Z is β -mixing with β ( k ) ≤ 2(1 − c ) k /r ; see e.g ., Theor em 1 in page 88 of Doukhan ( 1994 ). E X A M P L E C.4 (Hidden Marko v chains) . Assume ( x t ) t ∈ Z is a β -mixing stoc hastic pr o- cess with coefficients β x ( k ) , k ∈ N . If ˜ x t = F ( x t , ε t ) with ε t i.i.d., then the β -mixing coeffi- cients of ( ˜ x t ) t ∈ Z satisfy β ˜ x ( k ) = β x ( k ) . Ther efor e, ( ˜ x t ) t ∈ Z is also β -mixing and inherits the bounds on β x ( k ) . These pr ocesses ar e often used in practice with ( x t ) t ∈ Z being a Marko v chain. In this case ( ˜ x t ) t ∈ Z is called a Hidden Marko v chain. Section 2.4.2 of Doukhan ( 1994 ) also pro vides conditions on F and on the i.i.d. sequence ( ε t ) t ∈ Z ensuring that a stationary process ( x t ) t ∈ Z satisfying x t = F ( x t − 1 , . . . , x t − k , ε t ) e xists and is β -mixing. Lemma C.5 specializes such a result in the conte xt of Gaussian A R (1) pro- cesses, which will be considered in the empirical study in Section C.2 . CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 29 L E M M A C.5 (Gaussian A R (1) process) . Consider a generic sequence ( ε t ) t ∈ Z of i.i.d. ran- dom variables fr om a N (0 , σ 2 ) . Mor eover , let − 1 < θ < 1 and ψ ∈ R . Then the stationary solution to x t = ψ + θ x t − 1 + ε t is β -mixing and has coef ficients β ( k ) ≤ | θ | k / (2 √ 1 − θ 2 ) = (2 √ 1 − θ 2 ) − 1 exp( − k log (1 / | θ | )) , k ∈ N , thus meeting ( VII ). Notice that in the empirical study in Section C.2 the focus will be on inference for the A R parameter θ . Clearly , in this case it is not sufficient to focus on the mar ginal distribution of each x t . Rather , one should le verage the biv ariate distribution for the pairs ˜ x t := ( x t , x t +1 ) ; see also Bernton et al. ( 2019 ) where such a strategy is named delay reconstruction. This pro- cedure simply changes the focus to the bi v ariate stochastic process ( ˜ x t ) t ∈ Z , b ut does not alter the mixing properties. In particular , if β x ( k ) and β ˜ x ( k ) are the mixing coef ficients of ( x t ) t ∈ Z and ( ˜ x t ) t ∈ Z , respectiv ely , then from Definition C.1 we ha ve β ˜ x ( k ) = β x ( k − 1) , for k ≥ 1 . Notice that identifiability is a key to ensure concentration in the space of parameters of in- terest as in Corollary B.1 . This moti v ates further research, beyond the scope of this article, to deri ve delay reconstruction strategies ensuring identifiability in more comple x processes. Le veraging Lemma C.2 along with the newly-introduced Assumption ( VII ), Proposition C.6 states con ver gence of the A B C posterior when ε n = ε is fixed and n → ∞ . P R O P O S I T I O N C.6 . Under Assumptions ( III ), ( IV ) and ( VII ), for any ε > ˜ ε , it holds that π ( ε ) n ( θ ) → π ( θ | D F ( µ θ , µ ∗ ) ≤ ε ) ∝ π ( θ ) 1 {D F ( µ θ , µ ∗ ) ≤ ε } , (C.2) almost sur ely with r espect to y 1: n ∼ µ ∗ ( n ) , as n → ∞ . According to Proposition C.6 , replacing Assumption ( I ) with ( VII ), does not alter the uni- form con ver gence properties of the A B C posterior originally stated in Corollary 3.2 under the i.i.d. assumption. This allo ws to inherit the discussion after Corollary 3.2 also beyond i.i.d. settings, while suggesting that similar extensions would be possible in the regime ε n → ε ∗ and n → ∞ . These extensions are stated in Theorem C.7 , which provides an important gen- eralization of Theorem 3.3 beyond the i.i.d. case. T H E O R E M C.7 . Let ε n = ε ∗ + ¯ ε n , and assume ( II ), ( III ), ( IV ) and ( VII ). Then, if ¯ ε n → 0 is such that √ n ¯ ε 2 n → ∞ and ¯ ε n / R s n ( F ) → ∞ , with s n = ⌊ n/ (2 ⌊ √ n ⌋ ) ⌋ , we have π ( ε ∗ + ¯ ε n ) n n θ : D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n 3 + 2 R s n ( F ) + 4 b √ n +  2 b 2 s n log n ¯ ε L n  1 / 2 o ≤ 4 · 3 L c π n , with P y 1: n –pr obability going to 1 as n → ∞ , wher e R s n = sup µ ∈P ( Y ) R µ,s n . As for Proposition C.6 , also Theorem C.7 shows that informativ e concentration inequali- ties similar to those deri ved in Section 3.2 , can be obtained beyond the i.i.d. setting. These results pro vide insights comparable to those in Theorem 3.3 with the only dif ference that in this case we require √ n ¯ ε 2 n → ∞ rather than n ¯ ε 2 n → ∞ and the term 2 b 2 /n within the bound in Theorem 3.3 is now replaced by 2 b 2 /s n with s n ∼ √ n/ 2 as n → ∞ . This means that ¯ ε n must shrink to zero with a rate at least n 1 / 4 slo wer than the one allo wed in the i.i.d. setting. 30 This is an interesting result which clarifies that when moving beyond i.i.d. regimes concen- tration can still be achiev ed, although with a slo wer rate. Such a rate might be pessimistic in some models and we belie ve it may be impro ved under future refinements of Lemma C.2 . Notice that ( VII ) could be relaxed to include β -mixing processes whose coef ficients β ( k ) v anish to zero, b ut at a non-e xponential rate, e.g., β ( k ) ∼ 1 / ( k + 1) ξ for some ξ > 0 . In this case, we could still use Lemma C.2 to prov e concentration, but with a smaller s n , that would lead to ev en slo wer rates. Howe ver , we did not provide the most general result for the sake of readability . As for processes that are not β -mixing, we are not aw are of results similar to Lemma C.2 in this context. This is an important direction for future research. C.2. Illustrative simulation in non-i.i.d. settings. Let us illustrate the results in Sec- tion C.1 on a simple simulation study focusing on a contaminated Gaussian A R (1) process. More specifically , the uncontaminated data are generated from the model y ∗ t = 0 . 5 y ∗ t − 1 + ε t for t = 1 , . . . , 100 with ε t ∼ N (0 , 1) independently , and initial state y ∗ 0 ∼ N (0 , 1) . Then, sim- ilarly to the simulation study in Section 5 , these data are contaminated with a gro wing frac- tion α ∈ { 0 . 05 , 0 . 10 , 0 . 15 } of independent realizations from a N (20 , 1) . As such, each ob- served data point y t is either equal to y ∗ t or to a sample from N (20 , 1) , for t = 1 , . . . , 100 . For Bayesian inference, we assume an A R (1) model z t = θ z t − 1 + ε t , with ε t ∼ N (0 , 1) , and focus on learning θ via discrepancy-based A B C under a uniform prior on [ − 1 , 1] for θ . Rejection A B C is implemented under the same settings and discrepancies considered in Section 5 . Ho wev er , as discussed in Section C.1 , in this case we focus on distances among the empirical distributions of the n = m = 100 observ ed ( y 0 , y 1 ) , ( y 1 , y 2 ) , . . . , ( y 99 , y 100 ) and synthetic ( z 0 , z 1 ) , ( z 1 , z 2 ) , . . . , ( z 99 , z 100 ) pairs. This is consistent with the delay reconstruc- tion strategy in Bernton et al. ( 2019 ) and is moti vated by the f act that information on θ is in the biv ariate distributions, rather than in the marginals. For the same reason, in implement- ing summary-based A B C we consider the sample co v ariance rather than the sample mean. T able C.1 summarizes the concentration achiev ed by the different discrepancies analyzed under the aforementioned non-i.i.d. data generating process and model, at varying contam- ination α ∈ { 0 . 05 , 0 . 10 , 0 . 15 } . The results are coherent with those displayed in T able 1 for the i.i.d. scenario and further clarify that discrepancies with guarantees of uniform con ver - gence and concentration generally provide a rob ust choice, including in non-i.i.d. contexts. T A B L E C . 1 Concentration and runtimes in seconds (for a single discr epancy evaluation) of A B C under M M D with Gaussian kernel, W asserstein-1 distance, summary-based distance (covariance) and K L diverg ence for an A R (1) Huber contamination model with α ∈ { 0 . 05 , 0 . 10 , 0 . 15 } . M S E = ˆ E µ ∗ ( n ) [ ˆ E A B C ( θ − θ 0 ) 2 ] , θ 0 = 0 . 5 . M S E ( α = 0 . 05 ) M S E ( α = 0 . 10 ) M S E ( α = 0 . 15 ) time ( I P S ) M M D 0.029 0.036 0.049 < 0.01” ( I P S ) W asserstein-1 0.043 0.091 0.180 < 0.01” ( I P S ) summary (cov ariance) 0.575 0.998 1.001 < 0.01” (non– I P S ) K L 0.058 0.060 0.061 < 0.01” CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 31 APPENDIX D: PR OOFS OF THEOREMS, COR OLLARIES AND PR OPOSITIONS P R O O F O F T H E O R E M 3 . 1 . Note that, by le veraging the first inequality in Lemma 2.6 , we ha ve P y 1: n [ D F ( ˆ µ y 1: n , µ ∗ ) > 2 R µ ∗ ,n ( F ) + δ ] ≤ exp( − nδ 2 / 2 b 2 ) . Hence, setting δ = 1 /n 1 / 4 , and recalling that R µ ∗ ,n ( F ) ≤ R n ( F ) , it follows P y 1: n [ D F ( ˆ µ y 1: n , µ ∗ ) > 2 R n ( F ) + 1 /n 1 / 4 ] ≤ exp( − √ n/ 2 b 2 ) ; note that P n ≥ 0 exp( − √ n/ 2 b 2 ) < ∞ . Therefore, if we define the e vent (D.1) E n = {D F ( ˆ µ y 1: n , µ ∗ ) ≤ 2 R n ( F ) + 1 /n 1 / 4 } , then 1 { E c n } → 0 almost surely with respect to y 1: n i.i.d. ∼ µ ∗ as n → ∞ . No w , notice that π ( ε ) n { θ : D F ( µ θ , µ ∗ ) ≤ ε } = π ( ε ) n { θ : D F ( µ θ , µ ∗ ) ≤ ε } 1 { E n } + π ( ε ) n { θ : D F ( µ θ , µ ∗ ) ≤ ε } 1 { E c n } . Hence, in the following we focus on π ( ε ) n { θ : D F ( µ θ , µ ∗ ) ≤ ε } 1 { E n } . T o this end, recall that π ( ε ) n ( θ ) ∝ π ( θ ) Z 1 {D F ( ˆ µ y 1: n , ˆ µ z 1: n ) ≤ ε } µ n θ ( dz 1: n ) = π ( θ ) Z 1 {D F ( µ θ , µ ∗ ) ≤ ε + W F ( z 1: n ) } µ n θ ( dz 1: n ) =: π ( θ ) p n ( θ ) , where W F ( z 1: n ) = D F ( µ θ , µ ∗ ) − D F ( ˆ µ y 1: n , ˆ µ z 1: n ) , whereas p n ( θ ) denotes the probability of generating a sample z 1: n from µ n θ which leads to accept the parameter value θ . Note that, by applying the triangle inequality twice, we ha ve −D F ( ˆ µ z 1: n , µ θ ) − D F ( ˆ µ y 1: n , µ ∗ ) ≤ W F ( z 1: n ) ≤ D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ y 1: n , µ ∗ ) , and, hence, | W F ( z 1: n ) | ≤ D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ y 1: n , µ ∗ ) . This implies that the quantity p n ( θ ) can be bounded belo w and abov e as follows Z 1 {D F ( µ θ , µ ∗ ) ≤ ε − D F ( ˆ µ z 1: n , µ θ ) − D F ( ˆ µ y 1: n , µ ∗ ) } µ n θ ( dz 1: n ) ≤ p n ( θ ) ≤ Z 1 {D F ( µ θ , µ ∗ ) ≤ ε + D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ y 1: n , µ ∗ ) } µ n θ ( dz 1: n ) . Applying again Lemma 2.6 yields P z 1: n [ D F ( ˆ µ z 1: n , µ θ ) > 2 R µ θ ,n ( F ) + δ ] ≤ exp( − nδ 2 / 2 b 2 ) . Therefore, setting δ = 1 /n 1 / 4 , and recalling that R µ θ ,n ( F ) ≤ R n ( F ) and that we are on the e vent gi ven in ( D.1 ), it follo ws (D.2) − exp( − √ n/ 2 b 2 ) + 1 {D F ( µ θ , µ ∗ ) ≤ ε − 4 R n ( F ) − 2 /n 1 / 4 } ≤ p n ( θ ) ≤ exp( − √ n/ 2 b 2 ) + 1 {D F ( µ θ , µ ∗ ) ≤ ε + 4 R n ( F ) + 2 /n 1 / 4 } . No w , notice that the acceptance probability is defined as p n = R p n ( θ ) π ( dθ ) . Hence, inte- grating with respect to π ( θ ) in the abov e inequalities yields, for n large enough, π { θ : D F ( µ θ , µ ∗ ) ≤ ε − c F } − e n ≤ p n ≤ π { θ : D F ( µ θ , µ ∗ ) ≤ ε + c F } + e n , where c F = 4 lim sup R n ( F ) , as in Equation ( 3 ), thus concluding the first part of the proof. 32 T o proceed with the second part of the proof, notice that, by the definition of ˜ ε = inf { ϵ > 0 : π { θ : D F ( µ θ , µ ∗ ) ≤ ϵ } > 0 } , the left part of the above inequality is bounded away from zero for n large enough, whenev er ε − c F > ˜ ε . This implies that also the acceptance proba- bility p n is strictly positi ve. As a consequence, for such n , it follo ws that π ( ε ) n ( A ) = R p n ( θ ) 1 A ( θ ) π ( dθ ) R p n ( θ ) π ( dθ ) = R p n ( θ ) 1 A ( θ ) π ( dθ ) p n , is well-defined for any e vent A. Then, le veraging the upper bound in ( D.2 ) yields π ( ε ) n { θ : D F ( µ θ , µ ∗ ) > ε + 4 R n ( F ) } = R p n ( θ ) 1 {D F ( µ θ , µ ∗ ) > ε + 4 R n ( F ) } π ( dθ ) R p n ( θ ) π ( dθ ) ≤ R 1  D F ( µ θ , µ ∗ ) ≤ ε + 4 R n ( F ) + 2 /n 1 / 4  1 {D F ( µ θ , µ ∗ ) > ε + 4 R n ( F ) } π ( dθ ) p n + R exp( − √ n/ 2 b 2 ) 1 {D F ( µ θ , µ ∗ ) > ε + 4 R n ( F ) } π ( dθ ) p n . T o conclude the proof it is now necessary to control both terms. Note that we already prov ed that the denominator p n is bounded away from zero for n large enough. Both numerators are bounded by 1 , and going to 0 when n → ∞ . Thus, by the dominated con ver gence theorem, both summands in the abov e upper bound for π ( ε ) n { θ : D F ( µ θ , µ ∗ ) > ε + 4 R n ( F ) } go to zero. This implies, as a direct consequence, that π ( ε ) n { θ : D F ( µ θ , µ ∗ ) ≤ ε + 4 R n ( F ) } → 1 , almost surely with respect to y 1: n i.i.d. ∼ µ ∗ as n → ∞ , thereby concluding the proof. P R O O F O F C O RO L L A RY 3 . 2 . Note that by combining Equation ( 2 ) in Lemma 2.6 with the result P n> 0 exp[ − nδ 2 / (2 b 2 )] < ∞ , the Borel–Cantelli Lemma implies that both D F ( ˆ µ z 1: n , µ θ ) and D F ( ˆ µ y 1: n , µ ∗ ) con ver ge to 0 almost surely when R n ( F ) → 0 as n → ∞ . Hence, since −D F ( ˆ µ z 1: n , µ θ ) − D F ( ˆ µ y 1: n , µ ∗ ) ≤ D F ( µ θ , µ ∗ ) − D F ( ˆ µ y 1: n , ˆ µ z 1: n ) ≤ D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ y 1: n , µ ∗ ) , it follo ws that D F ( ˆ µ z 1: n , ˆ µ y 1: n ) → D F ( µ θ , µ ∗ ) almost surely as n → ∞ . Combining this result with the proof of Theorem 1 in Jiang, W u and W ong ( 2018 ) yields the statement of Corollary 3.2 . Notice that, as discussed in Section 3.1 , the limiting pseudo- posterior in Corollary 3.2 is well-defined only for those ε > ˜ ε , with ˜ ε as in Theorem 3.1 . P R O O F O F T H E O R E M 3 . 3 . Since Lemma 2.6 and R n ( F ) = sup µ ∈P ( Y ) R µ,n ( F ) ≥ R µ,n ( F ) hold for ev ery µ ∈ P ( Y ) , then, for ev ery integer n ≥ 1 and any scalar δ ≥ 0 , Equation ( 2 ) implies P x 1: n [ D F ( ˆ µ x 1: n , µ ) ≤ 2 R n ( F ) + δ ] ≥ 1 − exp( − nδ 2 / 2 b 2 ) . Moreov er , since this re- sult holds for any δ ≥ 0 , it follo ws that P x 1: n [ D F ( ˆ µ x 1: n , µ ) ≤ 2 R n ( F ) + ( c 1 − 2 R n ( F ))] ≥ 1 − exp[ − n ( c 1 − 2 R n ( F )) 2 / 2 b 2 ] , for any c 1 ≥ 2 R n ( F ) . Hence, (D.3) P x 1: n [ D F ( ˆ µ x 1: n , µ ) ≤ c 1 ] ≥ 1 − exp[ − n ( c 1 − 2 R n ( F )) 2 / 2 b 2 ] . Recalling the settings of Theorem 3.3 , consider the sequence ¯ ε n → 0 as n → ∞ , with n ¯ ε 2 n → ∞ and ¯ ε n / R n ( F ) → ∞ , which is possible by Assumption ( IV ). These regimes imply that ¯ ε n CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 33 goes to zero slo wer than R n ( F ) and, hence, for n large enough, ¯ ε n / 3 > 2 R n ( F ) . Therefore, under Assumptions ( I )–( III ) it is no w possible to apply ( D.3 ) to y 1: n , by setting c 1 = ¯ ε n / 3 , which yields P y 1: n [ D F ( ˆ µ y 1: n , µ ∗ ) ≤ ¯ ε n / 3] ≥ 1 − exp[ − n ( ¯ ε n / 3 − 2 R n ( F )) 2 / 2 b 2 ] . Since − n ( ¯ ε n / 3 − 2 R n ( F )) 2 = − n ¯ ε 2 n [1 / 9 + 4( R n ( F ) / ¯ ε n ) 2 − (4 / 3) R n ( F ) / ¯ ε n ] , it follows that − n ( ¯ ε n / 3 − 2 R n ( F )) 2 → −∞ when n → ∞ . From the abov e settings we also hav e that n ¯ ε 2 n → ∞ and R n ( F ) / ¯ ε n → 0 , when n → ∞ . Therefore, as a consequence, we obtain 1 − exp[ − n ( ¯ ε n / 3 − 2 R n ( F )) 2 / 2 b 2 ] → 1 as n → ∞ . Hence, in the rest of this proof, we will restrict to the e vent {D F ( ˆ µ y 1: n , µ ∗ ) ≤ ¯ ε n / 3 } . Denote with P θ ,z 1: n the joint distribution of θ ∼ π and z 1: n i.i.d. from µ θ . By definition of conditional probability , for any c 2 , including c 2 > 2 R n ( F ) , it follows that π ( ε ∗ + ¯ ε n ) n ( { θ : D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 } ) = P θ ,z 1: n [ D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 , D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] P θ ,z 1: n [ D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] . (D.4) T o deriv e an upper bound for the above ratio, we first identify an upper bound for its numera- tor . In addressing this goal, we lev erage the triangle inequality D F ( µ θ , µ ∗ ) ≤ D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ z 1: n , ˆ µ y 1: n ) + D F ( ˆ µ y 1: n , µ ∗ ) , since D F is a semimetric, and the previously-pro ved result that the e vent {D F ( ˆ µ y 1: n , µ ∗ ) ≤ ¯ ε n / 3 } has P y 1: n –probability going to 1 , thereby obtaining P θ ,z 1: n [ D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 , D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] ≤ P θ ,z 1: n [ D F ( ˆ µ z 1: n , ˆ µ y 1: n ) + D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ y 1: n , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 , D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] ≤ P θ ,z 1: n [ D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ y 1: n , µ ∗ ) > ¯ ε n / 3 + c 2 ] ≤ P θ ,z 1: n [ D F ( ˆ µ z 1: n , µ θ ) > c 2 ] . Re writing P θ ,z 1: n [ D F ( ˆ µ z 1: n , µ θ ) > c 2 ] as R θ ∈ Θ P z 1: n [ D F ( ˆ µ z 1: n , µ θ ) > c 2 | θ ] π (d θ ) and apply- ing ( D.3 ) to z 1: n yields Z θ ∈ Θ P z 1: n [ D F ( ˆ µ z 1: n ,µ θ ) > c 2 | θ ] π (d θ ) = Z θ ∈ Θ (1 − P z 1: n [ D F ( ˆ µ z 1: n ,µ θ ) ≤ c 2 | θ ]) π (d θ ) ≤ Z θ ∈ Θ exp[ − n ( c 2 − 2 R n ( F )) 2 / 2 b 2 ] π (d θ ) = exp[ − n ( c 2 − 2 R n ( F )) 2 / 2 b 2 ] . Hence, the numerator of the ratio in Equation ( D.4 ) can be upper bounded by exp[ − n ( c 2 − 2 R n ( F )) 2 / 2 b 2 ] for any c 2 > 2 R n ( F ) . As for the denominator , defining the e vent E n := { θ ∈ Θ : D F ( µ θ , µ ∗ ) ≤ ε ∗ + ¯ ε n / 3 } and applying again the triangle inequality , we hav e that P θ ,z 1: n [ D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] ≥ Z E n P z 1: n [ D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n | θ ] π (d θ ) 34 ≥ Z E n P z 1: n [ D F ( ˆ µ y 1: n , µ ∗ ) + D F ( µ θ , µ ∗ ) + D F ( ˆ µ z 1: n , µ θ ) ≤ ε ∗ + ¯ ε n | θ ] π (d θ ) ≥ Z E n P z 1: n [ D F ( ˆ µ z 1: n , µ θ ) ≤ ¯ ε n / 3 | θ ] π (d θ ) , where the last inequality follo ws directly from the fact that it is possible to restrict to the e vent {D F ( ˆ µ y 1: n , µ ∗ ) ≤ ¯ ε n / 3 } , and that we are inte grating over E n := { θ ∈ Θ : D F ( µ θ , µ ∗ ) ≤ ε ∗ + ¯ ε n / 3 } . Applying again ( D.3 ) to z 1: n , with c 1 = ¯ ε n / 3 > 2 R n ( F ) , the last term of the abov e inequality can be further lower bounded by Z E n (1 − exp[ − n ( ¯ ε n / 3 − 2 R n ( F )) 2 / 2 b 2 ]) π (d θ ) = π ( E n )(1 − exp[ − n ( ¯ ε n / 3 − 2 R n ( F )) 2 / 2 b 2 ]) , with π ( E n ) ≥ c π ( ¯ ε n / 3) L by ( II ), and, as sho wn before, 1 − exp[ − n ( ¯ ε n / 3 − 2 R n ( F )) 2 / 2 b 2 ] → 1 , when n → ∞ , which implies, for n large enough, 1 − exp[ − n ( ¯ ε n / 3 − 2 R n ( F )) 2 / 2 b 2 ] > 1 / 2 . Le veraging both results, the denominator in ( D.4 ) is lo wer bounded by ( c π / 2) ( ¯ ε n / 3) L . Let us no w combine the upper and lower bounds deriv ed, respecti vely , for the numerator and the denominator of the ratio in ( D.4 ), to obtain (D.5) π ( ε ∗ + ¯ ε n ) n ( { θ ∈ Θ : D F ( µ θ ,µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 } ) ≤ exp[ − n ( c 2 − 2 R n ( F )) 2 / 2 b 2 ] ( c π / 2)( ¯ ε n / 3) L , with P y 1: n –probability going to 1 as n → ∞ . T o conclude the proof it suffices to replace c 2 in ( D.5 ) with 2 R n ( F ) + p (2 b 2 /n ) log( M n / ¯ ε L n ) , which is ne ver lower than 2 R n ( F ) . Finally , setting M n = n yields the statement of Theorem 3.3 . P R O O F O F C O RO L L A RY B . 1 . Corollary B.1 follo ws by replacing the bounds in the proof of Corollary 1 by Bernton et al. ( 2019 ) with the ne wly-deri ved ones in Theorem 3.3 . P R O O F O F C O RO L L A RY 4 . 1 . Recall that in the case of M M D with kernels bounded by 1 we ha ve R n ( F ) ≤ n − 1 / 2 . Hence, re garding the upper and lo wer bounds on p n in ( 3 ) it holds π { θ : D M M D ( µ θ , µ ∗ ) ≤ ε − c F } ≥ π { θ : D M M D ( µ θ , µ ∗ ) ≤ ε − 4 / √ n } , π { θ : D M M D ( µ θ , µ ∗ ) ≤ ε + c F } ≤ π { θ : D M M D ( µ θ , µ ∗ ) ≤ ε + 4 / √ n } . Combining the abov e inequalities with the result in ( 3 ), and taking the limit for n → ∞ , prov es the first part of the statement. The second part is a direct application of Corollary 3.2 to the case of M M D with bounded kernels, after noticing that the aforementioned inequality R n ( F ) ≤ n − 1 / 2 implies R n ( F ) → 0 as n → ∞ . P R O O F O F C O RO L L A RY 4 . 2 . T o pro ve Corollary 4.2 , it suf fices to plug ¯ ε n = [(log n ) /n ] 1 2 and b = 1 into the statement of Theorem 3.3 , and then upper-bound the resulting radius via the inequalities R n ( F ) ≤ n − 1 / 2 and log n ≥ 1 . The latter holds for any n ≥ 3 and hence for n → ∞ . CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 35 P R O O F O F P RO P O S I T I O N 4 . 3 . W e first show that, under ( A 1) – ( A 3) , Assumptions 1 and 2 made in Bernton et al. ( 2019 ) are satisfied under M M D when f n ( ¯ ε n ) = 1 / ( n ¯ ε 2 n ) and c ( θ ) = E z [ k ( z , z )] , with z ∼ µ θ . Consistent with the abov e goal, first recall that, by standard properties of M M D , (D.6) D 2 M M D ( µ 1 , µ 2 ) = E x 1 ,x ′ 1 [ k ( x 1 , x ′ 1 )] − 2 E x 1 ,x 2 [ k ( x 1 , x 2 )] + E x 2 ,x ′ 2 [ k ( x 2 , x ′ 2 )] , with x 1 , x ′ 1 ∼ µ 1 and x 2 , x ′ 2 ∼ µ 2 , all independently; see e.g., Chérief-Abdellatif and Alquier ( 2022 ). Since k ( x, x ′ ) = ⟨ ϕ ( x ) , ϕ ( x ′ ) ⟩ H (see e.g., Muandet et al. , 2017 ), the abov e result im- plies that D 2 M M D ( µ 1 , µ 2 ) = E x 1 ,x ′ 1 [ ⟨ ϕ ( x 1 ) , ϕ ( x ′ 1 ) ⟩ H ] − 2 E x 1 ,x 2 [ ⟨ ϕ ( x 1 ) , ϕ ( x 2 ) ⟩ H ] + E x 2 ,x ′ 2 [ ⟨ ϕ ( x 2 ) , ϕ ( x ′ 2 ) ⟩ H ] = || E x 1 [ ϕ ( x 1 )] || 2 H − 2 ⟨ E x 1 [ ϕ ( x 1 )] , E x 2 [ ϕ ( x 2 )] ⟩ H + || E x 2 [ ϕ ( x 2 )] || 2 H = ∥ E x 1 [ ϕ ( x 1 )] − E x 2 [ ϕ ( x 2 )] ∥ 2 H . (D.7) Le veraging Equations ( D.6 )–( D.7 ) and basic Mark ov inequalities, for an y ¯ ε n ≥ 0 , it holds P y 1: n [ D M M D ( ˆ µ y 1: n , µ ∗ ) > ¯ ε n ] ≤ (1 / ¯ ε 2 n ) E y 1: n  D 2 M M D ( ˆ µ y 1: n , µ ∗ )  = (1 / ¯ ε 2 n ) E y 1: n [ ∥ (1 /n ) P n i =1 ϕ ( y i ) − E y [ ϕ ( y )] ∥ 2 H ] ≤ [1 / ( n 2 ¯ ε 2 n )] P n i =1 E y i [ ∥ ϕ ( y i ) ∥ 2 H ] ≤ [1 / ( n ¯ ε 2 n )] E y 1 [ ∥ ϕ ( y 1 ) ∥ 2 H ] = [1 / ( n ¯ ε 2 n )] E y 1 [ k ( y 1 , y 1 )] = [1 / ( n ¯ ε 2 n )] E y [ k ( y , y )] , with y ∼ µ ∗ . Since [1 / ( n ¯ ε 2 n )] E y [ k ( y , y )] → 0 as n → ∞ by condition ( A 1) , we hav e that D M M D ( ˆ µ y 1: n , µ ∗ ) → 0 in P y 1: n –probability as n → ∞ , thus meeting Assumption 1 in Bernton et al. ( 2019 ). Moreov er , as a direct consequence of the abov e deriv ations, P z 1: n [ D M M D ( ˆ µ z 1: n , µ θ ) > ¯ ε n ] ≤ [1 / ( n ¯ ε 2 n )] E z [ k ( z , z )] . Thus, setting 1 / ( n ¯ ε 2 n ) = f n ( ¯ ε n ) and E z [ k ( z , z )] = c ( θ ) , with z ∼ µ θ , ensures that P z 1: n [ D M M D ( ˆ µ z 1: n , µ θ ] > ¯ ε n ] ≤ c ( θ ) f n ( ¯ ε n ) , with f n ( u ) = 1 / ( nu 2 ) strictly decreasing in u for any fixed n , and f n ( u ) → 0 as n → ∞ , for fixed u . Moreo ver , by Assumptions ( A 2) – ( A 3) , c ( θ ) = E z [ k ( z , z )] is π -inte grable and there e xist a δ 0 > 0 and a c 0 > 0 such that c ( θ ) < c 0 for an y θ satisfying ( E z ,z ′ [ k ( z , z ′ )] − 2 E z ,y [ k ( y , z )] + E y ,y ′ [ k ( y , y ′ )]) 1 / 2 = D M M D ( µ θ , µ ∗ ) ≤ ε ∗ + δ 0 . This ensures that Assump- tion 2 in Bernton et al. ( 2019 ) holds. Finally , note that Assumption 3 in Bernton et al. ( 2019 ), is v erified by our Assumption ( II ). Therefore, under the assumptions in Proposition 4.3 it is possible to apply Proposition 3 in Bernton et al. ( 2019 ) with f n ( ¯ ε n ) = 1 / ( n ¯ ε 2 n ) , c ( θ ) = E z [ k ( z , z )] and R = M n , which yields the concentration result in Proposition 4.3 . 36 P R O O F O F P RO P O S I T I O N 4 . 4 . Assumptions (A1’) and (A2’) imply that the norm ∥∥ x ∥∥ ψ 1 (see Equation (14) in Lei ( 2020 ) for a definition) is uniformly bounded when x ∼ µ θ and x ∼ µ ∗ . Thus, we can use (15) in Lei ( 2020 ) to obtain P z 1: n ( D wass ( µ θ , ˆ µ z 1: n ) > u ) ≤ exp[ − c ′ n ( u − c 1 n − 1 / max( d, 3) ) 2 + ] =: f n ( u ) and, similarly , P y 1: n ( D wass ( µ ∗ , ˆ µ y 1: n ) > u ) ≤ f n ( u ) , where c ′ depends only on the constant c in (A1’) and (A2’), and c 1 is a univ ersal constant. Notice that (15) in Lei ( 2020 ) requires d > 2 . If d = 1 , we can define x ′ = ( x, 0 , 0) ∈ R 3 and apply the result in R 3 (we can proceed similarly if d = 2 ). This is why Proposition 4.4 is stated with max( d, 3) . The above bounds ensure that Assumptions 1–2 in Bernton et al. ( 2019 ) are met. Moreo ver , condition ( II ) veri- fies Assumption 3 in Bernton et al. ( 2019 ). Thus, we can apply Proposition 3 of Bernton et al. ( 2019 ) with f n ( · ) defined as abov e and under the v anishing conditions on ¯ ε n in Proposition 4.4 . This implies that when n → ∞ , and ¯ ε n → 0 such that f n ( ¯ ε n ) → 0 , for some C ∈ (0 , ∞ ) and any M n ∈ (0 , ∞ ) , with P y 1: n –probability going to 1 as n → ∞ , it holds π ( ε ∗ + ¯ ε n ) n ( { θ : D wass ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + f − 1 n  ¯ ε L n / M n  } ) ≤ C / M n . Recall that our restriction n − 1 / max( d, 3) ≪ ¯ ε n , together with n ¯ ε 2 n → ∞ , imply that for n large enough f n ( · ) is in vertible and f n ( ¯ ε n ) → 0 . On such a range, we hav e that f − 1 n  ¯ ε L n / M n  = [(1 / ( c ′ n )) log( M n / ¯ ε L n )] 1 / 2 + c 1 n − 1 / max( d, 3) , which concludes the proof. P R O O F O F L E M M A C . 2 . Let ( x t ) t ∈ Z be a stochastic process with β -mixing coef ficients β ( k ) , for k ∈ N . Moreov er , denote with µ ( n ) and µ the joint distrib ution of a sample x 1: n from ( x t ) t ∈ Z and its constant marginal µ = µ (1) , respecti vely . Then, by combining Propo- sition 2 and Lemma 2 in Mohri and Rostamizadeh ( 2008 ) under our notation, we hav e that for an y b -uniformly bounded class F , any inte ger n ≥ 1 and any scalar δ ≥ 0 , the inequality P x 1: n  D F ( ˆ µ x 1: n , µ ) > 2 R µ,n/ (2 K ) ( F ) + δ  ≤ 2 exp( − nδ 2 /K b 2 ) + 2( n/ 2 K − 1) β ( K ) , holds for e very integer K > 0 such that n/ (2 K ) ∈ N , where R µ,n/ (2 K ) is the Rademacher complexity based an i.i.d. sample of size n/ (2 K ) from µ ; see Definition 2.5 . The above con- centration inequality also implies P x 1: n  D F ( ˆ µ x 1: n , µ ) ≤ 2 R µ,n/ (2 K ) ( F ) + δ  ≥ 1 − 2 exp( − nδ 2 /K b 2 ) − 2( n/ 2 K − 1) β ( K ) ≥ 1 − 2 exp[ − nδ 2 / (4 K b 2 )] − 2( n/ 2 K ) β ( K ) . (D.8) Notice that, in order to ensure that both 2 exp( − nδ 2 / 4 K b 2 ) and 2( n/ 2 K ) β ( K ) v anish to zero (under Assumption ( VII ) for β ( K ) ), it is tempting to apply ( D.8 ) with K = n α for some 0 < α < 1 . Unfortunately , there is no reason for such a K to be an integer . A solution, would be to let K = ⌊ n α ⌋ , b ut in this case n/ (2 K ) might not be an integer . T o address such issues, it is necessary to consider a careful modification of ( D.8 ). T o this end write the Euclidean CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 37 di vision n = n ′ + r where n ′ = 2 K h ≤ n , h = ⌊ n/ (2 K ) ⌋ and 0 ≤ r < 2 K . Then, under the common marginal assumption and recalling also Proposition 1 in Mohri and Rostamizadeh ( 2008 ) together with the triangle inequality and the fact that the functions f within F are b -uniformly bounded, we hav e that D F ( ˆ µ x 1: n , µ ) = sup f ∈ F     1 n X n i =1 [ f ( x i ) − E µ f ( x )]     ≤ sup f ∈ F     1 n X n ′ i =1 [ f ( x i ) − E µ f ( x )]     + 1 n sup f ∈ F     X n ′ + r i = n ′ +1 [ f ( x i ) − E µ f ( x )]     ≤ sup f ∈ F     1 n ′ X n ′ i =1 [ f ( x i ) − E µ f ( x )]     + 2 br n ≤ D F ( ˆ µ x 1: n ′ , µ ) + 4 bK n , where the last inequality follo ws directly from the definition of D F ( ˆ µ x 1: n ′ , µ ) together with the fact that 0 ≤ r < 2 K . Applying no w ( D.8 ) to D F ( ˆ µ x 1: n ′ , µ ) yields P x 1: n  D F ( ˆ µ x 1: n ′ , µ ) + 4 bK /n ≤ 2 R µ,h ( F ) + δ + 4 bK /n  ≥ 1 − 2 exp( − hδ 2 / 2 b 2 ) − 2 hβ ( K ) . Therefore, since D F ( ˆ µ x 1: n , µ ) ≤ D F ( ˆ µ x 1: n ′ , µ ) + 4 bK /n we also hav e that P x 1: n [ D F ( ˆ µ x 1: n , µ ) ≤ 2 R µ,h ( F ) + δ + 4 bK /n ] ≥ 1 − 2 exp( − hδ 2 / 2 b 2 ) − 2 hβ ( K ) . T o conclude the proof, notice that to prov e con ver gence and concentration of the A B C pos- terior it will be suf ficient to let K = ⌊ √ n ⌋ . Therefore, by replacing K = ⌊ √ n ⌋ in the abov e inequality and within the expression for h = ⌊ n/ (2 K ) ⌋ we hav e P x 1: n h D F ( ˆ µ x 1: n , µ ) ≤ 2 R µ,s n ( F ) + 4 b √ n + δ i ≥ 1 − 2 exp( − s n δ 2 / 2 b 2 ) − 2 s n β ( ⌊ √ n ⌋ ) where s n = ⌊ n/ (2 ⌊ √ n ⌋ ) ⌋ and the term 4 b/ √ n follows directly from the fact that ⌊ √ n ⌋ /n ≤ √ n/n = 1 / √ n . P R O O F O F L E M M A C . 5 . The proof follows the ar guments used in Chapter 2 of Doukhan ( 1994 ) to study general Marko v chains. In particular , when ( x t ) t ∈ Z is a stationary Marko v chain with in v ariant distribution π ( · ) and transition kernel P ( · , · ) , a result prov en in Davy- dov ( 1974 ) and recalled in page 87–88 of Doukhan ( 1994 ) gi ves β ( k ) = E x ∼ π ∥ P k ( x, · ) − π ( · ) ∥ T V . When − 1 < θ < 1 , standard results for the A R (1) model in Lemma C.5 lead to the in v ariant distribution π = N ( ψ / (1 − θ ) , σ 2 / (1 − θ 2 )) . As for P k ( x, · ) , notice that, under such an A R (1) model with starting point x , we can write x k = θ x k − 1 + ψ + ε k = θ ( θ x k − 2 + ψ + ε k − 1 ) + ψ + ε k = . . . = θ k x + ψ X k − 1 l =0 θ l + X k − 1 l =0 θ l ε k − l . Therefore, P k ( x, · ) = N ( θ k x + ψ P k − 1 l =0 θ l , σ 2 P k − 1 l =0 θ 2 l ) . 38 Moreov er , notice that, by direct application of standard properties of finite po wer series, we ha ve P k − 1 l =0 θ l = (1 − θ k ) / (1 − θ ) and P k − 1 l =0 θ 2 l = (1 − θ 2 k ) / (1 − θ 2 ) . Since our goal is to deriv e an upper bound for β ( k ) and provided that the K L div er gence among Gaussian densities is a v ailable in closed form, let us first consider the Pinsker’ s inequality ∥ P k ( x, · ) − π ( · ) ∥ T V ≤ [ D KL ( P k ( x, · ) , π ( · )) / 2] 1 / 2 where D KL stands for the K L div ergence. Since both P k ( x, · ) and π ( · ) are Gaussian, then D KL ( P k ( x, · ) , π ( · )) = 1 2  σ 2 (1 − θ 2 k ) σ 2 − 1 + [ θ k x + ψ (1 − θ k ) / (1 − θ ) − ψ / (1 − θ )] 2 σ 2 / (1 − θ 2 ) + log σ 2 σ 2 (1 − θ 2 k )  = 1 2  (1 − θ 2 k ) − 1 + θ 2 k [ x − ψ / (1 − θ )] 2 σ 2 / (1 − θ 2 ) + log 1 1 − θ 2 k  = 1 2  − θ 2 k + θ 2 k [ x − ψ / (1 − θ )] 2 σ 2 / (1 − θ 2 ) + log  1 + θ 2 k 1 − θ 2 k  ≤ 1 2  − θ 2 k + θ 2 k [ x − ψ / (1 − θ )] 2 σ 2 / (1 − θ 2 ) + θ 2 k 1 − θ 2 k  ≤ θ 2 k 2  − 1 + [ x − ψ / (1 − θ )] 2 σ 2 / (1 − θ 2 ) + 1 1 − θ 2  . Therefore, by lev eraging the abov e result, together with standard properties of the expecta- tion, we ha ve β ( k ) ≤ E x ∼ π [ D KL ( P k ( x, · ) , π ( · )) / 2] 1 / 2 ≤ [ E x ∼ π D KL ( P k ( x, · ) , π ( · )) / 2] 1 / 2 ≤  θ 2 k 4  − 1 + E x ∼ π [ x − ψ / (1 − θ )] 2 σ 2 / (1 − θ 2 ) + 1 1 − θ 2  1 / 2 = s θ 2 k 4(1 − θ 2 ) = | θ | k 2 √ 1 − θ 2 , which concludes the proof. P R O O F O F P RO P O S I T I O N C . 6 . Under Assumption ( VII ), for any fixed δ > 0 , we have that P n> 0 [2 exp( − s n δ 2 / 2 b 2 ) + 2 s n C β exp( − γ ⌊ √ n ⌋ ξ )] < ∞ . Therefore, combining Lemma C.2 with Assumption ( IV ), both D F ( ˆ µ z 1: n , µ θ ) and D F ( ˆ µ y 1: n , µ ∗ ) con ver ge to 0 almost surely as n → ∞ , by the Borel–Cantelli Lemma. As a result, since −D F ( ˆ µ z 1: n , µ θ ) − D F ( ˆ µ y 1: n , µ ∗ ) ≤ D F ( µ θ , µ ∗ ) − D F ( ˆ µ y 1: n , ˆ µ z 1: n ) ≤ D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ y 1: n , µ ∗ ) , it holds that D F ( ˆ µ z 1: n , ˆ µ y 1: n ) → D F ( µ θ , µ ∗ ) almost surely as n → ∞ . T o conclude it suffices to apply again the proof of Theorem 1 in Jiang, W u and W ong ( 2018 ); see also the proof of Corollary 3.2 . P R O O F O F T H E O R E M C . 7 . T o pro ve Theorem C.7 we will follo w the same line of rea- soning as in the proof of Theorem 3.3 . Ho wev er , in this case we le verage Lemma C.2 instead of Lemma 2.6 . T o this end, letting δ = c 1 − 2 R s n ( F ) − 4 b/ √ n with c 1 ≥ 2 R s n ( F ) + 4 b/ √ n CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 39 and R s n = sup µ ∈P ( Y ) R µ,s n , we obtain, under Assumption ( VII ), Equation ( D.9 ) belo w , in- stead of ( D.3 ). (D.9) P x 1: n [ D F ( ˆ µ x 1: n , µ ) ≤ c 1 ] ≥ 1 − 2 exp[ − s n ( c 1 − 2 R s n ( F ) − 4 b/ √ n ) 2 / 2 b 2 ] − 2 s n C β exp( − γ ⌊ √ n ⌋ ξ ) . As in Theorem 3.3 , let c 1 = ¯ ε n / 3 and notice that, by the settings of Theorem C.7 , for n large enough ¯ ε n / 3 > 2 R s n ( F ) + 4 b/ √ n . Therefore, applying ( D.9 ) to y 1: n , with c 1 = ¯ ε n / 3 , leads to the follo wing upper bound P y 1: n [ D F ( ˆ µ y 1: n , µ ∗ ) ≤ ¯ ε n / 3] ≥ 1 − 2 exp[ − s n ( ¯ ε n / 3 − 2 R s n ( F ) − 4 b/ √ n ) 2 / 2 b 2 ] − 2 s n C β exp( − γ ⌊ √ n ⌋ ξ ) . Recall that, under the settings of Theorem C.7 , we hav e that √ n ¯ ε 2 n → ∞ and ¯ ε n / R s n ( F ) → ∞ and, therefore, s n ( ¯ ε n / 3 − 2 R s n ( F ) − 4 b/ √ n ) 2 ∼ s n ¯ ε 2 n / 9 ∼ √ n ¯ ε 2 n → ∞ . Combining this result with the fact that 2 s n C β exp( − γ ⌊ √ n ⌋ ξ ) → 0 as n → ∞ , it follo ws that the lower bound for P y 1: n [ D F ( ˆ µ y 1: n , µ ∗ ) ≤ ¯ ε n / 3] goes to 1 as n → ∞ . Hence, in the re- maining part of the proof, we will restrict to the e vent {D F ( ˆ µ y 1: n , µ ∗ ) ≤ ¯ ε n / 3 } . Let P θ ,z 1: n corresponds to the joint distribution of θ ∼ π and z 1: n from µ ( n ) θ . Then, as a di- rect consequence of the definition of conditional probability , for e very positiv e c 2 , including c 2 > 2 R s n ( F ) + 4 b/ √ n , it follows that π ( ε ∗ + ¯ ε n ) n ( { θ : D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 } ) = P θ ,z 1: n [ D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 , D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] P θ ,z 1: n [ D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] . (D.10) T o upper bound the ratio in ( D.10 ), let us first deriv e an upper bound for the numerator . T o this end, consider the triangle inequality D F ( µ θ , µ ∗ ) ≤ D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ z 1: n , ˆ µ y 1: n ) + D F ( ˆ µ y 1: n , µ ∗ ) (recall that D F is a semimetric), along with the previously-pro ved result that the ev ent {D F ( ˆ µ y 1: n , µ ∗ ) ≤ ¯ ε n / 3 } has P y 1: n –probability going to 1 . Hence, for n large enough we ha ve P θ ,z 1: n [ D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 , D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] ≤ P θ ,z 1: n [ D F ( ˆ µ z 1: n , ˆ µ y 1: n ) + D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ y 1: n , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 , D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] ≤ P θ ,z 1: n [ D F ( ˆ µ z 1: n , µ θ ) + D F ( ˆ µ y 1: n , µ ∗ ) > ¯ ε n / 3 + c 2 ] ≤ P θ ,z 1: n [ D F ( ˆ µ z 1: n , µ θ ) > c 2 ] , where P θ ,z 1: n [ D F ( ˆ µ z 1: n , µ θ ) > c 2 ] = R θ ∈ Θ P z 1: n [ D F ( ˆ µ z 1: n , µ θ ) > c 2 | θ ] π (d θ ) . 40 Therefore, le veraging the abo ve result and applying ( D.9 ) to z 1: n yields, Z θ ∈ Θ P z 1: n [ D F ( ˆ µ z 1: n , µ θ ) > c 2 | θ ] π (d θ ) = Z θ ∈ Θ (1 − P z 1: n [ D F ( ˆ µ z 1: n , µ θ ) ≤ c 2 | θ ]) π (d θ ) ≤ 2 exp[ − s n ( c 2 − 2 R s n ( F ) − 4 b/ √ n ) 2 / 2 b 2 ] + 2 s n C β exp( − γ ⌊ √ n ⌋ ξ ) . This controls the numerator in ( D.10 ). As for the denominator , defining the e vent E n := { θ ∈ Θ : D F ( µ θ , µ ∗ ) ≤ ε ∗ + ¯ ε n / 3 } and applying again the triangle inequality , we hav e that P θ ,z 1: n [ D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n ] ≥ Z E n P z 1: n [ D F ( ˆ µ z 1: n , ˆ µ y 1: n ) ≤ ε ∗ + ¯ ε n | θ ] π (d θ ) ≥ Z E n P z 1: n [ D F ( ˆ µ y 1: n , µ ∗ ) + D F ( µ θ , µ ∗ ) + D F ( ˆ µ z 1: n , µ θ ) ≤ ε ∗ + ¯ ε n | θ ] π (d θ ) ≥ Z E n P z 1: n [ D F ( ˆ µ z 1: n , µ θ ) ≤ ¯ ε n / 3 | θ ] π (d θ ) . The last inequality follo ws from that fact that we can restrict to the e vent {D F ( ˆ µ y 1: n , µ ∗ ) ≤ ¯ ε n / 3 } , and that we are integrating ov er E n := { θ ∈ Θ : D F ( µ θ , µ ∗ ) ≤ ε ∗ + ¯ ε n / 3 } . Let us no w apply again ( D.9 ) to z 1: n , with c 1 = ¯ ε n / 3 , to further lower bound the last term of the abov e inequality by Z E n [1 − 2 exp[ − s n ( ¯ ε n / 3 − 2 R s n ( F ) − 4 b/ √ n ) 2 / 2 b 2 ] − 2 s n C β exp( − γ ⌊ √ n ⌋ ξ )] π (d θ ) = π ( E n )[1 − 2 exp[ − s n ( ¯ ε n / 3 − 2 R s n ( F ) − 4 b/ √ n ) 2 / 2 b 2 ] − 2 s n C β exp( − γ ⌊ √ n ⌋ ξ )] . Note that, by Assumption ( II ), π ( E n ) ≥ c π ( ¯ ε n / 3) L . Moreover , as sho wn before, the quantity 1 − 2 exp[ − s n ( ¯ ε n / 3 − 2 R s n ( F ) − 4 b/ √ n ) 2 / 2 b 2 ] − 2 s n C β exp( − γ ⌊ √ n ⌋ ξ ) goes to 1 when n → ∞ , which also implies that, for a lar ge enough n , 1 − 2 exp[ − s n ( ¯ ε n / 3 − 2 R s n ( F ) − 4 b/ √ n ) 2 / 2 b 2 ] − 2 s n C β exp( − γ ⌊ √ n ⌋ ξ ) > 1 / 2 . Therefore, le veraging both results, the denominator in ( D.10 ) can be lower bounded by the term ( c π / 2) ( ¯ ε n / 3) L . T o proceed with the proof, let us combine the upper and lo wer bounds deri ved, respec- ti vely , for the numerator and the denominator of the ratio in ( D.10 ). This yields, for any in- teger K , (D.11) π ( ε ∗ + ¯ ε n ) n ( { θ : D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n / 3 + c 2 } ) ≤ 2 exp[ − s n ( c 2 − 2 R s n ( F ) − 4 b/ √ n ) 2 / 2 b 2 ] + 2 s n C β exp( − γ ⌊ √ n ⌋ ξ ) ( c π / 2)( ¯ ε n / 3) L , with P y 1: n –probability going to 1 as n → ∞ . CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 41 Replacing c 2 in ( D.11 ) with 2 R s n ( F ) + p (2 b 2 /s n ) log( n/ ¯ ε L n ) + 4 b/ √ n , giv es π ( ε ∗ + ¯ ε n ) n n θ : D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n 3 + 2 R s n ( F ) + 4 b √ n +  2 b 2 s n log n ¯ ε L n  1 / 2 o ≤ 4 · 3 L nc π  1 + ns n C β exp( − γ ⌊ √ n ⌋ ξ ) ¯ ε L n  . T o conclude, note that ns n C β exp( − γ ⌊ √ n ⌋ ξ ) / ¯ ε L n = n 1+ L/ 2 s n C β exp( − γ ⌊ √ n ⌋ ξ ) / (( n ¯ ε 2 n ) L/ 2 ) , where the numerator goes to 0 and the denominator goes to ∞ when n → ∞ , under the setting of Theorem C.7 . Therefore, with P y 1: n –probability going to 1 as n → ∞ , we hav e π ( ε ∗ + ¯ ε n ) n n θ : D F ( µ θ , µ ∗ ) > ε ∗ + 4 ¯ ε n 3 + 2 R s n ( F ) + 4 b √ n +  2 b 2 s n log n ¯ ε L n  1 / 2 o ≤ 4 · 3 L nc π , concluding the proof. REFERENCES A G R A W A L , R . and H O R E L , T. (2021). Optimal bounds between f -div ergences and integral probability met- rics. J ournal of Machine Learning Resear ch 22 1–59. B A RT L E T T , P . L ., B O U S Q U E T , O . and M E N D E L S O N , S . (2005). Local Rademacher complexities. Annals of Statistics 33 1497–1537. B A RT L E T T , P . L . and M E N D E L S O N , S . (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J ournal of Machine Learning Resear ch 3 463–482. B E R N T O N , E . , J A C O B , P . E ., G E R B E R , M . and R O B E RT , C . P . (2019). Approximate Bayesian computation with the W asserstein distance. Journal of the Royal Statistical Society Series B: Statistical Methodology 81 235–269. B I R R E L L , J . , D U P U I S , P ., K A T S O U L A K I S , M . A . , P A N T A Z I S , Y . and R E Y - B E L L E T , L . (2022). ( f , Γ) - di ver gences: Interpolating between f -diver gences and integral probability metrics. Journal of Machine Learning Resear ch 23 1–70. B I S S I R I , P . G . , H O L M E S , C . C . and W A L K E R , S . G . (2016). A general frame work for updating belief distri- butions. J ournal of the Royal Statistical Society Series B: Statistical Methodology 78 1103–1130. C H É R I E F - A B D E L L A T I F , B . - E . and A L Q U I E R , P . (2020). MMD-Bayes: Rob ust Bayesian estimation via max- imum mean discrepancy . In Symposium on Advances in Appr oximate Bayesian Infer ence 1–21. PMLR. C H É R I E F - A B D E L L A T I F , B . - E . and A L Q U I E R , P . (2022). Finite sample properties of parametric MMD esti- mation: Robustness to misspecification and dependence. Bernoulli 28 181 – 213. D A V Y D O V , Y . A . (1974). Mixing conditions for Markov chains. Theory of Pr obability & Its Applications 18 312–328. D O U K H A N , P . (1994). Mixing: Pr operties and Examples 85 . Springer Science & Business Media. D ROV A N D I , C . and F R A Z I E R , D . T. (2022). A comparison of likelihood-free methods with and without sum- mary statistics. Statistics and Computing 32 1–23. D U D L E Y , R . M . (2018). Real Analysis and Pr obability . CRC Press. 42 D U D L E Y , R . M . , G I N É , E . and Z I N N , J . (1991). Uniform and uni versal Gli venko-Cantelli classes. J ournal of Theor etical Pr obability 4 485–510. D U T TA , R ., S C H O E N G E N S , M . , P A C C H I A R D I , L . , U M M A D I S I N G U , A . , W I D M E R , N . , K Ü N Z L I , P . , O N - N E L A , J . - P . and M I R A , A . (2021). ABCpy: A high-performance computing perspectiv e to approximate Bayesian computation. J ournal of Statistical Softwar e 100 1–38. F E A R N H E A D , P . and P R A N G L E , D . (2012). Constructing summary statistics for approximate Bayesian com- putation: Semi-automatic approximate Bayesian computation. J ournal of the Royal Statistical Society Series B: Statistical Methodology 74 419–474. F O R B E S , F., N G U Y E N , H . D ., N G U Y E N , T. T . and A R B E L , J . (2021). Approximate Bayesian computation with surrogate posteriors. hal-03139256v4 . F O U R N I E R , N . and G U I L L I N , A . (2015). On the rate of con ver gence in Wasserstein distance of the empirical measure. Pr obability Theory and Related F ields 162 707–738. F R A Z I E R , D . T. (2020). Robust and efficient approximate Bayesian computation: A minimum distance ap- proach. arXiv pr eprint arXiv:2006.14126 . F R A Z I E R , D . T., K N O B L AU C H , J . and D RO V A N D I , C . (2024). The impact of loss estimation on Gibbs mea- sures. arXiv pr eprint arXiv:2404.15649 . F R A Z I E R , D . T. , R O B E RT , C . P . and R O U S S E AU , J . (2020). Model misspecification in approximate Bayesian computation: consequences and diagnostics. J ournal of the Royal Statistical Society Series B: Statistical Methodology 82 421–444. F R A Z I E R , D . T., M A RT I N , G . M ., R O B E RT , C . P . and R O U S S E AU , J . (2018). Asymptotic properties of approximate Bayesian computation. Biometrika 105 593–607. F U J I S A W A , M . , T E S H I M A , T., S A T O , I . and S U G I Y A M A , M . (2021). γ -ABC: Outlier-rob ust approximate Bayesian computation based on a rob ust di ver gence estimator. In International Confer ence on Artificial Intelligence and Statistics 1783–1791. PMLR. G R E T T O N , A ., B O R G W A R D T , K . M . , R A S C H , M . J . , S C H Ö L K O P F , B . and S M O L A , A . (2012). A kernel two-sample test. J ournal of Machine Learning Resear ch 13 723–773. G U T M A N N , M . U . , D U T T A , R ., K A S K I , S . and C O R A N D E R , J . (2018). Likelihood-free inference via classi- fication. Statistics and Computing 28 411–425. H O F M A N N , T . , S C H Ö L K O P F , B . and S M O L A , A . J . (2008). K ernel methods in machine learning. Annals of Statistics 36 1171–1220. J E W S O N , J . , S M I T H , J . Q . and H O L M E S , C . (2018). Principles of Bayesian inference using general div ergence criteria. Entr opy 20 442. J I A N G , B ., W U , T. - Y . and W O N G , W . H . (2018). Approximate Bayesian computation with Kullback-Leibler di ver gence as data discrepancy. In International Conference on Artificial Intelligence and Statistics 1711– 1721. PMLR. L E I , J . (2020). Con ver gence and concentration of empirical measures under W asserstein distance in un- bounded functional spaces. Bernoulli 26 767–798. L I , W . and F E A R N H E A D , P . (2018). On the asymptotic efficiency of approximate Bayesian computation esti- mators. Biometrika 105 285–299. M A R I N , J . - M . , P U D L O , P . , R O B E RT , C . P . and R Y D E R , R . J . (2012). Approximate Bayesian computational methods. Statistics and Computing 22 1167–1180. M A R I N , J . - M . , P I L L A I , N . S ., R O B E RT , C . P . and R O U S S E AU , J . (2014). Relev ant statistics for Bayesian model choice. J ournal of the Royal Statistical Society Series B: Statistical Methodology 76 833–859. CONCENTRA TION OF DISCREP ANCY -BASED ABC VIA RADEMA CHER COMPLEXITY 43 M A S S A RT , P . (2000). Some applications of concentration inequalities to statistics. In Annales de la F aculté des sciences de T oulouse: Mathématiques 9 245–303. M A T S U B A R A , T. , K N O B L AU C H , J . , B R I O L , F . - X . and O A T E S , C . J . (2022). Rob ust generalised Bayesian in- ference for intractable likelihoods. J ournal of the Royal Statistical Society Series B: Statistical Methodology 84 997–1022. M I L L E R , J . W . and D U N S O N , D . B . (2019). Robust Bayesian inference via coarsening. J ournal of the Amer - ican Statistical Association 114 1113–1125. M O H R I , M . and R O S TA M I Z A D E H , A . (2008). Rademacher complexity bounds for non-i.i.d. processes. In Advances in Neural Information Pr ocessing Systems 21 1097–1104. M UA N D E T , K . , F U K U M I Z U , K ., S R I P E RU M B U D U R , B . and S C H Ö L K O P F , B . (2017). Kernel mean embedding of distributions: A re view and be yond. F oundations and T r ends in Machine Learning 10 1–141. M Ü L L E R , A . (1997). Integral probability metrics and their generating classes of functions. Advances in Ap- plied Pr obability , 29 429–443. N G U Y E N , H . D ., A R B E L , J . , L Ü , H . and F O R B E S , F . (2020). Approximate Bayesian computation via the energy statistic. IEEE Access 8 131683–131698. P A R K , M ., J I T K R I T T U M , W . and S E J D I N OV I C , D . (2016). K2-ABC: Approximate Bayesian computation with kernel embeddings. In International Confer ence on Artificial Intelligence and Statistics 398–407. PMLR. R A M D A S , A ., T R I L L O S , N . G . and C U T U R I , M . (2017). On W asserstein two-sample testing and related families of nonparametric tests. Entr opy 19 47. S E J D I N O V I C , D . , S R I P E R U M B U D U R , B . , G R E T T O N , A . and F U K U M I Z U , K . (2013). Equiv alence of distance- based and RKHS-based statistics in hypothesis testing. Annals of Statistics 41 2263–2291. S R I P E RU M B U D U R , B . K . , F U K U M I Z U , K . , G R E T T O N , A . , S C H Ö L KO P F , B . and L A N C K R I E T , G . R . (2012). On the empirical estimation of integral probability metrics. Electr onic J ournal of Statistics 6 1550–1599. T A L AG R A N D , M . (1994). The transportation cost from the uniform measure to the empirical measure in dimension ≥ 3 . Annals of Pr obability 22 919–959. V I L L A N I , C . (2021). T opics in Optimal T ransportation (Second Edition) . American Mathematical Society . W A I N W R I G H T , M . J . (2019). High-Dimensional Statistics: A Non-Asymptotic V iewpoint . Cambridge Uni ver - sity Press. W A N G , Y . , K A J I , T. and R O C K O V A , V . (2022). Approximate Bayesian computation via classification. Journal of Machine Learning Resear ch 23 1–49. W E E D , J . and B A C H , F. (2019). Sharp asymptotic and finite-sample rates of con vergence of empirical mea- sures in W asserstein distance. Bernoulli 25 2620–2648.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment