Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations
Smooth activation functions are ubiquitous in modern deep learning, yet their theoretical advantages over non-smooth counterparts remain poorly understood. In this work, we study both approximation and statistical properties of neural networks with s…
Authors: Yuhao Liu, Zilin Wang, Lei Wu
Smo othness Adaptivit y in Constan t-Depth Neural Net w orks: Optimal Rates via Smo oth Activ ations Y uhao Liu 1 Zilin W ang 2 Lei W u 2 , 3 , 4 Shaob o Zhang 2 1 Departmen t of Mathematical Sciences, Tsinghua Univ ersit y 2 Sc ho ol of Mathematical Sciences, P eking Universit y 3 Cen ter for Mac hine Learning Researc h, Peking Univ ersity 4 AI for Science Institute, Beijing yh-liu21@mails.tsinghua.edu.cn , wangzilin@stu.pku.edu.cn leiwu@math.pku.edu.cn , zhangshaobo@stu.pku.edu.cn Abstract Smo oth activ ation functions are ubiquitous in modern deep learning, yet their theoretical adv an tages o ver non-smo oth counterparts remain p o orly understo od. In this work, we c haracterize both appro ximation and statistical prop erties of neural netw orks with smo oth activ ations o ver the Sobolev space W s, ∞ ([0 , 1] d ) for arbitrary smo othness s > 0. W e prov e that constan t-depth netw orks equipp ed with smo oth activ ations automatically exploit arbitrarily high orders of target function smo othness, ac hieving the minimax-optimal appro ximation and estimation error rates (up to logarithmic factors). In sharp con trast, net works with non-smo oth activ ations, such as ReLU, lack this adaptivity: their attainable appro ximation order is strictly limited b y depth, and capturing higher-order smo othness requires prop ortional depth growth. These results identify activ ation smoothness as a fundamental mechanism, alternative to depth, for attaining statistical optimalit y . T echnically , our results are established via a constructive approximation framework that pro duces explicit neural net work approximators with carefully controlled parameter norms and model size. This complexity control ensures statistical learnability under empirical risk minimization (ERM) and remov es the impractical sparsit y constrain ts commonly required in prior analyses. 1 In tro duction Neural netw orks constitute a central mo del class in mo dern mac hine learning, with applications spanning computer vision, natural language pro cessing, scientific computing, and generative mo deling ( Krizhevsky et al. , 2012 ; V aswani et al. , 2017 ; Raissi et al. , 2019 ; Radford et al. , 2019 ). A central theoretical question concerns how arc hitectural design gives rise to their observ ed p erformance. Among arc hitectural c hoices, activ ation functions in tro duce nonlinearit y and determine how represen tations are comp osed across lay ers. Clarifying their structural role and their in teraction with other arc hitectural dimensions is therefore essential for a principled understanding of neural netw ork expressivit y and generalization. In the early dev elopment of neural net w orks, sigmoid-t yp e activ ations were introduced as differen tiable surrogates of hard threshold units ( McCullo c h & Pitts , 1943 ; Rumelhart et al. , 1986 ). How ev er, their saturation in the tails leads to near-zero deriv atives, resulting in v anishing gradien ts and hindering the training of deep netw orks ( Hochreiter , 1991 ; Bengio et al. , 1994 ). T o All authors contributed equally , and the order follows the alphab etical conv ention. 1 mitigate this issue, the non-smo oth Rectified Linear Unit (ReLU) was in tro duced and quickly b ecame dominan t due to its simplicit y and its av oidance of gradien t saturation ( Nair & Hinton , 2010 ; Glorot et al. , 2011 ). Since then, ReLU has play ed a crucial role not only in influential arc hitectures such as AlexNet ( Krizhevsky et al. , 2012 ) and ResNet ( He et al. , 2016 ), but also in muc h of the theoretical analysis of deep learning ( Schmidt-Hieber , 2020 ; E et al. , 2022 ). Recen tly , activ ation design has witnessed a renew ed embrace of smooth nonlinearities after y ears of ReLU dominance—though not as a return to classical sigmoid-t yp e functions. Con temp orary c hoices such as Gaussian Error Linear Unit (GELU) ( Hendryc ks & Gimp el , 2016 ) and Sigmoid Linear Unit (SiLU) ( Ramachandran et al. , 2017 ; Elfwing et al. , 2018 ), together with their gated v arian ts lik e SwiGLU ( Shazeer , 2020 ), are designed to preserv e smo othness while mitigating gradien t saturation. This emphasis on smoothness is particularly eviden t in scientific computing, where higher-order deriv ativ es are often required, for example in neural PDE solvers ( Raissi et al. , 2019 ; W einan & Y u , 2018 ; Lu et al. , 2021b ). Bey ond scien tific applications, smo oth activ ations hav e also b ecome standard comp onen ts of mo dern large-scale mo dels. They are adopted in language mo dels suc h as GPT ( Radford et al. , 2019 ), LLaMA ( Grattafiori et al. , 2024 ), and DeepSeek ( Liu et al. , 2024 ), and are widely used in vision transformers and diffusion mo dels ( Dosovitskiy et al. , 2021 ; Ho et al. , 2020 ). Theory , ho wev er, has not kept pace with this shift in practice. Historically , it has evolv ed along tw o largely separate directions. Early w ork on neural netw orks with smo oth activ ations primarily fo cused on approximation p ow er ( DeV ore et al. , 1989 ; Mhask ar , 1996 ; Pinkus , 1999 ; Barron , 1993 ). These results w ere dev elop ed in the context of shallow net works and addressed approximation error alone, without considering statistical generalization or the role of net work depth. In contrast, more recent theoretical adv ances—motiv ated by the rise of deep learning—hav e emphasized the role of depth in e nhancing expressivit y and statistical efficiency , particularly for ReLU net works ( Y arotsky , 2017 ; T elgarsky , 2016 ; Schmidt-Hieber , 2020 ). Within this line of w ork, depth is often iden tified as the primary mechanism underlying smo othness adaptivit y—that is, the ability of the model to adapt to the smo othness of the target function. As a consequence of this historical tra jectory , the statistical role of activ ation smo othness has yet to be systematically characterized. In particular, existing results provide limited insight into ho w activ ation smoothness in teracts with net w ork depth in contributing to appro ximation and statistical efficiency , b ey ond what is ac hieved through depth alone. 1.1 Our Con tributions In this w ork, we tak e a step tow ard addressing this gap. W e study neural netw orks with smo oth activ ation functions for learning functions in the Sob olev space W s, ∞ ([0 , 1] d ) for arbitrary s > 0. Our analysis is constructiv e: we explicitly build constan t-depth netw ork approximators with carefully con trolled complexity and deriv e corresp onding finite-sample estimation rates. T ogether, these results provide a unified characterization of approximation and statistical p erformance under smooth activ ations. The main con tributions are summarized below. • Smoothness adaptivity at constan t depth. W e pro ve that constant-depth (depth 6 or 7, depending on the metric) neural net w orks equipp ed with smooth activ ation functions ac hieve the optimal approximation rate O ( N − s/d ) for arbitrary smoothness s > 0, where N denotes the total n umber of parameters. Building up on this constructive approximation, w e further establish that learning with ERM attains the minimax-optimal estimation rate O ( n − 2 s/ (2 s + d ) ) up to logarithmic factors, where n is the sample size. Th us, b oth appro ximation and statistical optimality are ac hieved at constan t depth. Unlik e prior w orks ( Bauer & Kohler , 2019 ; Schmidt-Hieber , 2020 ; Suzuki , 2019 ; Ohn & Kim , 2019 ), this adaptivity is “automatic”—it requires neither increasing the depth with the sample size nor imp osing intractable sparsit y constrain ts. 2 • Depth b ottlenec k for non-smo oth activ ations. T o highligh t the fundamental role of activ ation smo othness, we establish an appro ximation low er b ound for constant-depth ReLU net w orks. It sho ws that non-smooth activ ations cannot attain the minimax-optimal rate for arbitrary smo othness at fixed depth; rather, their achiev able rate is intrinsically limited by depth. This yields a pro v able separation b et w een smooth and non-smo oth activ ations. Complementary numerical exp erimen ts demonstrate that shallo w netw orks with smo oth activ ations exhibit faster generalization conv ergence when learning smo oth targets, empirically supp orting the theoretical separation. In summary , our results provide a smo othness-based p ersp ectiv e on the statistical and appro ximation adv antages of smo oth activ ations ov er their non-smo oth coun terparts. T ec hnically , our analysis builds on tw o main ingredients: (i) a nov el multiscale appro ximation sc heme for piecewise constant functions that eliminates the need for sparsit y constraints (App endix B.4 ); and (ii) a w eighted sup erp osition principle that lifts lo calized appro ximation guaran tees to global L ∞ error b ounds (Appendix B.8 ). Rethinking the role of depth. Our findings motiv ate a reconsideration of the role of depth in existing deep learning theory . A substan tial b ody of w ork establishes approximation and generalization guarantees for deep ReLU netw orks ( Y arotsky , 2017 ; Liang & Srik an t , 2017 ; Schmidt-Hieber , 2020 ; Kohler & Langer , 2021 ; Suzuki , 2019 ), reinforcing the view that increasing depth is essential for achieving smo othness adaptivity ( T elgarsky , 2016 ; V ardi & Shamir , 2020 ). By con trast, we show that when smo oth activ ations are employ ed, constant depth suffices to attain optimal rates o v er Sobolev classes. T ak en together, these results indicate that depth is not the only mechanism underlying smo othness adaptivit y; activ ation regularit y itself provides an alternativ e and theoretically sufficien t route. 1.2 Related W ork The theoretical study of neural netw orks dates back to the univ ersal appro ximation theorem ( Cyb enk o , 1989 ; Hornik et al. , 1989 ; Hornik , 1991 ), whic h guarantees that shallo w net works can approximate con tinuous functions on a compact domain to arbitrary accuracy . Subsequen t researc h has established quantitativ e approximation and estimation rates along t wo main directions. One researc h direction aims to understand how neural netw orks o vercome the curse of dimensionalit y . A representativ e result is Barron’s theorem ( Barron , 1993 ), which identifies function classes admitting dimension-indep enden t appro ximation rates. This framework has since b een substant ially extended in multiple directions ( DeV ore , 1998 ; Kurk ov´ a & Sanguineti , 2002 ; Bach , 2017 ; E et al. , 2018 , 2022 ; Siegel & Xu , 2020 ; W u & Long , 2022 ; Caragea et al. , 2023 ; Siegel & Xu , 2024 ; Chen et al. , 2025 ). A complementary direction studies approximation o ver classical smo oth function spaces suc h as H¨ older and Sob olev spaces ( Mhask ar & Micchelli , 1995 ; Mhask ar , 1996 ; Pinkus , 1999 ). The cen tral question is whether neural net works can adapt to the smo othness of the target function and ac hieve the corresponding minimax-optimal rates. Our w ork b elongs to this line of researc h, where we examine how activ ation regularity in teracts with net work depth and width in determining smoothness adaptivity . A detailed comparison with prior results is provided b elo w, with a structured summary in T able 1 . Appro ximation results. Y arotsky ( 2017 ) and Liang & Srik ant ( 2017 ) sho w that deep ReLU net works with N nonzero parameters achiev e the optimal approximation rate e O ( N − s/d ) o ver W s, ∞ ([0 , 1] d ) ( DeV ore et al. , 1989 ). These results highlight depth as a fundamental mechanism for ac hieving smo othness adaptivit y , and hav e motiv ated a substan tial b ody of subsequent w ork on the approximation p o w er of deep netw orks ( Petersen & V oigtlaender , 2018 ; B¨ olcsk ei et al. , 3 T able 1: Comparison of appro ximation and learning results under different activ ation functions and architectural regimes. W e compare prior w ork with our results along three key dimensions: depth requiremen t, sparsity constrain t, and norm con trol. Unless otherwise stated, the listed methods ac hieve the optimal appro ximation rate O ( N − s/d ) and nearly optimal estimation rate e O ( n − 2 s/ (2 s + d ) ), where N and n denote the num ber of non-zero parameters and samples, resp ectiv ely . In summary , existing results t ypically require either (i) depth growing with target accuracy or smo othness, (ii) sparsity constrain ts, or (iii) smo othness saturation. Our result is the only one that simultaneously ac hieves constant depth, explicit norm con trol, no sparsity constraints, and adaptivit y to arbitrarily high smo othness orders. Activ ation Reference Depth F ree of sparsit y Norm con trol Remark ReLU Y arotsky ( 2017 ), Liang & Srik ant ( 2017 ), Sc hmidt-Hieb er ( 2020 ) O (log( 1 ϵ )) × ✓ Kohler & Langer ( 2021 ) O (log( 1 ϵ )) ✓ ✓ P etersen & V oigtlaender ( 2018 ), Nak ada & Imaizumi ( 2020 ) O ( s log( s )) × ✓ Y ang & Zhou ( 2024 ) 2 ✓ ✓ s < d +3 2 ReLU k P etrushev ( 1998 ), Pinkus ( 1999 ), Y ang & Zhou ( 2025 ) 2 ✓ ✓ s < 2 k + d +1 2 Smo oth Mhask ar ( 1996 ), Pinkus ( 1999 ) 2 ✓ × No learning guaran tees De Ryck et al. ( 2021 ), Bauer & Kohler ( 2019 ) 3 × ✓ Ohn & Kim ( 2019 ) O (log( 1 ϵ )) × ✓ Ours 6 , 7 ✓ ✓ 2019 ; Shen et al. , 2019 , 2020 ; G ¨ uhring et al. , 2020 ; Kohler & Langer , 2021 ; Lu et al. , 2021a ; Suzuki & Nitanda , 2021 ; Grib on v al et al. , 2022 ; Hon & Y ang , 2022 ; Kohler et al. , 2022 ; Shen et al. , 2022 ; Kohler et al. , 2023 ; Siegel , 2023 ; Y ang et al. , 2023 ; Zhang et al. , 2024b , a ; Y ang & He , 2025 ). Ho wev er, the underlying constructions require the depth to increase with the target accuracy ϵ or smo othness s . In this sense, smo othness adaptivity is ac hieved through increasing depth. F or smo oth activ ation functions, classical results sho w that shallo w netw orks with infinitely differen tiable, non-p olynomial activ ations can achiev e optimal appro ximation rates ( Mhask ar , 1996 ; Pinkus , 1999 ). How ev er, these constructions lack explicit complexit y con trol and often require extremely large parameter magnitudes, leaving their statistical learnabilit y unclear. Recen t w orks incorporate complexit y con trol either through sparsit y constrain ts ( De Ryck et al. , 2021 ) or via higher-order but non-smo oth activ ations such as ReLU k ( Mao et al. , 2024 ; Y ang & Zhou , 2025 ), where smo othness adaptivity remains restricted. In this w ork, w e show that constan t-depth net works equipped with general smo oth activ ations achiev e full smo othness adaptivit y—namely , the optimal approximation rate for arbitrarily high smo othness orders—while main taining explicit norm con trol and without imp osing sparsity constraints. In con trast to ReLU-based analyses, depth growth is not required to attain these rates. Generalization results. Bey ond approximation, a cen tral question is whether neural net works can attain optimal finite-sample estimation rates. F or deep ReLU net works, optimal rates o v er Sob olev-t yp e spaces hav e b een established ( Schmidt-Hieber , 2020 ; Suzuki , 2019 ). 4 These guaran tees, how ev er, rely either on sparsity constraints or on increasing netw ork depth; ev en constructions that av oid explicit sparsit y still require depth growth ( Kohler & Langer , 2021 ). F or net works with smo oth activ ations, generalization theory remains comparativ ely less developed. Existing results t ypically impose sparsity constraints ( Bauer & Kohler , 2019 ; Ohn & Kim , 2019 ) or pro vide non-adaptive guaran tees, suc h as those established for ReLU k net works ( Y ang & Zhou , 2024 , 2025 ). Building on our complexity-con trolled appro ximation results, w e show that constan t-depth netw orks equipp ed with general smo oth activ ations ac hieve minimax-optimal estimation rates, without imp osing sparsity constrain ts and without requiring depth gro wth. 2 Preliminaries Notations. W e write N : = { 0 , 1 , 2 , . . . } for the set of non-negative in tegers, and N d : = { ( α 1 , . . . , α d ) : α i ∈ N , i = 1 , . . . , d } for its d -fold Cartesian pro duct. F or a mul ti-index α = ( α 1 , . . . , α d ) ∈ N d , we define | α | : = P d i =1 α i . F or a p ositiv e in teger K , we write [ K ] : = { 1 , 2 , . . . , K } and [ K ] d for its d -fold Cartesian pro duct. F or non-negativ e functions f and g , we write f ( x ) ≲ g ( x ) (equiv alen tly , f ( x ) = O ( g ( x ))) to mean that there exists a constant C > 0 suc h that f ( x ) ⩽ C g ( x ) for all x under consideration. W e write f ( x ) = e O ( g ( x )) to suppress p olylogarithmic factors, i.e., f ( x ) = O g ( x ) p olylog( x ) . Moreov er, we write f ( x ) ≂ g ( x ) if b oth f ( x ) ≲ g ( x ) and g ( x ) ≲ f ( x ) hold. F or a region D ⊂ R d , the indicator function is denoted b y 1 D ( · ), i.e., 1 D ( x ) = 1 if x ∈ D and 0 otherwise. The target function class. Throughout the pap er, we fix the domain Ω = [0 , 1] d and consider the Sob olev space W s, ∞ (Ω) with s > 0, whic h serves as the target function class in our analysis. Concretely , W s, ∞ (Ω) is defined as follows: • (In teger-order Sob olev spaces) F or s ∈ N , the space W s, ∞ (Ω) consists of functions u ∈ L ∞ (Ω) suc h that the weak deriv ativ es D α u ∈ L ∞ (Ω) for all α ∈ N d with | α | ⩽ s . The norm is giv en b y ∥ u ∥ W s, ∞ (Ω) = max | α | ⩽ s ∥ D α u ∥ L ∞ (Ω) , u ∈ W s, ∞ (Ω) . • (F ractional-order Sob olev spaces) Let s = m + ζ , where m = ⌊ s ⌋ and ζ ∈ (0 , 1). F or u ∈ W m, ∞ (Ω), define the H¨ older seminorm [ u ] C m,ζ (Ω) = max | α | = m sup x = y ∈ Ω | D α u ( x ) − D α u ( y ) | ∥ x − y ∥ ζ 2 . W e define W s, ∞ (Ω) = { u ∈ W m, ∞ (Ω) : [ u ] C m,ζ (Ω) < ∞} equipp ed with the norm ∥ u ∥ W s, ∞ (Ω) = ∥ u ∥ W m, ∞ (Ω) + [ u ] C m,ζ (Ω) . The neural netw ork mo del. W e consider the mo del of fully-connected net works with activ ation function ϕ : R → R , applied comp onen twise. F or depth L ⩾ 1, width M ⩾ 1, and parameter b ound B > 0, let H ϕ,L ( d in , d out , M , B ) = n x 7→ W L ϕ W L − 1 ϕ ( · · · ϕ ( W 1 x + b 1 ) · · · )+ b L − 1 + b L : ( W ℓ , b ℓ ) L ℓ =1 ∈ Θ o , where the parameter set Θ = Θ( L, d in , d out , M , B ) is given by Θ := n ( W ℓ , b ℓ ) L ℓ =1 : W 1 ∈ R M × d in , W ℓ ∈ R M × M (2 ⩽ ℓ ⩽ L − 1) , W L ∈ R d out × M , 5 b ℓ ∈ R M (1 ⩽ ℓ ⩽ L − 1) , b L ∈ R d out , ∥ W ℓ ∥ ∞ , ∞ ⩽ B , ∥ b ℓ ∥ ∞ ⩽ B (1 ⩽ ℓ ⩽ L ) o , where ∥ W ∥ ∞ , ∞ := max i,j | W ij | and ∥ b ∥ ∞ := max i | b i | . The activ ation assumptions. Our primary ob jectiv e is to understand how structural prop erties of the activ ation function influence the approximation p o w er of neural netw orks. Sp ecifically , we imp ose the follo wing assumptions on ϕ . Assumption 2.1 (Smoothness) . The activ ation function ϕ is infinitely differentiable and is not a p olynomial. Assumption 2.2 (Lipschitz contin uit y) . | ϕ ( x ) − ϕ ( y ) | ⩽ ∥ ϕ ∥ Lip | x − y | for all x, y ∈ R . This condition ensures that the activ ation gro ws at most linearly at infinity . It excludes activ ation functions with sup er-linear gro wth, such as SwiGLU ( Shazeer , 2020 ), whose tail b eha vior is quadratic. W e emphasize that this assumption is imp osed for tec hnical simplicit y; our analysis can b e extended to activ ations with p olynomial gro wth tails with minor mo difications. Assumption 2.3. The activ ation function ϕ satisfies one of the following: • (Hea viside-lik e) There exists a constant C 1 > 0 suc h that | ϕ ( t ) − H ( t ) | ⩽ C 1 min 1 , 1 | t | , ∀ t ∈ R , (1) where H : R → R is the Hea viside function H ( t ) = ( 1 , t ⩾ 0 , 0 , t < 0 . • (ReLU-lik e) There exists a constant C 2 > 0 suc h that | ϕ ( t ) − ReLU( t ) | ⩽ C 2 , ∀ t ∈ R , (2) where ReLU( t ) = max { t, 0 } . Most activ ation functions used in practice—including GELU, SiLU, tanh, and the sigmoid function—satisfy the ab o ve assumptions. In particular, GELU and SiLU are ReLU-lik e, whereas 1 2 (1 + tanh( · )) and the sigmoid function are Heaviside-lik e. See App endix A.2 for detailed v erification. 3 Appro ximation Theory: Smo othness as an Alternativ e to Depth The follo wing theorem establishes an L 2 -appro ximation of functions in W s, ∞ ([0 , 1] d ) b y constan t-depth neural netw orks. Theorem 3.1 ( L 2 appro ximation) . L et ϕ satisfy Assumptions 2.1 – 2.3 . F or any s > 0 and any f ⋆ ∈ W s, ∞ ([0 , 1] d ) with ∥ f ⋆ ∥ W s, ∞ ([0 , 1] d ) ⩽ 1 , and for any ϵ ∈ (0 , 1) , ther e exists a c onstant-depth neur al network g ∈ H ϕ,L ( d, 1 , M ϵ , B ϵ ) with L = 6 , M ϵ ≲ 1 ϵ d 2 s , B ϵ ≲ 1 ϵ max n d 2 2 s + d, d s +2 , d +4 2 s +5 , ⌈ s ⌉ o , (3) such that ∥ g − f ⋆ ∥ L 2 ([0 , 1] d ) ⩽ ϵ. 6 A pro of sk etc h of this theorem is pro vided in Section 6.1 and the detailed proof can b e found in App endix B . Theorem 3.1 establishes that smo oth activ ation functions enable constant-depth neural net works to ac hieve adaptivit y to arbitrarily high smoothness orders of the target function. In particular, to attain appro ximation accuracy ϵ , the netw ork width satisfies M ϵ ≲ ϵ − d 2 s , so that, under constant depth, the total num b er of parameters scales as O ( M 2 ϵ ) = O ( ϵ − d s ) . This matches the optimal nonlinear approximation rate for functions in W s, ∞ ([0 , 1] d ). By comparison, the results of De Ryck et al. ( 2021 ) achiev e a comparable appro ximation order only under sparsit y constrain ts on the n umber of nonzero parameters, while the total parameter count remains of higher order. Moreov er, in our construction the parameter norms are polynomially controlled, B ϵ = O (poly ( ϵ − 1 )) , a prop ert y that ensures statistical learnability under finite samples. R emark 3.2 (Width–norm trade-off ) . Theorem 3.1 pro vides a sufficient join t scaling of the net work width and parameter norm to guaran tee a target approximation accuracy . In particular, it establishes that there exist ( M ϵ , B ϵ ) with p olynomial dep endence on ϵ − 1 suc h that the desired appro ximation error is achiev ed. The result, ho wev er, do es not fully c haracterize the trade-off b etw een width and parameter norm. That is, we do not determine the b est ac hiev able appro ximation accuracy under prescribed width W and parameter-norm b ound B . Such a c haracterization would provide a more refined understanding of the appro ximation pro cess. F or the purp oses of this pap er, the ab o ve result is sufficien t, as the subsequent generalization analysis only requires explicit p olynomial control of both width and norm. As a refinemen t of Theorem 3.1 , we also establish an L ∞ appro ximation result. Theorem 3.3 ( L ∞ appro ximation) . Supp ose ϕ and f ⋆ satisfy the assumptions of The or em 3.1 . Then, for any ϵ ∈ (0 , 1) , ther e exists a neur al network g ∈ H ϕ,L ( d, 1 , M ϵ , B ϵ ) , with L = 7 , M ϵ ≲ 1 ϵ d 2 s , B ϵ ≲ 1 ϵ max n d 2 2 s + d, d s +2 , d +4 2 s +1 , ⌈ s ⌉ , 6 s +4 o , (4) such that ∥ g − f ⋆ ∥ L ∞ ([0 , 1] d ) ⩽ ϵ. Compared with the L 2 appro ximation guarantee in Theorem 3.1 , the L ∞ result only requires a mo dest increase in depth from 6 to 7. The pro of for L ∞ appro ximation follo ws a strategy similar to that of Theorem 3.1 , with the addition of a single lay er designed to implemen t a w eigh ted sup erposition principle. This mechanism effectively upgrades lo calized appro ximation guaran tees to a global L ∞ b ound. F or a complete deriv ation, we refer the reader to App endix B.8 . 4 Learning Theory: Ac hieving Optimal Risk without Sparsity W e now leverage the ab ov e constructive appro ximation results to deriv e generalization b ounds for learning target functions in W s, ∞ ([0 , 1] d ). Let { ( x i , y i ) } n i =1 b e i.i.d. samples generated according to y i = f ⋆ ( x i ) + ξ i , i = 1 , . . . , n, (5) where x i ∼ ρ with ρ b eing a distribution supp orted on [0 , 1] d with density p satisfying 0 ⩽ p ( x ) ≲ 1 for all x ∈ [0 , 1] d , 7 and the noises ξ i ∼ N (0 , σ 2 ) are indep enden t of x i . W e consider ERM ov er the netw ork class: b f n = argmin f ∈H ϕ,L ( d, 1 ,M n ,B n ) 1 n n X i =1 y i − ( T F f )( x i ) 2 . (6) Here T F denotes the truncation op erator, defined for F > 0 b y ( T F f )( x ) = ( f ( x ) , if | f ( x ) | ⩽ F , sign( f ( x )) F , if | f ( x ) | > F . This truncation ensures uniform b oundedness of the h yp othesis class and is standard in generalization analysis of ERM estimators. Theorem 4.1. L et ϕ satisfy Assumptions 2.1 – 2.3 and assume the noise varianc e σ 2 ≳ 1 . Fix any s > 0 and let f ⋆ ∈ W s, ∞ ([0 , 1] d ) satisfy ∥ f ⋆ ∥ W s, ∞ ([0 , 1] d ) ⩽ 1 . Then, for any p ositive inte ger n , cho osing L = 6 , M n ≂ n d 4 s +2 d , B n ≂ n max n d 2 , 1 , d +10 s +4 2(2 s + d ) , s ⌈ s ⌉ 2 s + d o , F = 2 , (7) we have E T F b f n − f ⋆ 2 L 2 ( ρ ) ≲ n − 2 s 2 s + d log n, (8) wher e the exp e ctation is taken over the sampling of the tr aining data. W e pro vide a pro of sk etch of this theorem in Section 6.2 , with the complete deriv ation deferred to App endix C . The fundamental insight of the pro of lies in the fact that the constructiv e appro ximation in Theorem 3.1 allo ws for precise control ov er the complexit y of the hypothesis class. It is w ell known that the minimax optimal rate for learning functions in the Sob olev class W s, ∞ ([0 , 1] d ) is O ( n − 2 s 2 s + d ) ( Stone , 1982 ; Tsybako v , 2009 ). Therefore, Theorem 4.1 sho ws that, when equipp ed with smo oth activ ation functions, constant-depth neural net works ac hiev e this optimal rate (up to logarithmic factors). In contrast, for non-smo oth activ ations, achieving the same rate t ypically requires the netw ork depth to grow with the sample size ( Sc hmidt-Hieb er , 2020 ; Suzuki , 2019 ; Ohn & Kim , 2019 ). Compared to these prior works, we successfully remov e the requirement of sparsit y constrain t, thereb y rendering the ERM practically implemen table. In Theorem 4.1 , empirical risk minimization is p erformed ov er neural net works whose parameters satisfy an ℓ ∞ -norm constraint. A slight mo dification of the pro of shows that the same optimal risk b ound holds under the more commonly used ℓ 2 -norm constraint (Theorem C.4 in App endix C ). Given the c lose connection b et ween ℓ 2 regularization and w eight decay techniques widely used in practice, this extension further aligns our theoretical guaran tees with standard training pro cedures. 5 The Depth Bottlenec k for Non-Smo oth Activ ations W e first establish a quan titative limitation of constant-depth ReLU net works, whose pro of is deferred to App endix D.1 . Prop osition 5.1 (Appro ximation lo wer b ound for constan t-depth ReLU net w orks) . Fix a depth L ⩾ 2 and a smo othness p ar ameter s > 0 . Then ther e exists a c onstant C s,L > 0 , dep ending only on s and L , such that for every M ⩾ 2 , sup ∥ f ⋆ ∥ W s, ∞ ([0 , 1]) ⩽ 1 inf g ∈H ReLU ,L (1 , 1 ,M , ∞ ) ∥ g − f ⋆ ∥ L 2 ([0 , 1]) ⩾ C s,L ( M log M ) − 2 min { L − 1 ,s } . 8 F or fixed depth L , the total num ber of parameters satisfies N ≂ M 2 L ≂ M 2 . Therefore, up to logarithmic factors, the approximation rate is low er b ounded b y N − min { L − 1 ,s } , whic h saturates at order N − ( L − 1) once s > L − 1. Thus, constant-depth ReLU netw orks cannot adapt to arbitrarily high smo othness of the target function b y increasing width alone. This stands in sharp con trast to Theorem 3.1 , where constant-depth net works with smooth activ ations ac hiev e appro ximation rates of order N − s for arbitrary s > 0. Hence, smo oth activ ations enable full smo othness adaptivit y ev en at fixed depth, whereas ReLU net works exhibit an in trinsic smo othness ceiling determined by depth. The limitation originates from the piecewise linear structure of ReLU net works. F or fixed depth L , a ReLU netw ork represents a piecewise linear function whose num ber of linear regions gro ws at most p olynomially in the width M , with degree dep ending on L . As a result, the net work’s effectiv e appro ximation order is fundamen tally constrained b y this p olynomial growth, prev enting it from exploiting higher-order smo othness b eyond L − 1. While increasing depth can exp onen tially increase the n um b er of linear regions and thereby enhance appro ximation p o w er ( Y arotsky , 2017 ; Liang & Srik an t , 2017 ), the prop osition ab ov e demonstrates that width alone is insufficien t to ov ercome this smo othness barrier at constant depth. Numerical evidence for generalization superiority . Ha ving established a sharp appro ximation separation at constan t depth, w e now inv estigate its implications for finite-sample learning. Deriving a sharp gener alization lo wer bound for constant-depth neural netw orks with non-smo oth activ ation functions—analogous to Proposition 5.1 —is tec hnically challenging. The main difficult y lies in the fact that classical information-theoretic to ols for low er b ounds, such as F ano’s inequality ( Co ver , 1999 ; Tsybako v , 2009 ), are formulated in a minimax framew ork o v er al l estimators and therefore do not directly capture model-sp ecific structural constrain ts (e.g., fixed depth and non-smo oth activ ations). Obtaining mo del-specific lo wer b ounds of this type remains an in teresting op en direction. Nev ertheless, w e pro vide empirical evidence that supp orts the generalization separation. W e generate a smo oth target function using random F ourier features and learn it using t wo-la y er neural net works equipp ed with v arious activ ation functions. T raining is p erformed using full-batch Adam optimizer to minimize the empirical risk. F or eac h activ ation, the learning rate and ℓ 2 -regularization hyperparameter are tuned ov er the same grid, and we rep ort the b est ac hiev ed p erformance. F urther implemen tation details are deferred to App endix D.2 . 2 1 0 2 1 1 2 1 2 Sample Size 0.02 0.03 0.04 0.06 Generalization Er r or R eL U (-0.657) T anh (-0.804) GEL U (-0.785) Figure 1: Generalization error versus sample size for t wo-la y er net w orks trained with differen t activ ation functions. Markers denote the measured generalization errors at each sample size (av eraged ov er 5 runs), and solid lines show least-squares fits of the form E ( n ) ∝ n − α . The fitted exp onen ts α , rep orted in the legend, indicate a faster decay of the generalization error for smo oth activ ations as the sample size increases. Figure 1 shows the log–log scaling of generalization error versus sample size. Smooth 9 activ ations (tanh and GELU) exhibit a steep er decay slop e than ReLU, consisten t with our theory . While optimization effects cannot b e completely ruled out, these empirical results supp ort that smo oth activ ations enable constant-depth net works to b etter exploit target smo othness, thereby impro ving sample efficiency when learning smo oth functions. 6 Pro of Sk etc hes 6.1 Pro of Sk etc h of Theorem 3.1 W e appro ximate f ⋆ using piecewise p olynomials as an in termediate represen tation. This reduces the problem to three building blo cks: (i) monomials, (ii) piecewise constan t functions, and (iii) pro ducts of these t w o comp onents; see Figure 2 for an illustration. Steps (i) and (iii) are implemen ted via finite-difference approximations of deriv atives, a classical technique in neural net work approximation ( Pinkus , 1999 ). F or step (ii), w e emplo y a m ultiscale construction based on a coarse-to-refined grid partition. This allows us to represen t a piecewise constant function with K 2 d refined cells using a constant-depth netw ork of width O ( K d ). Concretely , let C ( x ) = X i ∈ [ K ] d X j ∈ [ K ] d c i , j 1 Ω K i , j ( x ) b e a piecewise constant function on the refined partition { Ω K i , j } i , j ∈ [ K ] d , where { Ω K i } i ∈ [ K ] d denotes the coarse partition and, for eac h coarse cell Ω K i , the sets { Ω K i , j } j ∈ [ K ] d form its K d refined subcells (see Figure 2 ). Denote b y a K i a fixed reference p oin t of Ω K i (e.g., its low er-left corner). Then one can write C ( x ) = X j ∈ [ K ] d X i ∈ [ K ] d c i , j 1 Ω K i ( x ) 1 Ω K 1 , j x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) . where x − a K i maps x to its lo cal p osition within that coarse cell. Observe that the constituent comp onen ts include functions P i ∈ [ K ] d c i , j 1 Ω K i and P i ∈ [ K ] d a K i 1 Ω K i , whic h are piecewise constan t with respect to the same coarse grid partition of size K d , and the collection of K d indicator functions 1 Ω K 1 , j on refined cells. Consequently , eac h component admits an efficien t appro ximation b y a constan t-depth netw ork of width O ( K d ), thereb y establishing an ov erall width b ound of order O ( K d ) for the appro ximation of C . Figures 2 (b) and (c) provide a visual illustration of this approximation for d = 1 and K = 2. See App endix B.4 for details. R emark 6.1 . The multiscale decomp osition in step (ii) is crucial for con trolling the net w ork width. A naiv e construction that directly assigns one unit (or a small blo c k) to eac h of the K 2 d refined cells t ypically requires width O ( K 2 d ), leading to a m uc h larger parameter count and increased mo del complexit y . In con trast, the multiscale strategy k eeps the width at O ( K d ), whic h is necessary to obtain the optimal appro ximation rate without imposing sp ecial sparsit y constrain ts. Similar ideas also app ear in Kohler & Langer ( 2021 ). 6.2 Pro of Sk etc h of Theorem 4.1 T o ac hieve the appro ximation accuracy using the neural net w ork appro ximation in Theorem 3.1 , the complexity of the mo del class with input x ∈ [0 , 1] d , whic h con tains the solution, can b e c haracterized b y the following logarithmic co v ering n umber log N τ , H ϕ, 6 ( d, 1 , M ϵ , B ϵ ) , ∥·∥ ∞ ≲ 1 ϵ d s log 1 τ + log 1 ϵ . (9) 10 0 a 1 1 0 . 2 5 a 1 2 0 . 5 a 2 1 0 . 7 5 a 2 2 1 . 0 y = e x 1 2 1 1 1 2 2 1 2 2 + b i j 1 i j ( x ) b i j = ( 1 a i j ) e a i j ( c i j 1 i j ( x ) ) x c i j = e a i j × c i j 1 i j ( x ) ( x ) ( 0 ) ( 0 ) x 1 x 0 a 1 1 0 . 2 5 a 1 2 0 . 5 a 2 1 0 . 7 5 a 2 2 1 . 0 i , j c i j 1 i j ( x ) 1 2 1 1 1 2 2 1 2 2 + c 1 1 1 1 ( x ) + c 2 1 1 2 ( x ) c 1 2 1 1 ( x ) + c 2 2 1 2 ( x ) × × 1 1 1 ( x a ( x ) ) 1 1 2 ( x a ( x ) ) a ( x ) = a 1 1 1 1 ( x ) + a 2 1 1 2 ( x ) 0 a 1 1 0 . 2 5 a 1 2 0 . 5 a 2 1 0 . 7 5 a 2 2 1 . 0 1 1 1 ( x a ( x ) ) 1 2 1 1 1 2 2 1 2 2 1 1 1 ( x ) x a ( x ) x a ( x ) a ( x ) = a 1 1 1 1 ( x ) + a 2 1 1 2 ( x ) Figure 2: Illustration of the appro ximator construction for f ⋆ in Theorem B.19 with d = 1 and K = 2. (a) Appro ximate f ⋆ b y piecewise polynomials, realized as the pro duct of global p olynomials and piecewise constant functions. (b) The 4-piece piecewise constant function on refined cells is decomposed in to a summation of t wo 2-piece functions defined ov er coarse cells, multiplied by refined-cell indicator functions. (c) The refined-cell indicator functions are realized b y taking the extracted relative p osition information x − a ( x ) as the input for the reference indicators 1 [0 , 0 . 25] and 1 [0 . 25 , 0 . 5] , which correspond to the refined cells con tained within the leftmost coarse cell [0 , 0 . 5]. 11 Next, applying Lemma C.1 , we deriv e the following b ound on the generalization error: E T F b f n − f ⋆ 2 L 2 ( ρ ) ≲ ϵ 2 + 1 n 1 ϵ d s log 1 τ + log 1 ϵ + τ . By balancing the tradeoff b et ween approximation accuracy and mo del complexity , we choose ϵ ≂ n − s 2 s + d , τ ≂ n − 2 s 2 s + d , which leads to the follo wing generalization error b ound: E T F b f n − f ⋆ 2 L 2 ( ρ ) ≲ n − 2 s 2 s + d log n. The detailed pro of is pro vided in App endix C . 7 Conclusions W e ha ve dev elop ed a unified, constructiv e analysis of b oth appro ximation and generalization for constan t-depth neural netw orks equipp ed with smo oth activ ation functions ov er the Sob olev space W s, ∞ ([0 , 1] d ). W e constructed explicit constant-depth net works with smo oth activ ations whose parameter norms are carefully controlled. These netw orks attain the minimax-optimal appro ximation rate for arbitrary smo othness s > 0, thereb y demonstrating smo othness adaptivity at fixed depth. Moreo ver, the norm-con trolled construction enables a sharp statistical analysis, showing that empirical risk minimization ov er this model class ac hieves the minimax-optimal estimation rate (up to logarithmic factors). In contrast, w e established appro ximation low er b ounds for ReLU netw orks, showing that their smo othness adaptivity is fundamentally limited when the depth is kept constan t. T aken together, these results rev eal that depth is not the sole mec hanism for achieving optimal rates; activ ation-induced smo othness provides an alternative route to smo othness adaptivity and statistical optimality . Lo oking ahead, several imp ortan t directions remain open. First, while our results establish statistical optimality in the noisy regression setting, the precise learning b eha vior of smo oth-activ ation net works in the noiseless regime remains to b e fully understo o d. Whether similar optimalit y guaran tees can be established in this setting is an in teresting theoretical question ( Chen et al. , 2025 ). Second, in scien tific computing applications suc h as PDE solvers, p erformance is often ev aluated under stronger norms, suc h as higher-order Sob olev norms. The appro ximation and estimation rates of neural net w orks with smo oth activ ations under these stronger norms, as well as their p oten tial optimality in this regime, remain largely unexplored. Clarifying these questions w ould further illuminate the role of activ ation smo othness in high-accuracy numerical and scien tific learning tasks. Ac kno wledgmen ts This w ork was supp orted b y the National Key R&D Program of China (No. 2022YF A1008200) and the National Natural Science F oundation of China (NSFC 12522120). The authors thank Juncai He, Juno Kim, and Zik ai Shen for helpful discussions. References F rancis Bach. Breaking the curse of dimensionality with conv ex neural netw orks. Journal of Machine L e arning R ese ar ch , 18(19):1–53, 2017. ( cited on page 3 ) Andrew R Barron. Universal approximation b ounds for sup erp ositions of a sigmoidal function. IEEE T r ansactions on Information The ory , 39(3):930–945, 1993. ( cited on pages 2 and 3 ) 12 Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionalit y in nonparametric regression. The Annals of Statistics , 47(4):2261–2285, 2019. ( cited on pages 2 , 4 , and 5 ) Y osh ua Bengio, P atrice Simard, and Paolo F rasconi. Learning long-term dep endencies with gradien t descent is difficult. IEEE T r ansactions on Neur al Networks , 5(2):157–166, 1994. ( cited on page 1 ) Helm ut B¨ olcsk ei, Philipp Grohs, Gitta Kut yniok, and Philipp P etersen. Optimal appro ximation with sparsely connected deep neural netw orks. SIAM Journal on Mathematics of Data Scienc e , 1(1):8–45, 2019. doi: 10.1137/18M118709X. ( cited on page 3 ) Susanne C Brenner and L Ridgw ay Scott. The mathematic al the ory of finite element metho ds . Springer, 2008. ( cited on page 22 ) Andrei Caragea, Philipp Petersen, and F elix V oigtlaender. Neural netw ork appro ximation and estimation of classifiers with classification b oundary in a Barron class. The Annals of Applie d Pr ob ability , 33(4):3039–3079, 2023. ( cited on page 3 ) Hongrui Chen, Jihao Long, and Lei W u. A duality framew ork for analyzing random feature and t wo-la y er neural netw orks. The Annals of Statistics , 53(3):1044–1067, 2025. ( cited on pages 3 and 12 ) Ernesto Corominas and F erran Suny er Balaguer. Condiciones para que una funcion infinitamen te deriv able sea un p olinomio. R evista Matem´ atic a Hisp ano americ ana , 14(1): 26–43, 1954. ( cited on page 28 ) Thomas M Co ver. Elements of information the ory . John Wiley & Sons, 1999. ( cited on page 9 ) George Cyb enk o. Appro ximation by sup erp ositions of a sigmoidal function. Mathematics of Contr ol, Signals and Systems , 2(4):303–314, 1989. ( cited on page 3 ) Tim De Ryck, Samuel Lan thaler, and Siddhartha Mishra. On the approximation of functions b y tanh neural net works. Neur al Networks , 143:732–750, 2021. ( cited on pages 4 and 7 ) Ronald A DeV ore. Nonlinear approximation. A cta Numeric a , 7:51–150, 1998. ( cited on page 3 ) Ronald A DeV ore, Ralph How ard, and Charles Micc helli. Optimal nonlinear appro ximation. Manuscripta Mathematic a , 63(4):469–478, 1989. ( cited on pages 2 and 3 ) William F Donoghue. Distributions and F ourier T r ansforms . Academic Press, 1969. ( cited on page 28 ) Alexey Dosovitskiy , Lucas Beyer, Alexander Kolesniko v, Dirk W eissen b orn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylv ain Gelly , Jak ob Uszk oreit, and Neil Houlsb y . An image is w orth 16x16 w ords: T ransformers for image recognition at scale. International Confer enc e on L e arning R epr esentations , 2021. ( cited on page 2 ) W einan E, Chao Ma, and Lei W u. A priori estimates of the p opulation risk for tw o-la yer neural net works. arXiv pr eprint arXiv:1810.06397 , 2018. ( cited on page 3 ) W einan E, Chao Ma, and Lei W u. The Barron space and the flo w-induced function spaces for neural net work mo dels. Constructive Appr oximation , 55(1):369–406, 2022. ( cited on pages 2 and 3 ) 13 Stefan Elfwing, Eiji Uc hib e, and Kenji Do ya. Sigmoid-w eighted linear units for neural netw ork function approximation in reinforcement learning. Neur al Networks , 107:3–11, 2018. ( cited on page 2 ) Xa vier Glorot, Antoine Bordes, and Y oshua Bengio. Deep sparse rectifier neural net works. International Confer enc e on A rtificial Intel ligenc e and Statistics , 2011. ( cited on page 2 ) Aaron Grattafiori, Abhimanyu Dubey , Abhinav Jauhri, Abhina v Pandey , Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. The LLaMA 3 herd of mo dels, 2024. ( cited on page 2 ) R ´ emi Grib on v al, Gitta Kut yniok, Morten Nielsen, and F elix V oigtlaender. Approximation spaces of deep neural netw orks. Constructive Appr oximation , 55(1):259–367, 2022. ( cited on page 4 ) Ingo G¨ uhring, Gitta Kutyniok, and Philipp Petersen. Error b ounds for appro ximations with deep ReLU neural netw orks in W s,p norms. Analysis and Applic ations , 18(05):803–859, 2020. ( cited on page 4 ) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Confer enc e on Computer Vision and Pattern R e c o gnition , 2016. ( cited on page 2 ) Dan Hendryc ks and Kevin Gimp el. Gaussian error linear units (GELUs). arXiv pr eprint arXiv:1606.08415 , 2016. ( cited on page 2 ) Jonathan Ho, Ajay Jain, and Pieter Abb eel. Denoising diffusion probabilistic mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 2020. ( cited on page 2 ) Sepp Hochreiter. Untersuc h ungen zu dynamisc hen neuronalen netzen. Diploma, T e chnische Universit¨ at M ¨ unchen , 91(1):31, 1991. ( cited on page 1 ) Sean Hon and Haizhao Y ang. Simultaneous neural netw ork appro ximation for smooth functions. Neur al Networks , 154:152–164, 2022. ( cited on page 4 ) Kurt Hornik. Appro ximation capabilities of m ultilay er feedforward netw orks. Neur al Networks , 4(2):251–257, 1991. ( cited on page 3 ) Kurt Hornik, Maxwell Stinc hcom b e, and Halb ert White. Multilay er feedforw ard netw orks are univ ersal appro ximators. Neur al Networks , 2(5):359–366, 1989. ( cited on page 3 ) Mic hael Kohler and Sophie Langer. On the rate of conv ergence of fully connected deep neural net work regression estimates. The Annals of Statistics , 49(4):2231–2249, 2021. ( cited on pages 3 , 4 , 5 , and 10 ) Mic hael Kohler, Adam Krzy ˙ zak, and Sophie Langer. Estimation of a function of low lo cal dimensionalit y b y deep neural netw orks. IEEE T r ansactions on Information The ory , 68(6): 4032–4042, 2022. ( cited on page 4 ) Mic hael Kohler, Sophie Langer, and Ulric h Reif. Estimation of a regression function on a manifold b y fully connected deep neural netw orks. Journal of Statistic al Planning and Infer enc e , 222:160–181, 2023. ( cited on page 4 ) Alex Krizhevsky , Ily a Sutsk ever, and Geoffrey E Hinton. Imagenet classification with deep con volutional neural netw orks. A dvanc es in Neur al Information Pr o c essing Systems , 2012. ( cited on pages 1 and 2 ) V era Kurko v´ a and Marcello Sanguineti. Bounds on rates of v ariable-basis and neural-netw ork appro ximation. IEEE T r ansactions on Information The ory , 47(6):2659–2665, 2002. ( cited on page 3 ) 14 Shiyu Liang and R Srik an t. Wh y deep neural netw orks for function approximation? International Confer enc e on L e arning R epr esentations , 2017. ( cited on pages 3 , 4 , and 9 ) Aixin Liu, Bei F eng, Bing Xue, Bingxuan W ang, Bo c hao W u, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical rep ort. arXiv pr eprint arXiv:2412.19437 , 2024. ( cited on page 2 ) Jianfeng Lu, Zuow ei Shen, Haizhao Y ang, and Shijun Zhang. Deep net work approximation for smo oth functions. SIAM Journal on Mathematic al Analysis , 53(5):5465–5506, 2021a. ( cited on page 4 ) Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear op erators via DeepONet based on the univ ersal approximation theorem of op erators. Natur e Machine Intel ligenc e , 3(3):218–229, 2021b. ( cited on page 2 ) T ong Mao, Jonathan W Siegel, and Jinchao Xu. Appro ximation rates for shallow ReLU k neural net works on Sobolev spaces via the Radon transform. arXiv pr eprint arXiv:2408.10996 , 2024. ( cited on page 4 ) W arren S McCullo c h and W alter Pitts. A logical calculus of the ideas immanent in nerv ous activit y . The Bul letin of Mathematic al Biophysics , 5(4):115–133, 1943. ( cited on page 1 ) Hrushik esh N Mhask ar. Neural net works for optimal appro ximation of smo oth and analytic functions. Neur al Computation , 8(1):164–177, 1996. ( cited on pages 2 , 3 , and 4 ) Hrushik esh N Mhask ar and Charles A Micchelli. Degree of approximation b y neural and translation netw orks with a single hidden la y er. A dvanc es in Applie d Mathematics , 16(2): 151–183, 1995. ( cited on page 3 ) Vino d Nair and Geoffrey E Hinton. Rectified linear units improv e restricted Boltzmann mac hines. International Confer enc e on Machine L e arning , 2010. ( cited on page 2 ) Ryumei Nak ada and Masaaki Imaizumi. Adaptiv e approximation and generalization of deep neural net w ork with in trinsic dimensionalit y . Journal of Machine L e arning R ese ar ch , 21(174): 1–38, 2020. ( cited on page 4 ) Ilsang Ohn and Y ongdai Kim. Smo oth function approximation b y deep neural netw orks with general activ ation functions. Entr opy , 21(7):627, 2019. ( cited on pages 2 , 4 , 5 , and 8 ) Philipp Petersen and F elix V oigtlaender. Optimal approximation of piecewise smo oth functions using deep ReLU neural net w orks. Neur al Networks , 108:296–330, 2018. ( cited on pages 3 and 4 ) P encho P Petrushev. Approximation by ridge functions and neural net works. SIAM Journal on Mathematic al Analysis , 30(1):155–189, 1998. ( cited on page 4 ) Allan Pinkus. Approximation theory of the MLP model in neural netw orks. A cta Numeric a , 8: 143–195, 1999. ( cited on pages 2 , 3 , 4 , 10 , and 24 ) Alec Radford, Jeffrey W u, Rew on Child, Da vid Luan, Dario Amodei, Ily a Sutsk ever, et al. Language mo dels are unsup ervised multitask learners. Op enAI Blo g , 1(8):9, 2019. ( cited on pages 1 and 2 ) Maziar Raissi, P aris P erdik aris, and George E Karniadakis. Physics-informed neural net w orks: A deep learning framework for solving forw ard and inv erse problems inv olving nonlinear partial differential equations. Journal of Computational Physics , 378:686–707, 2019. ( cited on pages 1 and 2 ) 15 Pra jit Ramac handran, Barret Zoph, and Quoc V Le. Searc hing for activ ation functions. arXiv pr eprint arXiv:1710.05941 , 2017. ( cited on page 2 ) Da vid E Rumelhart, Geoffrey E Hin ton, and Ronald J Williams. Learning representations b y bac k-propagating errors. Natur e , 323(6088):533–536, 1986. ( cited on page 1 ) Johannes Schmidt-Hieber. Nonparametric regression using deep neural net w orks with ReLU activ ation function. The Annals of Statistics , 48(4):1875–1897, 2020. ( cited on pages 2 , 3 , 4 , 8 , and 57 ) Noam Shazeer. GLU v arian ts improv e T ransformer. arXiv pr eprint arXiv:2002.05202 , 2020. ( cited on pages 2 and 6 ) Zuo wei Shen, Haizhao Y ang, and Shijun Zhang. Nonlinear appro ximation via compositions. Neur al Networks , 119:74–84, 2019. ( cited on page 4 ) Zuo wei Shen, Haizhao Y ang, and Shijun Zhang. Deep netw ork approximation characterized b y n umber of neurons. Communic ations in Computational Physics , 28(5):1768–1811, Nov em b er 2020. ISSN 1815-2406. doi: 10.4208/cicp.oa- 2020- 0149. ( cited on page 4 ) Zuo wei Shen, Haizhao Y ang, and Shijun Zhang. Optimal approximation rate of ReLU netw orks in terms of width and depth. Journal de Math´ ematiques Pur es et Appliqu´ ees , 157:101–135, 2022. ( cited on page 4 ) Jonathan W Siegel. Optimal approximation rates for deep ReLU neural net works on Sob olev and Besov spaces. Journal of Machine L e arning R ese ar ch , 24(357):1–52, 2023. ( cited on pages 4 and 62 ) Jonathan W Siegel and Jinchao Xu. Appro ximation rates for neural net works with general activ ation functions. Neur al Networks , 128:313–321, 2020. ( cited on page 3 ) Jonathan W Siegel and Jinc hao Xu. Sharp b ounds on the approximation rates, metric entrop y , and n-widths of shallow neural netw orks. F oundations of Computational Mathematics , 24(2): 481–537, 2024. ( cited on page 3 ) Charles J Stone. Optimal global rates of conv ergence for nonparametric regression. The A nnals of Statistics , 10(4):1040–1053, 1982. ( cited on page 8 ) T aiji Suzuki. Adaptivity of deep ReLU net w ork for learning in Beso v and mixed smo oth Beso v spaces: optimal rate and curse of dimensionalit y . International Confer enc e on L e arning R epr esentations , 2019. ( cited on pages 2 , 3 , 4 , and 8 ) T aiji Suzuki and A tsushi Nitanda. Deep learning is adaptiv e to intrinsic dimensionalit y of model smo othness in anisotropic Besov space. A dvanc es in Neur al Information Pr o c essing Systems , 2021. ( cited on page 4 ) Matus T elgarsky . Benefits of depth in neural net w orks. Confer enc e on L e arning The ory , 2016. ( cited on pages 2 and 3 ) Alexandre B. Tsybak o v. Intr o duction to nonp ar ametric estimation . Springer Series in Statistics. Springer, 2009. ( cited on pages 8 and 9 ) Gal V ardi and Ohad Shamir. Neural net w orks with small w eigh ts and depth-separation barriers. A dvanc es in Neur al Information Pr o c essing Systems , 2020. ( cited on page 3 ) Ashish V aswani, Noam Shazeer, Niki P armar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Luk asz Kaiser, and Illia Polosukhin. A tten tion is all y ou need. A dvanc es in Neur al Information Pr o c essing Systems , 2017. ( cited on page 1 ) 16 E W einan and Bing Y u. The deep Ritz metho d: A deep learning-based numerical algorithm for solving v ariational problems. Communic ations in Mathematics and Statistics , 6(1):1–12, 2018. ( cited on page 2 ) Lei W u and Jihao Long. A sp ectral-based analysis of the separation b etw een t wo-la y er neural net works and linear methods. Journal of Machine L e arning R ese ar ch , 23(119):1–34, 2022. ( cited on page 3 ) Y ahong Y ang and Juncai He. Deep neural netw orks with general activ ations: Sup er-con vergence in Sob olev norms. arXiv pr eprint arXiv:2508.05141 , 2025. ( cited on page 4 ) Y ahong Y ang, Y ue W u, Haizhao Y ang, and Y ang Xiang. Nearly optimal approximation rates for deep sup er ReLU netw orks on Sob olev spaces. arXiv pr eprint arXiv:2310.10766 , 2023. ( cited on page 4 ) Y unfei Y ang and Ding-Xuan Zhou. Nonparametric regression using ov er-parameterized shallow ReLU neural netw orks. Journal of Machine L e arning R ese ar ch , 25(165):1–35, 2024. ( cited on pages 4 and 5 ) Y unfei Y ang and Ding-Xuan Zhou. Optimal rates of approximation b y shallow ReLU k neural net works and applications to nonparametric regression. Constructive Appr oximation , 62(2): 329–360, 2025. ( cited on pages 4 and 5 ) Dmitry Y arotsky . Error bounds for approximations with deep ReLU net w orks. Neur al Networks , 94:103–114, 2017. ( cited on pages 2 , 3 , 4 , and 9 ) Shijun Zhang, Jianfeng Lu, and Hongk ai Zhao. Deep net work appro ximation: Bey ond relu to div erse activ ation functions. Journal of Machine L e arning R ese ar ch , 25(35):1–39, 2024a. ( cited on page 4 ) Zihan Zhang, Lei Shi, and Ding-Xuan Zhou. Classification with deep neural net w orks and logistic loss. Journal of Machine L e arning R ese ar ch , 25(125):1–117, 2024b. ( cited on page 4 ) 17 App endix T able of Con ten ts A T ec hnical Preliminaries 18 A.1 Additional Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.2 V erification of Smo oth Activ ation Assumptions . . . . . . . . . . . . . . . . . 19 B Appro ximation Theory: Pro ofs and T echnical Details 20 B.1 Domain P artition Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2 Bramble-Hilbert Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 B.3 Approximation of Monomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 B.4 Approximation of Piecewise Constants . . . . . . . . . . . . . . . . . . . . . . 28 B.5 Approximation of Piecewise Polynomials . . . . . . . . . . . . . . . . . . . . . 38 B.6 Pro of of Theorem 3.1 ( L 2 Appro ximation) . . . . . . . . . . . . . . . . . . . . 40 B.7 Approximation of W eight F unctions . . . . . . . . . . . . . . . . . . . . . . . 42 B.8 Pro of of Theorem 3.3 ( L ∞ Appro ximation) . . . . . . . . . . . . . . . . . . . 54 C Learning Theory: Pro ofs and T echnical Details 57 C.1 Cov ering n umber b ounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 C.2 Pro of of Theorem 4.1 (Risk Bound) . . . . . . . . . . . . . . . . . . . . . . . . 59 C.3 Optimal Risk Bound under ℓ 2 Norm Constraints . . . . . . . . . . . . . . . . 60 D Supplemen tary Details to Section 5 61 D.1 Pro of of Prop osition 5.1 (ReLU Low er Bound) . . . . . . . . . . . . . . . . . 61 D.2 Setup for Generalization Exp erimen ts . . . . . . . . . . . . . . . . . . . . . . 62 A T ec hnical Preliminaries A.1 Additional Notations • Let Z denote the set of integers. Let N : = { 0 , 1 , 2 , . . . } denote the set of natural num bers. Accordingly , N d denotes the set of d -dimensional multi-indices i = ( i 1 , . . . , i d ) where each comp onen t i ℓ ∈ N for ℓ = 1 , . . . , d . W e denote N + : = N \ { 0 } as the set of p ositiv e integers. • F or any integer K ⩾ 1, we denote the set [ K ] : = { 1 , 2 , . . . , K } and the set [ e K ] : = { 0 , 1 , . . . , K } . Accordingly , [ K ] d (and similarly [ e K ] d ) denotes the set of d -dimensional m ulti-indices i = ( i 1 , . . . , i d ), where each comp onent i l ∈ [ K ] (or i l ∈ [ e K ]) for l = 1 , . . . , d . • F or non-negative functions f and g , w e write f ( x ) ≲ g ( x ) or f ( x ) = O ( g ( x )) to indicate that there exists a constant C relying only on the dimension d , smo othness s , and activ ation function ϕ s uc h that f ( x ) ⩽ C g ( x ). W e use f ( x ) = e O ( g ( x )) to suppress p olylogarithmic factors, i.e. f ( x ) = O ( g ( x ) p olylog( x )). Also, we write f ( x ) ≂ g ( x ) if b oth f ( x ) ≲ g ( x ) and g ( x ) ≲ f ( x ) hold. • F or a given neural net work g , let θ ( g ) denote the parameter v ector comprising all its w eight matrices and bias vectors. W e denote the maximum parameter magnitude b y the infinit y norm ∥ θ ( g ) ∥ ∞ 18 • F or a matrix A = ( a ij ) i ∈ [ m ] ,j ∈ [ n ] ∈ R m × n , the norm ∥ · ∥ ∞ , ∞ is given by ∥ A ∥ ∞ , ∞ = max i ∈ [ m ] ,j ∈ [ n ] | a ij | and the norm ∥ · ∥ 1 , ∞ is given by ∥ A ∥ 1 , ∞ = max i ∈ [ m ] P n j =1 | a ij | . • A d -dimensional multi-index is a tuple α = ( α 1 , . . . , α d ) ∈ N d . Several related notations are listed b elo w: – | α | : = P d i =1 α i ; – α ! : = Q d i =1 α i !; – x α : = x α 1 1 · · · x α d d , where x = ( x 1 , . . . , x d ) ∈ R d ; – D α : = ∂ | α | ∂ x α 1 1 ··· ∂ x α d d ; – k α : = k ! α ! , where k = | α | . • F or an y x ∈ R , let ⌊ x ⌋ : = max { n : n ⩽ x, n ∈ Z } and ⌈ x ⌉ : = min { n : n ⩾ x, n ∈ Z } . • Let 1 Ω ( x ) denote the indicator function of the region Ω, i.e., 1 Ω ( x ) = 1 if x ∈ Ω and 0 otherwise. A.2 V erification of Smo oth Activ ation Assumptions In this section, w e formally v erify that widely adopted smo oth activ ation functions—including sigmoid, 1 2 (tanh +1), SiLU, and GELU—satisfy Assumptions 2.1 , 2.2 and 2.3 . The condition of b eing infinitely differen tiable and non-p olynomial, as stipulated in Assumption 2.1 , is trivially satisfied for these aforemen tioned activ ations. Regarding Assumption 2.2 , observ e that for all considered functions, the first deriv ativ e ϕ ′ satisfies the uniform b ound sup t ∈ R | ϕ ′ ( t ) | < 2. Consequen tly , by the mean v alue theorem, these activ ations p ossess the Lipsc hitz con tinuit y prop ert y required b y Assumption 2.2 with ∥ ϕ ∥ Lip = 2. The remainder of this section fo cuses on verifying the appro ximation properties outlined in Assumption 2.3 . Hea viside-lik e case. W e verify that the sigmoid and 1 2 (tanh +1) activ ation functions satisfy the Heaviside-lik e condition. • Sigmoid function: Consider the sigmoid activ ation σ ( t ) = (1 + e − t ) − 1 . W e observe – F or t > 0, | σ ( t ) − H ( t ) | = 1 1 + e − t − 1 = e − t 1 + e − t < e − t . – F or t < 0, | σ ( t ) − H ( t ) | = 1 1 + e − t − 0 = 1 1 + e − t = e t 1 + e t < e −| t | . Note that e −| t | ⩽ min { 1 e | t | , 1 } , the assumption holds with C 1 = 1. • T anh-based function: Consider the activ ation function ϕ ( · ) = 1 2 (tanh( · ) + 1). W e analyze its appro ximation to H ( · ) as follo ws: – F or t > 0, | ϕ ( t ) − H ( t ) | = 1 2 (tanh( t ) + 1) − 1 = 1 2 (1 − tanh( t )) = e − t e t + e − t < e − 2 t . – F or t < 0, | ϕ ( t ) − H ( t ) | = 1 2 (tanh( t ) + 1) − 0 = 1 2 (tanh( t ) + 1) = e t e t + e − t < e 2 t = e − 2 | t | . Note that e − 2 | t | ⩽ min { 1 2 e | t | , 1 } , the assumption holds with C 1 = 1. 19 ReLU-lik e case. W e verify that the GELU and SiLU activ ation functions satisfy the ReLU-lik e condition. • SiLU: Consider the SiLU ac tiv ation ϕ ( t ) = tσ ( t ), where σ is the sigmoid function. The deviation is as follows: – F or t ⩾ 0, | ϕ ( t ) − ReLU( t ) | = | tσ ( t ) − t | = t (1 − σ ( t )) = te − t 1 + e − t = t 1 + e t . – F or t < 0, | ϕ ( t ) − ReLU( t ) | = | tσ ( t ) − 0 | = | t | σ ( t ) = | t | 1 + e − t = | t | 1 + e | t | . Note that f ( u ) = u 1+ e u ⩽ 1 for u ⩾ 0, the assumption holds with C 2 = 1. • GELU: Consider the GELU activ ation ϕ ( t ) = t Φ( t ), where Φ is the cumulativ e distribution function of the standard normal distribution defined as Φ( t ) = 1 √ 2 π Z t −∞ e − x 2 / 2 dx. The error satisfies – F or t ⩾ 0, | ϕ ( t ) − ReLU( t ) | = | t Φ( t ) − t | = t (1 − Φ( t )) . – F or t < 0, | ϕ ( t ) − ReLU( t ) | = | t Φ( t ) − 0 | = | t | Φ( t ) . By symmetry , we hav e Φ( −| t | ) = 1 − Φ( | t | ); thus, | G ( t ) − ReLU( t ) | = | t | (1 − Φ( | t | )) for all t ∈ R . W e estimate the tail integral as follo ws: 1 − Φ( t ) = 1 √ 2 π Z ∞ t e − x 2 / 2 d x ⩽ 1 √ 2 π Z ∞ t x t e − x 2 / 2 d x ⩽ 1 t √ 2 π h − e − x 2 / 2 i ∞ t = e − t 2 / 2 t √ 2 π . Multiplying b y t , w e obtain | ϕ ( t ) − ReLU( t ) | ⩽ 1 √ 2 π e − t 2 / 2 ⩽ 1 √ 2 π , the assumption holds with C 2 = (2 π ) − 1 / 2 . B Appro ximation Theory: Pro ofs and T ec hnical Details In this section, w e presen t the construction for approximators for f ⋆ giv en in Theorem 3.1 and 3.3 . B.1 Domain P artition Construction Our analysis require the following partition of the input domain. • Let K ∈ N + . F or any m ulti-index i = ( i 1 , · · · , i d ) ∈ [ K ] d , the i -th coarse cell of the uniform K d grid on the hypercub e [0 , 1] d is the axis-aligned cell Ω K i : = d Y l =1 i l − 1 K , i l K , 20 whose low er-left corner is a K i : = i 1 − 1 K , · · · , i d − 1 K ∈ R d . F or an y j = ( j 1 , · · · , j d ) ∈ [ K ] d , the j -th refined cell of the uniform K d subgrid of Ω K i is defined by Ω K i , j : = d Y l =1 ( i l − 1) K + j l − 1 K 2 , ( i l − 1) K + j l K 2 . Figure 3 depicts the spatial configuration of the cells Ω K i , Ω K i , j and the corner p oint a K i . • F or K ∈ N + and δ ∈ (0 , 1 3 K ), w e define the in terior region associated with the coarse cell Ω K i b y Ω K i , int ( δ ) : = d Y l =1 i l − 1 K + δ, i l K − δ . The corresp onding band region is given b y Ω K i , band ( δ ) : = Ω K i \ Ω K i , int ( δ ) . Similarly , for δ ∈ (0 , 1 3 K 2 ), we define the interior region asso ciated with the refined cell Ω K i , j as Ω K i , j , int ( δ ) : = d Y l =1 ( i l − 1) K + j l − 1 K 2 + δ, ( i l − 1) K + j l K 2 − δ , and the asso ciated band region is Ω K i , j , band ( δ ) : = Ω K i , j \ Ω K i , j , int ( δ ) . Figure 3 depicts the spatial configuration of the interior and band regions. a 2 ( 1 , 2 ) 2 ( 1 , 2 ) , b a n d ( ) 2 ( 1 , 2 ) , i n t ( ) = [ 0 , 1 ] 2 2 ( 2 , 1 ) , ( 2 , 2 ) , b a n d ( ) 2 ( 2 , 1 ) , ( 2 , 2 ) , i n t ( ) 2 ( 2 , 1 ) Figure 3: Visualization of the hierarc hical grid structure ( K = 2 , d = 2), detailing the coarse and refined cells along with their resp ectiv e interior and band regions. 21 • F or K ∈ N + , v = ( v 1 , · · · , v d ) ∈ [2] d and i , j ∈ [ e K ] d , we define the shifted refined cell: Ω K, v i , j : = d Y l =1 [2( i l − 1) K + 2 j l + v l − 1] 2 K 2 , [2( i l − 1) K + 2 j l + v l + 1] 2 K 2 ! \ [0 , 1] d , where i , j ∈ { 0 , 1 , · · · , K } d . F or δ ∈ 0 , 1 6 K 2 , we define the asso ciated interior region as: Ω K, v i , j , int ( δ ) : = d Y l =1 [2( i l − 1) K + 2 j l + v l − 1] 2 K 2 + δ, [2( i l − 1) K + 2 j l + v l + 1] 2 K 2 − δ ! \ [0 , 1] d , and the asso ciated band region is: Ω K, v i , j , band ( δ ) := Ω K, v i , j \ Ω K, v i , j , int ( δ ) . Finally , we define the ov erall shift in terior and band regions as: Ω K, v int ( δ ) := [ i , j ∈ [ e K ] d Ω K, v i , j , int ( δ ) , Ω K, v band ( δ ) := [ i , j ∈ [ e K ] d Ω K, v i , j , band ( δ ) . B.2 Bram ble-Hilb ert Lemma In this subsection, we approximate the target function f ⋆ b y a piecewise p olynomial. A closely related conclusion can b e found in Brenner & Scott ( 2008 ); for the self-con tained presentation, w e restate the argumen t and provide a proof. Lemma B.1 (Bramble-Hilbert lemma) . F or s > 0 , let f ⋆ ∈ W s, ∞ (Ω) . Then, ther e exists a pie c ewise p olynomial p of or der ⌈ s ⌉ − 1 on the p artition { Ω K i , j } i , j ∈ [ K ] d such that ∥ f ⋆ − p ∥ L ∞ (Ω) ⩽ c 1 ( s, d ) ∥ f ⋆ ∥ W s, ∞ (Ω) K − 2 s . Sp e cific al ly, the pie c ewise p olynomial p c an b e written as p ( x ) = X i ∈ [ K ] d , j ∈ [ K ] d p i , j ( x ) 1 Ω K i , j ( x ) , wher e e ach p olynomial p i , j ( x ) = P | α | < ⌈ s ⌉ a α , i , j x α has or der ⌈ s ⌉ − 1 and the c o efficients satisfy | a α , i , j | ⩽ c 2 ( s, d ) ∥ f ⋆ ∥ W s, ∞ (Ω) , ∀ i , j ∈ [ K ] d , | α | < ⌈ s ⌉ . Her e, c 1 ( s, d ) and c 2 ( s, d ) ar e two c onstants that dep end only on the smo othness s and dimension d . Pr o of. W e construct a p olynomial on eac h small cub e Ω K i , j to approximate f ⋆ lo cally . F or ease of notation, we neglect the subscript i , j when there is no confusion. Let e Ω = Ω K i , j b e a cub e with side length h ( h = K − 2 for partition { Ω K i , j } i , j ∈ [ K ] d ) and f ⋆ ∈ W s, ∞ ( e Ω). W e are going to construct a p olynomial p of order ⌈ s ⌉ − 1 on e Ω such that ∥ f ⋆ − p ∥ L ∞ ( e Ω) ⩽ c 1 ( s, d ) h s ∥ f ⋆ ∥ W s, ∞ ( e Ω) , and the co efficien ts of p satisfy | a α | ⩽ c 2 ( s, d ) ∥ f ⋆ ∥ W s, ∞ ( e Ω) . W e then giv e the construction. Let ψ ∈ C ∞ ( R d ) b e a cut-off function satisfying the following conditions: 22 • ψ is supported on e Ω; • ψ is non-negativ e, i.e., ψ ( y ) ⩾ 0 for all y ∈ R d ; • R e Ω ψ ( y ) d y = 1. Then we define the av eraged T aylor p olynomial of order ⌈ s ⌉ − 1 of f ⋆ as Q ⌈ s ⌉− 1 f ⋆ ( · ) = Z e Ω T ⌈ s ⌉− 1 y f ⋆ ( · ) ψ ( y ) d y , where T ⌈ s ⌉− 1 y f ⋆ is the T a ylor p olynomial of order ⌈ s ⌉ − 1 of f ⋆ at y : T ⌈ s ⌉− 1 y f ⋆ ( · ) = X | α | < ⌈ s ⌉ 1 α ! D α f ⋆ ( y )( · − y ) α . W e claim that p = Q ⌈ s ⌉− 1 f ⋆ satisfies the desired approximation prop ert y and co efficien t b ound. In the follo wing, w e will treat the cases of integer s and non-integer s separately . Case 1: s = m is an in teger. In this case, we hav e ⌈ s ⌉ = m . W e note that the target function f ⋆ admits the follo wing T a ylor expansion with in tegral remainder: f ⋆ ( x ) = T m − 1 y f ⋆ ( x ) + X | α | = m 1 α ! ( x − y ) α Z 1 0 mt m − 1 D α f ⋆ ( x + t ( y − x )) d t. The abov e T a ylor formula is classical for C ∞ functions. Since here f ⋆ ∈ W m, ∞ ( e Ω) does not necessarily guaran tee the p oin twise existence of D α f ⋆ , it should be understo od in the weak sense. By integrating the ab o v e equation against the cut-off function ψ ( · ) o ver e Ω, we obtain f ⋆ ( x ) − ( Q m − 1 f ⋆ )( x ) = Z e Ω f ⋆ ( x ) − ( T m − 1 y f ⋆ )( x ) ψ ( y ) d y = X | α | = m Z e Ω 1 α ! ( x − y ) α Z 1 0 mt m − 1 D α f ⋆ ( x + t ( y − x )) d t ψ ( y ) d y ⩽ X | α | = m 1 α ! sup y | x − y | | α | ∥ f ⋆ ∥ W m, ∞ ( e Ω) Z e Ω | ψ ( y ) | d y · Z 1 0 mt m − 1 d t ⩽ c 1 ( m, d ) h m ∥ f ⋆ ∥ W m, ∞ ( e Ω) . T o b ound the co efficien ts, w e expand the polynomial Q m − 1 f ⋆ and rewrite it as ( Q m − 1 f ⋆ )( x ) = X | α | k . Thus B m ( q ; k ) = 0 in this case. If k = m , the only admissible multi-index is α = (1 , 1 , . . . , 1). Therefore, B m ( q ; m ) = m ! q 1 q 2 · · · q m , whic h completes the pro of. Using Lemma B.2 , we can emplo y the central difference sc heme to appro ximate q 1 q 2 · · · q m . Lemma B.3. L et m ∈ N + , and supp ose ϕ ∈ C m +1 ( R ) and x 0 ∈ R satisfy ϕ ( m ) ( x 0 ) = 0 . F or any ve ctor q = ( q 1 , · · · , q m ) ∈ R m and step size 0 < h < 1 , define the function T ( x 0 ) m as T ( x 0 ) m ( q , h ) = 1 2 m h m ϕ ( m ) ( x 0 ) X ν ∈{± 1 } m m Y i =1 ν i ! ϕ x 0 + h m X i =1 ν i q i ! . Then, T ( x 0 ) m ( q , h ) − m Y i =1 q i ⩽ hA m +1 ( m + 1)! | ϕ ( m ) ( x 0 ) | sup | t − x 0 | ⩽ A ϕ ( m +1) ( t ) , wher e A : = P m i =1 | q i | . Pr o of. By T aylor’s theorem with the in tegral remainder, for an y t ∈ R , ϕ ( x 0 + t ) = m − 1 X k =0 ϕ ( k ) ( x 0 ) k ! t k + t m ( m − 1)! Z 1 0 (1 − s ) m − 1 ϕ ( m ) ( x 0 + st ) d s. Setting t = hS ν ( q ), where S ν ( q ) = P m i =1 ν i q i , and inserting this expansion into the definition of T ( x 0 ) m giv es T ( x 0 ) m ( q , h ) = 1 h m ϕ ( m ) ( x 0 ) m − 1 X k =0 h k ϕ ( k ) ( x 0 ) k ! B m ( q ; k ) + R, (10) 25 where B m ( q ; k ) = 1 2 m X ν ∈{± 1 } m m Y i =1 ν i ! [ S ν ( q )] k , and the remainder term R is R = 1 2 m ( m − 1)! ϕ ( m ) ( x 0 ) X ν ∈{± 1 } m m Y i =1 ν i ! [ S ν ( q )] m Z 1 0 (1 − s ) m − 1 ϕ ( m ) ( x 0 + shS ν ( q ))d s. By Lemma B.2 , one has B m ( q ; k ) = 0 for all k < m . Hence all lo wer-order contributions v anish, and only the remainder term in ( 10 ) remains. W e decomp ose the argumen t of ϕ ( m ) in R as ϕ ( m ) ( x 0 + shS ν ( q )) = ϕ ( m ) ( x 0 ) + h ϕ ( m ) ( x 0 + shS ν ( q )) − ϕ ( m ) ( x 0 ) i . Using Lemma B.2 again and the identities B m ( q ; m ) = 1 2 m X ν ∈{± 1 } m m Y i =1 ν i ! [ S ν ( q )] m = m ! m Y i =1 q i , Z 1 0 (1 − s ) m − 1 d s = 1 m , w e obtain the follo wing error estimate T ( x 0 ) m ( q , h ) − m Y i =1 q i ⩽ 1 2 m ( m − 1)! | ϕ ( m ) ( x 0 ) | X ν ∈{± 1 } m " | S ν ( q ) | m Z 1 0 (1 − s ) m − 1 × ϕ ( m ) ( x 0 + shS ν ( q )) − ϕ ( m ) ( x 0 ) d s # . By the mean v alue theorem, ϕ ( m ) ( x 0 + shS ν ( q )) − ϕ ( m ) ( x 0 ) ⩽ sh | S ν ( q ) | sup | t − x 0 | ⩽ sh | S ν ( q ) | ϕ ( m +1) ( t ) ⩽ shA sup | t − x 0 | ⩽ A ϕ ( m +1) ( t ) , where the last inequalit y comes from 0 < h < 1 , 0 < s < 1 and | S ν ( q ) | ⩽ A : = P m i =1 | q i | . Using the b ounds | S ν ( q ) | m ⩽ A m and Z 1 0 (1 − s ) m − 1 s d s = 1 m ( m + 1) , w e finally obtain T ( x 0 ) m ( q , h ) − m Y i =1 q i ⩽ hA m +1 ( m + 1)! | ϕ ( m ) ( x 0 ) | sup | t − x 0 | ⩽ A ϕ ( m +1) ( t ) . With Lemma B.2 , w e can construct t w o-lay er neural net works to appro ximate all monomials of the form x α 1 1 x α 2 2 · · · x α d d . Lemma B.4. L et d ∈ N + and let α = ( α 1 , . . . , α d ) ∈ N d satisfy ∥ α ∥ 0 : = P d j =1 α j = m ⩾ 1 . Assume ϕ ∈ C m +1 ( R ) is non-p olynomial. Then for any Q > 0 and sufficiently smal l ϵ ∈ (0 , 1) , ther e exists a neur al network g ∈ H ϕ, 2 ( d, 1 , 2 m , B ϵ ) , 26 with B ϵ satisfying B ϵ = 1 ϵ m ( mQ ) m ( m +1) h sup | t − t 0 | ⩽ mQ | ϕ ( m +1) ( t ) | i m [2( m + 1)!] m | ϕ ( m ) ( t 0 ) | m +1 . (11) for some t 0 ∈ R , such that sup x ∈ [ − Q,Q ] d g ( x ) − x α 1 1 · · · x α d d < ϵ. Pr o of. Since ϕ ∈ C m +1 ( R ) and is not a polynomial, there exists t 0 ∈ R suc h that ϕ ( m ) ( t 0 ) = 0. P artition the index set 1 , · · · , m in to disjoin t subsets I 1 , · · · , I d defined by I l = i : 1 + l − 1 X j =0 α j ⩽ i ⩽ l X l =0 α j , 1 ⩽ l ⩽ d, where α 0 = 0. Then | I l | = α l for every l . Define the neural netw ork g ( x ) = 1 2 m h m ϕ ( m ) ( t 0 ) X ν ∈{± 1 } m m Y i =1 ν i ! ϕ t 0 + h d X l =1 X j ∈ I l ν j x l , where h > 0 is a parameter to be c hosen. Setting q j = x l , j ∈ I l , for l = 1 , · · · , d, then expression inside ϕ b ecomes t 0 + hS ν ( q ) with S ν ( q ) = P m i =1 ν i q i . Thus, by Lemma B.3 , w e ha ve | g ( x ) − x α 2 1 x α 2 2 · · · x α d d | ⩽ hA m +1 ( m + 1)! | ϕ ( m ) ( t 0 ) | sup | t − t 0 | ⩽ A ϕ ( m +1) ( t ) , where A = P m i =1 | q i | = P d l =1 α l | x l | . Since | x l | ⩽ Q , w e ha v e A ⩽ mQ . Cho ose h = ϵ ( m + 1)! | ϕ ( m ) ( t 0 ) | ( mQ ) m +1 [sup | t − t 0 | ⩽ mQ | ϕ ( m +1) ( t ) | ] . With this c hoice, sup x ∈ [ − Q,Q ] d | g ( x ) − x α 1 1 x α 2 2 · · · x α d d | ⩽ ϵ. There are 2 m neurons in the hidden lay er of g and all weigh ts and biases in g are b ounded in magnitude by max t 0 , 1 2 m h m | ϕ ( m ) ( t 0 ) | , h X j ∈ I l ν j = 1 2 m h m | ϕ ( m ) ( t 0 ) | , for sufficiently small ϵ . This equals B ϵ as stated in ( 11 ). W e complete the pro of. R emark B.5 . In Lemma B.4 , the condition “sufficiently small ϵ ∈ (0 , 1)” implies the existence of a threshold ϵ 0 ∈ (0 , 1) suc h that the statement holds for all 0 < ϵ < ϵ 0 . Crucially , ϵ 0 is indep enden t of the parameters K and δ , but is allow ed to depend on problem sp ecifications, suc h as activ ation ϕ , dimension d , smoothness s , among others. W e adopt this con ven tion throughout the subsequen t analysis. R emark B.6 . If the input amplitude Q in Lemma B.4 is indep endent of ϵ, δ, K , then norm B ϵ defined in ( 11 ) satisfies B ϵ ≲ (1 /ϵ ) m . 27 The construction in Lemma B.4 relies on the existence of a p oin t x 0 at whic h ϕ ( m ) ( x 0 ) = 0. The follo wing lemma establishes that for any smo oth nonp olynomial function ϕ , there exists suc h a p oin t for all m ∈ N . Lemma B.7 ( Donoghue ( 1969 ); Corominas & Balaguer ( 1954 )) . L et ϕ ∈ C ∞ ( R ) . If ϕ is not a p olynomial, then ther e exists a p oint t 0 ∈ R such that ϕ ( m ) ( t 0 ) = 0 , m ∈ N . In the subsequent analysis, if ϕ satisfies the smo othness assumption in Assumption 2.3 , we assume that the p oint t 0 is such that ϕ ( m ) ( t 0 ) = 0 , m ∈ N . As a corollary of Lemma B.4 , we obtain an appro ximation of the identit y function, whic h serv es as a fundamen tal to ol in the subsequen t analysis. Corollary B.8. Fix Q > 0 . L et ϕ b e an activation function satisfying Assumption 2.1 with ϕ ′ ( t 0 ) = 0 . F or any L ⩾ 2 , and sufficiently smal l ϵ ∈ (0 , 1) , ther e exists a neur al network g ∈ H ϕ,L (1 , 1 , 2 , B ϵ ) , with B ϵ satisfying B ϵ ⩽ L − 1 ϵ Q 2 sup | t − t 0 | ⩽ Q | ϕ ′′ ( t ) | 4 | ϕ ′ ( t 0 ) | 2 , such that sup x ∈ [ − Q,Q ] | g ( x ) − x | < ϵ. Pr o of. By Lemma B.4 , there exist neural netw orks { ν i } L − 1 i =1 suc h that for eac h i , sup x ∈ [ − Q − 1 ,Q +1] | ν i ( x ) − x | ⩽ ϵ L − 1 . Eac h net work ν i has depth 2, width 2, and parameter norm b ounded by ∥ θ ( ν i ) ∥ ∞ ⩽ L − 1 ϵ Q 2 sup | t − t 0 | ⩽ Q | ϕ ′′ ( t ) | 4 | ϕ ′ ( t 0 ) | 2 . (12) W e define the comp osite net work g : = ν L − 1 ◦ ν L − 2 ◦ · · · ◦ ν 1 . Then for x ∈ [ − Q, Q ], the appro ximation error is b ounded by: | g ( x ) − x | ⩽ L − 1 X l =2 | ν l ◦ · · · ◦ ν 1 ( x ) − ν l − 1 ◦ · · · ◦ ν 1 ( x ) | + | ν 1 ( x ) − x | ⩽ ( L − 1) ϵ L − 1 ⩽ ϵ. It follo ws that the depth of g is L , its width is 2. Since the parameter norm b ound ( 12 ) holds for each comp osition ν i , it also applies to g . This completes the pro of. B.4 Appro ximation of Piecewise Constants In this subsection, we construct shallow neural netw ork appro ximations for the follo wing piecewise constant functions, consisting of K 2 d distinct pieces: C ( x ) = X i ∈ [ K ] d , j ∈ [ K ] d c i , j 1 Ω K i , j ( x ) (13) W e b egin b y reformulating C ( x ) using the follo wing lemma. 28 Lemma B.9. L et C ( x ) b e define d by ( 13 ) . Then C ( x ) = X j ∈ [ K ] d X i ∈ [ K ] d c i , j 1 Ω K i ( x ) 1 Ω K 1 , j x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) . (14) Her e, a K i denotes the lower-left vertex asso ciate d with the c el l Ω K i . Sp e cific al ly, for i = ( i 1 , . . . , i d ) , a K i : = i 1 − 1 K , · · · , i d − 1 K . Pr o of. F or an y i ∈ [ K ] d , j ∈ [ K ] d , the follo wing iden tity holds: 1 Ω K i , j ( x ) = 1 Ω K i ( x ) 1 Ω K 1 , j x − a K i = 1 Ω K i ( x ) 1 Ω K 1 , j x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) . Substituting this representation of 1 Ω K i , j ( x ) into ( 13 ), summing first ov er i and subsequently o ver j , yields the desired expression for the piecewise constan t function C in ( 14 ). With the reform ulation of C provided by Lemma B.9 , the problem of approximating a piecewise constant function with K 2 d pieces is reduced to appro ximating K d distinct piecewise constan t functions, eac h of whic h consists of K d regions of the form { Ω K i } i ∈ [ K ] d . More explicitly , for each j ∈ [ K ] d , we consider the function C j ( x ) = X i ∈ [ K ] d c i , j 1 Ω K i ( x ) , j ∈ [ K ] d . (15) It is imp ortan t to note that all these K d piecewise constan t functions share exactly the same partition { Ω K i } i ∈ [ K ] d . Besides, we also need to extract the position of x relativ e to the lo wer-left corner a K i of the region Ω K i con taining it, and subsequently determine which refined grid cell Ω K 1 , j this relative co ordinate falls in to. W e first construct approximations of indicator functions in one dimension. Lemma B.10. L et ϕ satisfy Assumption 2.3 . F or any a < b and δ ∈ 0 , b − a 3 , and any ϵ > 0 sufficiently smal l, ther e exists a neur al network g ∈ H ϕ, 2 1 , 1 , 2 , 4 C 1 ( | a | + | b | + 1) ϵδ , wher e C 1 is the c onstant sp e cifie d in Assumption 2.3 , such that • (Appr oximation) | g ( x ) − 1 [ a,b ) ( x ) | < ϵ, x / ∈ [ a, a + δ ] ∪ [ b − δ , b ] . • (Bounde dness) ∥ g ∥ L ∞ ( R ) ⩽ 2( C 1 + 1) . Pr o of. Define g ( x ) = ϕ β x − a + δ 2 − ϕ β x − b − δ 2 , (16) where β > 0 will b e c hosen later. F or ev ery x / ∈ [ a, a + δ ] ∪ [ b − δ , b ], w e ha v e 1 [ a,b ) ( x ) = 1 [ a + δ 2 ,b − δ 2 ) ( x ) = H β x − a + δ 2 − H β x − b − δ 2 , 29 Hence, still for x / ∈ [ a, a + δ ] ∪ [ b − δ , b ] g ( x ) − 1 [ a,b ) ( x ) ⩽ ( ϕ − H ) β x − a + δ 2 + ( ϕ − H ) β x − b − δ 2 . By the Heaviside-lik e assumption on ϕ , each term is bounded by 2 C 1 / ( β δ ). Cho osing β = 4 C 1 δ ϵ therefore gives g ( x ) − 1 [ a,b ) ( x ) ⩽ ϵ, x / ∈ [ a, a + δ ] ∪ [ b − δ , b ] . Moreo ver, since ∥ H ∥ L ∞ ( R ) ⩽ 1 and ϕ satisfies the Hea viside-like condition, w e ha ve ∥ ϕ ∥ L ∞ ( R ) ⩽ C 1 + 1. Consequently ∥ g ∥ L ∞ ( R ) ⩽ 2( C 1 + 1) . Finally , from the explicit construction ( 16 ), eac h parameter in volv ed in g is b ounded in magnitude by max 4 C 1 ( | a | + | b | + 1) ϵδ , 1 = 4 C 1 ( | a | + | b | + 1) ϵδ , for sufficiently small ϵ . W e complete the pro of. Lemma B.11. With the same assumptions as in L emma B.10 , exc ept that ϕ is a R eLU-like activation function, ther e exists a neur al network g ∈ H ϕ, 2 1 , 1 , 4 , 4 C 2 ( | a | + | b | + 1) ϵδ , wher e C 2 is the c onstant sp e cifie d in Assumption 2.3 , such that • (Appr oximation) | g ( x ) − 1 [ a,b ) ( x ) | < ϵ, x / ∈ [ a, a + δ ] ∪ [ b − δ , b ] . • (Bounde dness) ∥ g ∥ L ∞ ( R ) ⩽ 2 . Pr o of. Define ψ ( x ) = 1 β δ ReLU( β ( x − a )) − 1 β δ ReLU( β ( x − ( a + δ ))) + 1 β δ ReLU( β ( x − ( b − δ ))) + 1 β δ ReLU( β ( x − b )) , (17) where β > 0 is a parameter to b e chosen. One c hecks directly that ψ ( x ) = 1 [ a,b ) ( x ) , x / ∈ [ a, a + δ ] ∪ [ b − δ, b ] (18) and that ∥ ψ ∥ L ∞ ( R ) ⩽ 1. No w construct g by replacing eac h ReLU activ ation in ( 17 ) with ϕ . The ReLU-like condition implies the uniform approximation b ound | ψ ( x ) − g ( x ) | ⩽ 4 C 2 β δ , x ∈ R . Cho osing β = 4 C 2 ϵδ and combining ( 18 ) gives | g ( x ) − 1 [ a,b ) ( x ) | ⩽ ϵ, x / ∈ [ a, a + δ ] ∪ [ b − δ, b ] . F urthermore, since ∥ ψ ∥ L ∞ ( R ) ⩽ 1, it follows that for sufficiently small ϵ , ∥ g ∥ L ∞ ( R ) ⩽ 1 + ϵ < 2 . Finally , using the explicit construction of g , every parameter in the net work is bounded in magnitude by max 1 β δ , β ( | a | + | b | + 1) = max ϵ 4 C 2 , 4 C 2 ( | a | + | b | + 1) ϵδ = 4 C 2 ( | a | + | b | + 1) ϵδ . This completes the pro of. 30 W e next construct approximations of indicator functions in general dimensions d ⩾ 1. The follo wing tw o lemmas c haracterize neural netw ork appro ximations of all indicator functions asso ciated with the coarse grid. Lemma B.12. L et ϕ satisfy Assumptions 2.1 – 2.3 . Fix d ∈ N + and K ∈ N + with K sufficiently lar ge. Then for any sufficiently smal l ϵ ∈ (0 , 1) and any δ ∈ (0 , 1 3 K ) , ther e exists a neur al network g ∈ H ϕ, 3 d, K d , 2 d +1 K d , B ϵ,δ , with B ϵ,δ satisfying B ϵ,δ ≲ max ( 3 ϵδ , 1 ϵ d ) , such that, for e ach i ∈ [ K ] d , the i -th output [ g ( x )] i satisfies: • (Appr oximation) [ g ( x )] i − 1 Ω K i ( x ) ⩽ ϵ, x / ∈ Ω K i , band ( δ ) . • (Bounde dness) F or x ∈ R d , | [ g ( x )] i | ⩽ ( 2 d +1 (1 + C 1 ) d , He aviside-like ϕ, 2 d +1 , R eLU-like ϕ, (19) wher e C 1 is the c onstant app e aring in Assumption 2.3 . Pr o of. W e prov e the result for the Hea viside-like ϕ , and the pro of for the ReLU-like ϕ follows in the same manner. F or x ∈ R d , by in voking Lemma B.10 , for sufficien tly small ϵ 1 > 0 and δ ∈ (0 , 1 3 K ), there exist dK sub-netw orks h l,i ∈ H ϕ, 2 ( d, 1 , 2 , B ϵ 1 ) , with B ϵ 1 = 12 C 1 δ ϵ 1 satisfying, for all l ∈ [ d ], i ∈ [ K ] h l,i ( x ) − 1 [ i − 1 K , i K ) ( x l ) ⩽ ϵ 1 , x l / ∈ ( i − 1) K , ( i − 1) K + δ ∪ i K − δ, i K , and ∥ h l,i ∥ L ∞ ( R d ) ⩽ 2(1 + C 1 ). F urthermore, b y lemma B.4 , for sufficiently small ϵ 2 ∈ (0 , 1), there exists a neural netw ork ψ ∈ H ϕ, 2 ( d, 1 , 2 d , B ϵ 2 ) , with B ϵ 2 ≲ (1 /ϵ 2 ) d suc h that for q = [ q 1 , · · · , q d ] ∈ R d sup q ∈ [ − 2(1+ C 1 ) , 2(1+ C 1 )] d | ψ ( q ) − q 1 q 2 · · · q d | ⩽ ϵ 2 . Next, for i = ( i 1 , · · · , i d ), we define the i -th output of g as [ g ( x )] i = ψ ( h 1 ,i 1 ( x ) , · · · , h d,i d ( x )) . F or sufficiently small ϵ 1 , if x ∈ Ω K i , int ( δ ), we hav e | h l,i l ( x ) − 1 | ⩽ ϵ 1 , hence d Y l =1 h l,i l ( x ) − 1 ⩽ max n (1 + ϵ 1 ) d − 1 , 1 − (1 − ϵ 1 ) d o < 2 dϵ 1 , x ∈ Ω K i , int ( δ ) . Con versely , for x / ∈ Ω K i , there exists at least l suc h that | h l,i l ( x ) | ⩽ ϵ 2 , therefore d Y l =1 h l,i l ( x ) ⩽ 2 d (1 + C 1 ) d ϵ 1 x / ∈ Ω M ,K i . 31 By choosing ϵ 1 = ϵ 2 d +1 (1+ C 1 ) d and ϵ 2 = ϵ 2 , we obtain [ g ( x )] i − 1 Ω M ,K i ( x ) < 2 d (1 + C 1 ) d ϵ 1 + ϵ 2 < ϵ, x / ∈ Ω K i , band ( δ ) . Moreo ver, since ∥ h l,i l ∥ L ∞ ( R d ) ⩽ 2(1 + C 1 ), we obtain | [ g ( x )] i | ⩽ 2 d (1 + C 1 ) d + ϵ 2 < 2 d +1 (1 + C 1 ) d , x ∈ R d , Finally , regarding the netw ork arc hitecture, the neural net work g consists of 2 dK neurons in the first hidden lay er and 2 d K d neurons in the second hidden lay er. Therefore, for sufficiently large K , the width of g is b ounded as max { 2 dK , 2 d K d } ⩽ 2 d +1 K d . Given that ∥ θ ( h l,i ) ∥ ∞ = B ϵ 1 , ∥ θ ( ψ ) ∥ ∞ = B ϵ 2 , and the connecting w eigh ts betw een the net w orks (output la yer for h l,i and the input lay er for ψ ) are b ounded by O (1), we conclude that ∥ θ ( g ) ∥ ∞ is b ounded b y B ϵ,δ ≲ max { B ϵ 1 , B ϵ 2 , 1 } ≲ max ( 3 ϵδ , 1 ϵ d ) , This completes the pro of. Using the neural netw ork appro ximations for all indicator functions ov er the coarse grid in Lemma B.12 , we can appro ximate piecewise constan t functions o ver the coarse grid { Ω K i } i ∈ [ K ] d . Lemma B.13. L et ϕ satisfy Assumptions 2.1 – 2.3 . L et c i , j b e given in ( 13 ) and assume that max i , j {| c i , j |} ⩽ c max . Fix d ∈ N + and K ∈ N + with K sufficiently lar ge. Then for any sufficiently smal l ϵ ∈ (0 , 1) and any δ ∈ (0 , 1 3 K ) , ther e exists a neur al network g ∈ H ϕ, 3 d, K d , 2 d +1 K d , B ϵ,δ,K , with B ϵ,δ,K satisfying B ϵ,δ,K ≲ max ( c max K d ϵδ , c d max K d 2 ϵ d ) , (20) such that, for e ach j ∈ [ K ] d , the j -th output [ g ( x )] j satisfies: • (Appr oximation) F or C j given in ( 15 ) , | [ g ( x )] j − C j ( x ) | ⩽ ϵ, x / ∈ ∪ i Ω K i , band ( δ ) . • (Bounde dness) F or x ∈ R d , | [ g ( x )] i | ⩽ ( c max 2 d +2 (1 + C 1 ) d , He aviside-like ϕ, c max 2 d +2 , R eLU-like ϕ, wher e C 1 is the c onstant app e aring in Assumption 2.3 . Pr o of. W e prov e the result for the Heaviside-lik e ϕ , and the pro of for the ReLU-like ϕ follows in the same manner. Let ψ b e the neural net work constructed in Lemma B.12 , sufficiently small ϵ 1 ∈ (0 , 1), and δ ∈ (0 , 1 3 K ). W e ha v e ∥ θ ( ψ ) ∥ ∞ ≲ max { 3 / ( ϵ 1 δ ) , (1 /ϵ 1 ) d } . Define the neural net work g b y specifying its j -th output comp onen t as [ g ( x )] j = X i ∈ [ K ] d c i , j [ ψ ( x )] i . 32 Then, for an y x / ∈ S i Ω K i , band ( δ ) and an y j ∈ [ K ] d , we obtain | [ g ( x )] j − C j ( x ) | = X i ∈ [ K ] d c i , j [ ψ ( x )] i − X i ∈ [ K ] d c i , j 1 Ω K i ( x ) ⩽ X i ∈ [ K ] d | c i , j | [ ψ ( x )] i − 1 Ω K i ( x ) ⩽ c max K d ϵ 1 . By selecting ϵ 1 = ϵ c max K d , we obtain j ∈ [ K ] d | [ g ( x )] j − C j ( x ) | ⩽ ϵ, x / ∈ ∪ i Ω K i , band ( δ ) , j ∈ [ K ] d . Moreo ver, if x / ∈ [0 , 1) d , then C j ( x ) = 0. Hence, for an y j ∈ [ K ] d | [ g ( x )] j | = | [ g ( x )] j − C j ( x ) | ⩽ ϵ. On the other hand, if x ∈ [0 , 1) d , let i ∈ [ K ] d b e such that x ∈ Ω K i . Then | [ g ( x )] j | ⩽ | c i , j || [ ψ ( x )] i | + X l ∈ [ K ] d , l = i | c l , j || [ ψ ( x )] l − 1 Ω K l ( x ) | ⩽ 2 d +1 c max (1 + C 1 ) d + ϵ, Therefore, for ϵ sufficiently small, w e obtain the uniform b ound for x ∈ R d | [ g ( x )] j | ⩽ c max 2 d +2 (1 + C 1 ) d , j ∈ [ K ] d . Finally , the input dimension, output dimension, and width of g coincide with those of ψ . Moreo ver, ∥ θ ( g ) ∥ ∞ is b ounded b y B ϵ,δ,K ⩽ c max ∥ θ ( ψ ) ∥ ∞ ≲ max ( c max K d ϵδ , c d max K d 2 ϵ d ) , as stated in ( 20 ). This concludes the pro of. Next, we construct neural netw ork mo dules that approximate the mapping x 7→ x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) , thereb y extracting the relativ e position of x within its asso ciated partition grid cell. Lemma B.14. L et ϕ satisfy Assumptions 2.1 – 2.3 . Fix d ∈ N + and K ∈ N + with K sufficiently lar ge. Then for any sufficiently smal l ϵ > 0 and any δ ∈ (0 , 1 3 K ) , ther e exists a neur al network g ∈ H ϕ, 2 ( d, d, 6 dK, B ϵ,δ,K ) , with B ϵ,δ,K ≲ K / ( ϵδ ) , such that g ( x ) − x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) ∞ ⩽ ϵ, ∀ x / ∈ ∪ i ∈ [ K ] d Ω K i , band ( δ ) . (21) 33 Pr o of. W e prov e the result for the Hea viside-like ϕ , and the pro of for the ReLU-like ϕ follows in the same manner. W e first rewrite the x − P i ∈ [ K ] d a K i 1 Ω K i ( x ) in the comp onent form as: x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) = x 1 − P K i =1 i − 1 K 1 [ i − 1 K , i K ) ( x 1 ) · · · x d − P K i =1 i − 1 K 1 [ i − 1 K , i K ) ( x d ) . By Corollary B.8 , for sufficiently small ϵ 1 , there exist neural netw orks ψ l ∈ H ϕ, 2 ( d, 1 , 2 , B ϵ 1 ) , l ∈ [ d ] , with B ϵ 1 ≲ 1 /ϵ 1 suc h that | ψ l ( x ) − x l | ⩽ ϵ 1 , x ∈ [0 , 1) d . Besides, by Lemma B.10 , there exist dK sub-netw orks for sufficien tly small ϵ 2 h l,i l ∈ H ϕ, 2 ( d, 1 , 2 , B ϵ 2 ,δ ) , with B ϵ 2 ,δ ≲ 1 / ( ϵ 2 δ )satisfying, for all l ∈ [ d ], i l ∈ [ K ] h l,i l ( x ) − 1 [ i l − 1 K , i l K ) ( x l ) ⩽ ϵ 2 , x / ∈ Ω K i , band ( δ ) . Define the l -th output of the constructed neural net works g as [ g ( x )] l = ψ l ( x ) − K X i =1 i − 1 K h l,i ( x ) , l ∈ [ d ] . Then, for an y l ∈ [ d ], for x / ∈ Ω K i , band ( δ ), we hav e [ g ( x )] l − x l − K X i =1 i − 1 K 1 [ i − 1 K , i K ) ( x l ) ! ⩽ | ψ l ( x ) − x l | + K X i =1 i − 1 K h l,i ( x ) − 1 [ i − 1 K , i K ) ( x l ) ⩽ ϵ 1 + K ϵ 2 ⩽ ϵ, where we choose ϵ 1 = ϵ 2 and ϵ 2 = ϵ 2 K in the last inequalit y . This prov es the approximation guaran ty in ( 21 ). Finally , eac h netw ork ψ l and eac h subnetw ork h l,i emplo ys tw o hidden neurons. Consequen tly , g comprises a total of 2 d ( K + 1) hidden neurons, and its width is therefore b ounded by 6 dK for all sufficien tly large K . Moreov er, ∥ θ ( g ) ∥ ∞ is b ounded b y B ϵ,δ,K ⩽ max { B ϵ 1 , B ϵ 2 ,δ } ≲ K ϵδ . This completes the pro of. Using the relative-position extraction established in Lemma B.14 , w e subsequen tly construct neural netw ork appro ximations of indicator functions on the refined grid cells. Lemma B.15. L et ϕ satisfy Assumptions 2.1 – 2.3 . Fix d ∈ N + and K ∈ N + with K sufficiently lar ge. Then for any sufficiently smal l ϵ > 0 and any δ ∈ (0 , 1 3 K 2 ) , ther e exists a neur al network g ∈ H ϕ, 4 d, K d , 2 d +2 K d , B ϵ,δ,K , 34 with B ϵ,δ,K satisfying B ϵ,δ,K ≲ max ( K d ϵδ 2 , K d 2 ϵ d ) , (22) such that, for e ach j ∈ [ K ] d , the j -th output [ g ( x )] j satisfies: • (Appr oximation) F or x ∈ ∪ i , j Ω K i , j , int ( δ ) , [ g ( x )] j − 1 Ω K 1 , j x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) ⩽ ϵ K d . • (Bounde dness) F or x ∈ [0 , 1] d , X l ∈ [ K ] d | [ g ( x )] l | ⩽ ( 2 d +2 (1 + C 1 ) d , He aviside-like ϕ, 2 d +2 , R eLU-like ϕ, wher e C 1 is the c onstant app e aring in Assumption 2.3 . Pr o of. W e prov e the result for the Heaviside-lik e ϕ , and the pro of for the ReLU-like ϕ follows in the same manner. Let ψ 1 denote the neural net work constructed in Lemma B.14 with ϵ = δ / 2. F rom the construction of ψ 1 , we hav e ∥ θ ( ψ 1 ) ∥ ∞ ≲ K /δ 2 . F or i , j ∈ [ K ] d and x ∈ Ω K i , j , int ( δ ), w e can deduce the following: x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) ∈ Ω K 1 , j , int ( δ ) , whic h implies that ψ 1 ( x ) ∈ Ω K 1 , j , int δ 2 , x ∈ Ω K i , j , int ( δ ) . Analogous to the construction in Lemma B.12 , w e construct a neural netw ork ψ 2 to appro ximate the family of indicator functions supp orted on the refined cells of the b ottom-left coarse cell (i.e., { Ω K 1 , j } j ∈ [ K ] d ). F or sufficiently small ϵ 1 ∈ (0 , 1) and δ ∈ (0 , 1 3 K 2 ), there exists a netw ork ψ 2 ∈ H ϕ, 3 ( d, K d , 2 d +1 K d , B ϵ 1 ,δ ) with B ϵ 1 ,δ ≲ max { 1 / ( ϵ 1 δ ) , (1 /ϵ 1 ) d } , such that for all j ∈ [ K ] d , the following approximation error bound holds: [ ψ 2 ( x )] j − 1 Ω K 1 , j ( x ) ⩽ ϵ 1 , x / ∈ Ω K 1 , j , band δ 2 . Moreo ver, the output of ψ 2 also satisfies the b ound giv en in ( 19 ). No w, define g = ψ 2 ◦ ψ 1 . W e then obtain the approximation [ g ( x )] j − 1 Ω K 1 , j x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) ⩽ ϵ 1 , x ∈ [ i , j Ω K i , j , int ( δ ) . F or a fixed x ∈ [0 , 1] d , if ψ 1 ( x ) ∈ Ω K 1 , we denote j as the index such that ψ 1 ( x ) ∈ Ω K 1 ,j . Then due to the b oundedness of ψ 2 , we hav e the follo wing b ound: X l ∈ [ K ] d | [ g ( x )] l | = | g ( x ) | j + X l ∈ [ K ] d , l = j [ ψ 2 ( ψ 1 ( x ))] l − 1 Ω K 1 , l ( ψ 1 ( x )) ⩽ 2 d +1 (1 + C 1 ) d + K d ϵ 1 ⩽ 2 d +2 (1 + C 1 ) d , 35 where the last inequality follows from c ho osing ϵ 1 = ϵ/K d for sufficien tly small ϵ . On the other hand, if ψ 1 ( x ) / ∈ Ω K 1 , we can bound the norm similarly b y K d ϵ 1 ⩽ ϵ . Finally , note that the depth of ψ 1 is 2 and that of ψ 2 is 3, so the depth of composition g = ψ 2 ◦ ψ 1 is depth 4. The width of g is b ounded b y max { 6 dK, 2 d +1 K d } ⩽ 2 d +2 K d for sufficien tly large K . Moreov er, by the construction of ψ 1 , the norm of parameters in the output la yer of ψ 1 is b ounded b y O (1 /δ ), the input la yer of ψ 2 is b ounded b y O (1 / ( ϵ 1 δ )), therefore the norm of parameters connecting ψ 1 and ψ 2 are b ounded b y O (1 / ( ϵ 1 δ 2 )). Th us, the norm of the net work g is b ounded b y B ϵ,δ,K ≲ max ∥ θ ( ψ 1 ) ∥ ∞ , ∥ θ ( ψ 2 ) ∥ ∞ , 1 δ 2 ϵ 1 ≲ max ( K d ϵδ 2 , K d 2 ϵ d ) as stated in ( 22 ). This concludes the pro of. Finally , by combining the neural net work approximations for piecewise constan t functions o ver the coarse grid in Lemma B.13 , with the neural netw ork approximations for indicator functions on the refined grid cells in Lemma B.15 , we can construct neural netw ork appro ximations for piecewise constant functions o v er the refined grid cells, consisting of K 2 d pieces. Lemma B.16. L et ϕ satisfy Assumptions 2.1 – 2.3 . L et C ( x ) b e given in ( 13 ) and assume that max i , j {| c i , j |} ⩽ c max . Fix d ∈ N + and K ∈ N + with K sufficiently lar ge. Then for any sufficiently smal l ϵ > 0 and any δ ∈ (0 , 1 3 K 2 ) , ther e exists a neur al network g ∈ H ϕ, 5 ( d, 1 , 2 d +3 K d , B ϵ,δ,K ) , with B ϵ,δ,K ≲ ( c max + 1) 2 d max ( K d 2 ϵ d , K 2 d ϵ 2 , K d ϵδ 2 ) , (23) such that the fol lowing holds: • (Appr oximation) | g ( x ) − C ( x ) | ⩽ ϵ, x ∈ ∪ i , j Ω K i , j , int ( δ ) . • (Bounde dness) F or x ∈ [0 , 1] d , | g ( x ) | ⩽ ( c max 4 d +3 (1 + C 1 ) 2 d , He aviside-like ϕ, c max 4 d +3 , R eLU-like ϕ, wher e C 1 is the c onstant app e aring in Assumption 2.3 . Pr o of. W e prov e the result for the Hea viside-like ϕ , and the pro of for the ReLU-like ϕ follows in the same manner. Let ϵ 1 b e sufficien tly small. W e b egin by considering the neural netw ork constructed in Lemma B.13 with sufficien tly small ϵ 1 . F rom the construction of ψ 1 , we hav e ∥ θ ( ψ 1 ) ∥ ∞ ≲ max { c max K d / ( ϵ 1 δ ) , c d max K d 2 /ϵ d 1 } . By the appro ximation guaran ty for ψ 1 , we hav e | [ ψ 1 ( x )] j − C j ( x ) | ⩽ ϵ 1 , x ∈ ∪ i , j Ω K i , j , int ( δ ) . Next, by Corollary B.8 , we obtain K d subnet works { ψ 2 , j } j ∈ [ K ] d , each with depth 3 and ∥ θ ( ψ 2 , j ) ∥ ∞ ≲ 1 /ϵ 1 the norm of parameters in the input lay er for each ψ 2 , j is b ounded b y O (1). These subnetw orks satisfy | ψ 2 , j ( x ) − x | ⩽ ϵ 1 , for | x | ⩽ c max 2 d +2 (1 + C 1 ) d , j ∈ [ K ] d . Define the net work ψ 3 b y [ ψ 3 ( x )] j = ψ 2 , j ([ ψ 1 ( x )] j ) , 36 so that ψ 3 has depth 4 and width 2 d +2 K d . The norm of parameters connecting ψ 1 and ψ 2 , j is b ounded b y ∥ θ ( ψ 1 ) ∥ ∞ , since the since the w eights in the input la y er of ψ 2 , j are O (1). Therefore, w e ha ve the b ound for ∥ θ ( ψ 3 ) ∥ ∞ ∥ θ ( ψ 3 ) ∥ ∞ ≲ max {∥ θ ( ψ 1 ) ∥ ∞ , ∥ θ ( ψ 2 , j ) ∥ ∞ } ≲ max ( ( c max + 1) K d ϵ 1 δ , c d max K d 2 ϵ d 1 ) . F or the constructed ψ 3 , we hav e | [ ψ 3 ( x )] j − C j ( x ) | ⩽ 2 ϵ 1 , x ∈ ∪ i , j Ω K i , j , int ( δ ) and | [ ψ 3 ( x )] j | ⩽ | [ ψ 1 ( x )] j | + ϵ 1 ⩽ c max 2 d +3 (1 + C 1 ) d , x ∈ R d , j ∈ [ K ] d . Let ψ 4 b e the net work constructed in Lemma B.15 with ϵ = ϵ 1 . The depth and width of ψ 4 are 4 and 2 d +2 K d , resp ectiv ely . Moreov er, ∥ θ ( ψ 4 ) ∥ ∞ ≲ max { K d / ( ϵ 1 δ 2 ) , K d 2 /ϵ d 1 } and the norm of parameters in the output la yer of ψ 4 is b ounded by O ( K d 2 /ϵ d 1 ). F or the constructed ψ 4 , w e ha ve [ ψ 4 ( x )] j − e I j ( x ) ⩽ ϵ 1 K d , x ∈ ∪ i , j Ω K i , j , int ( δ ) , where e I j is the shorthand notation for e I j ( x ) = 1 Ω K 1 , j x − X i ∈ [ K ] d a K i 1 Ω K i ( x ) and X l ∈ [ K ] d | [ ψ 4 ( x )] l | ⩽ 2 d +2 (1 + C 1 ) d x ∈ R d . Again b y Lemma B.4 , for sufficien tly small ϵ 2 , there exist K d subnet works { ψ 5 , j } j ∈ [ K ] d suc h that | ψ 5 , j ( x, y ) − xy | ⩽ ϵ 2 , 0 ⩽ | x | , | y | < max { c max , 1 } 2 d +3 (1 + C 1 ) d . The norm of ψ 5 , j is b ounded by O 1 ϵ 2 2 , and the norm of the parameters in the input lay er of ψ 5 , j is b ounded b y O (1). Define the final constructed neural net work g as g ( x ) = X j ∈ [ K ] d ψ 5 , j ([ ψ 3 ( x )] j , [ ψ 4 ( x )] j ) . W e hav e the following appro ximation error b ound for constructed g | g ( x ) − C ( x ) | = X j ∈ [ K ] d ψ 5 , j ([ ψ 3 ( x )] j , [ ψ 4 ( x )] j ) − X j ∈ [ K ] d C j ( x ) e I j ( x ) ⩽ X j ∈ [ K ] d | ψ 5 , j ([ ψ 3 ( x )] j , [ ψ 4 ( x )] j ) − [ ψ 3 ( x )] j [ ψ 4 ( x )] j | + X j ∈ [ K ] d [ ψ 3 ( x )] j [ ψ 4 ( x )] j − C j ( x ) e I j ( x ) ⩽ K d ϵ 2 + X j ∈ [ K ] d | [ ψ 4 ( x )] j | | [ ψ 3 ( x )] j − C j ( x ) | + X j ∈ [ K ] d | C j ( x ) | [ ψ 4 ( x )] j − e I j ( x ) ⩽ K d ϵ 2 + 2 ϵ 1 X j ∈ [ K ] d | [ ψ 4 ( x )] j | + c max ϵ 1 ⩽ K d ϵ 2 + c max + 2 d +3 (1 + C 1 ) d ϵ 1 ⩽ ϵ, x ∈ ∪ i , j Ω K i , j , int ( δ ) . 37 where we choose ϵ 1 = ϵ 2 ( c max + 2 d +3 (1 + C 1 ) d ) , ϵ 2 = ϵ 2 K d in the last inequality . Moreo v er, w e ha ve the uniform b ound for the output of g as | g ( x ) | = X j ∈ [ K ] d ψ 5 , j ([ ψ 3 ( x )] j , [ ψ 4 ( x )] j ) − [ ψ 3 ( x )] j [ ψ 4 ( x )] j + [ ψ 3 ( x )] j [ ψ 4 ( x )] j ⩽ X j ∈ [ K ] d | ψ 5 , j ([ ψ 3 ( x )] j , [ ψ 4 ( x )] j ) − [ ψ 3 ( x )] j [ ψ 4 ( x )] j | + X j ∈ [ K ] d | [ ψ 3 ( x )] j [ ψ 4 ( x )] j | ⩽ X j ∈ [ K ] d ϵ 2 + X j ∈ [ K ] d max x ∈ R d [ ψ 3 ( x )] j [ ψ 4 ( x )] j ⩽ ϵ 2 + c max 2 d +3 (1 + C 1 ) d X j ∈ [ K ] d [ ψ 4 ( x )] j ⩽ ϵ 2 + c max 2 2 d +5 (1 + C 1 ) 2 d ⩽ c max 4 d +3 (1 + C 1 ) 2 d , x ∈ [0 , 1] d . Finally , regarding the architecture of g , the depth is 5, and the width can b e b ounded by 2 d +2 K d + 2 d +2 K d = 2 d +3 K d . The norm of parameters connecting ψ 3 , ψ 4 to { ψ 5 , j } j is b ounded b y max {∥ θ ( ψ 3 ) ∥ ∞ , ∥ θ ( ψ 4 ) ∥ ∞ } , since the weigh ts in the input la yer of { ψ 5 , j } j are O (1). Therefore, ∥ θ ( g ) ∥ ∞ is b ounded b y B ϵ,δ,K ≲ max {∥ θ ( ψ 3 ) ∥ ∞ , ∥ θ ( ψ 4 ) ∥ ∞ , ∥ θ ( ψ 5 , j ) ∥ ∞ } ≲ max ( ( c max + 1) K d ϵ 1 δ , c d max K d 2 ϵ d 1 , K d ϵ 1 δ 2 , K d 2 ϵ d 1 , 1 ϵ 2 2 ) ≲ ( c max + 1) 2 d max ( K d 2 ϵ d , K 2 d ϵ 2 , K d ϵδ 2 ) , as stated in ( 23 ). This concludes the pro of. R emark B.17 . In fact, one can approximate piecewise constan t functions with K 2 d pieces o v er refined grids b y directly approximating all indicator functions on each of the K 2 d refined grid cells. Ho wev er, although the num b er of non-zero parameters in the newly constructed netw ork is O ( K 2 d ), the width of the netw ork is also O ( K 2 d ), leading to a total num b er of parameters that gro ws as O ( K 4 d ). This discrepancy betw een the count of non-zero parameters and the total parameter space imp oses an impractical sparsit y constrain t on the newly constructed netw ork. B.5 Appro ximation of Piecewise Polynomials By com bining the appro ximations for piecewise constan t functions established in Lemma B.16 along with the approximation for monomials derived in Lemma B.4 , one can construct neural net work architectures for piecewise p olynomials. Lemma B.18. L et ϕ satisfy Assumptions 2.1 – 2.3 . L et d ∈ N + and α = ( α 1 , · · · α d ) ∈ N d satisfy ∥ α ∥ 0 : = P d j =1 α j = m ⩾ 1 . L et C ( x ) b e given in ( 13 ) and assume that max i , j {| c i , j |} ⩽ c max . Fix d ∈ N + and K ∈ N + with K sufficiently lar ge. Then for any sufficiently smal l ϵ > 0 and any δ ∈ (0 , 1 3 K 2 ) , ther e exists a neur al network g ∈ H ϕ, 6 ( d, 1 , 2 d +4 K d , B ϵ,δ,K ) , with B ϵ,δ,K ≲ ( c max + 1) 3 d + m max ( K d 2 ϵ d , K 2 d ϵ 2 , K d ϵδ 2 , 1 ϵ m ) , (24) 38 such that the fol lowing pr op erties hold: • (Appr oximation) | g ( x ) − C ( x ) x α | ⩽ ϵ x ∈ ∪ i , j Ω K i , j , int ( δ ) . • (Bounde dness) F or x ∈ [0 , 1) d , | g ( x ) | ⩽ ( c max 4 d +4 (1 + C 1 ) 2 d , He aviside-like ϕ, c max 4 d +4 , R eLU-like ϕ, wher e C 1 is the c onstant app e aring in Assumption 2.3 . Pr o of. W e prov e the result for the Hea viside-like ϕ , and the pro of for the ReLU-like ϕ follows in the same manner. F or sufficiently small ϵ 1 > 0, by Lemma B.4 , there exists a neural netw ork ψ 1 with depth 2, width 2 m , and ∥ θ ( ψ 1 ) ∥ ∞ ≲ (1 /ϵ 1 ) m suc h that sup x ∈ [ − 1 , 1] d | ψ 1 ( x ) − x α 1 1 · · · x α d d | ⩽ ϵ 1 , and | ψ 1 ( x ) | ⩽ 3 2 , x ∈ [ − 1 , 1] d . Next, b y Corollary B.8 , there exists a neural netw ork ν with depth 4, width 2, and parameter norm ∥ θ ( ν ) ∥ ∞ ≲ 1 /ϵ 1 , such that sup x ∈ [ − 3 2 , 3 2 ] | ν ( x ) − x | ⩽ ϵ 1 . Additionally , | ν ( x ) | < 2 for − 3 2 < x < 3 2 , and the parameter norms in the input and output la yers are bounded b y O (1) and O (1 /ϵ 1 ), resp ectively . W e now define the neural netw ork ψ 2 as the comp osition ψ 2 : = ν ◦ ψ 1 . By the prop erties of the netw ork composition, w e ha ve the following approximation b ound ψ 2 ( x ) − x α 1 1 · · · x α d d ⩽ 5 ϵ 1 , x ∈ [ − 1 , 1] d , and | ψ 2 ( x ) | ⩽ 2 , x ∈ [ − 1 , 1] d . Considering the arc hitecture for ψ 2 , its depth and width are 5 and 2 m , resp ectiv ely , and ∥ θ ( ψ 2 ) ∥ ∞ ≲ (1 /ϵ 1 ) m . Moreov er, the norm of the parameters in the output lay er of ψ 2 is b ounded b y O (1 /ϵ 1 ). Let ψ 3 b e the neural netw ork constructed in Lemma B.16 with ϵ = ϵ 1 . W e ha ve the appro ximation b ound | ψ 3 ( x ) − C ( x ) | ⩽ ϵ 1 , x ∈ ∪ i , j Ω K i , j , int ( δ ) . F urthermore, b y Lemma B.4 , there exists a neural netw ork ψ 4 with depth 2, width 4 and ∥ θ ( ψ 4 ) ∥ ∞ ≲ (1 /ϵ 1 ) 2 , such that | ψ 4 ( x, y ) − xy | ⩽ ϵ 1 , 0 ⩽ | x | , | y | < c max 4 d +3 (1 + C 1 ) 2 d . Moreo ver, the norm for parameters in the input la yer for ψ 4 is b ounded b y O (1). Finally , we define the constructed neural netw ork g as g ( x ) = ψ 4 ( ψ 2 ( x ) , ψ 3 ( x )) , x ∈ [0 , 1) d , then we obtain the approximation b ound for x ∈ ∪ i , j Ω K i , j , int ( δ ) | g ( x ) − C ( x ) x α | ⩽ | ψ 4 ( ψ 2 ( x ) , ψ 3 ( x )) − ψ 2 ( x ) ψ 3 ( x ) | + | ψ 2 ( x ) ψ 3 ( x ) − C ( x ) x α | ⩽ ϵ 1 + | ψ 2 ( x ) || ψ 3 ( x ) − C ( x ) | + | C ( x ) || ψ 2 ( x ) − x α | ⩽ ϵ 1 + 2 ϵ 1 + 5 c max ϵ 1 ⩽ ϵ, 39 where w e c ho ose ϵ 1 = ϵ 3+5 c max in the last inequalit y . Moreov er, w e can bound g ( x ) for x ∈ [0 , 1) d as follows: | g ( x ) | ⩽ | ψ 4 ( ψ 2 ( x ) , ψ 3 ( x )) − ψ 2 ( x ) ψ 3 ( x ) | + | ψ 2 ( x ) ψ 3 ( x ) | ⩽ ϵ 1 + 2 × c max 4 d +3 (1 + C 1 ) 2 d ⩽ c max 4 d +4 (1 + C 1 ) 2 d . Finally , the depth of g is 6, and its width can b e b ounded b y 2 d +3 K d + max { 2 m , 4 } ⩽ 2 d +4 K d . The norm of parameters connecting ψ 2 , ψ 3 , ψ 4 is b ounded b y max {∥ θ ( ψ 3 ) ∥ ∞ , ∥ θ ( ψ 4 ) ∥ ∞ } , since the weigh ts in the input lay er of ψ 4 are O (1). Therefore, ∥ θ ( g ) ∥ ∞ can b e bounded b y B ϵ,δ,K ≲ max {∥ θ ( ψ 2 ) ∥ ∞ , ∥ θ ( ψ 3 ) ∥ ∞ , ∥ θ ( ψ 4 ) ∥ ∞ } ≲ ( c max + 1) 3 d + m max ( K d 2 ϵ d , K 2 d ϵ 2 , K d ϵδ 2 , 1 ϵ m ) , as stated in ( 24 ). This concludes the pro of. B.6 Pro of of Theorem 3.1 ( L 2 Appro ximation) By com bining the piecewise polynomial approximation in Lemma B.1 with the neural net work constructions in Lemma B.18 , we obtain the following approximation guarant y for f ⋆ ∈ W s, ∞ ([0 , 1] d ). Theorem B.19. L et ϕ satisfy Assumptions 2.1 – 2.3 . F or any s > 0 and any f ⋆ ∈ W s, ∞ ([0 , 1] d ) with ∥ f ⋆ ∥ W s, ∞ ([0 , 1]) d ⩽ 1 . L et ϵ ∈ (0 , 1) b e sufficiently smal l, and we define K : = ⌈ (2 c 1 ( s, d ) /ϵ ) 1 / 2 s ⌉ . Then, for any δ ∈ (0 , 1 3 K 2 ) , ther e exists a neur al network g ∈ H ϕ, 6 ( d, 1 , M ϵ , B ϵ,δ ) , with M ϵ ≲ 1 ϵ d 2 s , B ϵ,δ ≲ max ( 1 ϵ max n d 2 2 s + d, d s +2 , ⌈ s ⌉ o , 1 δ 2 ϵ d 2 s +1 ) , (25) such that the fol lowing pr op erties hold: • (Appr oximation) ∥ g − f ⋆ ∥ L ∞ ( S i , j Ω K i , j , in t ( δ ) ) ⩽ ϵ. • (Bounde dness) F or x ∈ [0 , 1) d , | g ( x ) | ⩽ ( ⌈ s ⌉ d c 2 ( s, d )4 d +4 (1 + C 1 ) 2 d , He aviside-like ϕ, ⌈ s ⌉ d c 2 ( s, d )4 d +4 , R eLU-like ϕ, (26) Pr o of. W e prov e the result for the Heaviside-lik e ϕ , and the pro of for the ReLU-like ϕ follows in the same manner. By Lemma B.1 , for f ⋆ , there exists a piecewise polynomial p = P | α | < ⌈ s ⌉ p α defined ov er partition { Ω K i , j } i , j ∈ [ K ] d suc h that ∥ p − f ⋆ ∥ L ∞ ([0 , 1] d ) ⩽ c 1 ( s, d ) ∥ f ⋆ ∥ W s, ∞ ([0 , 1] d ) K − 2 s , with the magnitudes of the co efficien ts in each piecewise monomial p α b ounded, i.e. c max ⩽ c 2 ( s, d ) ∥ f ⋆ ∥ W s, ∞ ([0 , 1] d ) ⩽ c 2 ( s, d ). Next, by Lemma B.18 , for sufficien tly small ϵ 1 > 0, there exists a set of neural net w orks { ψ α } | α | < ⌈ s ⌉ suc h that | ψ α ( x ) − p α ( x ) | < ϵ 1 , x ∈ ∪ i , j Ω K i , j , int ( δ ) , 40 with b ounded output: | ψ α ( x ) | < c 2 ( s, d )4 d +4 (1 + C 1 ) 2 d Define the neural netw ork g = P | α | < ⌈ s ⌉ ψ α . Then we can b ound the L ∞ error for g appro ximating f ⋆ as follows: ∥ g − f ⋆ ∥ L ∞ ( S i , j Ω K i , j , in t ( δ ) ) ⩽ ∥ p − f ⋆ ∥ L ∞ ( S i , j Ω K i , j , in t ( δ ) ) + ∥ g − p ∥ L ∞ ( S i , j Ω K i , j , in t ( δ ) ) ⩽ ∥ p − f ⋆ ∥ L ∞ ([0 , 1] d ) + X | α | < ⌈ s ⌉ ∥ ψ α − p α ∥ L ∞ ( S i , j Ω K i , j , in t ( δ ) ) ⩽ c 1 ( s, d ) K − 2 s + ⌈ s ⌉ d ϵ 1 , Set K = & 2 c 1 ( s, d ) ϵ 1 2 s ' , ϵ 1 = ϵ 2 ⌈ s ⌉ d . Then we obtain ∥ g − f ⋆ ∥ L ∞ ( S i , j Ω K i , j , in t ( δ ) ) ⩽ ϵ. And the output for g is b ounded b y: | g ( x ) | ⩽ X | α | < ⌈ s ⌉ | ψ α ( x ) | ⩽ ⌈ s ⌉ d c 2 ( s, d )4 d +4 (1 + C 1 ) 2 d , x ∈ [0 , 1) d . And the width of g is b ounded b y M ϵ = ⌈ s ⌉ d 2 d +4 K d ≲ 1 ϵ d 2 s , and the parameter norm of the netw ork g is bounded b y B ϵ,δ ≲ max ( K d 2 ϵ d 1 , K 2 d ϵ 2 1 , K d ϵ 1 δ 2 , max | α | < ⌈ s ⌉ 1 ϵ | α | 1 ) ≲ max ( 1 ϵ max n d 2 2 s + d, d s +2 , ⌈ s ⌉ o , 1 δ 2 ϵ d 2 s +1 ) , as stated in ( 3 ). This concludes the pro of. With the approximation of f ⋆ b y g established on S i , j ∈ [ K ] d Ω K i , j , int ( δ ) and uniform b oundedness ensured on [0 , 1) d in Theorem B.19 , w e now pro ceed to the proof of Theorem 3.1 . Pr o of for The or em 3.1 . It suffices to prov e for sufficien tly small ϵ < ϵ 0 , where ϵ 0 dep ends on s , d and ϕ . W e pro ve the result for the Heaviside-lik e ϕ , and the pro of for the ReLU-lik e ϕ follows in the same manner. By Theorem B.19 , for any sufficien tly small ϵ 1 > 0, with K : = ⌈ (2 c 1 ( s, d ) /ϵ 1 ) 1 / 2 s ⌉ and δ ∈ (0 , 1 / (3 K 2 )), there exists a neural net w ork g such that ∥ g − f ⋆ ∥ L ∞ ( S i , j Ω K i , j , in t ( δ ) ) ⩽ ϵ 1 . (27) and, for all x ∈ [0 , 1) d , | g ( x ) | ⩽ ⌈ s ⌉ d c 2 ( s, d )4 d +4 (1 + C 1 ) 2 d , W e estimate the L 2 ([0 , 1] d ) approximation by decomp osing the domain into interior ( ∪ i , j Ω K i , j , int ( δ )) and b oundary regions ( ∪ i , j Ω K i , j , band ( δ )): ∥ g − f ⋆ ∥ L 2 ([0 , 1] d ) ⩽ ∥ g − f ⋆ ∥ L 2 ( S i , j Ω K i , j , in t ( δ ) ) | {z } ( a ) + ∥ g − f ⋆ ∥ L 2 ( S i , j Ω K i , j , band ( δ ) ) | {z } ( b ) ⩽ ϵ 1 + ⌈ s ⌉ d c 2 ( s, d )4 d +4 (1 + C 1 ) 2 d × q 2 d ( K − 2 ) d − 1 δ × K 2 d = ϵ 1 + √ 2 dc 2 ( s, d )4 d +4 (1 + C 1 ) 2 d K √ δ . 41 Here, term (a) is controlled b y the appro ximation guarantee in ( 27 ), while term (b) follo ws from the uniform b oundedness ( B.6 ) together with the measure estimate of the b oundary region. W e now choose ϵ 1 = ϵ 2 , δ = ϵ 2 2 4 d +19 c 2 2 ( s, d ) d (1 + C 1 ) 4 d K 2 , whic h yields ∥ g − f ⋆ ∥ L 2 ([0 , 1] d ) ⩽ ϵ. Finally , by the construction in Theorem B.19 , the netw ork width is b ounded by O ((1 /ϵ ) d 2 s ) and the parameter norm ∥ θ ( g ) ∥ ∞ is b ounded b y: B ϵ ≲ max 1 ϵ max n d 2 2 s + d, d s +2 , ⌈ s ⌉ o 1 , 1 δ 2 ϵ d 2 s +1 1 ≲ 1 ϵ max n d 2 2 s + d, d s +2 , d +4 2 s +5 , ⌈ s ⌉ o , as stated in ( 3 ). This concludes the pro of. B.7 Appro ximation of W eight F unctions In the remaining part of this section, we strengthen the appro ximation result of Theorem 3.1 by impro ving the error guarantee from the L 2 ([0 , 1] d ) to L ∞ ([0 , 1] d ). The constructed netw orks giv en in Theorem 3.1 fail to achiev e an uniform L ∞ ([0 , 1] d ) approximation on the region ∪ i , j Ω K i , j , band ( δ ). The obstruction stems from the fact that indicator functions asso ciated with this region cannot b e uniformly appro ximated, as sho wn in Lemma B.12 . T o circumv en t this issue, w e in tro duce a weigh t function that assigns negligible mass to the asso ciated region, thereb y suppressing the appro ximation error. W e begin b y formally defining the basis functions emplo y ed in the construction of the w eight functions, which are tailored to the activ ation function classes as sp ecified in Assumption 2.3 . Definition B.20 (Basis function) . Let ϕ satisfy Assumption 2.3 . F or any K ∈ N + and any δ ∈ 0 , 1 12 K 2 , and an y β > 0, w e define the basis function B β ,δ,K ϕ as follows: • If ϕ is Heaviside-lik e, then: B β ,δ,K ϕ ( x ) = ϕ ( β ( x − 3 δ )) − ϕ β ( x + 3 δ − K − 2 ) . • If ϕ is ReLU-like, then: B β ,δ,K ϕ ( x ) = 1 2 δ β h ϕ ( β ( x − 2 δ )) − ϕ ( β ( x − 4 δ )) − ϕ β ( x − K − 2 + 4 δ ) + ϕ β ( x − K − 2 + 2 δ ) i . Building up on the basis functions established in Definition B.20 , we pro ceed to define the univ ariate weigh t functions as follows. Definition B.21 (Univ ariate weigh t function) . Let ϕ satisfy Assumption 2.3 . F or an y K ∈ N + and any δ ∈ 0 , 1 12 K 2 and any β > 0, w e define the primary w eigh t function w β ,δ,K ϕ, 1 : [0 , 1] → R piecewise via translation: w β ,δ,K ϕ, 1 ( x ) : = B β ,δ,K ϕ x − k K 2 , x ∈ k K 2 , k + 1 K 2 , (28) where k = 0 , 1 , · · · , K 2 − 1. 42 The complementary weigh t function w β ,δ,K ϕ, 2 is defined as: w β ,δ,K ϕ, 2 ( x ) : = 1 − w β ,δ,K ϕ, 1 ( x ) , x ∈ [0 , 1) . Finally , we imp ose p erio dic b oundary conditions such that w β ,δ,K ϕ,i (1) : = w β ,δ,K ϕ,i (0) for i ∈ { 1 , 2 } . The univ ariate w eigh t functions exhibit the following prop erties. Prop osition B.22. L et w β ,δ,K ϕ, 1 and w β ,δ,K ϕ, 2 b e the 1D weight functions define d in Definition B.21 . F or any ϵ ∈ (0 , 1) and any β ⩾ 2 δ ϵ max { C 1 , C 2 } , the fol lowing pr op erties hold: • (Partition of unity) w β ,δ,K ϕ, 1 ( x ) + w β ,δ,K ϕ, 2 ( x ) = 1 , x ∈ [0 , 1] . • (L o c al ly quasi-vanishing b ehavior) The weight functions ar e effe ctively supp orte d away fr om b and r e gions: | w β ,δ,K ϕ,i ( x ) | ⩽ ϵ, x ∈ Ω K,i band (2 δ ) , i = 1 , 2 , wher e the b and r e gions Ω K,i band (2 δ ) ar e define d as: Ω K, 1 band (2 δ ) : = K 2 [ k =0 − 2 δ + k K 2 , 2 δ + k K 2 \ [0 , 1] , Ω K, 2 band (2 δ ) : = K 2 − 1 [ k =0 2 k + 1 2 K 2 − 2 δ, 2 k + 1 2 K 2 + 2 δ . • (Bounde dness) max n ∥ w β ,δ,K ϕ, 1 ∥ L ∞ ([0 , 1]) , ∥ w β ,δ,K ϕ, 2 ∥ L ∞ ([0 , 1]) o ⩽ 2 C 1 + 3 . Pr o of. By Definition B.21 , the weigh t functions w β ,δ,K ϕ, 1 and w β ,δ,K ϕ, 2 are p erio dic with p erio d T = K − 2 . It therefore suffices to v erify the stated prop erties on the in terv al [0 , K − 2 ). Moreo v er, the partition-of-unit y property follows directly from Definition B.21 . W e then prov e the remaining t wo prop erties. W e first consider Heaviside-lik e activ ations and then deal with the ReLU-like case. • (Hea viside-lik e) Let χ be the indicator function defined b y χ ( x ) : = H ( β ( x − 3 δ )) − H ( β ( x + 3 δ − K − 2 )) W e hav e the following prop erties. – F or x ∈ [0 , 2 δ ] ∪ [ 1 K 2 − 2 δ, 1 K 2 ], χ ( x ) = 0. – F or x ∈ [ 1 2 K 2 − 2 δ, 1 2 K 2 + 2 δ ], χ ( x ) = 1. On the interv al I : = [0 , 2 δ ] ∪ [ 1 2 K 2 − 2 δ, 1 2 K 2 + 2 δ ] ∪ [ 1 K 2 − 2 δ, 1 K 2 ], the deviation b etw een w β ,δ,K ϕ, 1 ( x ) and χ ( x ) is b ounded by: | w β ,δ,K ϕ, 1 ( x ) − χ ( x ) | ⩽ | ( ϕ − H )( β ( x − 3 δ )) | + | ( ϕ − H )( β ( x + 3 δ − K − 2 )) | ( a ) ⩽ C 1 β | x − 3 δ | − 1 + | x + 3 δ − K − 2 | − 1 ( b ) ⩽ 2 C 1 β δ ( c ) ⩽ ϵ, where ( a ) follows from the Heaviside-lik e assumption in ( 1 ), ( b ) follows from the fact that x ∈ I ensures the distance from x to 3 δ and K − 2 − 3 δ is at least δ , and ( c ) follo ws from β ⩾ 2 C 1 δ ϵ . 43 The lo cally quasi-v anishing prop erties for w eigh t functions follow directly: | w β ,δ,K ϕ, 1 ( x ) | = | w β ,δ,K ϕ, 1 ( x ) − χ ( x ) | ⩽ ϵ, x ∈ [0 , 2 δ ] ∪ 1 K 2 − 2 δ, 1 K 2 , | w β ,δ,K ϕ, 2 ( x ) | = | 1 − w β ,δ,K ϕ, 1 ( x ) | = | χ ( x ) − w β ,δ,K ϕ, 1 ( x ) | ⩽ ϵ, x ∈ 1 2 K 2 − 2 δ, 1 2 K 2 + 2 δ . Moreo ver, by the Heaviside-lik e assumption in ( 1 ) and ∥ χ ∥ L ∞ ([0 , 1]) ⩽ 1, w e obtain: ∥ w β ,δ,K ϕ, 1 ∥ L ∞ ([0 , 1]) ⩽ 2 C 1 + 1 , ∥ w β ,δ,K ϕ, 2 ∥ L ∞ ([0 , 1]) ⩽ 1 + ∥ w β ,δ,K ϕ, 1 ∥ L ∞ ([0 , 1]) ⩽ 2 C 1 + 2 . • (ReLU-lik e) W e define the nominal trap ezoidal profile g via: χ ( x ) : = 1 2 δ β h ReLU( β ( x − 2 δ )) − ReLU( β ( x − 4 δ )) − ReLU( β ( x − K − 2 + 4 δ )) + ReLU( β ( x − K − 2 + 2 δ )) i . Similar to the Heaviside case, the function g satisfies – F or x ∈ [0 , 2 δ ] ∪ [ 1 K 2 − 2 δ, 1 K 2 ], χ ( x ) = 0. – F or x ∈ [ 1 2 K 2 − 2 δ, 1 2 K 2 + 2 δ ], χ ( x ) = 1. F or x ∈ R , the deviation b etw een w β ,δ,K ϕ, 1 ( x ) and χ ( x ) is b ounded as: w β ,δ,K ϕ, 1 ( x ) − χ ( x ) ⩽ 1 2 δ β 2 X j =1 | ϕ ( β ( x − 2 j δ )) − ReLU( β ( x − 2 j δ )) | + 1 2 δ β 2 X j =1 | ϕ ( β ( x − K − 2 + 2 j δ )) − ReLU( β ( x − K − 2 + 2 j δ )) | ( a ) ⩽ 2 C 2 δ β ( b ) ⩽ ϵ, (29) where ( a ) follows from the ReLU-lik e assumption in ( 2 ) and ( b ) follo ws from β ⩾ 2 C 2 δ β . The lo cally quasi-v anishing prop erties follow the same logic as the Heaviside-lik e case. Regarding b oundedness, since ∥ χ ∥ L ∞ ([0 , 1]) ⩽ 1, and com bining the error b ound in ( 29 ), w e establish the b oundedness for the weigh t functions: ∥ w β ,δ,K ϕ, 1 ∥ L ∞ ([0 , 1]) ⩽ 1 + ϵ < 2 , ∥ w β ,δ,K ϕ, 2 ∥ L ∞ ([0 , 1]) ⩽ 1 + ∥ w β ,δ,K ϕ, 1 ∥ L ∞ ([0 , 1]) < 3 . W e no w generalize the weigh t functions to arbitrary dimensions using a tensor product approac h. Definition B.23 (Multiv ariate w eigh t functions) . Let the parameters β , δ, K and the activ ation function ϕ satisfy the same conditions giv en in Definition B.21 . F or d ∈ N + and a m ulti-index v = ( v 1 , . . . , v d ) ∈ [2] d , we define the d -v ariate weigh t function w β ,δ,K ϕ, v : [0 , 1] d → R as: w β ,δ,K ϕ, v ( x ) := d Y l =1 w β ,δ,K ϕ,v l ( x l ) . 44 The multiv ariate w eight functions satisfy the follo wing properties. Prop osition B.24. L et w β ,δ,K ϕ, v denote the d -variate weight functions define d in Definition B.23 . F or any sufficiently smal l ϵ ∈ (0 , 1) and any β ⩾ 2(2 C 1 +3) d − 1 δ ϵ max { C 1 , C 2 } , the fol lowing pr op erties hold: • (Partition of unity) P v ∈ [2] d w β ,δ,K ϕ, v ( x ) = 1 , x ∈ [0 , 1] d . • (L o c al ly quasi-vanishing b ehavior) The weight function is effe ctively supp orte d away fr om the b and r e gion Ω K, v band (2 δ ) : | w β ,δ,K ϕ, v ( x ) | ⩽ ϵ, x ∈ Ω K, v band (2 δ ) . • (Bounde dness) w β ,δ,K ϕ, v L ∞ ([0 , 1] d ) ⩽ (2 C 1 + 3) d . Pr o of. W e no w establish the three prop erties: • (P artition of unit y) Exploiting the tensor pro duct structure, the summation o ver the m ulti-index v ∈ [2] d factorizes into a product of univ ariate sums: X v ∈ [2] d w β ,δ,K ϕ, v ( x ) = X v 1 ∈ [2] w β ,δ,K ϕ,v 1 ( x 1 ) X v 2 ∈ [2] w β ,δ,K ϕ,v 2 ( x 2 ) · · · X v d ∈ [2] w β ,δ,K ϕ,v d ( x d ) = d Y l =1 X v l ∈ [2] w β ,δ,K ϕ,v l ( x l ) = 1 , x ∈ [0 , 1] d , where w e use the partition of unity prop erty for univ ariate w eigh t functions w β ,δ,K ϕ, 1 ( x ) + w β ,δ,K ϕ, 2 ( x ) = 1 from Prop osition B.22 . • (Locally quasi-v anishing b eha vior) Let ϵ 1 ∈ (0 , 1) and assume β ⩾ 2 δ ϵ 1 max { C 1 , C 2 } . Consider an arbitrary p oin t x ∈ Ω K, v band (2 δ ). By the definition of the m ultiv ariate band region, there exists at least one co ordinate index l ∈ [ d ] such that x l ∈ Ω K,v l band (2 δ ). Thus, w e obtain: w β ,δ,K ϕ, v ( x ) = | w β ,δ,K ϕ,v l ( x l ) | Y k = l | w β ,δ,K ϕ,v k ( x k ) | ( a ) ⩽ (2 C 1 + 3) d − 1 ϵ 1 ( b ) ⩽ ϵ, x ∈ Ω K, v band (2 δ ) , where ( a ) follows from the lo cally quasi-v anishing b eha vior of the univ ariate w eight function w β ,δ,K ϕ,v l and the b oundedness of the remaining d − 1 univ ariate weigh t functions as giv en in Prop osition B.22 , and ( b ) is obtained by setting ϵ 1 = ϵ/ (2 C 1 + 3) d − 1 and c ho osing β ⩾ 2(2 C 1 +3) d − 1 δ ϵ max { C 1 , C 2 } . • (Boundedness) Lev eraging the univ ariate b ounds established in Proposition B.22 , we deduce the uniform b ound for multiv ariate w eight functions: w β ,δ,K ϕ, v L ∞ ([0 , 1] d ) ⩽ d Y l =1 w β ,δ,K ϕ.v l L ∞ ([0 , 1]) ⩽ (2 C 1 + 3) d . 45 W e now construct neural netw ork appro ximators for the univ ariate w eight functions. Noticing that w β ,δ,K ϕ, 1 and w β ,δ,K ϕ, 2 exhibit K 2 p eriods on [0 , 1], a straigh tforward construction uses a shallow neural net w ork of O ( K 2 ) width. Ho wev er, for the case d = 1, this O ( K 2 ) width dominates the width scaling of O ( K ) established for appro ximators constructed in Theorem 3.1 , thereb y inflating the total parameter complexit y for the subsequent L ∞ ([0 , 1] d ) appro ximation. T o mitigate this, w e exploit the p eriodicity to approximate the w eight functions with a net work of width O ( K ), while maintaining the same order of total parameters. Specifically , w e appro ximate the w eigh t functions on [0 , K − 1 ] with K p erio ds, and for all x ∈ [0 , 1], w e extract the relative p osition within its asso ciated coarse interv als. Define a K as: a K ( x ) := K − 1 X i =0 i K 1 [ i K , i +1 K ) ( x ) , and use the relative information x − a K ( x ) as the input to the netw orks. The follo wing lemma establishes neural netw ork approximators for univ ariate w eight functions on [0 , 1). Lemma B.25. L et ϕ satisfy Assumptions 2.1 – 2.3 , and let w β ,δ,K ϕ,i for i = 1 , 2 denote the univariate weight functions define d in Definition B.21 with the same p ar ameters β , δ, K . F or any sufficiently smal l ϵ ∈ (0 , 1) , sufficiently lar ge K and any β ⩾ 4 K δ ϵ max { C 1 , C 2 } , ther e exist neur al networks { ψ i } 2 i =1 such that ψ i ∈ H ϕ, 3 (1 , 1 , M K , B β ,ϵ,δ,K ) , i = 1 , 2 , with M K ≲ K , B β ,ϵ,δ,K ≲ K 2 β 2 ϵδ 2 , (30) such that the fol lowing pr op erties hold: • (Appr oximation) L et Ω K,i coarse , band ( δ ) denote the b and r e gion asso ciate d with the c o arse c el ls on [0 , 1] : Ω K, 1 coarse , band ( δ ) : = K [ k =0 − δ + k K , δ + k K \ [0 , 1] , Ω K, 2 coarse , band ( δ ) : = K − 1 [ k =0 2 k + 1 2 K − δ, 2 k + 1 2 K + δ , Then, for x ∈ [0 , 1) \ Ω K,i coarse , band ( δ ) , | ψ i ( x ) − w β ,K,δ ϕ,i ( x ) | ⩽ ϵ, i = 1 , 2 . • (Bounde dness) ∥ ψ i ∥ L ∞ ( R ) ⩽ 2 C 1 + 4 , i = 1 , 2 . Pr o of. It suffices to prov e the result for ψ 1 and w β ,δ,K ϕ, 1 , as the argument for the complementary ψ 2 and w β ,δ,K ϕ, 2 follo ws symmetrically . Hea viside-lik e case. W e first construct a tw o-la yer netw ork η 1 designed to approximate the w eight function on the reference domain [0 , K − 1 ]: η 1 ( x ) : = K − 1 X k =0 ϕ β x − k K 2 − 3 δ − ϕ β x − k + 1 K 2 + 3 δ . (31) 46 Fix an in terv al index k ∈ { 0 , . . . , K − 1 } . F or an y x ∈ [ k K 2 , k +1 K 2 ), the perio dicit y of w β ,δ,K ϕ, 1 implies that w β ,δ,K ϕ, 1 ( x ) = w β ,δ,K ϕ, 1 x − k K 2 , x ∈ k K 2 , k + 1 K 2 , k = 0 , 1 , · · · , K − 1 . Then the appro ximation error is determined by the tails of the remaining terms l = k in ( 31 ): | η 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) | ⩽ K − 1 X l =0 ,l = k ϕ β x − l K 2 − 3 δ − ϕ β x − l + 1 K 2 + 3 δ ( a ) ⩽ K − 1 X l =0 ,l = k H β x − l K 2 − 3 δ − H β x − l + 1 K 2 + 3 δ + C 1 β K − 1 X l =0 ,l = k " x − l K 2 − 3 δ − 1 + x − l + 1 K 2 + 3 δ − 1 # ( b ) ⩽ 2 C 1 K 3 β δ ( c ) ⩽ ϵ 2 . (32) Here, Step ( a ) follows from the Heaviside-lik e assumption given in ( 2.3 ). Step ( b ) follows b ecause the distance from x ∈ [ k K 2 , k +1 K 2 ) to the switching p oints l /K 2 + 3 δ, ( l + 1) /K 2 − 3 δ of an y l = k term is at least 3 δ . Step ( c ) follo ws from β ⩾ 4 C 1 K 3 δ ϵ . Next, using the approximation guarantee given in ( 32 ), w e obtain: ∥ η 1 ∥ L ∞ ([0 ,K − 1 ]) ⩽ ∥ w β ,δ,K ϕ, 1 ∥ L ∞ ([0 ,K − 1 ]) + ϵ 2 ( a ) ⩽ 2 C 1 + 3 + ϵ 2 < 2 C 1 + 4 , where ( a ) comes from the boundedness of w β ,δ,K − 1 ϕ, 1 stated in Proposition B.22 . F or x / ∈ [0 , K − 1 ], a similar analysis to ( 32 ) yields: | η 1 ( x ) | ⩽ C 1 β K − 1 X k =0 " x − k K 2 − 3 δ − 1 + x − k + 1 K 2 + 3 δ − 1 # ⩽ 2 C 1 K 3 β δ ⩽ ϵ 2 < 1 . Th us, ∥ η 1 ∥ L ∞ ( R ) ⩽ 2 C 1 + 4. By Lemma B.14 , there exists a net work π ∈ H ϕ, 2 (1 , 1 , 6 K, B ϵ 1 ,δ,K ), where B ϵ 1 ,δ,K ≲ K ϵ 1 δ , appro ximating the relative p osition map x 7→ x − a K ( x ) such that: | π ( x ) − ( x − a K ( x )) | ⩽ ϵ 1 , x ∈ [0 , 1] \ Ω K, 1 coarse , band ( δ ) . (33) W e define the final appro ximation ψ 1 : = η 1 ◦ π . The uniform b oundedness of ψ 1 follo ws immediately from the b oundedness of the outer function η 1 : ∥ ψ 1 ∥ L ∞ ( R ) ⩽ ∥ η 1 ∥ L ∞ ( R ) ⩽ 2 C 1 + 4 . F or any x ∈ [0 , 1] \ Ω K, 1 coarse , band ( δ ), the error b etw een ψ 1 ( x ) and w β ,δ,K ϕ, 1 ( x ) is b ounded as: | ψ 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) | = | η 1 ( π ( x )) − w β ,δ,K ϕ, 1 ( x − a K ( x )) | ⩽ | η 1 ( π ( x )) − η 1 ( x − a K ( x )) | | {z } ( a ) + | η 1 ( x − a K ( x )) − w β ,δ,K ϕ, 1 ( x − a K ( x )) | | {z } ( b ) ⩽ 2 β K ∥ ϕ ∥ Lip ϵ 1 + ϵ 2 ( c ) ⩽ ϵ, 47 where term ( a ) is b ounded by the approximation guaran tee giv en in ( 33 ), and there are 2 K activ ations with Lipschitz constant ∥ ϕ ∥ Lip in the construction of ( 31 ), while term ( b ) follo ws from x − a K ( x ) ∈ [0 , K − 1 ] and the approximation guaran tee in ( 32 ). Finally , ( c ) follows from c ho osing ϵ 1 = ϵ 4 K β ∥ ϕ ∥ Lip . The widths of η 1 and π are b ounded b y 6 K and 2 K , resp ectiv ely; thus, the width of the comp osition ψ 1 is b ounded b y 6 K . F urthermore, ∥ θ ( ψ 1 ) ∥ ∞ is b ounded b y : ∥ θ ( ψ 1 ) ∥ ∞ ⩽ ∥ θ ( η 1 ) ∥ ∞ ∥ θ ( π ) ∥ ∞ ≲ β B ϵ 1 ,δ,K ≲ K 2 β 2 ϵδ . ReLU-lik e case. The pro of for ReLU-lik e activ ations ϕ follo ws similarly to that for Hea viside-like activ ations. The netw ork η 1 is reconstructed as: η 1 ( x ) : = 1 2 δ β K − 1 X k =0 " ϕ β x − k K 2 − 2 δ − ϕ β x − k K 2 − 4 δ − ϕ β x − k + 1 K 2 + 4 δ + ϕ β x − k + 1 K 2 + 2 δ # . Consider an in terv al [ k K 2 , k +1 K 2 ). F or any index l = k , the asso ciated ReLU trapezoid v anishes on this in terv al. Consequen tly , the error is dominated b y the deviation of ϕ from ReLU: | η 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) | ( a ) ⩽ 1 2 δ β K − 1 X l =0 ,l = k ReLU( β ( x − l /K 2 − 2 δ )) − ReLU( β ( x − l/K 2 − 4 δ )) − ReLU( β ( x − ( l + 1) /K 2 + 4 δ )) + ReLU( β ( x − ( l + 1) /K 2 + 2 δ )) + 2 C 2 K δ β ( b ) ⩽ ϵ 2 , (34) where ( a ) follo ws from the ReLU-lik e assumption given in ( 2 ) and ( b ) follows from β ⩾ 4 C 2 K δ ϵ . Define ψ 1 : = η 1 ◦ π . F ollowing the same analysis for Heaviside-lik e activ ations, w e obtain the b oundedness ∥ ψ 1 ∥ L ∞ ( R ) ⩽ ∥ η 1 ∥ L ∞ ( R ) ⩽ 2 C 1 + 4 and the appro ximation error b et w een ψ 1 and w β ,δ,K ϕ, 1 is b ounded as: | ψ 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) | ⩽ | η 1 ( π ( x )) − η 1 ( x − a K ( x )) | | {z } ( a ) + | η 1 ( x − a K ( x )) − w β ,δ,K ϕ, 1 ( x − a K ( x )) | | {z } ( b ) ⩽ 4 K ∥ ϕ ∥ Lip ϵ 1 δ + ϵ 2 ⩽ ϵ, x ∈ [0 , 1] \ Ω K, 1 coarse , band ( δ ) , where ( a ) is b ounded by the appro ximation guarantee giv en in ( 33 ), and there are 4 K activ ations with Lipschitz constan t ∥ ϕ ∥ Lip in the construction of ( 31 ), ( b ) is b ounded by the approximation guaran tee giv en in ( 34 ) and we set ϵ 1 = δ ϵ 8 K ∥ ϕ ∥ Lip . Finally , the depth of ψ 1 is 3, and its width is b ounded by O ( K ). The parameter norm satisfies. The parameter norm satisfies: ∥ θ ( ψ 1 ) ∥ ∞ ⩽ ∥ θ ( η 1 ) ∥ ∞ ∥ θ ( π ) ∥ ∞ ≲ max β , 1 β δ B ϵ 1 ,δ,K ≲ K 2 β 2 ϵδ 2 , as stated in ( 30 ). This concludes the pro of. Lemma B.25 establishes the approximation of the univ ariate weigh t functions outside the coarse band region. T o ac hieve a uniform appro ximation ov er the en tire interv al [0 , 1], we m ultiply the netw ork output b y a spatial quasi-indicator function supp orted on the complemen t 48 of the coarse bands. This strategy exploits the fact that within the coarse band regions, the target weigh t functions are quasi-v anishing, while the neural net work output remains uniformly b ounded. Consequently , this m ultiplication effectiv ely suppresses the error in the band regions. In the follo wing, W e provide the definition for the quasi-indicator functions. Definition B.26 (Quasi-indicator function) . Let ϕ satisfy Assumption 2.3 , with K ∈ N + , δ ∈ (0 , 1 8 K ) and e β > 0. The primary quasi-indicator function is defined as: • If ϕ is Heaviside-lik e, then I e β ,δ,K ϕ, 1 ( x ) := K X k =0 ϕ e β x − k K − 3 2 δ − ϕ e β x − k + 1 K + 3 2 δ . (35) • If ϕ is ReLU-like, then I e β ,δ,K ϕ, 1 ( x ) : = 1 2 δ e β K X k =0 " ϕ e β x − k K − 2 δ − ϕ e β x − k K − 4 δ − ϕ e β x − k + 1 K + 4 δ + ϕ e β x − k + 1 K + 2 δ # . (36) In b oth cases, the complemen tary quasi-indicator function is defined b y the translation: I e β ,δ,K ϕ, 2 ( x ) : = I e β ,δ,K ϕ, 1 x + 1 2 K , x ∈ [0 , 1] . The quasi-indicator functions hav e the following prop erties: Prop osition B.27. L et I e β ,δ,K ϕ,i denote the quasi-indic ator functions define d in Definition B.26 . F or any sufficiently smal l ϵ ∈ (0 , 1) and any e β ⩾ 8 K δ ϵ max { C 1 , C 2 } , the fol lowing pr op erties hold for i = 1 , 2 : • (Indic ator appr oximation) The function appr oximates the indic ator function as: I e β ,δ,K ϕ,i ( x ) − 1 ( Ω K,i coarse , band (2 δ ) ) c ( x ) ⩽ ϵ, x ∈ D K,i ( δ ) , wher e Ω K,i coarse , band (2 δ ) c : = [0 , 1] \ Ω K,i coarse , band (2 δ ) and the r e gion D K,i ( δ ) is define d as: D K,i ( δ ) : = [0 , 1] \ Ω K,i coarse , band (2 δ ) [ Ω K,i coarse , band ( δ ) . • (Bounde dness) ∥ I e β ,δ,K ϕ,i ∥ L ∞ ([0 , 1]) ⩽ 2 C 1 + 2 . Pr o of. The proof proceeds analogously to that of Prop osition B.22 and Lemma B.25 . First, regarding the appro ximation error, we observe that substituting the exact Heaviside function in to ( 35 ) (or the exact ReLU function into ( 36 )) recov ers the indicator function supp orted on [0 , 1] \ Ω K,i coarse , band (2 δ ) for all x ∈ D K,i ( δ ). T o establish the error b ound for the appro ximator I e β ,δ,K ϕ, 1 , consider an arbitrary x ∈ D K,i ( δ ). By construction, the distance b etw een x and the transition p oin ts k + i − 1 2 K ± 3 2 δ is b ounded from b elow by 1 2 δ . Consequen tly , in v oking the asymptotic prop erties of ϕ (specifically the Heaviside-lik e condition ( 1 ) or the ReLU-like condition ( 2 )) ensures that the appro ximation error can b e made arbitrarily small for sufficien tly large e β . The b oundedness argumen ts for quasi-indicator functions mirror those established for ψ i in Lemma B.25 . 49 Lev eraging the quasi-indicator functions, we now construct neural approximators for the univ ariate weigh t functions w β ,δ,K ϕ,i o ver the global domain [0 , 1]. Lemma B.28. L et ϕ satisfy Assumptions 2.1 – 2.3 . L et w β ,δ,K ϕ,i ( i = 1 , 2 ) denote the univariate weight functions define d in Definition B.21 with p ar ameters β , δ, K . F or any sufficiently smal l ϵ ∈ (0 , 1) , sufficiently lar ge K ∈ N + , δ ∈ (0 , 1 12 K 2 ) , any β ⩾ (24 C 1 +60) K δ ϵ max { C 1 , C 2 } , ther e exist neur al networks { g i } i =1 , 2 such that g i ∈ H ϕ, 4 (1 , 1 , M K , B β ,ϵ,δ,K ) , i = 1 , 2 , with M K ≲ K , B β ,ϵ,δ,K ≲ K 2 β 2 ϵ 2 δ 2 , (37) such that the fol lowing pr op erties hold for i = 1 , 2 : • (Uniform appr oximation) F or al l x ∈ [0 , 1] , | g i ( x ) − w β ,δ,K ϕ,i ( x ) | ⩽ ϵ. • (Bounde dness) ∥ g i ∥ L ∞ ([0 , 1]) ⩽ (2 C 1 + 5) 2 . Pr o of. W e fo cus our analysis on the construction of the appro ximator for w β ,δ,K ϕ, 1 ; the construction for w β ,δ,K ϕ, 2 pro ceeds analogously . Firstly , for sufficien tly small ϵ 1 ∈ (0 , 1), by Lemma B.25 , there exists a neural net work ψ 1 of depth 3, width O ( K ), and parameter norm O ( K 2 β 2 ϵ 1 δ 2 ) such that | ψ 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) | ⩽ ϵ 1 , x ∈ [0 , 1] \ Ω K, 1 coarse , band ( δ ) , (38) pro vided that β ⩾ 4 K δ ϵ 1 max { C 1 , C 2 } . F urthermore, the netw ork satisfies the uniform b ound ∥ ψ 1 ∥ L ∞ ([0 , 1]) ⩽ 2 C 1 + 4. Secondly , according to Definition B.26 , the quasi-indicator I e β ,δ,K ϕ, 1 is represen ted by a neural netw ork of depth 2, width O ( K ), and parameter norm O ( e β /δ ). F urthermore, b y Prop osition B.27 , upon choosing e β = 16 K δ ϵ 2 max { C 1 , C 2 } , we obtain the approximation error b ound: I e β ,δ,K ϕ, 1 ( x ) − 1 ( Ω K, 1 coarse , band (2 δ ) ) c ( x ) ⩽ ϵ 2 2 , x ∈ D K, 1 ( δ ) , (39) and the uniform b oundedness ∥ I e β ,δ,K ϕ,i ∥ L ∞ ([0 , 1]) ⩽ 2 C 1 + 2. Next, b y inv oking Corollary B.8 , for sufficiently small ϵ 2 ∈ (0 , 1), there exists a neural net work µ of depth 2, width 2 and parameter norm O (1 /ϵ 2 ), such that | µ ( x ) − x | ⩽ ϵ 2 2 , x ∈ [ − (2 C 1 + 2) , 2 C 1 + 2] . (40) No w, w e define the comp osition of µ with the quasi-indicator function as ν 1 = µ ◦ I e β ,δ,K ϕ, 1 . By ( 39 ) and ( 40 ), we obtain the following approximation error ν 1 ( x ) − 1 ( Ω K, 1 coarse , band (2 δ ) ) c ( x ) ⩽ ϵ 2 , x ∈ D K, 1 ( δ ) . (41) and the uniform b oundedness ∥ ν 1 ∥ L ∞ ([0 , 1]) ⩽ 2 C 1 + 4. Structurally , ν 1 is neural netw ork of depth 3, width O ( K ), and parameter norm O (max { e β , 1 /ϵ 2 } ) = O ( K/ ( δ ϵ 2 )). 50 Subsequen tly , for sufficiently small ϵ 3 ∈ (0 , 1), by Lemma B.4 , there exists a neural net work η of depth 2, width 4, and parameter norm O (1 /ϵ 2 3 ) such that | η ( x, y ) − xy | ⩽ ϵ 3 , 0 < | x | , | y | ⩽ 2 C 1 + 4 . (42) W e define the final appro ximator as g 1 ( x ) : = η ( ψ 1 ( x ) , ν 1 ( x )). W e pro ceed to b ound the global appro ximation error betw een g 1 ( x ) and I β ,δ,K ϕ, 1 b y partitioning the domain [0 , 1] into three regions. W e choose parameters ϵ 1 = ϵ 2 = ϵ 6 C 1 +15 and ϵ 3 = ϵ 3 . • F or x ∈ Ω K, 1 coarse , band (2 δ ) c , the appro ximation error is b ounded by | g 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) | ⩽ | η ( ψ 1 ( x ) , ν 1 ( x )) − ψ 1 ( x ) ν 1 ( x ) | + ψ 1 ( x ) ν 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) ⩽ | η ( ψ 1 ( x ) , ν 1 ( x )) − ψ 1 ( x ) ν 1 ( x ) | + | ν 1 ( x ) || ψ 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) | + | w β ,δ,K ϕ, 1 ( x ) | ν 1 ( x ) − 1 ( Ω K, 1 coarse , band (2 δ ) ) c ( x ) ( a ) ⩽ ϵ 3 + (2 C 1 + 4) ϵ 1 + (2 C 1 + 3) ϵ 2 < ϵ, where ( a ) follows from the b oundedness of ν 1 , w β ,δ,K ϕ, 1 , and the approximation guarantees in ( 38 ), ( 41 ), and ( 42 ). • When x ∈ Ω K, 1 coarse , band ( δ ), the appro ximator nearly v anishes: | g 1 ( x ) | ( a ) ⩽ | ψ 1 ( x ) || ν 1 ( x ) | + ϵ 3 ( b ) ⩽ (2 C 1 + 4) ϵ 2 + ϵ 3 ( c ) ⩽ 2 3 ϵ, where ( a ) follows from the approximation guarantee in ( 42 ), ( b ) follows from the approximation guarantee in ( 41 ) and the boundedness of ψ 1 . Moreo v er, by Prop osition B.22 , the target univ ariate function also exhibits near-v anishing b eha vior as | w β ,δ,K ϕ, 1 ( x ) | ⩽ ϵ 1 for x ∈ Ω K, 1 coarse , band ( δ ) ⊂ Ω K, 1 coarse , band (2 δ ) pro vided β ⩾ 4 K δ ϵ 1 max { C 1 , C 2 } . Th us, the total appro ximation error is b ounded as | g 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) | ⩽ ϵ 1 + 2 3 ϵ < ϵ, x ∈ Ω K, 1 coarse , band ( δ ) . • When x ∈ Ω K, 1 coarse , band (2 δ ) \ Ω K, 1 coarse , band ( δ ), we obtain | g 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) | ( a ) ⩽ | ψ 1 ( x ) || ν 1 ( x ) | + | w β ,δ,K ϕ, 1 ( x ) | + ϵ 3 ⩽ | ψ 1 ( x ) − w β ,δ,K ϕ, 1 ( x ) || ν 1 ( x ) | + ( | ν 1 ( x ) | + 1) | w β ,δ,K ϕ, 1 ( x ) | + ϵ 3 ( b ) ⩽ ϵ 1 (2 C 1 + 4) + ϵ 1 (2 C 1 + 5) + ϵ 3 ⩽ ϵ, where ( a ) follo ws from the appro ximation guaran tee in ( 42 ), ( b ) follo ws from the appro ximation guarantee in ( 38 ), the b oundedness of ν 1 , and the lo cally quasi-v anishing prop ert y of w β ,δ,K ϕ, 1 when β ⩾ 4 K δ ϵ 1 max { C 1 , C 2 } . The uniform b oundedness of g 1 follo ws directly from the pro duct bound: | g 1 ( x ) | ⩽ | ψ 1 ( x ) || ν 1 ( x ) | + ϵ 3 ⩽ (2 C 1 + 4) 2 + ϵ 3 ⩽ (2 C 1 + 5) 2 , x ∈ [0 , 1] . Regarding the netw ork architecture, the comp osition of the constituent sub-netw orks yields a final arc hitecture for g 1 with depth 4, width O ( K ), and its parameter norm b ounded by ∥ θ ( g 1 ) ∥ ∞ ≲ max K 2 β 2 ϵ 1 δ 2 , K δ ϵ 2 2 , 1 ϵ 2 3 ≲ K 2 β 2 ϵ 2 δ 2 , 51 as stated in ( 37 ). Finally , to ensure the v alidit y of the preceding analysis, w e require the parameter β to satisfy β ⩾ 4 K δ ϵ 1 max { C 1 , C 2 } = (24 C 1 + 60 K ) δ ϵ max { C 1 , C 2 } . This concludes the pro of. Figure 4 illustrates our constructiv e approximation pro cedure. Panel (a) depicts the construction in Lemma B.25 , where we appro ximate the w eight function w 1 on [0 , 1] up to an exceptional b oundary band of width δ . The resulting approximator is denoted by ψ 1 . This construction combines a lo cal approximator η 1 on [0 , K − 1 ] with an extractor that maps x to its relative p osition x − a ( x ). Panel (b) corresp onds to Lemma B.28 , where we upgrade the appro ximation to a uniform one ov er [0 , 1] and obtain an appro ximator g 1 . The key step is to multiply the local appro ximator ψ 1 b y an indicator appro ximator ν 1 that suppresses the appro ximation error in the b oundary band. Building up on the neural appro ximators constructed in Lemma B.28 for the univ ariate w eight functions on [0 , 1], w e now pro ceed to construct neural appro ximators for the m ultiv ariate functions on [0 , 1] d . Lemma B.29. L et ϕ satisfy Assumptions 2.1 – 2.3 . L et d ∈ N + , and let w β ,δ,K ϕ, v b e the d -variate weight functions fr om Definition B.23 with p ar ameters β , δ, K . F or any sufficiently smal l ϵ ∈ (0 , 1) , sufficiently lar ge K , δ ∈ (0 , 1 12 K 2 ) , and β ⩾ 24 d (2 C 1 +5) 2 d +1 δ ϵ max { C 1 , C 2 } , ther e exist neur al networks { g v } v ∈ [2] d satisfying g v ∈ H ϕ, 5 (1 , 1 , M K , B β ,ϵ,δ,K ) , with M K ≲ K , B β ,ϵ,β ,K ≲ max K 2 β 2 ϵ 2 δ 2 , 1 ϵ d , (43) such that the fol lowing pr op erties hold for v ∈ [2] d : • (Uniform appr oximation) F or al l x ∈ [0 , 1] d : | g v ( x ) − w β ,δ,K ϕ, v ( x ) | ⩽ ϵ. • (Bounde dness) ∥ g v ∥ L ∞ ([0 , 1] d ) ⩽ (2 C 1 + 5) d . Pr o of. Consider a sufficiently small ϵ 1 ∈ (0 , 1). By Lemma B.28 , there exist 2 d neural netw orks { ψ l,v } l ∈ [ d ] ,v ∈ [2] suc h that for all x ∈ [0 , 1] d ψ l,v ( x ) − w β ,δ,K ϕ,v ( x l ) ⩽ ϵ 1 , l = 1 , · · · , d, v = 1 , 2 , (44) pro vided that β ⩾ (24 C 1 +60) K δ ϵ 1 max { C 1 , C 2 } . Moreov er, eac h ψ l,v l has depth 4, width O ( K ), parameter norm O ( K 2 β 2 ϵ 1 δ 2 ), and satisfies the uniform bound ∥ ψ l,v l ∥ L ∞ ⩽ (2 C 1 + 5) 2 . Next, for sufficiently small ϵ 2 ∈ (0 , 1), b y Lemma B.4 , there exists a neural net work ν of depth 2, width 2 d , and parameter norm O (1 /ϵ d 2 ) such that | ν ( x 1 , · · · , x d ) − x 1 x 2 · · · x d | ⩽ ϵ 2 , 0 ⩽ | x 1 | , · · · , | x d | ⩽ (2 C 1 + 5) 2 . (45) 52 w 1 ( x ) 1 ( x ) x a ( x ) ( x ) w 1 ( x ) 1 = 1 1 ( x ) w 1 ( x ) g 1 = 1 1 2 Figure 4: Illustration of the constructiv e appro ximation for the w eight function w 1 , instan tiated with K = 2. Dep endencies on indices ϕ, β and δ are suppressed for clarit y . P anels (a) and (b) visualize the approximators constructed in Lemma B.25 and Lemma B.28 , resp ectively . The orange shaded region denotes the domain Ω K, 1 coarse , band ( δ ), while the blue region corresp onds to the difference set Ω K, 1 coarse , band (2 δ ) \ Ω K, 1 coarse , band ( δ ). 53 W e no w define the final neural approximator as g v ( x ) : = ν ( ψ 1 ,v 1 ( x ) , · · · , ψ d,v d ( x )). The appro ximation error b et ween g v ( x ) and w β ,δ,K ϕ, v is b ounded as: g v ( x ) − w β ,δ,K ϕ, v ( x ) ( a ) ⩽ ϵ 2 + d Y l =1 ψ l,v l ( x ) − d Y l =1 w β ,δ,K ϕ,v l ( x l ) ⩽ ϵ 2 + d X k =1 | ψ k,v k ( x ) − w β ,δ,K ϕ,v k ( x l ) | k − 1 Y l =1 | w β ,δ,K l,v l ( x l ) | d Y l ′ = k +1 | ψ l ′ ,v l ′ ( x ) | ( b ) ⩽ ϵ 2 + d (2 C 1 + 5) 2 d ϵ 1 ( c ) ⩽ ϵ, (46) where ( a ) follo ws from the appro ximation guarantee in ( 45 ), ( b ) uses the approximation guaran tee in ( 44 ), along with the b oundedness of ψ l,v and w β ,δ,K ϕ,i , and ( c ) holds b y setting ϵ 1 = ϵ 2 d (2 C 1 +5) 2 d , ϵ 2 = ϵ 2 . By the boundedness of w β ,δ,K ϕ,v l and the appro ximation guaran tee in ( 46 ), w e further deduce that | g v ( x ) | ⩽ ϵ + (2 C 1 + 3) d ⩽ (2 C 1 + 5) d . By the comp osition of the neural net works inv olv ed, we conclude that g v is a neural netw ork of depth 5, width O ( K ), and its parameter norm is b ounded by ∥ θ ( g v ) ∥ ∞ ≲ max K 2 β 2 ϵ 1 δ 2 , 1 ϵ d 2 ≲ max K 2 β 2 ϵδ 2 , 1 ϵ d , as stated in ( 43 ). Finally , to ensure the v alidit y of the ab o ve analysis, w e require the parameter β to satisfy β ⩾ (24 C 1 + 60) K δ ϵ 1 max { C 1 , C 2 } = 24 d (2 C 1 + 5) 2 d +1 δ ϵ max { C 1 , C 2 } . This concludes the pro of. B.8 Pro of of Theorem 3.3 ( L ∞ Appro ximation) Building upon the neural appro ximators constructed in Lemma B.29 for the m ultiv ariate w eigh t functions on the global domain [0 , 1] d , w e pro ceed to construct neural appro ximators for f ⋆ with resp ect to the L ∞ ([0 , 1] d ) norm. T o achiev e this, we must extend the approximation guaranties of Theorem B.19 —established in the in terior region—to the shifted interior regions. Theorem B.30. Supp ose ϕ and f ⋆ satisfy the assumptions of The or em B.19 , and let the p ar ameters ϵ, K , and δ b e define d as ther ein. Then, for any shift index v ∈ [2] d , ther e exists a neur al network g v ∈ H ϕ, 6 ( d, 1 , M ϵ , B ϵ,δ ) , with the identic al M ϵ and B ϵ,δ sp e cifie d in ( 25 ) of The or em B.19 , such that the fol lowing pr op erties hold for g v for v ∈ [2] d : • (Appr oximation) ∥ g v − f ⋆ ∥ L ∞ ( Ω K, v int ( δ ) ) ⩽ ϵ. • (Bounde dness) F or al l x ∈ [0 , 1) d , g v ( x ) satisfies the uniform b ound establishe d in ( 26 ) of The or em B.19 . 54 0 0 . 2 5 0 . 5 0 . 7 5 1 . 0 w 1 f 1 + w 2 f 2 + f 1 2 2 2 2 1 b a n d ( ) f 2 2 2 2 2 2 2 b a n d ( ) × w 1 × w 2 = 1 w 1 Figure 5: Illustration of the L ∞ ([0 , 1]) appro ximation strategy for f ⋆ detailed in Theorem 3.3 . Large appro ximation errors of f i on the bands Ω 2 ,i band ( δ ) are nullified b y the v anishing w eight functions w i ( x ). Since the weigh ts constitute a partition of unity , the global reconstruction w 1 f 1 + w 2 f 2 main tains the desired appro ximation accuracy across the entire domain [0 , 1]. Pr o of. The construction pro ceeds analogously to that of Theorem B.19 . The pro of decomp oses in to tw o steps. First, we construct polynomial approximations for f ⋆ lo cally within eac h shifted refined cell Ω K, v i , j . Second, w e approximate the resulting piecewise p olynomial defined o ver the partition { Ω K, v i , j } i , j ∈ [ e K ] d via a neural net work. The second step employs the identical appro ximation sc heme established in Lemma B.18 , adapted here to the shifted partition of [0 , 1] d . W e now present the pro of of Theorem 3.3 . Pr o of of The or em 3.3 . W e pro vide the pro of for Hea viside-like activ ation functions; the pro of for ReLU-like activ ations follows similarly . First, let ϵ 1 ∈ (0 , 1) b e a sufficiently small parameter to b e sp ecified later. F ollo wing Theorem B.19 , define K : = ⌈ (2 c 1 ( s, d ) /ϵ 1 ) 1 / 2 s ⌉ . Throughout the remainder of the pro of, fix a parameter δ ∈ 0 , 1 12 K 2 , whic h will also b e chosen later. By Theorem B.30 , there exists a family of 2 d neural netw orks { ψ v } v ∈ [2] d suc h that | ψ v ( x ) − f ⋆ ( x ) | ⩽ ϵ 1 , x ∈ Ω K, v int ( δ ) . (47) Eac h netw ork ψ v has depth 6, width O ( ϵ − d/ (2 s ) 1 ), and parameter norm M ϵ 1 ,δ as sp ecified in ( 25 ). Moreo ver, these netw orks are uniformly b ounded as ∥ ψ v ∥ L ∞ ([0 , 1] d ) ⩽ ⌈ s ⌉ d c 2 ( s, d )4 d +4 (1 + C 1 ) 2 d = : M ψ . Secondly , let ϵ 2 ∈ (0 , 1) b e a sufficiently small parameter to b e determined later. In voking Lemma B.29 (approximating the weigh t functions) and Corollary B.8 (approximating the iden tity function), there exist 2 d neural netw orks { ν v } v ∈ [2] d suc h that | ν v ( x ) − w β ,δ,K ϕ, v ( x ) | ⩽ ϵ 2 , x ∈ [0 , 1] d , (48) pro vided w e choose β = 24 d (2 C 1 +5) 2 d +1 K δ ϵ 2 max { C 1 , C 2 } . Each ν v has depth 6, width O ( K ), parameter norm O (max { K 2 β 2 / ( ϵ 2 2 δ 2 ) , ϵ − d 2 } ), and uniform output b ound ∥ ν v ∥ ⩽ (2 C 1 + 6) d = : M ν . 55 W e now b ound the approximation error of the w eighted com bination P v ψ v ν v against f ⋆ . F or any x ∈ [0 , 1] d , define the index set of “active” shifted interiors: V ( x ) : = n v ∈ [2] d : x ∈ Ω K, v int ( δ ) o . F or indices v ∈ V ( x ), the appro ximation | ψ v ( x ) − f ⋆ ( x ) | ⩽ ϵ 1 holds. Con versely , let ¯ V ( x ) : = [2] d \ V ( x ) b e the indices where x falls in to the shifted band region Ω K, v band ( δ ). By Prop osition B.24 , the w eight function nearly v anishes in this shifted band region, i.e., | w β ,δ,K ϕ, v ( x ) | ⩽ ϵ 2 , v ∈ ¯ V ( x ), as we hav e β ⩾ 2(2 C 1 +3) d − 1 δ ϵ 2 max { C 1 , C 2 } . Since δ < 1 4 K 2 , ev ery x m ust fall into at least one shifted interior region; hence, V ( x ) = ∅ . Using the partition of unit y prop erty P v w β ,δ,K ϕ, v ( x ) = 1 in Prop osition B.24 , w e decomp ose the approximation error as : X v ∈ [2] d ψ v ( x ) ν v ( x ) − f ⋆ ( x ) = X v ∈ [2] d ψ v ( x ) ν v ( x ) − X v ∈ [2] d f ⋆ ( x ) w β ,δ,K ϕ, v ( x ) ⩽ X v ∈ [2] d | ψ v ( x ) | ν v ( x ) − w β ,δ,K ϕ, v ( x ) + X v ∈ V ( x ) ψ v ( x ) − f ⋆ ( x ) w β ,δ,K ϕ, v ( x ) + X v ∈ ¯ V ( x ) ( | f ⋆ ( x ) | + ν v ( x )) | w β ,δ,K ϕ, v ( x ) | ( a ) ⩽ 2 d M ψ ϵ 2 + 2 d (2 C 1 + 3) ϵ 1 + 2 d (1 + M ν ) ϵ 2 ( b ) ⩽ ϵ 2 , (49) where ( a ) follo ws from the appro ximation guaran tee in ( 47 ), and ( 48 ), and the b oundedness of ψ v , ν v , f ⋆ and the lo cally quasi-v anishing property for w β ,δ,K ϕ, v on Ω K, v band ( δ ) from Prop osition B.24 , ( b ) holds b y c ho osing ϵ 1 = ϵ 2 d +2 (2 C 1 + 3) , ϵ 2 = ϵ 2 d +2 ( M ψ + M ν + 1) . By Lemma B.4 , there exists neural net works η of depth 2, width 4 and parameter norm O (1 /ϵ 2 ), such that | η ( x, y ) − xy | ⩽ ϵ 2 d +1 , 0 ⩽ | x | , | y | ⩽ max { M ν , M ψ } . (50) W e define the final approximator g ( x ) : = P v ∈ [2] d η ( ψ v ( x ) , ν v ( x )). The total error is b ounded b y: | g ( x ) − f ⋆ ( x ) | ⩽ X v ∈ [2] d ψ v ( x ) ν v ( x ) − f ⋆ ( x ) + X v ∈ [2] d η ( ψ v ( x ) , ν v ( x )) − ψ v ( x ) ν v ( x ) ( a ) ⩽ ϵ 2 + 2 d ϵ 2 d +1 ⩽ ϵ, where ( a ) follo ws from the appro ximation guaran tee in ( 49 ) and ( 50 ). Finally , regarding the architecture, g is realized b y summing the comp ositions of η with the parallel sub-netw orks ψ v and ν v . By construction, the depth of g is 7. Regarding the width, recall that ψ v scales as O ( ϵ − d/ (2 s ) ), since ϵ 1 ≂ ϵ . F urthermore, given the scaling K = O ( ϵ − 1 / (2 s ) ) from Theorem B.19 , the width of ν v scales as O ( K ) = O ( ϵ − 1 / (2 s ) ). Consequen tly , the total width of g is dominated b y the former term, b ounded by O ( ϵ − d/ (2 s ) ). Noting that β = O ( K/ ( δ ϵ 2 )) , ϵ 2 ≂ ϵ , choosing δ = (24 K 2 ) − 1 (so that δ − 1 = 24 K 2 = O ((1 /ϵ ) 1 s )), then the parameter norm is b ounded by max 1 ϵ max n d 2 2 s + d, d s +2 , ⌈ s ⌉ o 1 , 1 δ 2 ϵ d 2 s +1 1 , K 2 β 2 δ 2 ϵ 2 2 , 1 ϵ d 2 , 1 ϵ d ≲ 1 ϵ max n d 2 2 s + d, d s +2 , d +4 2 s +1 , ⌈ s ⌉ , 6 s +4 o , as stated in ( 4 ). This concludes the pro of. 56 C Learning Theory: Pro ofs and T ec hnical Details Lemma C.1 ( Sc hmidt-Hieb er ( 2020 )) . L et n ∈ N ⩾ 1 , and let f ⋆ and { ( x i , y i ) } n i =1 b e given in ( 5 ) . L et F n denote a mo del class, and let b f n b e the estimator define d as b f n = argmin f ∈F n 1 n n X i =1 ( f ( x i ) − y i ) 2 . (51) Assume that f ⋆ ∪ F n ⊂ { f : [0 , 1] d → [ − F , F ] } for some F ⩾ 1 . If the c overing numb er N n : = N ( τ , F n , ∥ · ∥ ∞ ) ⩾ 3 , then, E b f n − f ⋆ 2 L 2 ( ρ ) ⩽ 4 inf f ∈F n ∥ f − f ⋆ ∥ 2 L 2 ( ρ ) + F 2 18 log N n + 72 F n + 32 τ F , for al l τ ∈ (0 , 1] . R emark C.2 . Lemma C.1 follows directly from Lemma 4 in Schmidt-Hieber ( 2020 ) b y taking ϵ = 1 and P X as the uniform distribution on [0 , 1] d , and identifying b f with the empirical risk minimization estimator giv en in ( 51 ). C.1 Co v ering num b er b ounds Lemma C.1 characterizes the trade-off betw een appro ximation accuracy and the complexit y of the mo del class, as measured by the cov ering n umber. W e now provide an upp er b ound for the co vering num ber of the mo del class defined in Section 2 . Lemma C.3 (Co vering n umber b ound) . L et ϕ satisfy Assumption 2.2 . The c overing numb er of H ϕ,L ( d, 1 , W, B ) with input x ∈ [0 , 1] d c an b e b ounde d by log N ( τ , H ϕ,L ( d, 1 , M , B ) , ∥·∥ ∞ ) ⩽ 2( L + d ) M 2 log 4 L +1 d (max {∥ ϕ ∥ Lip , 1 } M ) L B L +1 τ , (52) for B ⩾ max n 1 , | ϕ (0) | max {∥ ϕ ∥ Lip , 1 } ( d +1) o . Pr o of. Without loss of generality w e assume ∥ ϕ ∥ Lip ⩾ 1. Now suppose that a pair of different t wo netw orks g , b g ∈ H ϕ,L ( d, 1 , M , B ) giv en b y g ( x ) = ( W L ϕ ( · ) + b L ) ◦ · · · ◦ ( W 1 x + b 1 ) , b g ( x ) = ( c W L ϕ ( · ) + b b L ) ◦ · · · ◦ ( c W 1 x + b b 1 ) with ∥ W l − c W l ∥ ∞ , ∞ ⩽ ϱ, ∥ b l − b b l ∥ ∞ ⩽ ϱ, 1 ⩽ l ⩽ L. F or the net work g , b g , w e recurren tly define { z l } L l =0 , { b z l } L l =0 as z 0 : = x ∈ [0 , 1] d , z 1 : = W 1 z 0 + b 1 , z l : = W l ϕ ( z l − 1 ) + b l , for 2 ⩽ l ⩽ L, b z 0 : = x ∈ [0 , 1] d , b z 1 : = c W 1 b z 0 + b b 1 , b z l : = c W l ϕ ( b z l − 1 ) + b b l , for 2 ⩽ l ⩽ L, The netw ork outputs are then g ( x ) = z L , b g ( x ) = b z L . Firstly , we prov e by induction that ∥ ϕ ( b z l ) ∥ ∞ ⩽ d (4 ∥ ϕ ∥ Lip B ) l M l − 1 , 1 ⩽ l ⩽ L − 1 . F or the base case l = 1, w e ha ve ∥ ϕ ( b z 1 ) ∥ ∞ ⩽ ∥ ϕ ∥ Lip ∥ b z 1 ∥ ∞ + | ϕ (0) | ⩽ ∥ ϕ ∥ Lip ∥ c W 1 ∥ 1 , ∞ ∥ x ∥ ∞ + ∥ b b 1 ∥ ∞ + | ϕ (0) | ⩽ ∥ ϕ ∥ Lip ( d + 1) B + | ϕ (0) | ⩽ 2 ∥ ϕ ∥ Lip ( d + 1) B ⩽ 4 ∥ ϕ ∥ Lip dB , 57 F or inductiv e steps, assume that for k ⩽ l , the inequalit y ∥ ϕ ( b z k ) ∥ ∞ ⩽ d (4 ∥ ϕ ∥ Lip B ) k W k − 1 holds. F or k = l + 1, we hav e ∥ ϕ ( b z l +1 ) ∥ ∞ ⩽ ∥ ϕ ∥ Lip ∥ c W l +1 ∥ 1 , ∞ ∥ ϕ ( b z l ) ∥ ∞ + ∥ b b l +1 ∥ ∞ + | ϕ (0) | ⩽ ∥ ϕ ∥ Lip d (4 ∥ ϕ ∥ Lip B ) l M l − 1 × ( M B ) + B + | ϕ (0) | ⩽ 2 ∥ ϕ ∥ Lip d (4 ∥ ϕ ∥ Lip M ) l B l +1 + B ⩽ d (4 ∥ ϕ ∥ Lip B ) l +1 M l . Th us, b y induction, the bound holds for all 1 ⩽ l ⩽ L − 1. W e no w b ound the error b et ween t wo netw orks. In particular, w e will prov e b y induction that ∥ z l − b z l ∥ ∞ ⩽ 2 d (4 ∥ ϕ ∥ Lip M B ) l − 1 ϱ. F or the base case l = 1, w e ha ve: ∥ z 1 − b z 1 ∥ ∞ = ( W 1 − c W 1 ) x + ( b 1 − b b 1 ) ∞ ⩽ W 1 − c W 1 1 , ∞ ∥ x ∥ ∞ + ∥ b 1 − b b 1 ∥ ∞ ⩽ ( d + 1) ϱ < 2 dM ϱ. F or inductiv e steps, assume that for k ⩽ l , the inequality ∥ z l − b z l ∥ ∞ ⩽ (4 ∥ ϕ ∥ Lip ) l M l B l − 1 ϱ holds. F or k = l + 1, we hav e ∥ z l +1 − b z l +1 ∥ ∞ = W l +1 ϕ ( z l ) + b l +1 − ( c W l +1 ϕ ( b z l ) + b b l +1 ) ∞ ⩽ ∥ W l +1 ∥ 1 , ∞ ∥ ϕ ( z l ) − ϕ ( b z l ) ∥ ∞ + W l +1 − c W l +1 1 , ∞ ∥ ϕ ( b z l ) ∥ ∞ + b l +1 − b b l +1 ∞ ⩽ ( M B ) ∥ ϕ ∥ Lip ∥ z l − b z l ∥ ∞ + ( M ϱ ) ∥ ϕ ( b z l ) ∥ ∞ + ϱ, ⩽ ( M B ) ∥ ϕ ∥ Lip × 2 d (4 ∥ ϕ ∥ Lip M B ) l − 1 ϱ + ( M ϱ ) × d (4 ∥ ϕ ∥ Lip B ) l M l − 1 + ϱ ⩽ 2 d (4 ∥ ϕ ∥ Lip M B ) l ϱ. Th us, b y induction, the approximation b ound holds for all 1 ⩽ l ⩽ L . Then by choosing ϱ = τ 2 d (4 ∥ ϕ ∥ Lip M B ) L , we hav e ∥ g − b g ∥ L ∞ ([0 , 1] d ) ⩽ ∥ z L − b z L ∥ ∞ ⩽ (4 ∥ ϕ ∥ Lip ) L M L B L − 1 ϱ ⩽ τ . The total n umber parameters for g , b g is giv en b y P = ( M d + M ) + L − 1 X l =2 ( M 2 + M ) + (1 · M + 1) = ( L − 2) M 2 + ( L + d ) M + 1 ⩽ 2( L + d ) M 2 . Therefore, the co vering num ber is bounded b y N ( τ , H ϕ,L ( d, 1 , M , B ) , ∥·∥ ∞ ) ⩽ 2 B ϱ P ⩽ 4 L +1 d ( ∥ ϕ ∥ Lip M ) L B L +1 τ 2( L + d ) M 2 , whic h implies that log N ( τ , H ϕ,L ( d, 1 , M , B ) , ∥·∥ ∞ ) ⩽ 2( L + d ) M 2 log 4 L +1 d ( ∥ ϕ ∥ Lip M ) L B L +1 τ , as stated in ( 52 ). This concludes the pro of. 58 C.2 Pro of of Theorem 4.1 (Risk Bound) By Lemma C.1 , together with the approximation b ound from Theorem 3.1 and the complexit y estimate for our mo del class in Lemma C.3 , w e pro ve Theorem 4.1 as follows. Pr o of for The or em 4.1 . The estimator T F b f n obtained b y ( 6 ) can b e in terpreted as the follo wing excess risk minimization estimator in T F H ϕ,L ( d, 1 , M n , B n ) T F b f n = argmin h ∈ T F H ϕ,L ( d, 1 ,M n ,B n ) 1 n n X i =1 ( y i − h ( x i )) 2 , where T F H ϕ,L ( d, 1 , M , B ) is defined as T F H ϕ,L ( d, 1 , M , B ) = n T F f : f ∈ H ϕ,L ( d, 1 , M , B ) o . F or sufficien tly small ϵ > 0, inv oking Theorem 3.1 with F = 2 and recalling the bounded density assumption 0 ⩽ p ( x ) ≲ 1 for data distribution ρ supp orted on [0 , 1] d , w e obtain the follo wing appro ximation b ound inf h ∈ T F H ϕ, 6 ( d, 1 ,M ϵ ,B ϵ ) ∥ h − f ⋆ ∥ L 2 ( ρ ) ≲ inf h ∈ T F H ϕ, 6 ( d, 1 ,M ϵ ,B ϵ ) ∥ h − f ⋆ ∥ L 2 ([0 , 1] d ) ≲ ϵ, (53) pro vided that M ϵ ≂ 1 ϵ d 2 s , B ϵ ≂ 1 ϵ max n d 2 2 s + d, d s +2 , d +4 2 s +5 , ⌈ s ⌉ o . F or g , b g ∈ H ϕ,L ( d, 1 , W, B ), we hav e the inequalit y ∥ T F g − T F b g ∥ L ∞ ([0 , 1] d ) ⩽ ∥ g − b g ∥ L ∞ ([0 , 1] d ) , whic h implies the follo wing co vering n umber b ound log N τ , T F H ϕ, 6 ( d, 1 , M ϵ , B ϵ ) , ∥·∥ ∞ ⩽ log N τ , H ϕ, 6 ( d, 1 , M ϵ , B ϵ ) , ∥·∥ ∞ ≲ 1 ϵ d s log 1 ϵ + log 1 τ , (54) where the “ ≲ ” is due to the cov ering num b er b ound established in Lemma C.3 . Applying Lemma C.1 , w e obtain the follo wing b ound E T F b f n − f ⋆ 2 L 2 ( ρ ) ≲ ϵ 2 + 1 n 1 ϵ d s log 1 ϵ + log 1 τ + τ . By selecting ϵ ≂ n − s 2 s + d and τ ≂ n − 2 s 2 s + d , w e deriv e the following con vergence rate for excess risk E T F b f n − f ⋆ 2 L 2 ( ρ ) ≲ n − 2 s 2 s + d log n, as stated in ( 8 ). Additionally , w e obtain the b ounds for M n and B n : M n ≂ n d 4 s +2 d , B n ≂ n max n d 2 , 1 , d +4+10 s 2(2 s + d ) , s ⌈ s ⌉ 2 s + d o , as stated in ( 7 ). This concludes the pro of. 59 C.3 Optimal Risk Bound under ℓ 2 Norm Constrain ts In this part we establish learning guarantees for ERM ov er neural net works sub ject to practically relev an t ℓ 2 parameter norm constraints. W e b egin by formally defining this hypothesis space, denoted by e H ϕ , sub ject to an ℓ 2 parameter b ound: e H ϕ,L ( d in , d out , M , B ) = n x 7→ ( W L ϕ ( · ) + b L ) ◦ · · · ◦ ( W 1 x + b 1 ) : W 1 ∈ R M × d in , W L ∈ R d out × M , W l ∈ R M × M , 2 ⩽ l ⩽ L − 1; b L ∈ R d out , b l ∈ R M , 1 ⩽ l ⩽ L − 1; s X l ∥ W l ∥ 2 F + ∥ b l ∥ 2 2 ⩽ B o . (55) Subsequen tly , w e define the estimator e f n obtained by ERM o ver suc h class as e f n = argmin f ∈ e H ϕ,L ( d, 1 ,M n ,B n ) 1 n n X i =1 y i − ( T F f )( x i ) 2 . (56) The estimation error of this estimator is c haracterized b y the follo wing theorem. Theorem C.4. Supp ose the assumptions on ϕ and f ⋆ fr om The or em 4.1 hold. F or sufficiently lar ge n ∈ N + (dep ending only on ϕ, s, d ), if we cho ose L = 6 , M n ≂ n d 4 s +2 d , B n ≂ n d 4 s +2 d +max n d 2 , 1 , d +4+10 s 2(2 s + d ) , s ⌈ s ⌉ 2 s + d o , F = 2 , and let e f n b e the estimator obtaine d via ( 56 ) , then E T F e f n − f ⋆ 2 L 2 ( ρ ) ≲ n − 2 s 2 s + d log n. (57) Pr o of. W e begin b y establishing the inclusion relationship b et ween the function classes. Observe that H ϕ,L ( d, 1 , M , B ) ⊂ e H ϕ,L ( d, 1 , M , √ P B ) ⊂ H ϕ,L ( d, 1 , M , √ P B ) , (58) where P denotes the total n umber of parameters for netw orks within these classes, given b y P = M 2 ( L − 2) + M ( L + d ) + 1 = O ( M 2 ) . (59) By ( 53 ), ( 58 ) and ( 59 ) recalling the b ounded densit y assumption 0 ⩽ p ( x ) ≲ 1 for data distribution ρ supported on [0 , 1] d , we ha ve the follo wing appro ximation error b ound under L 2 ( ρ ) metric inf g ∈ T F e H ϕ, 6 ( d, 1 ,M ϵ ,B ϵ ) ∥ g − f ⋆ ∥ L 2 ( ρ ) ≲ inf g ∈ T F e H ϕ, 6 ( d, 1 ,M ϵ ,B ϵ ) ∥ g − f ⋆ ∥ L 2 ([0 , 1] d ) ≲ ϵ, (60) pro vided that M ϵ ≂ 1 ϵ d 2 s , B ϵ ≂ 1 ϵ d 2 s +max n d 2 2 s + d, d s +2 , d +4 2 s +5 , ⌈ s ⌉ o . Using the co vering num ber b ound ( 54 ) and the inclusion relationship in ( 58 ), w e ha ve log N τ , T F e H ϕ, 6 ( d, 1 , M ϵ , B ϵ ) , ∥·∥ ∞ ⩽ log N τ , T F H ϕ, 6 ( d, 1 , M ϵ , B ϵ ) , ∥·∥ ∞ ≲ 1 ϵ d s log 1 ϵ + log 1 τ , when the input space is [0 , 1] d . By following the same analysis as in the pro of of Theorem 4.1 to balance appro ximation error and model complexit y , we establish the rate giv en in ( 57 ), pro vided that: M n ≂ n d 4 s +2 d , B n ≂ n d 4 s +2 d +max n d 2 , 1 , d +4+10 s 2(2 s + d ) , s ⌈ s ⌉ 2 s + d o . This completes the pro of. 60 D Supplemen tary Details to Section 5 In this section, we establish the low er b ounds on the approximation error for finite-depth neural netw orks emplo ying non-smo oth activ ation functions. Additionally , w e detail the exp erimen tal setup used to compare the generalization error of smo oth v ersus non-smo oth activ ation functions. D.1 Pro of of Prop osition 5.1 (ReLU Lo wer Bound) W e pro ve a low er b ound on the L 2 ([0 , 1]) approximation error ov er the Sob olev ball { f : ∥ f ∥ W s, ∞ ([0 , 1]) ⩽ 1 } for constant-depth ReLU net works. The argument has t wo parts: (i) a direct piecewise-linear low er b ound using the quadratic example f ⋆ ( x ) = 1 2 x 2 , and (ii) a general lo wer b ound for ReLU netw orks. Step 1: A lo cal L 2 lo w er b ound for linear appro ximation of x 2 . Lemma D.1. L et I ⊂ R b e a close d interval of length l , and let h b e any line ar function. Then Z I ( x 2 − h ( x )) 2 d x ⩾ 1 180 l 5 . (61) Pr o of. Let I = [ u, v ] with l = v − u and midp oin t m = ( u + v ) / 2. Define e ( x ) = h ( x ) − x 2 . With the shift t = x − m , since h is linear we may write e ( t + m ) = − t 2 + at + b for some a, b ∈ R . Hence Z I e ( x ) 2 d x = Z l/ 2 − l/ 2 ( − t 2 + at + b ) 2 d t = 2 Z l/ 2 0 t 4 + ( a 2 − 2 b ) t 2 + b 2 d t = l 5 80 + l 3 12 a 2 − l 3 6 b + l b 2 . Completing the square in b giv es Z I e ( x ) 2 d x = l 5 80 − l 5 144 + l 3 12 a 2 + l b − l 2 12 2 ⩾ 1 180 l 5 , whic h pro ves ( 61 ). Step 2: A low er b ound for piecewise-linear approximators. Lemma D.2 (Low er bound for approximating x 2 b y ReLU netw orks) . L et f ⋆ ( x ) = 1 2 x 2 on [0 , 1] . F or any depth L ⩾ 2 and width M ⩾ 2 , inf g ∈H ReLU ,L (1 , 1 ,M , ∞ ) ∥ g − f ⋆ ∥ L 2 ([0 , 1]) ⩾ 1 12 √ 5 ( M + 1) − 2( L − 1) . Pr o of. Any ReLU net w ork g is a contin uous piecewise linear function on [0 , 1]. Therefore there exists a partition { I j } K j =1 of [0 , 1] in to in terv als such that g is linear on eac h I j . In one dimension, eac h hidden la yer of width M can introduce at most M new breakp oints within each interv al pro duced by the previous la yers, so the total n umber of linear pieces satisfies K ⩽ ( M + 1) L − 1 . (62) 61 Since P K j =1 | I j | = 1 and x 7→ x 5 is conv ex, Jensen’s inequality yields K X j =1 | I j | 5 ⩾ K 1 K K X j =1 | I j | 5 = 1 K 4 . (63) On each interv al I j , g is linear, so applying Lemma D.1 and scaling b y (1 / 2) 2 giv es Z I j f ⋆ ( x ) − g ( x ) 2 d x = 1 4 Z I j x 2 − e h j ( x ) 2 d x ⩾ 1 720 | I j | 5 , for some linear e h j (namely e h j = 2 g | I j ). Summing ov er j and using ( 62 )–( 63 ), w e obtain ∥ g − f ⋆ ∥ 2 L 2 ([0 , 1]) = K X j =1 Z I j g ( x ) − f ⋆ ( x ) 2 d x ⩾ 1 720 K X j =1 | I j | 5 ⩾ 1 720 K − 4 ⩾ 1 720 ( M +1) − 4( L − 1) . T aking square ro ots prov es the claim. Step 3: A lo wer b ound for ReLU netw orks. The following is a sp ecialization of ( Siegel , 2023 , Theorem 3) to d = 1 and fixed depth L . Lemma D.3 ( Siegel ( 2023 )) . Fix L ⩾ 2 and let s > 0 . Then ther e exists a c onstant C s,L > 0 such that for every M ⩾ 2 , sup ∥ f ⋆ ∥ W s, ∞ ([0 , 1]) ⩽ 1 inf g ∈H ReLU ,L (1 , 1 ,M , ∞ ) ∥ g − f ⋆ ∥ L 2 ([0 , 1]) ⩾ C s,L ( M 2 log M ) − s . Pr o of of Pr op osition 5.1 . By Lemma D.2 , c ho osing f ⋆ ( x ) = 1 2 x 2 (whic h b elongs to the Sob olev unit ball under the standard W s, ∞ normalization up to a constant factor) yields sup ∥ f ⋆ ∥ W s, ∞ ([0 , 1]) ⩽ 1 inf g ∈H ReLU ,L (1 , 1 ,M , ∞ ) ∥ g − f ⋆ ∥ L 2 ([0 , 1]) ⩾ c ( M + 1) − 2( L − 1) for some c > 0. On the other hand, Lemma D.3 gives sup ∥ f ⋆ ∥ W s, ∞ ([0 , 1]) ⩽ 1 inf g ∈H ReLU ,L (1 , 1 ,M , ∞ ) ∥ g − f ⋆ ∥ L 2 ([0 , 1]) ⩾ C s,L ( M 2 log M ) − s . Com bining the tw o b ounds and using M + 1 ⩽ 2 M and log M ⩾ log 2 for M ⩾ 2, w e obtain sup ∥ f ⋆ ∥ W s, ∞ ([0 , 1]) ⩽ 1 inf g ∈H ReLU ,L (1 , 1 ,M , ∞ ) ∥ g − f ⋆ ∥ L 2 ([0 , 1]) ⩾ C ′ s,L ( M log M ) − 2 min { L − 1 ,s } , after absorbing constan ts in to C ′ s,L . D.2 Setup for Generalization Exp erimen ts In this section, we describ e the exp erimen tal setup of the generalization separation in detail. Data generation and target function. W e consider a target f ⋆ : [0 , 1] d → R of the form f ⋆ ( x ) = K X k =1 a k cos( w ⊤ k x + b k ) . Throughout, w e fix the input dimension d = 5 and the num b er of random F ourier features K = 50. The parameters are sampled indep enden tly as 62 • w k ∼ N (0 , I d ), • b k ∼ U (0 , 2 π ), • a k ∼ N (0 , 1). F or a given sample size n , the training dataset { ( x i , y i ) } n i =1 is generated by sampling x i ∼ U ([0 , 1] d ) and setting y i = f ⋆ ( x i ) + ξ i , ξ i ∼ N (0 , σ 2 ) , with σ = 0 . 1. W e ev aluate generalization performance across sample sizes n ∈ { 1024 , 1400 , 2048 , 2800 , 4096 , 5600 } . Mo del architecture. W e use a fully connected netw ork with a single hidden la yer to compare differen t activ ation functions. The hidden width is fixed to M = 6000 for all experiments. W e compare the non-smooth ReLU activ ation with tw o smo oth activ ations, namely GELU and tanh. T raining and h yp erparameter tuning. W e minimize the empirical mean-squared error (MSE) 1 n n X i =1 y i − f ( x i ) 2 , where f denotes the neural net work predictor. Optimization is carried out using full-batch Adam for 50 , 000 ep o c hs with a cosine learning-rate decay schedule. F or eac h sample size n , we p erform a grid search o ver • learning rates η ∈ { 10 − 4 , 10 − 3 , 10 − 2 } , • L 2 regularization co efficien ts λ ∈ { 10 − 5 , 5 × 10 − 5 , 10 − 4 , 5 × 10 − 4 , 10 − 3 , 5 × 10 − 3 , 10 − 2 , 5 × 10 − 2 , 10 − 1 } . The h yp erparameter pair ( η , λ ) is selected b y the smallest generalization error on a noiseless test set. W e rep eat the entire pro cedure o ver 5 indep endent runs and report the av erage of the resulting b est generalization errors. This proto col mitigates sensitivity to h yp erparameter c hoices and yields a robust estimate of the empirical conv ergence b eha vior. 63
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment