Learning Without Training

Learning Without T raining By Ry an O’Do wd Claremon t Graduate Univ ersit y 2026 © Cop yright Ry an O’Do wd, 2026. All righ ts reserved. Committee Appro v al This dissertation has b een duly read, review ed, and critiqued b y the committee listed below, whic h hereb y appro ves the man uscript of Ry an O’Do wd as fulﬁlling the scop e and qualit y requirements for meriting the degree of Do ctor of Philosophy in Mathematics. Hrushik esh Mhask ar, Chair Claremon t Graduate Universit y Distinguished Researc h Professor of Mathematics Asuman Akso y Claremon t McKenna College Cro wn Professor of Mathematics and George R. Rob erts F ello w Alexander Cloninger Univ ersity of California, San Diego Professor of Mathematics Allon P ercus Claremon t Graduate Universit y Joseph H. P engilly Professor of Mathematics Abstract W e live in an era of big data. Whether it b e algorithms designed to help corporations eﬃcien tly allocate the use of their resources, systems to block or in tercept transmissions in times of war, or the helpful p o ck et companion known as ChatGPT, machine learning is at the heart of managing the real-world problems asso- ciated with massive data. With the success of neural netw orks on such large-scale problems, more research in mac hine learning is b eing conducted now than ever before. This dissertation fo cuses on three diﬀerent pro jects ro oted in mathematical theory for machine learning applications. Common themes throughout in- v olve the synthesis of mathematical ideas with problems in mac hine learning, yielding new theory , algorithms, and directions of study . The ﬁrst pro ject deals with sup ervised learning and manifold learning. In theory , one of the main problems in sup ervised learning is that of function approximation: that is, given some data set D = { ( x j , f ( x j )) } M j =1 , can one build a mo del F ≈ f ? A t the surface level, classical approximation theory seems readily applicable to such a problem, but under the surface there are tec hnical diﬃculties including unknown data domains, extremely high dimensional feature spaces, and noise. W e introduce a method which aims to tackle these diﬃculties and remedies sev eral of the theoretical shortcomings of the current paradigm. The second pro ject deals with transfer learning, which is the study of how an approximation pro cess or mo del learned on one domain can b e leveraged to improv e the approximation on another domain. This can b e viewed as the lifting of a function from one manifold to another. This viewp oint enables us to connect some inv erse problems in applied mathematics (such as the inv erse Radon transform) with transfer learning. W e study suc h liftings of functions when the data is assumed to be kno wn only on a part of the whole domain. W e are in terested in determining subsets of the target data space on whic h the lifting can b e deﬁned, and ho w the lo cal smo othness of the function and its lifting are related. The third pro ject is concerned with the classiﬁcation task in machin e learning, particularly in the active learning paradigm. Classiﬁcation has often b een treated as an approximation problem as well, but we prop ose an alternative approach leveraging techniques originally introduced for signal separation problems. The analogue to p oint sources are the supp orts of distributions from which data b elonging to each class is sampled from. W e in tro duce theory to unify signal separation with classiﬁcation and a new algorithm whic h yields comp etitiv e accuracy to other recent active learning algorithms while providing results muc h faster. Ac kno wledgemen ts My advisor Hrushik esh Mhask ar has my deep est gratitude for in tro ducing me to the w orld of mathematics researc h and pushing me to improv e and accomplish far b ey ond what I imagined p ossible for myself. His supp ort, advice, kindness, and feedback hav e b een instrumental to the construction of this dissertation and to my developmen t as a mathematician. I will alw ays remember fondly our discussions of mathematics and philosoph y , walks along campus, meals shared, and trav el abroad together. I am grateful for the advice and kindness that Allon Percus—who has b een m y academic advisor during m y years at CGU—has provided to me ov er the years. His guidance led me along the path that has culminated in this work. I also would like to thank Asuman Aksoy and Alexander Cloninger for graciously agreeing to sit on m y dissertation committee and provide v aluable feedback. I would like to ackno wledge my fellow students who I hav e shared in the c hallenges of homeworks, the c heer of Math Club, or even the o ccasional conv ersation with. The friendly faces which are alwa ys present at CGU make the campus such a joy to b e around. I would also like to ackno wledge the professors at CGU and the Claremon t Colleges who all provide exceptionally well-led courses and hav e immense care for the success of their studen ts. Bey ond CGU, I am thankful for the advice I ha v e gotten from mathematicians and scien tists alike. In particular, Richard Spencer, Raghu Ra j, Jerry Kim, and F rank Filbir stand out for the professional guidance and kno wledge they hav e passed to me. I am immensely thankful for my parents, Sean and Allison. The pursuit of my dreams and the culmination of my work in this dissertation has been upheld by their ongoing supp ort and unconditional lov e throughout m y life. I am thankful for m y ﬁanc ´ ee Diana, and all of the moments she encouraged me forw ard when doubts held me back. Her lo v e is the reserv oir through which I ﬁnd my strength. I am thankful for my siblings T yler, T rev or, and Kelsey , and their uncanny abilit y to raise m y spirits. I am also thankful for Monica, whose w armth and kindness is alwa ys inspiring. I lov e you all. This dissertation is the pro duct of an accumulation of supp ort more plen tiful than I can mention or fully appreciate. T o all those who I hav e kno wn as friends in my life, thank you. i Con ten ts 1 In tro duction 1 1.1 Mac hine Learning Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Mac hine Learning Paradigm for Sup ervised Learning . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Examples of Hyp othesis Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Empirical Risk, Generalization, and Optimization . . . . . . . . . . . . . . . . . . . . 5 1.3 Appro ximation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Univ ersal Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Degree of Appro ximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Shortcomings of the Sup ervised Learning Paradigm . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Constructiv e Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.7 Shortcomings of Classical Appro ximation Theory . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.8 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 Appro ximation on Manifolds 19 2.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Related Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 An Ov erview of the Pro of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Bac kground . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Spherical Harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Lo calized Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.3 Appro ximation on Spheres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.4 Appro ximation on Equators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 F unction Approximation on Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 ii 2.6.1 Piecewise Diﬀeren tiable F unction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.2 P arameter Estimation in Bi-exp onential Sums . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.3 Darcy Flo w Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7 Pro ofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7.1 In tegral Reconstruction Op erator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Lo cal T ransfer Learning 50 3.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Data Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Join t Data Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Lo cal Appro ximation in Joint Data Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5 Pro ofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Classiﬁcation 71 4.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Related w orks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 P oint source signal separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Relev ant Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.2 F-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.6 MASC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6.2 Comparison With CA C and SCALe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.7 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.7.1 Circle on Ellipse Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.7.2 Do cumen t Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.7.3 Salinas Hyp ersp ectral Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.7.4 Indian Pines Hyp ersp ectral Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.7.5 Comparison With LAND and LEND . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.7.6 Activ e V ersus Semi-Sup ervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.8 Pro ofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.8.1 Measure Supp ort Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 iii 4.8.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.8.3 F-score Result Pro of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5 Conclusion 109 App endices 112 A T aub erian Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B Bac kground on Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 C Orthogonal P olynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 D Net w ork Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 E Enco ding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Bibliograph y 123 iv Chapter 1 In tro duction Supp ose we are giv en data ab out the w eather and wan t to build a model to predict future weather o ccurrences. Our ﬁrst thought is to visualize the data to understand trends, but the issue is that each data p oint has so man y diﬀeren t parameters: geographic lo cation, humidit y , wind sp eed, wind direction, temp erature, climate, h uman p ollution, atmospheric comp osition, time of da y , etc. It is impossible to visualize. The goal is to use all of this data to estimate a weather function for future time data p oints. Although the fundamental problem is just that of function approximation, the function in this case is describ ed by so muc h data that constructing a mathematical mo del also seems infeasible. So where to even b egin? Problems of large and c hanging data lik e this one are not unique to meteorology . They face scientists and researchers in every ﬁeld, from ﬁnances, geology , astronom y , biology , etc. Some challenges are unique to eac h ﬁeld and even eac h sp eciﬁc problem, but common ones include large scale data sets, noisy data, high dimensionalit y , changing data, and more. Machine learning algorithms hav e become a p opular c hoice to tac kle these large-scale problems where data visualization and traditional mathematical modeling approac hes are simply not feasible. This c hapter serv es to review bac kground information on machine learning, and in vestigate the shortcomings which ha ve motiv ated the work in this dissertation. W e discuss machine learning along with its many paradigms and problems of interest in Section 1.1 . W e delv e in to the paradigm of sup ervised learning in Section 1.2 . W e point out the traditional role approximation theory pla ys in sup ervised learning in Section 1.3 . T o conclude the discussion of the sup ervised learning paradigm, w e describe some shortcomings of this paradigm in Section 1.4 . In Section 1.5 , we explore an alternativ e wa y in which approximation theory could play a more imp ortant role in machine learning. A relativ ely recent paradigm which motiv ated part of our research is the area of manifold learning, which we in tro duce in Section 1.6 . W e then p oint out in Section 1.7 the reasons why we think classical approximation 1 theory cannot play a direct role in machine learning. W e then conclude this chapter b y providing an in tro duction to our thesis in Section 1.8 . Since it is absolutely essential for clarit y in the presentation of our results, we would also lik e to state no w how constants will b e treated in this dissertation. The symbols c, c 1 , c 2 , · · · will denote generic p ositive constan ts dep ending up on the ﬁxed quan tities in the discussion. Their v alues may b e diﬀerent at diﬀerent o ccurrences, ev en within a single formula. The notation A ≲ B means A ≤ cB , A ≳ B means B ≲ A , and A ∼ B means A ≲ B ≲ A . In some cases where we b elieve it may be otherwise unclear, we will clarify which v alues a constan t may dep end on, whic h may app ear in the subscript of the ab ov e-men tioned symbols. F or example, A ≲ d B m eans there exists c ( d ) > 0 suc h that A ≤ c ( d ) B . In this chapter only , w e denote the ℓ p norms on vectors b y ||◦|| p , and the L p norms for functions deﬁned on a domain X by ||◦|| X,p . In later chapters w e will adjust our notation for norms to simplify our discussion. In this con text, we will alwa ys assume that 1 ≤ p ≤ ∞ . 1.1 Mac hine Learning Bac k ground Mac hine learning is a ﬁeld at the in tersection of mathematics, computer science, and data science concerned with the dev elopment of theory and algorithms capable of learning complex tasks. T ypically , machine learning mo dels are given some data from which they le arn the task at hand. The tasks of interest for mac hine learning approaches are often those where traditional mathematical mo deling leav es something to b e desired. As mentioned, this may b e due to a large size of data, changing data, high dimensionality , noise, complicated factors inﬂuencing the data, or even the lack of a kno wn mathematical mo del b ehind the data to b egin with. W e cite [ 52 ] as an in tro ductory resource for mac hine learning, and muc h of the conten t in this section can b e learned in further depth there. W e start our discussion on the background of machine learning b y ﬁrst in tro ducing several settings for problems that are studied in the ﬁeld. • Sup ervised Learning: The main goal of sup ervise d le arning is to generate a mo del to appr oximate a function f on unseen data p oints. Let D = { ( x j , y j ) } M j =1 b e a giv en data set, where ( x j , y j ) are sampled from some unknown joint distribution τ = ( X, Y ). The goal is to appro ximate the target function f ( x ) = E τ ( Y | X = x ). • Unsup ervised Learning: Let D = { x j } M j =1 b e a given data set, where x j are sampled from some unkno wn distribution ν on a high-dimensional, Euclidean space X . The goal of unsup ervise d le arning is to use D to c onstruct a function f that satisﬁes some desired prop ert y . 2 • Semi-Sup ervised Learning: The problems considered in semi-sup ervise d le arning are broad and encompass the case where the y j from the sup ervised setting are given for only some subset of the total data. In a common setup, one may seek to construct the missing y j data p oints and then pro ceed in a sup ervised manner to approximate new data p oints. • Active Learning: Active learning incorporates ideas from both unsupervised and sup ervised learning. Let D = { x j } M j =1 b e a given data set like that of the unsup ervised setting, but in active learning we are also given f called an or acle in this context. W e assume that there is some cost to query f , so we start in the unsup ervised setting to determine some set of p oints we would like to query to maximize the information we can gain. Then once we hav e queried this set of p oints, w e are in the semi-sup ervised setting where the goal is to appro ximate f for the non-queried p oints. W e note brieﬂy that there are n umerous other settings one ma y consider in the realm of mac hine learning, including distribute d le arning , online le arning , and self-sup ervise d le arning . Since these topics are not directly relev ant to this dissertation we hav e omitted discussion on them. There are tw o ma jor problems or tasks in mac hine learning which we will discuss. • Classiﬁcation: In classiﬁc ation problems, the function f is discrete, taking on only some ﬁnite set of v alues called class lab els . • Regression: In r e gr ession problems the function f may take on an y v alue on a con tin uum (and is often assumed to b e a contin uous function). F urthermore, in man y cases it is assumed that y j = f ( x j ) + ϵ j where ϵ j is sampled from some unkno wn (often assumed mean-zero) distribution ϵ . Unsup ervised learning is used mostly for classiﬁcation problems. In unsup ervised classiﬁcation, one seeks to cluster the data by someho w constructing f to giv e the same class lab el to p oints that are “similar” in some sense. Sup ervised, semi-supervised, and active learning each are known to b e used for b oth classiﬁcation and regression problems. W e now turn our fo cus to the general paradigm in machine learning for solving problems in the sup ervised setting. 1.2 Mac hine Learning P aradigm for Sup ervised Learning In this section, w e discuss the current machine learning paradigm for sup ervised learning. As mentioned ab ov e, the fundamental problem of sup ervised learning is to build a mo del to approximate a function based on ﬁnitely man y (noisy) data p oints from said function. Therefore, w e b egin with a function approximation framework. W e assume that f b elongs to some class of functions called the universe 3 of disc ourse X (t ypically a Banac h space), and then decide on some hyp othesis sp ac es V n of functions to mo del f by ( appr oximants in the language of approximation theory). As in approximation theory , a typical approac h is to choose a nested family of hypothesis spaces V 1 ⊆ V 2 ⊆ · · · ⊆ V n ⊆ . . . , (1.2.1) where the v alue of the subscript often represen ts some notion of complexit y . The idea is that these h yp othesis spaces should satisfy a density pr op erty . That is, as n → ∞ the space S ∞ n =1 V n should b e dense in X . 1.2.1 Examples of Hyp othesis Spaces As an example, supp ose the universe of discourse is c hosen to b e contin uous, real functions on [ − 1 , 1] d . P erhaps the simplest choice for a family of hypothesis spaces { V n } would b e the sets of p olynomials of degree at most n on [ − 1 , 1] d . Fitting data to this mo del choice is known as p olynomial r e gr ession . F rom the Stone-W eierstrass theorem, we know that if f ∈ X , then for any ϵ > 0 there exists n and P ∈ V n suc h that || f − P || [ − 1 , 1] d , ∞ < ϵ . That is, the mo del spaces allo w us to approximate f arbitrarily closely as n increases, demonstrating the densit y prop erty . In the case d = 1, p olynomials form a simple model. Ho w ever as d gro ws they may b ecome unwieldy , with the num ber of parameters growing exp onentially with d . A second choice for hypothesis spaces that has increased in p opularit y immensely in recent y ears is that of neural net works. A shal low neur al network is a function appro ximation mo del typically taking the form F n ( x ) = n X k =1 a k ϕ ( w T k x + b k ) , (1.2.2) where n [ k =1 { a k , w k, 1 , . . . , w k,d , b k } are le arning p ar ameters asso ciated with the mo del, x is the input v alue, n is kno wn as the width (also c omplexity ) of the net work, and ϕ is a judiciously chosen activation function. One adv antage of a neural netw ork mo del is that it can b e impleme n ted using parallel computation. As to whether or not a class of neural net works satisﬁes the density prop erty is another matter. F or instance, it is known that if ϕ ( t ) = t , then there is no netw ork which can repro duce ev en the exclusive-OR function. In Section 1.3.1 w e will return to this question. Other common hypothesis spaces include radial basis functions and kernel-based approximation based on a v ariety of k ernel functions. Once a space V n is c hosen to mo del the function by , the paradigm for sup ervised learning diverges from approximation theory instead opting to use ideas from statistics and optimization theory to solv e the problem. 4 1.2.2 Empirical Risk, Generalization, and Optimization The primary approach to select a mo del from V n is to introduce the notion of a gener alization err or , which is given as a loss functional taking in the mo del as an input and returning some error v alue. Many choices can b e used for such a loss functional, but p erhaps the most common example is me an-squar e d-err or (MSE), whic h is given as L ( P ) = Z ( y − P ( x )) 2 dτ ( x, y ) . (1.2.3) Notice that the generalization error do es not dep end on the given data; instead it is an exp ectation ov er the en tire distribution τ , which again is not known. This means that one cannot exp ect to attain a minimizer of the generalization error, ˜ P : = argmin P ∈ V n L ( P ), from only ﬁnite data. Instead, one typically seeks to ﬁnd a minimizer of the empiric al risk , which is a discretized version of the generalization error based on the data. F or example, the MSE empirical risk is given as L ( D ; P ) = 1 M M X j =1 ( y j − P ( x j )) 2 = 1 M || y − P ( x ) || 2 2 . (1.2.4) The minimizer of the empirical risk, P # : = argmin P ∈ V n L ( D ; P ), is the v alue that is ultimately sought after in machine learning applications. How ev er, there are tw o questions with this approac h that remain activ e areas of researc h: • How can we attain the minimizer P # ? • If we can attain it, how closely do es it estimate ˜ P ? Both questions are essen tial to the performance of mac hine learning algorithms trained by empirical risk minimization, and in man y cases neither question has a simple solution. When giv en large amounts of high-dimensional data, and especially in the presence of ﬁnite noise re- alizations, analytically deducing P # ma y b e imp ossible. So instead, a pro cess known as optimization is used. The ﬁeld of optimization theory is v ast so we will limit our discussion to a commonly used metho d kno wn as gr adient (or ste ep est) desc ent . The idea of optimization is to parameterize the hypothesis space and then leverage the parameterization with resp ect to a c hosen loss functional to make successiv e (typically impro ving) guesses at an optimal v alue. F or example, if V n is the space of all neural netw orks of the form ( 1.2.2 ) with n neurons, then setting θ to b e a vectorization of n [ k =1 { a k , w k, 1 , . . . , w k,d , b k } may serve as the v ector of parameters. Gradient descent starts with some initial guess of the parameters for the mo del to appro ximate the target function, known as the initialization . Then, the guess is mov ed by some step size in the opp osite direction of the gradient of the chosen loss function with resp ect to those parameters at that 5 p oin t. That is, the guess at iteration j , which we will denote by θ j , is up dated by the follo wing rule: θ j +1 = θ j − η ∇ L ( θ j ) , (1.2.5) where η is called the le arning r ate , or step size . F or theoretical analysis, it is con v enien t to express this up date as a dynamical pro cess inv olving θ as a function of of a v ariable t . Then the pro cess is describ ed by ˙ θ = − η ∇ L ( θ ) , (1.2.6) where ˙ denotes the deriv ative with resp ect to t . Then, ˙ L ( θ ) = ∇ L ( θ ) · ˙ θ = − η ||∇ L ( θ ) || 2 2 . (1.2.7) Th us, with the up date process ( 1.2.6 ), the loss functional is non-increasing, and hence reac hes a limit as t → ∞ . W e examine the implemen tation of gradient descent for a shallow neural net w ork with inputs from R using the MSE empirical risk as in ( 1.2.4 ). F or simplicit y , we will break up the parameters θ into a = [ a 1 , . . . , a n ] T , w = [ w 1 , . . . , w n ] T , and b = [ b 1 , . . . , b n ] T . By deﬁning ϕ =          ϕ ( w 1 x 1 + b 1 ) ϕ ( w 2 x 1 + b 2 ) . . . ϕ ( w n x 1 + b n ) ϕ ( w 1 x 2 + b 1 ) ϕ ( w 2 x 2 + b 2 ) . . . ϕ ( w n x 2 + b n ) . . . . . . . . . . . . ϕ ( w 1 x M + b 1 ) ϕ ( w 2 x M + b 2 ) . . . ϕ ( w n x M + b n )          , ϕ ′ =          ϕ ′ ( w 1 x 1 + b 1 ) ϕ ′ ( w 2 x 1 + b 2 ) . . . ϕ ′ ( w n x 1 + b n ) ϕ ′ ( w 1 x 2 + b 1 ) ϕ ′ ( w 2 x 2 + b 2 ) . . . ϕ ′ ( w n x 2 + b n ) . . . . . . . . . . . . ϕ ′ ( w 1 x M + b 1 ) ϕ ′ ( w 2 x M + b 2 ) . . . ϕ ′ ( w n x M + b n )          , (1.2.8) w e can set the vector z = ϕ a − f ( x ) , (1.2.9) in order to simplify our notation of the MSE loss functional: L ( D ; F n ) = 1 M || z || 2 2 . (1.2.10) 6 Then w e observe that dL d z = 2 M z , (1.2.11) allo wing us to compute each of the following b y the chain rule: ∂ L ∂ a = dL d z ∂ z ∂ a = 2 M ϕ T z , ∂ L ∂ w = dL d z ∂ z ∂ w = 2 M ( ϕ ′ ) T diag( x ) z , ∂ L ∂ b = dL d z ∂ z ∂ b = 2 M ( ϕ ′ ) T z , (1.2.12) where diag( x ) represen ts the matrix with each diagonal en try i, i giv en by x i and each non-diagonal en try set to 0. Thus, the gradient descent up dates can b e done in three parts: a t +1 = a t − η 1 2 M ϕ T ( ϕ a − f ( x )) , w t +1 = w t − η 2 2 M ( ϕ ′ ) T diag( x )( ϕ a − f ( x )) , b t +1 = b t − η 3 2 M ( ϕ ′ ) T ( ϕ a − f ( x )) . (1.2.13) Since the calculation inv olv es the computation of tw o matrices dep ending up on ϕ, ϕ ′ , one signiﬁcan t w ay to simplify the gradien t descent calculation is to use an activ ation function which satisﬁes ϕ ′ ( x ) = g ( ϕ ( x )) , (1.2.14) for a relatively simple function g . F or instance, the function ϕ ( x ) = tanh( x ) is a sigmoidal activ ation function satisfying ϕ ′ ( x ) = 1 − ϕ ( x ) 2 . Another example is the p opular c hoice of activ ation function known as the r e ctiﬁe d line ar unit (ReLU), deﬁned by ϕ ( x ) =        x x > 0 0 x ≤ 0 . (1.2.15) Ev en though this function is not diﬀerentiable at 0, it is piecewise diﬀerentiable and for every non-zero p oint w e can represen t ϕ ′ ( x ) = xϕ ( x ). These simpliﬁcations allo w the ϕ ′ matrix ab ov e to b e calculated directly from the ϕ matrix, sa ving computation time. 7 1.3 Appro ximation Theory One ma jor question in sup ervised learning is how to choose the hypothesis space. The classical intersection of approximation theory with machine learning seeks to provide some theoretical justiﬁcation for this choice. There are tw o main questions which are inv estigated extensively . The ﬁrst question is whether the sequence of hypothesis spaces is appropriate. That is, does the loss functional for V n con verge to 0 as n → ∞ ? W e discuss a couple of examples of early research in this direction in Section 1.3.1 . The second question is: how w ell can the target function b e approximated by elemen ts of V n ? In Section 1.3.2 , we give some examples of early researc h in this direction. 1.3.1 Univ ersal Appro ximation W e say that a sequence of hypothesis spaces, { V n } , satisﬁes a universal appr oximation pr op erty if for every d ≥ 1, every compact set K ⊆ R d , every f ∈ C ( K ), and ev ery ϵ > 0, there exists some N ∈ N and P ∈ V N suc h that || f − P || K, ∞ ≤ ϵ. (1.3.1) An y activ ation function which yields a family of universal approximator neural net w orks is called a Kol- mo gor ov function. An activ ation function is called sigmoidal of order k if lim x →∞ x − k ϕ ( x ) = 1 and lim x →−∞ x − k ϕ ( x ) = 0. In [ 32 , Theorem 1 and Lemma 1] and [ 19 , Theorem 2.1], it was shown that shallow neural net works with contin uous sigmoidal order 0 activ ation functions satisfy the univ ersal approxima- tion prop erty for contin uous functions deﬁned on compact sets. In [ 94 , Corollary 3.4] it was shown that a con tinuous function ϕ satisfying the condition sup x ∈ R (1 + x 2 ) − n | ϕ ( x ) | < ∞ (1.3.2) for some integer n is a Kolmogorov function if and only if it is not a p olynomial. The w ork of [ 67 ] also sho wed that the activ ation function b eing non-p olynomial was suﬃcient under diﬀerent assumptions on ϕ . 1.3.2 Degree of Appro ximation While the ab ov e discussion on neural netw orks gives some indication that they ma y be a suitable h yp othesis space for machine learning, a crucial question remains unanswered: how w ell can a ﬁxed-width netw ork appro ximate a function? Even if neural netw orks can approximate any function as n → ∞ , this gives us no idea ab out the rate at which w e would conv erge to f . 8 T o answer this question, researchers hav e often turned to degree of approximation results. The de gr e e of appr oximation is deﬁned to b e the least p ossible distance from V n to f . That is, E n ( f ) : = inf P ∈ V n || f − P || , (1.3.3) where ∥ · ∥ is a suitably deﬁned norm, typically the L 2 norm with resp ect to the marginal distribution of τ on x . The b est appr oximation , P ∗ = argmin P ∈ V n E n ( f ), is the mo del from V n that minimizes the degree of appro ximation. This v alue indicates a global b est that one can hop e their hypothesis space (and thereb y algorithm) to ac hiev e. In the context of neural netw orks, signiﬁcant research has gone in to degree of appro ximation theorems in diﬀerent settings. W e say that f : R d → R b elongs to the Barr on Sp ac e with parameter s > 0, denoted by B s , if it satisﬁes the follo wing norm condition || f || B s : = Z R d (1 + | x | 2 ) s/ 2    ˆ f ( x )    d x < ∞ . (1.3.4) In [ 2 , Theorem 1], it was sho wn that if f ∈ B 1 , then there exists a netw ork G n with a sigmoidal activ ation function and width n suc h that || f − G n || B d (0 , 1) , 2 = O (1 / √ n ) . (1.3.5) Another degree of appro ximation result of note is for Sobolev spaces. The Sobolev space with parameters r , p on a set X ⊆ R d is deﬁned as W d r,p ( X ) = { f ∈ L p ( X ) : D k f ∈ L p ( X ) , | k | ≤ r } . These spaces are asso ciated with the follo wing (resp ective) semi-norm and norm: | f | W d r,p : =   X | k | = r       f ( k )       p p   1 /p , || f || W d r,p : =   X | k |≤ r       f ( k )       p p   1 /p . (1.3.6) If X = [ − 1 , 1] d , it was shown in [ 86 , Theorem 2.1] that if f ∈ W d r,p then one can construct a net w ork G n with width n suc h that || f − G N || X ,p ≲ n − r/d || f || W d r,p . (1.3.7) The constructive pro cess starts with a p olynomial appro ximation which satisﬁes the b ound in ( 1.3.7 ). F or example, a shifted av erage of the partial sums of the Chebyshev expansion of f can b e used in the uniform appro ximation case ( p = ∞ ). Then eac h of the monomials in the polynomial can be approximated by divided diﬀerences of the function deﬁned b y w 7→ ϕ ( w · x + b ). In [ 79 ], degree of approximation results w ere sho wn for m ultila y er neural net works in terms of a v ariational 9 mo dulus of smo othness. A survey of some of the main results of the time can b e found in [ 112 ]. Over the y ears, many others ha v e studied optimal appro ximation prop erties of neural net w orks in diﬀerent contexts including [ 10 , 111 , 123 ]. Degree of approximation results like these are oft-cited to motiv ate the usage of neural netw orks in practical applications. In the following section, we discuss why this setup may not b e as desirable as it ﬁrst seems. T o summarize this section, w e show the current paradigm graphically in Figure 1.1 . Figure 1.1: A depiction of the standard sup ervised learning paradigm. The universe of discourse X is assumed to contain a target function f and hypothesis spaces V n are judiciously chosen based on the algorithm of c hoice. P # denotes the empirical risk minimizer, ˜ P denotes the minimizer of the generalization error, and P ∗ denotes the b est approximation. 1.4 Shortcomings of the Sup ervised Learning P aradigm In this section w e note some of the shortcomings of the current paradigm for sup ervised learning from a theoretical p oin t of view, in spite of its tremendous success in practice. 1. Any optimization problem, gradien t descen t in particular, suﬀers from such shortcomings as getting stuc k in a lo cal minimum, slow conv ergence or even a lack of conv ergence, instability , and sensitivity to the initial choice of parameters. F or example, in [ 72 ] a neural netw ork task ed with approximating | t | = t + + ( − t ) + yields a constant output when giv en a p o or choice of initialization. The phenomenon has b ecome kno wn as de ad on arrival . Another issue is ho w to decide when to stop iterating. One ma y halt the iterations once | θ j − θ j − 1 | are suﬃciently small, but this runs in to the issue of false stabilization , or reaching a p oint where the iterations seem to conv erge to a p oint but in actuality will con tinue c hanging given enough iterations. 2. The use of degree of approximation for deciding up on the choice of the mo del space migh t b e misleading. W e elab orate up on this p oint in further detail in this section. 10 3. The use of a global loss functional suc h as the one deﬁned in ( 1.2.3 ) is insensitive to lo cal artifacts in the target function. Finding minimizers ˜ P (of the loss functional) or even P ∗ (b est approximation) ma y not preserv e the imp ortant lo cal eﬀects to the problem at hand. W e will elab orate on this asp ect more in Section 1.5 . 4. In many cases, one can give a direct solution to the fundamental problem of machine learning, so that the appro ximation at each point is aﬀected only b y the smoothness of the target function in the vicinity of that p oin t. F urthermore, this can b e done while preserving the order of magnitude of the degree of approximation according to the local smoothness of the function on that neighborho o d. W e will elab orate on this asp ect in Section 1.5 . It has b ecome common to motiv ate the usage of neural netw orks for mac hine learning applications by citing degree of approximation results as in Section 1.3.2 or similar. But it is important to recognize that these types of results are often existence-based and ignore the fact that in practice the appro ximation needs to b e constructed from the data. This distinction should not b e understated. T o giv e an example, we lo ok at a ReLU activ ation function (deﬁned by t → | t | = ( t ) + + ( − t ) + ) netw ork to approximate functions on the d -dimensional sphere S d . The functions in question are assumed to hav e a representation of the form f ( x ) = Z S d | x · y |D |·| ( f )( y ) dµ ∗ d ( y ) , (1.4.1) where µ ∗ d is the volume measure of S d normalized to b e a probability measure and D |·| ( f ) ∈ C ( S d ). W e deﬁne, in this discussion, ∥ f ∥ W |·| = ∥ f ∥ ∞ + ∥D |·| ( f ) ∥ ∞ , and W |·| to b e the class of functions with ∥ f ∥ W |·| < ∞ . The results in [ 82 , Corollary 4.1 and Remark 4.1] sho w that for f ∈ W |·| , there exist a k , y k dep ending up on f in an unsp eciﬁed manner such that           f − N X k =1 a k |◦ · y k |           S d ≲ ( N − ( d +3) / (2 d ) ) ∥ f ∥ W |·| . (1.4.2) In contrast, it is prov ed in [ 84 , Corollary 4.1 and Remark 4.5] that for any c hoice of y k whic h admit a quadrature formula exact for integrating spherical p olynomials of a certain degree, and scattered data of the form ( ξ j , f ( ξ j )), ξ j ∈ S d , there exist linear functionals a k (giv en explicitly in [ 84 ]) dep ending only on this data suc h that           f − N X k =1 a k |◦ · y k |           S d ≲ ( N − 2 /d ) ∥ f ∥ W |·| . (1.4.3) W e note that unlike the b ounds for Barron spaces and Sobolev spaces discussed b efore, b oth the bounds ab o ve are for the same space of functions. The dimension-indep endent b ounds are obtained purely as an 11 existence theorem, while the dimension-dep endent b ounds are based on explicit constructions, essentially solving the problem of sup ervised learning directly . In particular, utilizing degree of approximation results to motiv ate the use of a particular mo del in practice can b e misleading. 1.5 Constructiv e Appro ximation The purpose of this section is to in tro duce a direct and simple wa y of solving the problem of mac hine learning in some sp ecial cases. In particular, since trigonometric approximation serves as a role mo del for almost all appro ximation pro cesses, we elab orate our ideas in this context. W e introduce the approximation of multiv ariate 2 π -p erio dic functions by trigonometric polynomials. W e deﬁne T = R / (2 π Z ) and denote in this section only µ ∗ d to b e the probability Leb esgue measure on T d . In this section, we c ho ose the universe of discourse to b e the space C ( T d ) comprising contin uous real functions on T d , equipp ed with the uniform norm. W e deﬁne the h yp othesis spaces V n b y V n = span { x 7→ exp( i k · x ) : | k | < n } . The F ourier co eﬃcients of a function f ∈ L 1 ( T d ) are deﬁned b y ˆ f ( k ) = Z T d f ( x ) e − i k · x dµ ∗ d ( x ) . (1.5.1) The b est approximation in the sense of the global L 2 norm is giv en by the F ourier pro jection, deﬁned by s n ( f )( x ) = X | k | 0 and f b elong to a smo othness class with p ar ameter γ ; i.e., ∥ f ∥ W γ = ∥ f ∥ + sup n ≥ 0 2 nγ E 2 n ( f ) < ∞ . If n ≳ 1 and M ≳ n d +2 γ log( n ) , then with pr ob ability going to 1 as M → ∞ , we have || ˜ σ n − f || T d , ∞ ≲ n − γ . (1.5.7) While this construction may seem disparate from neural netw orks, there is actually a trick to represent the appro ximation in the form ( 1.5.6 ) as a neural netw ork structure. F rom [ 102 ], we hav e e i k · x = 1 2 π ˆ ϕ (1) Z T ϕ ( k · x − t ) e it dt. (1.5.8) Then, this integral can be discretized and substituted into ( 1.5.3 ) so that ( 1.5.6 ) forms a neural net w ork with activ ation function ϕ . Next, w e discuss the notion of lo cal approximation using our reconstruction op erator in the case when d = 1. W e lo ok at an example of approximating the function f ( θ ) = | cos θ | 1 / 4 . W e note that the function 13 Figure 1.2: Comparison of recov eries for f ( θ ) = | cos θ | 1 / 4 b y the b est appro ximation (black) and a go o d appro ximation (red) for degrees n = 63 , 127 , 255. Figure credit: Hrushikesh Mhask ar. has singularities at − π / 2 , π / 2. In Figure 1.2 , we see the recov ery error of f b y σ 64 , σ 128 , and σ 256 compared with the recov ery errors by s 64 , s 128 , and s 256 resp ectiv ely . W e see that the go o d appro ximation actually reduces error m uch faster aw ay from the singularities than the b est approximation. 1.6 Manifold Learning The follo wing section contains excerpts from our pap ers [ 90 , 89 ]. The purp ose of this section is to introduce a relatively new paradigm of manifold learning. First, we examine a phenomenon known as the curse of dimensionality . It is a critical issue in machine learning which informs our later choice to work with functions deﬁned on manifolds. W e can formalize this notion through the lense of nonline ar widths , shown graphically in Figure 1.3 . W e start with a subset of a Banach space of functions K ⊆ X deﬁned on a domain X ⊆ R Q , a (contin uous) parameterization scheme P : K → R M , and a (con tinuous) appro ximation scheme A : R M → X . The nonlinear L p width is deﬁned b y d M ( K , X ) p : = inf P , A sup f ∈ K || f − A ( R ( f )) || X,p . (1.6.1) This v alue represen ts the w orst p ossible error resulting from the appro ximation of any f ∈ K using the b est p ossible (con tinuous) parameterization and approximation pro cess. In particular, we hav e the follo wing theorem from [ 38 ] demonstrating this curse for the Sob olev spaces as deﬁned in Section 1.3.2 . Theorem 1.6.1 ([ 38 , Theorem 4.2]) . L et K Q r,p = { f ∈ L P ([0 , 1] Q ) : | f | W Q r,p ≤ 1 } , and 1 ≤ p ≤ q . Then, d M ( K Q r,p , W Q r,p ) q ≳ r M − r/Q . (1.6.2) 14 Figure 1.3: Depiction of nonlinear width. Here X is a metric space of functions with K some subset. The goal is to understand the error asso ciated with appro ximating any function f ∈ K , where P is a contin uous parameterization of K and A is an appro ximation scheme mapping into X . This theorem informs us that if we wan t an approximation pro cess that estimates every function in K r,p b y some corresp onding function in W r,p with an L p error less than ϵ , we need M ≳ r ϵ − Q/r . The curse of dimensionality comes in to pla y when Q is large and ϵ is small, making this low er b ound on M so large as to b e impractical. The curse of dimensionalit y is a symptom of the choice of the hypothesis spaces the appro ximation pro cess b elongs to. A relatively recent idea to alleviate the curse of dimensionality is to assume the v alues x j are sampled from an unkno wn, lo w-dimensional submanifold of the high-dimensional am bient space and build the ap- pro ximation scheme based on functions from this lo wer-dimensional structure. This has b ecome kno wn as the manifold assumption . In theory , this assumption implies that the ov erall structure of the data can b e preserv ed on a muc h low er dimensional space. Man y metho ds for suc h dimensionality reduction ha ve b een studied, including Isomaps [ 130 ], lo cally linear embedding (LLE) [ 116 ], Hessian lo cally linear embedding (HLLE) [ 34 ], diﬀusion maps (Dmaps) [ 29 ], maxim um v ariance unfolding (MVU) whic h is also known as semideﬁnite programming (SDP) [ 134 ], lo cal tangen t space alignment (L TSA) [ 135 ], Laplacian eigenmaps (Leigs) [ 6 ], and randomized anisotropic trans- form [ 22 ]. Chui and W ang ha v e giv en a surv ey on these methods in [ 21 ]. The sp ecial issue [ 18 ] of Applied and Computational Harmonic Analysis (2006) pro vides a great introduction on diﬀusion geometry . Applications of these metho ds include, but are by no means limited to, semi-sup ervised learning [ 5 , 7 ], do cumen t analysis [ 30 ], face recognition [ 22 , 57 , 69 ], hypersp ectral imaging [ 20 ], image pro cessing [ 9 , 39 ], cataloguing of galaxies [ 40 ], and so cial netw orking [ 51 ]. The term manifold le arning has generally b ecome asso ciated with a tw o-step pro cedure, where there is ﬁrst ﬁnding information ab out the manifold itself, and second utilizing this information to do function appro ximation. With the t wo-step pro cedure, the estimates obtained in function approximation need to b e temp ered by the errors accrued in the manifold learning step. In turn, the errors in the manifold learning 15 step ma y b e sensitive to the choice of diﬀerent parameters used in the pro cess as well as noise in the data. One approach is to estimate an atlas of the manifold, which thereby allows function appro ximation to b e conducted via lo cal co ordinate c harts. One such eﬀort is to utilize the underlying parametric structure of the functions to determine the dimension of the manifold and the parameters inv olv ed [ 75 ]. Approximations utilizing estimated co ordinate charts hav e b een implemented, for example, via dee p learning [ 121 , 27 , 120 ], mo ving least-squares [ 126 ], lo cal linear regression [ 17 ], or using Euclidean distances among the data p oin ts [ 24 ]. It is also shown in [ 61 , 60 ] that an atlas on the unknown manifold can b e deﬁned in terms of the heat k ernel corresp onding to the Laplace-Beltrami op erator on the manifold. This leads to another approach, which is to lo ok at an eigendecomp osition of the Laplace-Beltrami op erator. It has b een shown that the so-called graph Laplacian (and the corresp onding eigendecomp osition) constructed from data p oin ts conv erges to the manifold Laplacian and its eigendecomp osition [ 6 , 8 , 125 ]. An in tro duction to the sub ject is given in [ 18 ]. In [ 30 , 81 ], a multi-resolution analysis is constructed using the heat k ernel. Another imp ortant to ol is the theory of lo calized k ernels based on the eigen-decomp osition of the heat kernel. These were in tro duced in [ 73 ] based on certain assumptions on the sp ectral function and the prop ert y of ﬁnite sp eed of wa ve propagation. In the context of manifolds, this later prop erty was prov ed in [ 124 , 43 ] to b e equiv alent to the so called Gaussian upp er b ounds on the heat kernels. These b ounds hav e b een studied in man y contexts by man y authors, e.g., [ 55 , 54 , 35 , 64 ], and recently for a general smo oth manifold in [ 85 ]. F unction approximation on manifolds based on sc atter e d data (i.e., data p oints x j whose locations are not prescribed analytically) has b een studied in detail in man y pap ers, starting with [ 73 ], e.g., [ 43 , 44 , 83 , 42 , 81 , 85 ]. A theory was applied successfully in [ 99 ] to construct deep net works for predicting bloo d sugar levels based on con tin uous glucose monitoring devices. In [ 95 , Theorem 4.3], lo calized k ernels were constructed based on the Gaussian upp er b ound on the heat kernel. 1.7 Shortcomings of Classical Appro ximation Theory As we discussed, the fundamental problem of sup ervised learning is that of function approximation, yet appro ximation theory has play ed only a marginal role. W e p oint out some reasons. 1. W e do not typically know whether the assumptions on the target function in v olved in the appro ximation theory b ounds are satisﬁed in practice, or whether the num ber of parameters is the righ t criterion to lo ok at in the ﬁrst place. F or example, when one considers approximation by radial basis function (RBF) net works, it is observ ed in many pap ers (e.g., [ 83 ]) that the minimal separation among the cen ters is the right criterion rather than the n um b er of parameters. It is sho wn that if one measures 16 the degree of appro ximation in terms of the minimal separation, then one can determine the smoothness of the underlying target function by examining the rate at whic h the degrees of appro ximation con verge to 0. 2. Most of the appro ximation theory literature fo cuses on the question of estimating the degree of ap- pro ximation in v arious norms and conditions on f , where the support of the marginal distribution ν is assumed to b e a kno wn domain, such as a torus, a cub e, the whole Euclidean space, a hypersphere, etc.; equiv alen tly , one assumes that the data p oin ts y j are “dense” on such a domain. This creates a gap b et ween theory , where the domain of ν is known, and practice, where it is not. 3. One consequence of approximating, say on a cub e, is the curse of dimensionality as mentioned. Man- ifold learning seeks to alleviate this problem, but the current metho dology in that area is a t w o-fold pro cedure, introducing extra errors in the problem of function approximation itself, apart from the c hoice of hyper-parameters, need to ﬁnd the eigen-decomp osition of a large matrix, out-of-sample extension, etc. 1.8 Organization of the Thesis This dissertation introduces three new approaches for problems of mac hine learning. A unifying theme across the metho ds is that they are developed from a harmonic analysis and approximation theory viewp oint. The ﬁrst pro jec t, discussed in Chapter 2 , in volv es the appro ximation of noisy functions deﬁned on an unkno wn submanifold of a hypersphere directly , without training and without estimating information ab out the submanifold. W e note that our theory may also w ork with data from a Euclidean space via a pro jection to the h yp ersphere of the same dimension. This pro ject aims to remedy ailments in the current sup ervised learning paradigm by in troducing a metho d working in a new paradigm, where the results attained from real data are tied directly to those in the theory . The second pro ject, discussed in Chapter 3 , in troduces a new approach for lo calized transfer learning. The idea of transfer learning is to take information learned in one setting to aid with learning in another. Our theory w orks with data spaces which are a generalization of manifolds. The third pro ject, discussed in Chapter 4 , tac kles problem of classiﬁcation in mac hine learning. The problem of classiﬁcation in the supervised setting has often b een approached by function appro ximation. This yields several theoretical challenges which were not present when dealing with regression. Muc h work has b een done to ov ercome these challenges. There is how ever an alternate p ersp ective, which dra ws inspiration from signal separation problems. W e introduce the problem of signal separation and show how the problem 17 of mac hine learning classiﬁcation can b e viewed as a generalization of this problem. Our results are built from this alternative persp ective whic h solves the classiﬁcation problem completely when the classes are w ell-separated. W e also in tro duce theory and an algorithm to tackle the case where the data classes ma y not b e well-separated. W e note here that the chapters discussing these pro jects will b e excerpts from our pap ers, and therefore ha ve self-con tained notation that may not apply across chapters. 18 Chapter 2 Appro ximation on Manifolds The con ten t in this c hapter is sourced from our pap er published in Neur al Networks titled “Learning on manifolds without manifold learning” [ 90 ]. 2.1 In tro duction Let D = { ( y j , z j ) } M j =1 b e our data, dra wn randomly from an unknown probabilit y distribution τ , and set f ( y ) = E τ ( z | y ). W e recall the fundamental problem of sup ervised learning is to approximate f giv en the data D . The purpose of this chapter is to in troduce a direct metho d of approximation on unknown , compact, q -dimensional submanifolds of R Q without trying to ﬁnd out anything about the manifold other than its dimension. The motiv ation of this approach is outlined in Section 1.6 . As discussed in Chapter 1 , there are many shortcomings to the current paradigm for sup ervised learning. The results in this chapter b elong to an alternate machine learning paradigm proposed by H.N. Mhask ar, sho wn in Figure 2.1 . In this paradigm, one is still concerned with assuming that f b elongs to some universe of discourse and appro ximating it by some function in a hypothesis space V n . The ma jor diﬀerence lies in the approximation. Instead of using optimization to minimize empirical risk, the idea is to directly c onstruct a go o d approximation σ n ∈ V n in the appro ximation theory sense. More sp eciﬁcally , in [ 80 ] it is sho wn how a speciﬁc construction of k ernels results in a direct function appro ximation metho d when the data space is an unkno wn, smo oth, compact, connected manifold. The ap- pro ximation pro cedure is additionally done directly from the data, without the t wo step procedure describ e d in Section 1.6 . The k ey to that w ork was the construction of localized k ernels on the Euclidean space in 19 Figure 2.1: A depiction of a new machine learning paradigm, where one constructs an appro ximation σ n in the space V n directly from the data. This is done in such a wa y that one can also measure a direct reconstruction error from the appro ximation to the target function. terms of Hermite functions. W e deﬁne P n,q : =        ( − 1) n √ (2 n )! 2 n n ! π 1 / 4 h 2 n ( x ) if q = 1 1 π (2 q − 1) / 4 Γ(( q − 1) / 2) P n ℓ =0 ( − 1) ℓ Γ(( q − 1) / 2+ n − ℓ ) √ (2 ℓ )! ( n − ℓ )!2 ℓ ℓ ! h 2 ℓ ( x ) else, (2.1.1) where h n is the orthonormalized Hermite function as deﬁned in ( C.8 ). Then the lo calized kernel is deﬁned b y the sum ˜ Φ n,q ( x ) = ⌊ n 2 / 2 ⌋ X k =0 H √ 2 k n ! P k,q ( x ) , (2.1.2) where H : [0 , ∞ ) → [0 , 1] is a C ∞ function suc h that H ( t ) = 1 for t ∈ [0 , 1 / 2] and H ( t ) = 0 for t ≥ 1. Letting D = { ( x j , y j ) } M j =1 , the appro ximation pro cess for x ∈ R q is then deﬁned b y ˜ F n,α ( D ; x ) : = n q (1 − α ) M M X j =1 y j ˜ Φ n,q ( n 1 − α || x − x j || 2 ) . (2.1.3) The main theorem of that w ork is the following. Theorem 2.1.1 ([ 80 , Theorem 3.1]) . L et γ > 0 , and D b e a data set sample d fr om a distribution τ supp orte d on X × Ω wher e the mar ginal distribution of τ r estricte d to the manifold X (wher e the x j ’s ar e sample d) is absolutely c ontinuous with r esp e ct to µ ∗ and has density f 0 ∈ W γ ( X ) , wher e W γ is a smo othness class analo gous to 2.5.1 . L et f ∈ W γ ( X ) , 0 < α < 4 2+ γ , and α ≤ 1 . Then for every n ≥ 1 and 0 < δ < 1 , if M ≥ n q (2 − α )+2 αγ log( n/δ ) , with pr ob ability at le ast 1 − δ we have       ˜ F n,α ( D ; ◦ ) − f f 0       X , ∞ ≲ q || f 0 || X , ∞ || Y || X × Ω , ∞ + || f f 0 || W γ ( X ) n αγ . (2.1.4) 20 This theorem demonstrates an appro ximation pro cess for functions deﬁned on an unkno wn manifold. There are some limitations ho w ev er. The ﬁrst is that the complexity of the appro ximation grows lik e n 2 ev en though the appro ximation rate is dependent on n αγ . The second is that it is based on Hermite functions, whic h hav e the issue of b ecoming unstable when implemen ted to o far from the origin, due to the quic kly decreasing e − x 2 / 2 term. The third is that the class of approximan ts used to generate the appro ximation in ( 2.1.3 ) dep ends explicitly on the c hoice x at whic h one is approximating. In this chapter, w e lo ok to alleviate these dra wbacks. The upshot is that this metho d do es not require optimization since it is built from mathematical theory with guaranteed go o d approximation prop erties, and the lo calization of the kernels allows for the approxi- mation to b e successful despite the usual pitfalls of approximation theory in machine learning applications. F urthermore, it succeeds in doing the ab o ve without utilizing the classical approach of approximating the manifold via an eigen-decomp osition or atlas estimate as discussed in Section 1.6 . Instead, only the dimension of the manifold is assumed and in practice can b e considered a hyperparameter. In the presen t work we pro ject the q -dimensional manifold X in question from the ambien t space R Q to a sphere S Q of the same dimension. W e can then use a sp ecially designed, localized, univ ariate kernel Φ n,q (cf. ( 2.4.12 )) whic h is a spherical p olynomial of degree < n on S Q , with n and q b eing tunable hyperparameters. Our construction is v ery simple; we deﬁne F n ( D ; x ) : = 1 M M X j =1 z j Φ n,q ( x · y j ) . (2.1.5) W e note that F n ( D ; ◦ ) is a function deﬁned on the ambien t sphere S Q . The lo calization of the kernel allows us to adapt the appro ximation to the unknown manifold. Our main theorem (cf. Theorem 2.5.1 ) has the following form: Theorem 2.1.2. ( Informal statement ) L et D = { ( y j , z j ) } M j =1 b e a set o f r andom samples chosen fr om a distribution τ . Supp ose f b elongs to a smo othness class W γ (detaile d in Deﬁnition 2.5.1 ) with asso ciate d norm ||◦|| W γ . Then under some additional c onditions and with a judicious choic e of n , we have with high pr ob ability: || F n ( D ; ◦ ) − f || X ≤ c  || z || + || f || W γ   log M M  γ / ( q +2 γ ) , (2.1.6) wher e c is a p ositive c onstant indep endent of f . W e note some mathematical features of our construction and theorem which we ﬁnd interesting. 1. The usual approach in machine learning is to construct the approximation using an optimization pro cedure, usually in v olving a regularization term. The setting up of this optimization problem, 21 esp ecially the regularization term, requires one to assume that the function b elongs to some special function class, such as a repro ducing k ernel Hilb ert/Banach space. Th us, the constructions are not explicit nor universal. In contrast, our construction ( 2.1.5 ) do es not require a prior on the function in order to use our mo del. Of course, the theorem and its high-probability conv ergence rates do require v arious assumptions on τ , the marginal distribution, the dimension of the manifold, the smo othness of the target function, etc. The p oint is that the construction itself do es not require any assumptions. 2. A ma jor problem in manifold learning is one of out of sample extension; i.e., extending the approx- imation to outside the manifold. A usual procedure for this in the con text of appro ximation using the eigenstructure of the Laplace-Beltrami operator on the manifold is the Nystr¨ om extension [ 31 ]. Ho wev er, this extension is no longer in terms of any orthogonal system on the ambien t space, and hence there is no guarantee of the quality of approximation even if the function is known outside the manifold. In contrast, the p oint x in ( 2.1.5 ) is not restricted to the manifold, but rather freely chosen from S Q . That is, our construction deﬁnes an out of sample extension in terms of spherical polynomials on the am bient sphere, whose approximation capabilities are w ell studied. 3. In terms of M , the estimate in ( 2.1.6 ) dep ends asymptotically on the dimension q of the manifold rather than the dimension Q of the am bient space. 4. W e do not need to kno w anything ab out the manifold (e.g., eigendecomp osition or atlas estimate) itself apart from its dimension in order to prov e our theorem. There are sev eral pap ers in the literature for estimating the dimension from the data, for example [ 71 , 70 , 75 ]. How ever, the simplicity of our construction allows us to treat the dimension q as a tunable parameter to b e determined by the usual division of the data in to training, v alidation, and test data. There are several other approaches sup erﬁcially similar to our constructions. W e will comment on some of these in Section 2.2 . W e describ e the main idea behind our pro ofs in Section 2.3 . The results require an understanding of the approximation prop erties of spherical p olynomials. Accordingly , we describ e some bac kground on the spherical p olynomials, our localized kernels, and their use in appro ximation theory on subspheres of the am bien t sphere in Section 1.3 . The main theorems for appro ximation on the unknown manifold are giv en in Section 2.5 . The theorems are illustrated with three n umerical examples in Section 2.6 . One of these examples is closely related to an important problem in magnetic resonance relaxometry , in whic h one seeks to ﬁnd the prop ortion of water molecules in the m y elin co v ering in the brain based on a mo del that in volv es inv ersion of the Laplace transform. The pro ofs of the main theorems are given in Section 2.7 . W e w ould lik e to thank Dr. Richard Sp encer at the National Institute of Aging (NIH) for his helpful 22 commen ts, esp ecially on Section 2.6.2 , v erifying that our simulation is consisten t with what is used in the discipline of magnetic resonance relaxometry . 2.2 Related Ideas Since our metho d is based on a highly lo calized k ernel, it is exp ected to b e comparable to the simple nearest neigh b or algorithm. How ever, rather than sp ecifying the n um b er of neighbors to consider in adv ance, our metho d allows the selection of neighbors adaptively for each test p oint, controlled b y the parameter n . Also, rather than taking a simple av eraging, our metho d is more sophisticated, designed to give an optimal order of magnitude of the appro ximation error. One of the oldest ideas for data based function appro ximation is the so-called Nadaray a-W atson estimator (NWE), giv en by N W h ( x ) = P M j =1 z j K ( | x − y j | /h ) P M j =1 K ( | x − y j | /h ) , where K is a k ernel with an eﬀectiv ely small supp ort—the Gaussian kernel K ( t ) = exp( − t 2 ), as a common example—and h is a scaling parameter. Another p ossible choice is a B -spline (including Bernstein polynomi- als) whic h has a compact supp ort. This construction is designed to w ork on a Euclidean space by eﬀectively shrinking the supp ort of K using the scaling parameter h → 0, analogously to spline appro ximation. The degree of appro ximation of such methods is measured in terms of h . It is well known (e.g., [ 37 ]) that the use of a p ositive kernel K suﬀers from the so-called saturation phenomenon: the degree of appro ximation cannot b e smaller than O ( h 2 ) unless the function is a trivial one in some sense. Radial basis function (RBF) netw orks and neural netw orks are used widely for function approximation, using either in terp olation or least square ﬁt. Standard RBF netw orks, suc h as Gaussian net w orks or thin plate spline netw orks, use a ﬁxed, scaled k ernel. Typically , the matrices inv olv ed in either interpolation or least square appro ximation are very ill-conditioned, and the approximation is not highly lo calized. Restricted to the sphere, b oth of the notions are represented by a zonal function (ZF) netw ork. A zonal function on a sphere is a function of the form x 7→ g ( x · x 0 ). A ZF netw ork is a linear combination of ﬁnitely man y zonal functions. One may notice that g ( x · x 0 ) = g  1 − | x − x 0 | 2 2  , so w e can see that a ZF net work is also a neural/RBF net work. Con versely , a neural/RBF netw ork restricted to the sphere is a ZF netw ork. The same observ ations ab out RBF netw orks hold for ZF netw orks as well. 23 W e note that all the pap ers we are aw are of whic h deal with approximation by ZF netw orks actually end up appro ximating a spherical p olynomial by the netw orks in question. Rather than working with a ﬁxed, scaled kernel, in this chapter w e deal with a sequence of highly lo calized p olynomial k ernels. W e do not need to solve any system of equations or do any optimization to arriv e at our construction. RBF netw orks and NWE were dev eloped for approximation on Euclidean domains instead of unknown manifolds. Both hav e a single hyperparameter h and w ork analogously to the spline approximation. In contrast, our metho d is designed for approximation on unkno wn manifolds without ha ving to learn anything ab out the manifold b esides the dimension. It has t wo integer hyperparameters ( n and q ) and yields a p olynomial approximation. If one chooses h small enough relative to a ﬁxed n then NWE may b e able to outp erform our metho d as measured in terms of a global error b ound, such as the root mean square (RMS) error. If one instead chooses n large enough relativ e to a ﬁxed h then our metho d may b e able to outp erform NWE. So in order to give a fair comparison in Example 2.2.1 , we force the RMS error of b oth metho ds to b e approximately equiv alen t and in vestigate the qualitativ e diﬀerences of the errors pro duced b y each metho d. W e additionally show that b oth metho ds in the example outp erform an interpolatory RBF netw ork. Example 2.2.1. This example serves to illustrate t wo points. The ﬁrst point is to compare the performance our metho d with NWE and an RBF interpolant. In doing so, w e sho w that the error asso ciated with our metho d is localized to singularities of the target function, whereas the other methods do not exhibit this b eha vior. The second p oin t is that using a global error estimate such as RMS can b e misleading. Ev en if the RMS error with a given metho d might b e large, the p ercentage of test data p oin ts at which it is smaller than a threshold could b e substantially higher due to the lo cal eﬀects in the target function. T o ensure fair comparison, w e use each of the three metho ds for approximation on S 1 = { (cos θ , sin θ ) : θ ∈ ( − π , π ] } , where the Gaussian kernel can b e expressed in the form of a zonal function as explained ab o ve. W e consider the function f ( θ ) = 1 + | cos θ | 7 / 2 sin(cos θ + sin θ ) / 2 , θ ∈ ( − π , π ] . (2.2.1) W e note that the function is analytic except at θ = ± π / 2, where it has a discontin uit y in the 4th order deriv ative. Our training data consists of 2 13 equidistan tly spaced y j ’s along the circle. W e set z j = f ( y j ), and examine the resulting error on a test data consisting of 2 11 p oin ts chosen randomly according to the uniform distribution on S 1 . W e consider three approximation pro cesses : (1) Nadaray a-W atson estimator NW h with K h ( t ) = exp( − t 2 /h 2 ), (2) in terp olatory approximation b y the RBF netw ork of the form P a k exp( −| ◦ 24 − y j | 2 /h 2 ), (3) our metho d with the kernel Φ 50 , 1 . W e exp erimen tally determined the optimal h v alue in NWE to b e ≈ 7 . 45 e -4 (eﬀectiv ely simulating the minimization of the actual generalization error on the test data), resulting in an RMS error of 1.8462e-7. The same v alue of h w as used for interpolation with the Gaussian RBF netw ork, yielding a RMS error of 2.2290e-4. W e then chose n so as to yield a (comparable to NWE) RMS error of 1.8594e-7 (though we note that in this case our metho d contin ues to provide a b etter approximation if n is further increased). The detailed results are summarized in Figure 2.2 b elow. Figure 2.2: Error comparison b et ween our method, the Nadaray a-W atson estimator, and an interpolatory RBF netw ork. (Left) Comparison of absolute errors b etw een the metho ds with the target function plotted on the righ t y -axis for b eneﬁt of the viewer. W e note that the error from the RBF metho d is scaled by 10 − 3 so as to not dominate the ﬁgure. (Right) Percen t p oin t plot of the log absolute error for all three metho ds. In the left plot in Figure 2.2 , we can see a clear diﬀerence b etw een the errors of the three metho ds. The (scaled by 10 − 3 ) error from the RBF netw ork jumps throughout the whole domain, signaling the ill- conditioned nature of the matrix. The error from the Nadara ya-W atson estimator exhibits some oscillation across the whole domain as well. The error with our metho d is lo calized to the tw o singularity p oints of the function. In other words, our metho d exhibits 1) sensitivit y to the singularities of a function and 2) error adapting to the lo cal smo othness of the function. In comparison, RBF netw orks and NWE do not alwa ys exhibit such b ehavior. On the right plot of Figure 2.2 , we give a p ercent p oint plot of the log absolute error for all three metho ds. There are three curves corresp onding to the three methods b eing compared. Each p oin t ( x, y ) along a giv en curve indicates that the corresponding metho d appro ximated x % of test p oints with absolute error b elow 10 y . This plot can also b e thought of as the inv erse CDF for the random v ariable of the resulting log absolute error for a test p oint sampled uniformly at random. F or example, whereas the Nadaray a-W atson estimator yields an error below ≈ 10 − 7 . 85 for only about 10% of the tested points, our metho d exhibits the same error or b elow for ab out 60% of the test p oints. Our metho d has the higher uniform error, but low er error for ov er 90% of the test p oints. Although the o v erall RMS error is roughly the same, our metho d exhibits a quick er decay from the uniform error. This illustrates, in particular, that 25 measuring the p erformance using a global measure for the error, such as the uniform or RMS error can b e misleading. The interpolatory RBF netw ork performs the worst of the three metho ds as the right plot of Figure 2.2 sho ws clearly . ■ There are some eﬀorts [ 47 , 65 ] to do function approximation on manifolds using tensor pro duct splines or RBF netw orks deﬁned on an ambien t space by ﬁrst extending the target function to the ambien t space. A lo cally adaptiv e p olynomial approach is used in [ 126 ] for accomplishing function appro ximation on manifolds using the data. All these works require that the manifold b e known. In [ 80 ], w e hav e suggested a direct approach to function approximation on an unknown submanifold of a Euclidean space using a lo calized kernel based on Hermite polynomials. This construction was used successfully in predicting diab etic sugar episo des [ 92 ] and recognition of hand gestures [ 77 ]. In particular, in [ 92 ], we constructed our approximation based on one clinical data set and used it to predict the episo des based on another clinical data set. In order to extend the applicability of suc h results to wearable devices, it is imp ortan t that the approximation should b e enco ded by a hop efully small num b er of real num bers, which can then b e hardwired or used for a simpler approximation pro cess [ 108 ]. How ev er, the construction in [ 80 ] is a linear combination of k ernels of the form Ψ( | ◦ − y j | ), where Ψ( t ) = P ( t ) exp( − t 2 / 2) is a univ ariate kernel utilizing a judiciously c hosen p olynomial P . This means that w e get a go o d approximation, but the space from whic h the appro ximation tak es place changes with the p oint at whic h the appro ximation is desired. This do es not allow us to encode the approximation using ﬁnitely many real n umbers. In contrast, the metho d prop osed in this chapter allows us to encode the approximation using co eﬃcients of the target function in the spherical harmonic expansion (deﬁned in a distributional sense), computed empirically . Moreov er, the degree of the p olynomials inv olv ed in [ 80 ] to obtain same the rate of conv ergence in terms of the n umber of samples is O ( n 2 ), while the degree of the p olynomials inv olved in this chapter is O ( n ). W e note that the construction in b oth inv olv e only univ ariate p olynomials, so that the dimension of the input space en ters only linearly in the complexit y of the construction. 2.3 An Overview of the Pro of W e can think of F n ( D ; x ) deﬁned in ( 2.5.5 ) as an empirical approximation for an expected v alue with resp ect to the data distribution τ : E τ ( F n ( D ; x )) = Z z Φ n,q ( x · y ) dτ ( y ) . (2.3.1) 26 Figure 2.3: Visualization of our approximation approach. Here, X is a submanifold of the sphere S Q . The map η x , analogous to the exp onential map, allows us to relate the part of the integral in ( 2.3.3 ) near x with an integral on the tangen t sphere at x via a c hange of v ariables (solid curves). The lo calization of the kernels in our metho d allow for the appro ximation to b e extended o v er X and the tangent sphere S x (dotted curves). Assuming that the marginal distribution of τ on X is absolutely contin uous with resp ect to the Riemannian v olume measure µ ∗ on X ; i.e., giv en by f 0 dµ ∗ for some smo oth function f 0 , w e hav e E τ ( F n ( D ; x )) = Z X f ( y ) f 0 ( y )Φ n,q ( x · y ) dµ ∗ ( y ) . (2.3.2) Accordingly , we deﬁne an inte gr al r e c onstruction op er ator b y σ n ( X , f )( x ) : = Z X Φ n,q ( x · y ) f ( y ) dµ ∗ ( y ) , f ∈ C ( X ) , x ∈ X , (2.3.3) study the approximation prop erties of this op erator, and use it with f f 0 in place of f . The appro ximation prop erties of the op erator σ n in the case of when X is the q -dimensional sphere S q are well known (Prop osi- tion 2.4.2 ), and can b e easily transferred to a q -dimensional equator of the ambien t sphere S Q (Section 2.4.4 , Theorem 2.4.1 ). W e introduce a lo cal exponential map η x at x ∈ X betw een X and the tangent equatorial sphere S x (i.e., a rotated version of S q that shares the tangent space with X at x ). W e give an illustration of this setup in Figure 2.3 . Lo cally , a change of v ariable formula and the prop erties of this map allow us to compare the integral ov er a small manifold ball with that of its image on S x (cf. ( 2.7.10 )). W e keep track of the errors using the Bernstein inequality for spherical p olynomials (cf. ( 2.4.14 )) and standard approxi- mations b etw een geo desic distances and volume elemen ts on the manifold by those on S x . This constitutes the main part of the pro of of the critical Lemma 2.7.1 . W e use the high lo calization prop erty of our kernel Φ n,q to lift the rest of the integral in ( 2.3.3 ) on X at any p oint x ∈ X to the rest of S x with small error (cf. ( 2.7.11 ), ( 2.7.12 )). After this, we can use known results from the theory of approximation on the sphere by spherical p olynomials (cf. Prop osition 2.4.2 and Theorem 2.4.1 ). A partition of unity argumen t is used often in the pro of. Ha ving obtained the approximation result for the integral reconstruction op erator, we then discretize the 27 in tegral and keep track of the errors using concen tration inequalities. 2.4 Bac kground In this section, we outline some imp ortant details ab out spherical harmonics (Section 2.4.1 ) which leads to the construction of the kernels of interest in this c hapter (Section 2.4.2 ). W e then review some classical appro ximation results using these kernels on spheres (Section 2.4.3 ) and equators of spheres (Section 2.4.4 ). 2.4.1 Spherical Harmonics The material in this section is based on [ 104 , 127 ]. Let 0 ≤ q ≤ Q b e in tegers. W e deﬁne the q -dimensional spher e em b edded in Q + 1-dimensional space as follows S q : = { ( x 1 , . . . , x q +1 , 0 , . . . , 0 | {z } Q − q ) : x 2 1 + · · · + x 2 q +1 = 1 } . (2.4.1) Observ e that S q is a q -dimensional compact manifold with geo desic deﬁned by ρ ( x, y ) = arccos( x · y ). Let µ ∗ q denote the normalized volume measure on S q . By representing a point x ∈ S q as ( x ′ sin θ , cos θ ) for some x ′ ∈ S q − 1 , one has the recursiv e formula for measures ω q ω q − 1 dµ ∗ q ( x ) = sin q − 1 ( θ ) dθ dµ ∗ q − 1 ( x ′ ) , (2.4.2) where ω q denotes the surface v olume of S q . One can write ω q recursiv ely by ω q = 2 π ( q +1) / 2 Γ(( q + 1) / 2) =          2 π , if q = 1 , √ π Γ( q / 2) Γ( q / 2 + 1 / 2) ω q − 1 , if q ≥ 2, (2.4.3) where Γ denotes the Gamma function. The restriction of a homogenous harmonic p olynomial in q + 1 v ariables to the q -dimensional unit sphere S q is called a spherical harmonic. The space of all spherical harmonics of degree ℓ in q + 1 v ariables will b e denoted b y H q ℓ . The space of the restriction of all q + 1 v ariable p olynomials of degree < n to S q will b e denoted by Π q n . W e extend this notation for an arbitrary real v alue x > 0 b y writing Π q x : = Π q ⌊ x ⌋ . It is known that H q ℓ is orthogonal to H q j in L 2 ( µ ∗ q ) whenev er j  = ℓ , and Π q n = n − 1 M ℓ =0 H q ℓ . In particular, L 2 ( µ ∗ q ) = ∞ M ℓ =0 H q ℓ . 28 If w e let { Y ℓ,k } dim( H q ℓ ) k =1 b e an orthonormal basis for H q ℓ with resp ect to µ ∗ q , w e can deﬁne K q ,ℓ ( x, y ) : = dim( H q ℓ ) X k =1 Y ℓ,k ( x ) Y ℓ,k ( y ) . (2.4.4) In [ 104 , Theorem 2] and [ 127 , Theorem 2.14], it is shown that K q ,ℓ ( x, y ) = ω q ω q − 1 p q ,ℓ (1) p q ,ℓ ( x · y ) , (2.4.5) where p q ,ℓ denotes the orthonormalized ultraspherical p olynomial of dimension q and degree ℓ . These ultra- spherical p olynomials satisfy the following orthogonalit y condition. Z 1 − 1 (1 − x 2 ) ( q / 2 − 1) p q ,m ( x ) p q ,n ( x ) dx = δ m,n . (2.4.6) Computationally , it is customary to use the following recurrence relation: s ( n + 1)( n + q − 1) (2 n + q − 1)(2 n + q + 1) p q ,n +1 ( x ) = xp q ,n ( x ) − s n ( n + q − 2) (2 n + q − 1)(2 n + q − 3) p q ,n − 1 ( x ) , n ≥ 1 , p q , 0 ( x ) = p q , 0 = 2 1 / 2 − q / 2 Γ( q − 1) Γ( q / 2) , p q , 1 ( x ) = 2 1 / 2 − q / 2 p Γ( q )Γ( q + 1) Γ( q / 2) x. (2.4.7) W e note further that p q ,n (1) = 2 1 / 2 − q / 2 Γ( q / 2) s Γ( n + q − 1)(2 n + q − 1) Γ( n + 1) . (2.4.8) Remark 2.4.1. Many notations ha ve b een used for ultraspherical p olynomials in the past. F or example, [ 129 ] uses the notation of P ( λ ) n for the Gegenbauer p olynomials, also commonly denoted by C ( λ ) n . It is also usual to use a normalization, which we will denote by R q n in this remark, given b y R q n = p q ,n /p q ,n (1). Ultraspherical polynomials are also simply a special case of the Jacobi polynomials P ( α,β ) n where α = β . Setting h q ,n : = 2 q − 1 Γ( n + q / 2) 2 n !Γ( n + q − 1)(2 n + q − 1) , (2.4.9) w e hav e the following connection b etw een these notations: p q ,n ( x ) = h − 1 / 2 q ,n P ( q / 2 − 1 ,q/ 2 − 1) n ( x ) = Γ( q − 1) Γ( q / 2) s n !(2 n + q − 1) 2 q − 1 Γ( n + q − 1) C ( q / 2 − 1 / 2) n ( x ) = 2 1 / 2 − q / 2 Γ( q / 2) s Γ( n + q − 1)Γ(2 n + q − 1) Γ( n + 1) R q n . (2.4.10) 29 ■ F urthermore, the ultraspherical p olynomials for the sphere of dimension d 1 can b e represented by those for the sphere of dimension d 2 in the follo wing manner p d 1 ,n = n X ℓ =0 C d 2 ,d 1 ( ℓ, n ) p d 2 ,ℓ . (2.4.11) The coeﬃcients C hav e b een studied, and explicit form ulas are giv en in [ 1 , Equation 7.34] and [ 129 , Equation 4.10.27]. 2.4.2 Lo calized Kernels Let h b e an inﬁnitely diﬀeren tiable function supp orted on [0 , 1] where h ( x ) = 1 on [0 , 1 / 2]. This function will b e ﬁxed in the rest of this chapter, and its mention will b e omitted from the notation. Then we deﬁne the follo wing univ ariate kernel for t ∈ [ − 1 , 1]: Φ n,q ( t ) : = Φ n,q ( h ; t ) = n X ℓ =0 h  ℓ n  K ℓ,q ( t ) = ω q ω q − 1 n X ℓ =0 h  ℓ n  p q ,ℓ (1) p q ,ℓ ( t ) . (2.4.12) The follo wing prop osition lists some technical prop erties of these kernels which we will often use, sometimes without an explicit men tion. Prop osition 2.4.1. L et x, y ∈ S Q . F or any S > 0 , the kernel Φ n,q ( x, y ) satisﬁes the lo c alization b ound | Φ n,q ( x · y ) | ≲ n q max(1 , n arccos( x · y )) S , (2.4.13) wher e the c onstant involve d may dep end up on S . F urther, we have the Lipschitz c ondition: | Φ n,q ( x · y ) − Φ n,q ( x · y ′ ) | ≲ n q +1 | arccos( x · y ) − arccos( x · y ′ ) | , y ′ ∈ S Q . (2.4.14) Pr o of. The estimate ( 2.4.13 ) is prov ed in [ 97 , Lemma 4.9]. Since θ 7→ Φ n,q (cos θ ) is a trigonometric p oly- nomial of degree < n , the Bernstein inequality for the deriv atives of trigonometric p olynomials implies that | Φ n,q (cos θ ) − Φ n,q (cos ϕ ) | ≤ n ∥ Φ n,q ∥ ∞ | θ − ϕ | ≲ n q +1 | θ − ϕ | . This leads easily to ( 2.4.14 ). ■ 30 2.4.3 Appro ximation on Spheres Metho ds of approximating functions on S q ha ve b een studied in, for example, [ 96 , 118 ] and some details are summarized in Prop osition 2.4.2 . F or a compact set A , let C ( A ) denote the space of contin uous functions on A , equipped with the suprem um norm || f || A = max x ∈ A | f ( x ) | . W e deﬁne the degree of approximation for a function f ∈ C ( S q ) to b e E n ( f ) : = inf P ∈ Π q n || f − P || S q . (2.4.15) Let W γ ( S q ) b e the class of all f ∈ C ( S q ) suc h that || f || W γ ( S q ) : = || f || S q + sup n ≥ 0 2 nγ E 2 n ( f ) < ∞ . (2.4.16) W e note that an alternative smo othness characterized in terms of constructive prop erties of f is explored b y man y authors; some examples are given in [ 33 ]. W e deﬁne the approximation op erator for S q b y σ n ( f )( x ) : = σ n ( S q , f )( x ) = Z S q Φ n,q ( x · u ) f ( u ) dµ ∗ q ( u ) . (2.4.17) With this setup, w e now review some b ounds on how well σ n ( f ) appro ximates f . Prop osition 2.4.2 ([ 96 , Prop osition 4.1]) . L et n ≥ 1 . (a) F or al l P ∈ Π q n/ 2 , we have σ n ( P ) = P . (b) F or any f ∈ C ( S q ) , we have E n ( f ) ≤ || f − σ n ( f ) || S q ≲ E n/ 2 ( f ) . (2.4.18) In p articular, if γ > 0 then f ∈ W γ ( S q ) if and only if || f − σ n ( f ) || S q ≲ || f || W γ ( S q ) n − γ . (2.4.19) Remark 2.4.2. Part (a) is known as a r epr o duction property , whic h shows that p olynomials up to degree < n/ 2 are unchanged when passed through the operator σ n . P art (b) demonstrates that σ n yields what we term a go o d appr oximation , where its appro ximation error is no more than some constant multiple of the degree of approximation. P art (c) not only gives the approximation b ounds in terms of the smo othness parameter γ , but shows also that the rate of decrease of the approximation error obtained by σ n ( f ) determines the smo othness γ . ■ 31 2.4.4 Appro ximation on Equators Let S O ( Q + 1) denote group of all unitary ( Q + 1) × ( Q + 1) matrices with determinan t equal to 1. A q -dimensional e quator of S Q is a set of the form Y = {R u : u ∈ S q } for some R ∈ S O ( Q + 1). The goal in the remainder of this section is to give appro ximation results for equators. Since there exist inﬁnite options for R ∈ SO( Q + 1) to generate the set Y , w e ﬁrst give a deﬁnition of degree of appro ximation in terms of spherical p olynomials that is inv ariant to the choice of R . Fix Y to b e a given q -dimensional equator of S Q and let R , S ∈ SO( Q + 1) mapping S q to Y . Observ e that if P ∈ Π q n , then P ( R T S ◦ ) ∈ Π q n and vice versa. As a result, the functions F R = f ( R◦ ) and F S = f ( S ◦ ) deﬁned on S q satisfy E n ( F R ) = E n ( F S ) . (2.4.20) Since the degree of appro ximation in this context is inv ariant to the choice of R ∈ SO( Q + 1), we ma y simply c ho ose any such matrix that maps S q to Y , drop the subscript R from F R , and deﬁne E n ( Y , f ) : = E n ( F ) . (2.4.21) This allo ws us to deﬁne the space W γ ( Y ) as the class of all f ∈ C ( Y ) such that || f || W γ ( Y ) : = || f || Y + sup n ≥ 0 2 nγ E 2 n ( Y , f ) < ∞ . (2.4.22) W e can also deﬁne the approximation op erator on the set Y as σ n ( Y , f )( x ) : = Z Y Φ n,q ( x · y ) f ( y ) dµ ∗ Y ( y ) , (2.4.23) where µ ∗ Y ( y ) is the probability volume measure on Y . Let F R ∈ C ( S q ) satisfy F R = f ◦ R . W e observe that σ n ( Y , f )( x ) = Z S q Φ n,q ( x · R u ) f ( R u ) dµ ∗ q ( u ) = Z S q Φ n,q ( R T x · u ) F R ( u ) dµ ∗ q ( u ) = σ n ( S q , F R )( R T x ) . (2.4.24) W e now give an analogue of Prop osition 2.4.2 for approximation on equators. Theorem 2.4.1. L et f ∈ C ( Y ) . (a) We have E n ( Y , f ) ≤ || σ n ( Y , f ) − f || Y ≲ E n/ 2 ( Y , f ) . (2.4.25) 32 (b) If γ > 0 , then f ∈ W γ ( Y ) if and only if || σ n ( Y , f ) − f || Y ≲ n − γ || f || W γ ( Y ) . (2.4.26) Pr o of. Let F ( ◦ ) = f ( R◦ ) for some R ∈ SO( Q + 1) with Y = {R u : u ∈ S q } . T o see ( 2.4.25 ), we chec k using Prop osition 2.4.2 that || σ n ( Y , f ) − f || Y =     σ n ( S q , F )( R T ◦ ) − F ( R T ◦ )     Y = || σ n ( S q , F ) − F || S q ≲ E n/ 2 ( F ) = E n/ 2 ( Y , f ) . (2.4.27) Additionally , E n ( Y , f ) ≤ || σ n ( Y , f ) − f || Y since σ n ( Y , f ) = σ n ( S q , F )( R T x ) ∈ Π q n . Part (b) can be seen from part (a) and the deﬁnitions. ■ 2.5 F unction Appro ximation on Manifolds In this section, we develop the notion of smo othness for the target function deﬁned on a manifold, and state our main theorem: Theorem 2.5.1 . F or a brief introduction to manifolds and some results we will b e using in this c hapter, see App endix B . Let Q ≥ q ≥ 1 b e in tegers and X b e a q -dimensional, compact, connected, submanifold of S Q without b oundary . Let ρ denote the geodesic distance and µ ∗ b e the normalized v olume measure (that is, µ ∗ ( X ) = 1). F or any x ∈ X , observe that the tangen t space T x ( X ) is a q -dimensional vector space tangen t to S Q . W e deﬁne S x = S x ( X ) to b e the q -dimensional equator of S Q passing through x whose own tangent space at x is also T x ( X ). As an imp ortant note, S x is also a q -dimensional compact manifold. In this chapter we will consider many spaces, and need to deﬁne balls on each of these spaces, whic h we list in T able 2.1 b elow. Space Description Deﬁnition Am bient space Euclidean ball B Q +1 ( x, r ) = { y ∈ R Q +1 : || x − y || 2 ≤ r } Am bient sphere Spherical cap S Q ( x, r ) = { y ∈ S Q : arccos( x · y ) ≤ r } T angent space T angent ball B T ( x, r ) = { y ∈ T x ( X ) : || x − y || 2 ≤ r } T angent sphere T angent cap S x ( r ) = { y ∈ S x : arccos( x · y ) ≤ r } Manifold Geo desic ball B ( x, r ) = { y ∈ X : ρ ( x, y ) ≤ r } T able 2.1: Deﬁnition and description of balls in diﬀeren t spaces. W e also need to deﬁne the smo othness classes we will be considering for functions on X . Let C ( X ) denote the space of all con tin uous functions on X , and C ∞ ( X ) ⊂ C ( X ) denote the space of all inﬁnitely diﬀeren tiable functions on X . Let ε x b e the exp onential map at x for S x and ε x b e the exp onential map at x for X . Since 33 b oth X and S x are compact, w e hav e some ι 1 , ι 2 suc h that ε x , ε x are deﬁned on B T ( x, ι 1 ) , B T ( x, ι 2 ) resp ectively for an y x . W e write ι ∗ = min { 1 , ι 1 , ι 2 } and deﬁne η x : S x ( ι ∗ ) → X b y η x : ε x ◦ ε − 1 x . Thus, ρ ( x, η x ( y )) = arccos( x · y ) , x ∈ X , y ∈ S x ( ι ∗ ) . (2.5.1) Deﬁnition 2.5.1. W e say that f ∈ C ( X ) is γ -smo oth for some γ > 0, or also that f ∈ W γ ( X ), if for every x ∈ X and ϕ ∈ C ∞ ( X ) supported on B ( x, ι ∗ ), the function F x,ϕ : S x → R deﬁned b y F x,ϕ : = f ( η x ( u )) ϕ ( η x ( u )) b elongs to W γ ( S x ) as outlined in Section 2.4.3 (in particular, Equation ( 2.4.22 )). W e also deﬁne || f || W γ ( X ) : = sup x ∈ X , || ϕ || W γ ( S x ) ≤ 1 || F x,ϕ || W γ ( S x ) . (2.5.2) Our main theorem, describing the approximation of f f 0 (the target function weigh ted by the density of data p oints) by the op erator deﬁned in ( 2.1.5 ), is the follo wing. W e note that appro ximation of f f 0 includes lo cal approximation on X in the sense that when the training data is sampled only from a subset of X , this fact can b e enco ded by f 0 b eing supp orted on this subset. Theorem 2.5.1. We assume that sup x ∈ X ,r> 0 µ ∗ ( B ( x, r )) r q ≲ 1 . (2.5.3) L et D = { ( y j , z j ) } M j =1 b e a r andom sample fr om a joint distribution τ . We assume that the mar ginal distribution of τ r estricte d to X is absolutely c ontinuous with r esp e ct to µ ∗ with density f 0 , and that the r andom variable z has a b ounde d r ange. We say z ∈ [ − || z || , || z || ] . L et f ( y ) : = E τ ( z | y ) , (2.5.4) and F n ( D ; x ) : = 1 M M X j =1 z j Φ n,q ( x · y j ) , x ∈ S Q , (2.5.5) wher e Φ n,q is deﬁne d in ( 2.4.12 ) . L et 0 < γ < 2 and f f 0 ∈ W γ ( X ) . Then for every n ≥ 1 , 0 < δ < 1 / 2 and M ≳ n q +2 γ log( n/δ ) , (2.5.6) 34 we have with τ -pr ob ability (i.e., pr ob ability over the distribution τ ) ≥ 1 − δ : || F n ( D ; ◦ ) − f f 0 || X ≲ p || f 0 || X || z || + || f f 0 || W γ ( X ) n γ . (2.5.7) Equivalently, for inte ger M ≥ 2 and n satisfying ( 2.5.6 ) , we have with τ -pr ob ability ≥ 1 − δ : || F n ( D ; ◦ ) − f f 0 || X ≲  q || f 0 || X || z || + || f f 0 || W γ ( X )   log( M /δ q +2 γ ) M  γ / ( q +2 γ ) . (2.5.8) W e discuss tw o corollaries of this theorem, whic h demonstrate how the theorem can b e used for b oth estimation of the probabilit y densit y f 0 and the approximation of the function f in the case when the training data is sampled from the v olume measure on X . The ﬁrst corollary is a result on function approximation in the case when the marginal distribution of y is µ ∗ ; i.e., f 0 ≡ 1. Corollary 2.5.1. Assume the setup of The or em 2.5.1 . Supp ose also that the mar ginal distribution of τ r estricte d to X is uniform. Then we have with τ -pr ob ability ≥ 1 − δ : || F n ( D ; ◦ ) − f || X ≲ || z || + || f || W γ ( X ) n γ . (2.5.9) The second corollary is obtained by setting f ≡ 1, to point out that our theorem gives a method for densit y estimation. In practice, one may not ha ve kno wledge of f 0 (or ev en the manifold X ). So, the follo wing corollary can b e applied to estimate this critical quan tit y . W e use this fact in our numerical examples in Section 2.6 . T ypically , a p ositiv e kernel is used for the problem of density estimation in order to ensure that the appro ximation is also a positive measure. It is w ell kno wn in approximation theory that this results in a saturation for the rate of conv ergence. Our metho d do es not use p ositive kernels, and do es not suﬀer from suc h saturation. Corollary 2.5.2. Assume the setup of The or em 2.5.1 . Then we have with τ -pr ob ability ≥ 1 − δ :                   1 M M X j =1 Φ n,q ( ◦ · y j )       − f 0             X ≲ || f 0 || W γ ( X ) n γ . (2.5.10) 35 2.6 Numerical Examples In this section, we illustrate our theory with some numerical experiments. In Section 2.6.1 , we consider the approximation of a piecewise diﬀerentiable function, and demonstrate how the lo calization of the kernel leads to a determination of the lo cations of the singularities. The example in Section 2.6.2 is motiv ated b y magnetic resonance relaxomety . Since it is relev ant to our method for practical applications, w e hav e included some discussion and results ab out how q eﬀects the approximation in this example. The example in Section 2.6.3 illustrates ho w our metho d can b e used for in verse problems in the realm of diﬀerential equations. In all the examples, we will examine how the accuracy of the appro ximation depends on the maximal degree n of the p olynomial, the num b er M of samples, and the level of noise. 2.6.1 Piecewise Diﬀerentiable F unction In this example only w e deﬁne the function to b e approximated as f ( θ ) : = 1 + | cos θ | 1 / 2 sin(cos θ + sin θ ) / 2 , (2.6.1) deﬁned on the ellipse E = { (3 cos θ , 6 sin θ ) : θ ∈ ( − π , π ] } . (2.6.2) W e pro ject E to the sphere S 2 using an in verse stereographic pro jection deﬁned by P ( x ) = ( x , 1) || ( x , 1) || 2 , (2.6.3) and call X = P ( E ). Eac h x ∈ X is asso ciated with the v alue θ x satisfying x = P  (3 cos θ x , 6 sin θ x )  , so that f ( x ) : = f ( θ x ) is a con tinuous function on X . W e generate our data p oin ts by taking y j = P  (3 cos θ j , 6 sin θ j )  , where θ j are each sampled uniformly at random from ( − π , π ]. W e then deﬁne z j = f ( y j ) + ϵ j where ϵ j are sampled from some mean-zero normal noise. Our data set is thus D : = { ( y j , f ( y j ) + ϵ j ) } M j =1 . W e will measure the magnitude of noise using the signal-to-noise ratio (SNR), deﬁned b y 20 log 10       f ( y 1 ) , . . . , f ( y M )      2 .      ϵ 1 , . . . , ϵ M      2  . (2.6.4) Since f 0  = 1 in this case, w e could calculate f 0 from the pro jection, or w e ma y estimate it using Corol- 36 lary 2.5.2 . That is, f 0 ( x ) ≈ 1 M M X j =1 Φ n, 1 ( x · y j ) . (2.6.5) This option may b e desirable in cases where f 0 is not feasible to compute (i.e. if the underlying domain of the data is unkno wn or irregularly shap ed). Our approximation is thus: F n ( D ; x ) = M X j =1 ( f ( y j ) + ϵ j )Φ n, 1 ( x · y j ) ,   M X j =1 Φ n, 1 ( x · y j )   . (2.6.6) Figure 2.4: Left y-axis: Plot of the true function f compared with F 32 constructed by 2 13 noiseless training p oin ts. Right y-axis: Plot of | f − F 32 | . Figure 2.5: (Left) P ercen t p oint plot of log absolute error for v arious n with M = 2 13 training p oints and no noise. (Cen ter) Percen t p oint plot of log absolute error for v arious choices of M with no noise. (Right) P ercent p oint plot of log absolute error for v arious noise levels with M = 2 13 training p oin ts. Figure 2.4 shows a plot of the true function and our approximation on the left y-axis and the absolute error on the right y-axis. The plot demonstrates that the appro ximation achiev es muc h low er error than the uniform error bound at p oints where the function is relativ ely smo oth, and only spikes lo cally at the singularities of the function ( θ = ± π / 2). Figure 2.5 displays three p ercent p oint plots illustrating ho w the distribution of log 10 | F n − f | b eha ves for v arious choices of n, M , ϵ . Each p oint ( x, y ) on a curve indicates that x % of test p oints w ere approximated by our metho d with absolute error b elow 10 y for the n , M , and ϵ v alue asso ciated with the curve. The ﬁrst graph shows the trend for v arious n v alues. As we increase n , we see consistent drop in log error. The second graph shows v arious v alues of M . W e again see a decrease in the o verall log error as M is increased. The third graph sho ws how the log error decreases as the noise decreases. 37 W e can see that the approximation is muc h worse for low SNR v alues, but nearly indistinguishable from the noiseless case when the SNR is ab ov e 60. 2.6.2 P arameter Estimation in Bi-exp onen tial Sums This example is motiv ated by magnetic resonance relaxometry , in which the proton nuclei of water are ﬁrst excited with radio frequency pulses and then exhibit an exp onen tially deca ying electromagnetic signal. When one may assume the presence of tw o water compartmen ts undergoing slow exchange, with signal corrupted b y additive Gaussian noise, the signal is mo deled t ypically as a bi-exp onential deca y function ( 2.6.7 ) (cf. [ 117 ]): F ( t ) = c 1 exp( − t/T 2 , 1 ) + c 2 exp( − t/T 2 , 2 ) + E ( t ) , where E is the noise, T 2 , 1 , T 2 , 2 > 0, and the time t is typically sampled at equal interv als. The problem is to determine c 1 , c 2 , T 2 , 1 , T 2 , 2 . The problem app ears also in many other medical applications, such as intra v o xel incoheren t motion studies in magnetic resonance. An accessible survey of these applications is giv en in [ 59 ]. W riting t = j δ , λ 1 = δ /T 2 , 1 , λ 2 = δ /T 2 , 2 , w e may reformulate the data as f ( j ) : = f ( λ 1 , λ 2 , j ) = c 1 e − λ 1 j + c 2 e − λ 2 j + ϵ ( j ) , (2.6.7) where ϵ ( j ) are samples of mean-zero normal noise. In this example, suggested by Dr. Spencer at the National Institute of Aging (NIH), we consider the case where c 1 = . 7 , c 2 = . 3 and use our metho d to determine the v alues λ 1 , λ 2 , giv en data of the form ˜ y ( λ 1 , λ 2 ) : = ( f (1) , f (2) , . . . , f (100)) . (2.6.8) W e “train” our appro ximation pro cess with M samples of ( λ 1 , λ 2 ) ∈ [ . 1 , . 7] × [1 . 1 , 1 . 7] chosen uniformly at random and then plugging those v alues into ( 2.6.7 ) to generate vectors of the form shown in ( 2.6.8 ). The dimension of the input data is Q = 100, how ev er (in the noiseless case) the data lies on a q = 2 dimensional manifold, so w e will use Φ n, 2 to generate our appro ximations. W e note that our metho d is agnostic to the particular mo del ( 2.6.8 ) used to generate the data. W e treat λ 1 , λ 2 as functions of ˜ y without a prior knowledge of this function. In the noisy case, this problem do es not p erfectly ﬁt the theory studied in this c hapter since the noise is applied to the input v alues f ( t ) meaning we cannot assume they lie directly on an unkno wn manifold anymore. Nev ertheless, we can see some success 38 with our metho d. W e deﬁne the op erators T ( ˜ y ) = 1000 ˜ y − (380 , 189 , 116 , 0 , . . . , 0) , P ( ◦ ) = ( ◦ , 100) || ( ◦ , 100) || 2 (2.6.9) and denote y = P ( T ( ˜ y )). In practice, the v alues used to deﬁne T and P need to b e treated as hyperparam- eters of the mo del. In this example, we did not conduct a rigorous grid search. W e use the same density estimation as in Section 2.6.1 : DE  x ( λ 1 , λ 2 )  = M X j =1 Φ n, 2  x ( λ 1 , λ 2 ) · y ( λ 1 ,j , λ 2 ,j )  . (2.6.10) As a result, our appro ximation pro cess lo oks like:    λ 1 λ 2    ≈ F n  x ( λ 1 , λ 2 )  = M X j =1    λ 1 ,j λ 2 ,j    Φ n, 2  x ( λ 1 , λ 2 ) · y ( λ 1 ,j , λ 2 ,j )  . DE  x ( λ 1 , λ 2 )  . (2.6.11) Similar to Example 2.6.1 , we will include ﬁgures showing how the results are eﬀected as n, M , ϵ are adjusted. W e meas ure noise using the signal-to-noise ratio (SNR) deﬁned b y 20 log 10  || ˜ y || 2 .      ϵ (1) , . . . , ϵ (100)      2  . (2.6.12) Unlik e Example 2.6.1 , we will now b e considering p ercent appro ximation error instead of uniform error as it is more relev ant in this problem. W e deﬁne the c ombine d err or to b e 2 X j =1 | λ j, true − λ j, approx | λ j, true . (2.6.13) Figure 2.6: (Left) Percen t p oint plot of log combined error for v arious n with M = 2 13 training p oints, and no noise. (Center) Percen t p oint plot of log combined error for ﬁxed n = 32, v arious choices of M , and no noise. (Right) P ercent p oin t plot of log com bined error for ﬁxed n = 32, ﬁxed M = 2 13 training p oints, and v arious noise levels. 39 Figure 2.7: Plot of RMS error for approximation by F 32 for v arious q v alues with M = 2 13 and no noise. Figure 2.6 depicts three percent p oint plots showing the distribution of sorted log 10 (Com bined Error) p oin ts for v arious n, M , ϵ . Each p oint ( x, y ) on a curve indicates that x % of test p oints were approximated b y our metho d with combined error b elow 10 y for the n , M , and ϵ associated with the curve. In the ﬁrst graph, we see the distribution of for v arious choices of n . As n increases, the ov erall log error decreases. An in teresting phenomenon o ccurring in this ﬁgure is with the n = 128 case where the uniform error is actually higher than the n = 64 case. This is likely due to the fact that o v erﬁtting can o ccur if n gets to o large relativ e to a ﬁxed M . The second graph illustrates how the approximation improv es as M is increased. As exp ected, we see log error deca y as we include more and more training points. In the third graph, we see that the appro ximation impro v es up to a limit as the noise decreases. There is very little noticeable diﬀerence b et ween the noiseless case and any case where SNR > 50. Another question that may arise when utilizing our metho d on v arious data is what v alue of q to use. While the theory predicts that q should be asso ciated with the in trinsic dimension of the manifold underlying the data, in practice this can only b e estimated and so q should be treated as a h yp erparameter. In Figure 2.7 , w e explore how changing q eﬀects the approximation in this example. In this case, the intrinsic dimension is 2, and when q = 2 , 3 the approximation do es well. If q is chosen to o high or tw o low, the approximation yields a greater error. 2.6.3 Darcy Flow Problem In this section we will lo ok at a numerical example from the realm of PDE inv erse problems. Steady-state Darcy ﬂo w is given by the follo wing PDE (see for example, [ 114 , Eq. (4.7)]): −∇ · ( a ∇ y ) = f , (2.6.14) 40 deﬁned on a domain D with the prop erty that y | ∂ D = 0. The problem is to predict the diﬀusion c o eﬃcient a and for cing term f giv en some noisy samples of y on D . In this chapter we consider a 1-dimensional version and supp ose that a = e − st and f = pe − st for some p, s . W e tak e noisy samples of y ( p, s ; ◦ ) = y satisfying the follo wing b oundary v alue problem: − d dt ( e − st y ′ ( t )) = pe − st , y (1) = 0 , y (0) = 1 . (2.6.15) In this sample, we take a similar approach to that of Example 2.6.2 by “training” our mo del with a data set of the form { y j , ( p j , s j ) } M j =1 , where ( p j , s j ) ∈ [ . 1 , . 25] × [1 . 5 , 2 . 5] are sampled uniformly at random for eac h j . Letting y j denote the y satisfying ( 2.6.15 ) with p = p j , s = s j , then y j = P ( y j ( t 1 ) , y j ( t 2 ) , . . . , y j ( t 100 )), where t 1 , t 2 , . . . , t 100 are sampled uniformly from [0 , 1] and P is the pro jection to the sphere. In this example, the pro jection ﬁrst consists of ﬁnding the center C and maximum spread ov er a single feature r of the data. That is, C =  max j y j ( t 1 ) + min j y j ( t 1 ) , . . . , max j y j ( t 100 ) + min j y j ( t 100 )  . 2 , r = max t i  max j y j ( t 1 ) − min j y j ( t 1 ) , . . . , max j y j ( t 100 ) − min j y j ( t 100 )  . (2.6.16) Then, w e deﬁne P ( ◦ ) = ( ◦ − C , r ) || ( ◦ − C , r ) || 2 . (2.6.17) Our appro ximation pro cess then lo oks like:    p s    ≈ F n  y  = M X j =1    p j s j    Φ n, 2 ( y · y j ) . DE( y ) , (2.6.18) where DE( y ) = M X j =1 Φ n, 2 ( y · y j ) . (2.6.19) Also similar to Example 2.6.2 , we use the same notion of SNR and ev aluate the success of our mo del using a c ombine d err or , now deﬁned to b e      p true − p approx p true     +     s true − s approx s true      . (2.6.20) In Figure 2.8 , we pro vide some p ercent point plots from using our metho d on this data. Eac h p oint ( x, y ) on a curve indicates that x % of test p oints were approximated b y our metho d with combined error b elo w 10 y for the n , M , and ϵ asso ciated with the curve. W e see in the left-most plot that as w e increase n , the 41 Figure 2.8: (Left) P ercent p oint plot of log com bined error for v arious n , ﬁxed M = 2 13 , and no noise. (Cen ter) Percen t p oint plot of log com bined error for ﬁxed n = 64, v arious M , and no noise. (Right) P ercent p oin t plot of log combined error for ﬁxed n = 64, ﬁxed M = 2 13 , and v arious noise levels. error tends to decrease. In contrast to previous e xamples, the middle plot do es not show muc h improv emen t b y increasing M. This may b e an indication of the fact that we hav e chosen a tight parameter space in this example (as compared to 2.6.2 ) and not many samples are needed to suﬃciently cov er the space. On the righ t-most plot, w e see a decrease in error with the decrease of noise as expected, with conv ergence appearing to o ccur around the SNR=70 mark, as indicated by the green and light-blue lines b eing so close together. 2.7 Pro ofs The purp ose of this section is to prov e Theorem 2.5.1 . In Section 2.7.1 , we study the approximation properties of the integral reconstruction op erator deﬁned in ( 2.3.3 ) (Theorem 2.7.1 ). In Section 2.7.2 , we use this theorem with f f 0 in place of f , and use the Bernstein concen tration inequality (Prop osition 2.7.1 ) to discretize the integral expression in ( 2.3.3 ) and complete the pro of of Theorem 2.5.1 . 2.7.1 In tegral Reconstruction Op erator In this section, w e prov e the following theorem which is an integral analogue of Theorem 2.5.1 . Theorem 2.7.1. L et 0 < γ < 2 , f ∈ W γ ( X ) , σ n b e as deﬁne d in ( 2.3.3 ) . Then for n ≥ 1 , we have || f − σ n ( X , f ) || X ≲ n − γ || f || W γ ( X ) . (2.7.1) In order to prov e this theorem, we will use a cov ering of X using balls of radius ι ∗ , and a corresp onding partition of unit y . A key lemma to facilitate the details here is the following. 42 Lemma 2.7.1. L et x ∈ X . L et g ∈ C ( X ) b e supp orte d on B ( x, ι ∗ ) . If G ( u ) = g ( η x ( u )) , 0 < γ < 2 . Then     Z X Φ n,q ( x · y ) g ( y ) dµ ∗ ( y ) − Z S x Φ n,q ( x · u ) G ( u ) dµ ∗ S x ( u )     ≲ n − γ || g || X . (2.7.2) If ϕ ∈ C ∞ ( X ) is supp orted on B ( x, ι ∗ ), then we may apply this theorem with g = f ϕ , thereby providing lo cally a lifting of the integral on X to the tangent equator S x with the function corresp onding to g on this tangen t equator. Naturally , the ﬁrst step in this pro of is to show that the Leb esgue constant for the k ernel Φ n,q is b ounded indep enden tly of n (cf. ( 2.7.4 )). Moreo v er, one can ev en leverage the lo calization of the kernel to impro ve on this b ound when the in tegral is taken aw ay from the p oint x (cf. ( 2.7.3 )). These are b oth done in the follo wing lemma. Lemma 2.7.2. L et r > 0 and n ≥ 1 /r . If Φ n,q is given as in ( 2.4.12 ) with S > q , then sup x ∈ X Z X \ B ( x,r ) | Φ n,q ( x · y ) | dµ ∗ ( y ) ≲ max(1 , nr ) q − S . (2.7.3) A dditional ly, sup x ∈ X Z X | Φ n,q ( x · y ) | dµ ∗ ( y ) ≲ 1 . (2.7.4) Pr o of. Recall from Prop osition B.1 that ρ ( x, y ) ∼ arccos ( x · y ), so ( 2.4.13 ) implies | Φ n,q ( x · y ) | ≲ n q max(1 , n arccos( x · y )) S ≲ n q max(1 , nρ ( x, y )) S . (2.7.5) In this pro of only , w e ﬁx x ∈ X . Let A 0 = B ( x, r ) and A k = B ( x, 2 k r ) \ B ( x, 2 k − 1 r ), k ≥ 1. Then µ ∗ ( A k ) ≲ 2 k q r q , and for an y y ∈ A k , 2 k − 1 r ≤ ρ ( x, y ) ≤ 2 k r . First, let nr ≥ 1. In view of ( 2.5.3 ) and ( 2.7.5 ), it follows that Z X \ B ( x,r ) | Φ n,q ( x · y ) | dµ ∗ ( y ) = ∞ X k =1 Z A k | Φ n,q ( x · y ) | dµ ∗ ( y ) ≲ ∞ X k =1 µ ∗ ( A k ) n q ( n 2 k − 1 r ) S ≲ ( nr ) q − S ∞ X k =0 2 k ( q − S ) ≤ ( nr ) q − S . (2.7.6) Using this estimate with r = 1 /n and the condition ( 2.5.3 ) on the measures of balls w e see that Z X | Φ n,q ( x · y ) | dµ ∗ ( y ) = Z A 0 | Φ n,q ( x · y ) | dµ ∗ ( y ) + Z X \ B ( x,r ) | Φ n,q ( x · y ) | dµ ∗ ( y ) ≲ 1 + ( nr ) q − S ∼ 1 . Since the c hoice of x w as arbitrary , we ha v e pro v en ( 2.7.4 ). Then ( 2.7.4 ) and ( 2.7.6 ) combined giv e the 43 b ounds for ( 2.7.3 ). ■ Next, w e prov e Lemma 2.7.1 . Proof of Lemma 2.7.1 . Since γ < 2, w e may choose (for suﬃciently large n ) δ = n − ( γ + q +1) / ( q +3) , nδ = n (2 − γ ) / ( q +3) > 1 , S > 2 q + 3 γ 2 − γ . (2.7.7) W e may assume further that δ < ι ∗ . Then, by using ( 2.4.14 ) and Prop osition B.1 , we see that     Φ n,q ( x · η x ( u )) − Φ n,q ( x · u )     ≲ n q +1 | arccos( x · η x ( u )) − arccos( x · u ) | = n q +1 | arccos( x · η x ( u )) − ρ ( x, η x ( u )) | ≲ n q +1 ρ ( x, η x ( u )) 3 ≲ n q +1 δ 3 , (2.7.8) for any u ∈ S x ( δ ). Let g 1 , g 2 b e the metric tensors associated with the exponential maps ε x : T x ( X ) → X and ε x : T x ( X ) → S x , resp ectiv ely . Then we hav e the following change of v ariables formulas (cf. T able 2.1 ): Z B ( x,δ ) dµ ∗ ( ε x ( v )) = Z B T ( x,δ ) p | g 1 | dv , Z S x ( δ ) dµ ∗ q ( u ) = Z B T ( x,δ ) p | g 2 | dε − 1 x ( u ) . (2.7.9) W e set v = ε − 1 x ( u ) and use the fact (cf. ( B.5 )) that on B T ( x, δ ), | p | g 1 | − 1 | ≲ δ 2 and | p | g 2 | − 1 | ≲ δ 2 . Then b y applying Equations ( 2.7.8 ), ( 2.7.9 ), ( 2.4.13 ), and ( 2.4.14 ), we can deduce      Z B ( x,δ ) Φ n,q ( x · y ) g ( y ) dµ ∗ ( y ) − Z S x ( δ ) Φ n,q ( x · u ) G ( u ) dµ ∗ S x ( u )      ≤      Z B T ( x,δ ) Φ n,q ( x · ε x ( v )) g ( ε x ( v ))( p | g 1 | − p | g 2 | ) dv      +      Z S x ( δ ) (Φ n,q ( x · η x ( u )) − Φ n,q ( x · u )) G ( u ) dµ ∗ S x ( u )      ≲ || g || X ( δ q +2 n q + δ q +3 n q +1 ) ≤ δ q +3 n q +1 || g || X (1 / ( nδ ) + 1) ≲ || g || X n − γ . (2.7.10) No w it only remains to examine the terms a w a y from S x ( δ ) , B ( x, δ ). Utilizing Lemma 2.7.2 , and the fact that S ≥ 2 q +3 γ 2 − γ , w e hav e      Z X \ B ( x,δ ) Φ n,q ( x · y ) g ( y ) dµ ∗ ( y )      ≲ || g || X ( nδ ) q − S = || g || X n ( q − S )(2 − γ ) / ( q +3) ≲ || g || X n − γ . (2.7.11) 44 Similarly , again using Lemma 2.7.2 (with S x as the manifold) and observing || g || X = || G || S x , we can conclude      Z S x \ S x ( δ ) Φ n,q ( x · u ) G ( u ) dµ ∗ S x ( u )      ≲ || G || S x ( nδ ) q − S ≲ || g || X n − γ , (2.7.12) completing the pro of. ■ W e are now in a p osition to complete the pro of of Theorem 2.7.1 . Proof of Theorem 2.7.1 . Let x ∈ X . Cho ose ϕ ∈ C ∞ suc h that 0 ≤ ϕ ( y ) ≤ 1 for all y ∈ X , ϕ ( y ) = 1 on B ( x, ι ∗ / 2), and ϕ ( y ) = 0 on X \ B ( x, ι ∗ ). Then f ϕ is supp orted on B ( x, ι ∗ ) and F ( u ) : = ϕ ( η x ( u )) f ( η x ( u )) b elongs to W γ ( S x ). W e observe that || f || X ≲ || f || W γ ( X ) . By Lemma 2.7.2 ,     Z X Φ n,q ( x · y ) f ( y )(1 − ϕ ( y )) dµ ∗ ( y )     ≤ || f || X Z X \ B ( x,ι ∗ / 2) | Φ n,q ( x · y ) | dµ ∗ ( y ) ≲ n − γ || f || W γ ( X ) . (2.7.13) Note that the constan t ab ov e is chosen to account for the case where n < 2 /ι ∗ . By Lemma 2.7.1 ,     Z X Φ n,q ( x · y ) f ( y ) ϕ ( y ) dµ ∗ ( y ) − σ n ( S x , F )( x )     ≲ n − γ || f ϕ || X ≲ n − γ || f || W γ ( X ) . (2.7.14) Observ e that since f ( x ) = F ( x ) and || F || W γ ( S x ) ≤ || f || W γ ( X ) , | f ( x ) − σ n ( X , f )( x ) | ≤| f ( x ) − F ( x ) | + | F ( x ) − σ n ( S x , F )( x ) | + | σ n ( S x , F )( x ) − σ n ( X , f )( x ) | ≲ 0 + n − γ || F || W γ ( S x ) +     σ n ( S x , F )( x ) − Z X Φ n,q ( x · y ) f ( y ) ϕ ( y ) dµ ∗ ( y )     +     Z X Φ n,q ( x · y ) f ( y )(1 − ϕ ( y )) dµ ∗ ( y )     ≤ n − γ || f || W γ ( X ) . (2.7.15) Since this b ound is indep endent of x , the pro of is completed. ■ 2.7.2 Discretization In order to complete the pro of of Theorem 2.5.1 , we need to discretize the in tegral op erator in Theorem 2.7.1 while k eeping trac k of the error. If the manifold were known and we could use the eigendecomposition of the Laplace-Beltrami op erator, we could do this discretization without losing the accuracy using quadrature form ulas (cf., e.g., [ 85 ]). In our current set up, it is more natural to use concentr ation inequalities. W e will use the inequalit y summarized in Prop osition 2.7.1 b elow (c.f. [ 12 , Section 2.8]). Prop osition 2.7.1 (Bernstein concen tration inequalit y) . L et Z 1 , · · · , Z M b e indep endent r e al value d r andom 45 variables such that for e ach j = 1 , . . . , M , | Z j | ≤ R , and E ( Z 2 j ) ≤ V . Then for any t > 0 , Prob         1 M M X j =1 ( Z j − E ( Z j ))       ≥ t   ≤ 2 exp  − M t 2 2( V + Rt/ 3)  . (2.7.16) In the follo wing, we will set Z j ( x ) = z j Φ n,q ( x · y j ), where ( y j , z j ) are sampled from τ . The follo wing lemma estimates the v ariance of Z j . Lemma 2.7.3. With the setup fr om The or em 2.5.1 , we have sup x ∈ X Z | z Φ n,q ( x · y ) | 2 dτ ( y , z ) ≲ n q || z || 2 || f 0 || X , x ∈ S Q . (2.7.17) Pr o of. W e observe that ( 2.4.13 ) and Lemma 2.7.2 imply that sup x ∈ X Z X Φ n,q ( x · y ) 2 dµ ∗ ( y ) ≲ n q sup x ∈ X Z X | Φ n,q ( x · y ) | dµ ∗ ( y ) ≲ n q . (2.7.18) Hence, sup x ∈ X Z | z ( y , ϵ )Φ n,q ( x · y ) | 2 dτ ( y , z ) ≤ || z || 2 || f 0 || X sup x ∈ X Z X Φ n,q ( x · y ) 2 dµ ∗ ( y ) ≲ n q || z || 2 || f 0 || X . (2.7.19) ■ A limitation of the Bernstein concentration inequality is that it only considers a single x v alue. Since w e are interested in supremum-norm b ounds, we must ﬁrst relate the supremum norm of Z j o ver all x ∈ S Q to a ﬁnite set of p oints. W e set up the connection in the follo wing lemma. Lemma 2.7.4. L et ν b e any (b ounde d variation) me asur e on X . Then ther e exists a ﬁnite set C of size |C | ∼ n Q such that         Z X Φ n,q ( ◦ · y ) dν ( y )         S Q ≤ 2 max x ∈C     Z X Φ n,q ( x · y ) dν ( y )     . (2.7.20) Pr o of. In view of the Bernstein inequalit y for the deriv atives of spherical p olynomials, we see that | P ( x ) − P ( y ) | ≤ cn || x − y || ∞ || P || ∞ , P ∈ Π Q n . (2.7.21) W e can see by construction that R X Φ n,q ( t · y ) dν ( y ) is a p olynomial of degree < n in the v ariable t . Since S Q 46 is a compact space and p olynomials of degree < n are contin uous functions, there exists some x ∗ ∈ S Q suc h that         Z X Φ n,q ( ◦ · y ) dν ( y )         S Q =     Z X Φ n,q ( x ∗ · y ) dν ( y )     . (2.7.22) Let c b e the same as in ( 2.7.21 ) and C b e a ﬁnite set satisfying max x ∈ S Q min y ∈C || x − y || ∞ ≤ 1 2 cn . (2.7.23) Since S Q is a compact Q -dimensional space, the set C needs no more than ∼ n Q p oin ts. Then there exists some z ∗ ∈ C suc h that     Z X (Φ n,q ( x ∗ · y ) − Φ n,q ( z ∗ · y )) dν ( y )     ≲ n || x ∗ − z ∗ || ∞     Z X Φ n,q ( x ∗ · y ) dν ( y )     , (2.7.24) whic h implies ( 2.7.20 ). ■ With this preparation, we no w state the follo wing theorem whic h giv es a b ound on the diﬀerence betw een our discrete appro ximation F n and con tinuous appro ximation σ n with high probabilit y . Theorem 2.7.2. Assume the setup of The or em 2.5.1 . Then for every n ≥ 1 and M ≳ n q +2 γ log( n/δ ) we have Prob τ  || F n ( D ; ◦ ) − σ n ( X , f f 0 ) || S Q ≥ c || z || n − γ q || f 0 || X  ≤ δ. (2.7.25) Pr o of. In this pro of only , constan ts c, c 1 , c 2 , . . . will maintain their v alue once used. Let Z j ( x ) = z j Φ n,q ( x · y j ). Since z is integrable with resp ect to τ , one has the following for any x ∈ S Q : E τ ( Z j ( x )) = Z X E τ ( z | y )Φ n,q ( x · y ) dν ∗ ( y ) = Z X f ( y )Φ n,q ( x · y ) f 0 ( y ) dµ ∗ ( y ) = σ n ( X , f f 0 )( x ) . (2.7.26) W e hav e from ( 2.4.13 ) that | Z j | ≲ n q || z || . Lemma 2.7.3 informs us that E τ ( Z 2 j ) ≲ n q || z || 2 || f 0 || X . Assume 0 < r ≤ 1 and set t = r || z || || f 0 || X . F rom Prop osition 2.7.1 , w e see Prob τ         1 M M X j =1 Z j ( x ) − σ n ( X , f f 0 )( x )       ≥ t   ≤ 2 exp − c 1 M t 2 ( n q || z || 2 || f 0 || X + n q || z || t/ 3) ! ≤ 2 exp  − c 2 M || f 0 || X r 2 n q  . (2.7.27) Let δ ∈ (0 , 1 / 2), C b e a ﬁnite set satisfying ( 2.7.23 ) with |C | ≤ c 3 n Q (without loss of generality we assume 47 c 3 ≥ 1), c 4 ≥ max  log 2 ( c 3 ) + 1 , Q  c 2 , (2.7.28) and M ≥ c 4 n q +2 γ log( n/δ ) . (2.7.29) W e now ﬁx r ≡ s c 4 n q M || f 0 || X log( n/δ ) . (2.7.30) Notice that since || f 0 || X ≥ 1, our assumption of M in ( 2.7.29 ) implies r ≤ n − γ . q || f 0 || X ≤ 1 , (2.7.31) so our c hoice of r may b e substituted into ( 2.7.27 ). F urther, r || z || || f 0 || X ≤ c || z || n − γ q || f 0 || X . (2.7.32) With this preparation, w e can conclude Prob τ    F n ( D ; ◦ ) − σ n ( X , f f 0 )   S Q ≥ c || z || n − γ q || f 0 || X  ≤ Prob τ               1 M M X j =1 Z j − σ n ( X , f f 0 )             S Q ≥ r || z || || f 0 || X   (from ( 2.7.32 )) ≤ Prob τ   max x k ∈C         1 M M X j =1 Z j ( x k ) − σ n ( X , f f 0 )( x k )         ≥ t   (b y Lemma 2.7.4 ) ≤ |C | X k =1 Prob τ         1 M M X j =1 Z j ( x k ) − σ n ( X , f f 0 )( x k )       ≥ t   ≤|C | exp  − c 2 M || f 0 || X r 2 n q  (from ( 2.7.27 )) ≤ c 3 n Q − c 2 c 4 δ c 2 c 4 (from ( 2.7.30 )) ≤ c 3 n Q − Q  1 2  log 1 / 2 (1 /c 3 ) δ (from ( 2.7.28 ) and δ < 1 / 2) ≤ δ. (2.7.33) ■ W e are now ready for the pro of of Theorem 2.5.1 . 48 Pr o of of The or em 2.5.1 (and Cor ol lary 2.5.1 and 2.5.2 ). Since f , f 0 ∈ W γ ( X ), we can determine that f f 0 ∈ W γ ( X ) as well. Utilizing Theorem 2.7.1 with f f 0 and Theorem 2.7.2 , w e obtain with probability at least 1 − δ that || F n ( D ; ◦ ) − f f 0 || X ≤ || F n ( D ; ◦ ) − σ n ( X ; f f 0 ) || X + || σ n ( X ; f f 0 ) − f f 0 || X ≲ p || f 0 || X || z || + || f f 0 || W γ ( X ) n γ . (2.7.34) Corollary 2.5.1 is seen immediately by setting f 0 = 1. Corollary 2.5.2 follows from setting z = 1 and then observing that f = 1 and p || f 0 || X ≲ || f 0 || W γ ( X ) . ■ 49 Chapter 3 Lo cal T ransfer Learning The conten t in this part is sourced from our pap er published in Inverse Pr oblems, R e gularization Metho ds and R elate d T opics: A V olume in Honour of Thamb an Nair titled “Lo cal transfer learning from one data space to another” [ 89 ]. 3.1 In tro duction The problem of transfer learning is to learn the parameters of an approximation pro cess based on one data set and leverage this information to aid in the determination of an approximation pro cess on another data set [ 108 , 76 , 78 ]. An ov erview of transfer learning can b e found in [ 58 ]. Since training large (b y to day’s standards) neural net w orks requires computing p ow er simply una v ailable to most researchers, the idea of transfer learning may b e app ealing to man y wishing to use neural netw ork mo dels on new problems. While training a neural net work from scratc h on a new problem is not feasible in many cases, tuning a pre-trained net work for a similar problem as the one it was trained on is doable and often yields b etter results anyw a ys. Of course, this leads to some ma jor questions such as • How do es one identify which parameters for one problem are imp ortant for another? • What if the new problem has a feature (or multiple) unlike any from the pre-trained mo del? • How do we interpret the outputs of the mo del after it has b een tuned for the new problem? Since these questions are v ery broad and remain op en in many settings, transfer learning is an exciting and activ e area of research. In the context of manifold learning, a data set (point cloud) determines a manifold, so that diﬀeren t data sets would corresp ond to diﬀerent manifolds. In the context of data spaces, we can therefore interpret 50 transfer learning as “lifting” a function from one data space (the b ase data sp ac e ) to another (the tar get data sp ac e ). This viewp oint allows us to unify the topic of transfer learning with the study of some inv erse problems in image/signal pro cessing. F or example, the problem of synthetic ap erture radar (SAR) imaging can b e describ ed in terms of an in v erse Radon transform [ 109 , 16 , 105 ]. The domain and range of the Radon transform are diﬀerent, and hence, the problem amoun ts to approximating the actual image on one domain based on observ ations of its Radon transform, which are lo cated on a diﬀerent domain. Another application is in analyzing hypersp ectral images changing with time [ 28 ]. A similar problem arises in analyzing the progress of Alzheimer’s disease from MRI images of the brain taken o ver time, where one is in terested in the dev elopment of the cortical thickness as a function on the surface of the brain, a manifold which is changing o ver time [ 62 ]. Motiv ated by these applications and the pap er [ 28 ] of Coifman and Hirn, the question of lifting a function from one data space to another, when certain landmarks from one data space were identiﬁed with those on the other data space, was studied in [ 95 ]. F or example, it is known [ 66 ] that in spite of the changing brain, one can think of each brain to be parametrized b y an inner sphere, and the cortical thic kness at certain standard p oin ts based on this parametrization are imp ortant in the prognosis of the disease. In [ 95 ] we in vestigated certain conditions on the tw o data spaces which allow the lifting of a function from one to the other, and analyzed the eﬀect on the smo othness of the function as it is lifted. In many applications, the data ab out the function is av ailable only on a part of the base data space. The no vel part of this chapter is to inv estigate the follo wing questions of interest: (1) determine on what subsets of the target data space the lifting is deﬁned, and (2) how the lo cal smo othness on the base data space translates in to the lo cal smo othness of the lifted function. In limited angle tomography , one observes the Radon transform on a limited part of a cylinder and needs to reconstruct the image as a function on a ball from this data. A rudimen tary introduction to the sub ject is given in the b o ok of Natterer [ 107 ]. W e do not aim to solv e the limited angle tomography problem itself, but we will study in detail an example motiv ated b y the singular v alue decomp osition of the Radon transform, which in v olves t w o diﬀerent systems of orthogonal p olynomials on the interv al [ − 1 , 1]. W e are given the co eﬃcients in the expansion of a function f on [ − 1 , 1] in terms of Jacobi p olynomials with certain parameters (the base space expansion in our language), and use them as the co eﬃcients in an expansion in terms of Jacobi polynomials with resp ect to a diﬀeren t set of parameters (the target space in our language). Under what conditions on f and the parameters of the tw o Jacobi p olynomial systems will the expansion in the target space con verge and in which L p spaces? W e will illustrate our general theory b y obtaining a lo calized transplantation theorem for uniform approximation. In Section 3.2 , we review certain imp ortan t results in the context of a single data sp ac e (our abstraction of a manifold). In particular, we presen t a characterization of lo cal approximation of functions on such spaces. 51 In Section 3.3 , we review the notion of join t spaces (introduced under a diﬀeren t name in [ 95 ]). The main new result of our c hapter is to study the lifting of a function from a subset (typically , a ball) on one data space to another. These results are discussed in Section 3.4 . The pro ofs are giv en in Section 3.5 . An essential ingredien t in our constructions is the notion of lo calized kernels which, in turn, dep end upon a T aub erian theorem. F or the conv enience of the reader, this theorem is presented in Appendix A . App endix C lists some imp ortan t prop erties of Jacobi p olynomials which are required in our examples. 3.2 Data Spaces As men tioned in the in tro duction, a go o d deal of research on manifold learning is devoted to the question of learning the geometry of the manifold. F or the purp ose of harmonic analysis and approximation theory on the manifold, we do not need the full strength of the diﬀerentiabilit y structure on the manifold. Our o wn understanding of the correct hypotheses required to study these questions has ev olv ed, resulting in a plethora of terminology such as data deﬁned manifolds, admissible systems, data deﬁned spaces, etc., culminating in our current understanding with the deﬁnition of a data space given in [ 85 ]. F or the sake of simplicity , we will restrict our attention in this chapter to the case of compact spaces. W e do not exp ect an y serious problems in extending the theory to the general case, except for a great deal of technical details. Th us, the set up is the following. W e consider a compact metric measure space X with metric d and a probability measure µ ∗ . W e take { λ k } ∞ k =0 to b e a non-decreasing sequence of real num b ers with λ 0 = 0 and λ k → ∞ as k → ∞ , and { ϕ k } ∞ k =0 to b e an orthonormal set in L 2 ( µ ∗ ). W e ass ume that each ϕ k is con tinuous. The elements of the space Π n = span { ϕ k : λ k < n } (3.2.1) are called diﬀusion p olynomials (of order < n ). W e write Π ∞ = [ n> 0 Π n . W e in tro duce the following notation. B ( x, r ) = { y ∈ X : d ( x, y ) ≤ r } , x ∈ X , r > 0 . (3.2.2) If A ⊆ X w e deﬁne B ( A, r ) = [ x ∈ A B ( x, r ) . (3.2.3) With this set up, the deﬁnition of a compact data space is the following. Deﬁnition 3.2.1. The tuple Ξ = ( X , d, µ ∗ , { λ k } ∞ k =0 , { ϕ k } ∞ k =0 ) is called a (compact) data space if each of the follo wing conditions is satisﬁed. 52 1. F or each x ∈ X , r > 0, B ( x, r ) is compact. 2. ( Ball measure condition ) There exist q ≥ 1 and κ > 0 with the following prop erty: F or each x ∈ X , r > 0, µ ∗ ( B ( x, r )) = µ ∗ ( { y ∈ X : d ( x, y ) < r } ) ≤ κr q . (3.2.4) (In particular, µ ∗ ( { y ∈ X : d ( x, y ) = r } ) = 0.) 3. ( Gaussian upp er b ound ) There exist κ 1 , κ 2 > 0 suc h that for all x, y ∈ X , 0 < t ≤ 1,      ∞ X k =0 exp( − λ 2 k t ) ϕ k ( x ) ϕ k ( y )      ≤ κ 1 t − q / 2 exp  − κ 2 d ( x, y ) 2 t  . (3.2.5) W e refer to q as the exp onen t for Ξ. The primary example of a data space is, of course, a Riemannian manifold. Example 3.2.1. Let X b e a smo oth, compact, connected Riemannian manifold (without b oundary), d b e the geo desic distance on X , µ ∗ b e the Riemannian volume measure normalized to be a probabilit y measure, { λ k } b e the sequence of eigen v alues of the (negative) Laplace-Beltrami operator on X , and ϕ k b e the eigenfunction corresp onding to the eigenv alue λ k ; in particular, ϕ 0 ≡ 1. It has b een prov en in [ 85 , App endix A] that the Gaussian upp er b ound is satisﬁed. Therefore, if the condition in Equation ( 3.2.4 ) is satisﬁed, then ( X , d, µ ∗ , { λ k } ∞ k =0 , { ϕ k } ∞ k =0 ) is a data space with exp onen t equal to the dimension of the manifold. ■ Remark 3.2.1. In [ 46 ], F riedman and Tillic h give a construction for an orthonormal system on a graph whic h leads to a ﬁnite sp eed of wa v e propagation. It is shown in [ 43 ] that this, in turn, implies the Gaussian upp er b ound. Therefore, it is an interesting question whether appropriate deﬁnitions of measures and distances can b e deﬁned on a graph to satisfy the assumptions of a data space. ■ Example 3.2.2. In this example, we let X = [0 , π ] and for θ 1 , θ 2 ∈ X w e simply deﬁne the distance as d ( θ 1 , θ 2 ) = | θ 1 − θ 2 | . (3.2.6) W e will consider the so-called trigonometric functions : ϕ ( α,β ) n ( θ ) = (1 − cos θ ) α/ 2+1 / 4 (1 + cos θ ) β / 2+1 / 4 p ( α,β ) n (cos θ ) , (3.2.7) 53 where p ( α,β ) n are orthonormalized Jacobi p olynomials deﬁned as in App endix C and α, β ≥ − 1 / 2. W e deﬁne dµ ∗ ( θ ) = 1 π dθ . (3.2.8) W e see that a c hange of v ariables x = cos θ in Equation ( C.11 ) results in the follo wing orthogonalit y condition Z π 0 ϕ ( α,β ) n ( θ ) ϕ ( α,β ) m ( θ ) dθ = δ n,m . (3.2.9) So our orthonormal set of functions with resp ect to µ ∗ will b e { √ π ϕ ( α,β ) n } . It was prov en in [ 110 , Theorem A in view of Equation (3)] that with λ n = n + α + β + 1 2 , (3.2.10) w e hav e π ∞ X n =0 exp  − λ 2 n t  ϕ ( α,β ) n ( θ 1 ) ϕ ( α,β ) n ( θ 2 ) ≲ t − 1 / 2 exp  − c d ( θ 1 , θ 2 ) 2 t  , θ 1 , θ 2 ∈ X . (3.2.11) In conclusion, Ξ = ( X , d, µ ∗ , { λ n } , { √ π ϕ ( α,β ) n } ) (3.2.12) is a data space with exp onent 1. ■ The following example illustrates how a manifold with boundary can b e transformed in to a closed manifold as in Example 3.2.1 . W e will use the notation and facts from App endix C without alwa ys referring to them explicitly . W e adopt the notation S q = { x ∈ R q +1 : | x | = 1 } , S q + = { x ∈ S q : x q +1 ≥ 0 } . (3.2.13) Example 3.2.3. Let µ ∗ q denote the volume measure of S q , normalized to b e a probability measure. Let H q n b e the space of the restrictions to S q of homogeneous harmonic p olynomials of degree n on q + 1 v ariables, and { Y n,k } k b e an orthonormal (with resp ect to µ ∗ q ) basis for H q n . The p olynomials Y n,k are eigenfunctions of the Laplace-Beltrami op erator on the manifold S q with eigenv alues n ( n + q − 1). The geo desic distance b et ween ξ , η ∈ S q is arccos( ξ · η ), so the Gaussian upp er b ound for manifolds takes the form X n,k exp( − n ( n + q − 1) t ) Y n,k ( x ) Y n,k ( y ) ≲ t − q / 2 exp  − c (arccos( x · y )) 2 t  . (3.2.14) As a result, ( S q , arccos( ◦ · ◦ ) , µ ∗ q , { λ n } n , { Y n,k } n,k ) is a data space with dimension q . 54 No w we consider X = B q = { x ∈ R q : | x | ≤ 1 } . W e can iden tify B q with S q + as follo ws. An y point x ∈ B q has the form x = ω sin θ for some ω ∈ S q − 1 , θ ∈ [0 , π / 2]. W e write ˆ x = ( ω sin θ , cos θ ) ∈ S q + . With this identiﬁcation, S q + is parameterized by B q and we deﬁne dµ ∗ ( x ) = dµ ∗ q ( ˆ x ) = V ol( B q ) V ol( S q + ) (1 − | x | 2 ) − 1 / 2 dm ∗ ( x ) = Γ(( q + 1) / 2) √ π Γ( q / 2 + 1) (1 − | x | 2 ) − 1 / 2 dm ∗ ( x ) , (3.2.15) where µ ∗ q is the probability volume measure on S q + , and m ∗ is the probability volume measure on B q . It is also con venien t to deﬁne the distance on B q b y d ( x 1 , x 2 ) = arccos( ˆ x 1 · ˆ x 2 ) = arccos( x 1 · x 2 + p 1 − | x 1 | 2 p 1 − | x 2 | 2 ) . (3.2.16) All s pherical harmonics of degree 2 n are even functions on S q . So with the iden tiﬁcation of measures as ab o ve, one can represent the even spherical harmonics as an orthonormal system of functions on B q . That is, b y deﬁning P 2 n,k ( x ) = √ 2 Y 2 n,k ( ˆ x ) , (3.2.17) w e hav e Z B q P 2 n,k ( x ) P 2 n ′ ,k ′ ( x ) dµ ∗ ( x ) =2 Z S q + Y 2 n,k ( ˆ x ) Y 2 n ′ ,k ′ ( ˆ x ) dµ ∗ q ( ˆ x ) = Z S q Y 2 n,k ( ξ ) Y 2 n ′ ,k ′ ( ξ ) dµ ∗ q ( ξ ) = δ ( n,k ) , ( n ′ ,k ′ ) . (3.2.18) T o show the Gaussian upper b ound for Y 2 n,k on B q , we ﬁrst see that in view of the summation formula ( 2.4.5 ) (from Chapter 2 ) and ( C.13 ) w e hav e dim( H q 2 n ) X k =1 P 2 n,k ( x ) P 2 n,k ( y ) = dim( H q 2 n ) X k =1 Y 2 n,k ( ˆ x ) Y 2 n,k ( ˆ y ) = ω q ω q − 1 p ( q / 2 − 1 ,q/ 2 − 1) 2 n (1) p ( q / 2 − 1 ,q/ 2 − 1) 2 n ( ˆ x · ˆ y ) = ω q ω q − 1 2 ( q − 1) / 2 p ( q / 2 − 1 , − 1 / 2) n (1) p ( q / 2 − 1 , − 1 / 2) n (cos(2 arccos( ˆ x · ˆ y ))) . (3.2.19) 55 In ligh t of Equation ( C.10 ) we deﬁne λ n = p n ( n + q / 2 − 1 / 2) , (3.2.20) whic h is conv eniently not dep endent up on k . Using ( C.15 ), we see that for t > 0 ∞ X n =0 dim( H q 2 n ) X k =1 exp  − λ 2 n t  P 2 n,k ( x ) P 2 n,k ( y ) ∼ ∞ X n =0 exp( − n ( n + q / 2 − 1 / 2) t ) p ( q / 2 − 1 , − 1 / 2) n (1) p ( q / 2 − 1 , − 1 / 2) n (cos(2 arccos( ˆ x · ˆ y ))) ≲ t − q / 2  − 4 c arccos( ˆ x 1 · ˆ x 2 ) 2 t  . (3.2.21) Therefore, ( B q , d, µ ∗ , { λ n } n , { P 2 n,k } n,k ) is a data space with exp onent q . ■ In this section, we will assume Ξ to b e a ﬁxed data space and omit its mention from the notations. W e will mention it later in other parts of the c hapter in order to av oid confusion. Next, w e deﬁne smo othness classes of functions on X . In the absence of any diﬀeren tiability structure, w e do this in a manner that is customary in approximation theory . W e deﬁne ﬁrst the de gr e e of appr oximation of a function f ∈ L p ( µ ∗ ) by E n ( p, f ) = min P ∈ Π n ∥ f − P ∥ p,µ ∗ , n > 0 , 1 ≤ p ≤ ∞ , f ∈ L p ( µ ∗ ) . (3.2.22) W e ﬁnd it conv enien t to denote by X p the space { f ∈ L p ( µ ∗ ) : lim n →∞ E n ( p, f ) = 0 } ; e.g., in the manifold case, X p = L p ( µ ∗ ) if 1 ≤ p < ∞ and X ∞ = C ( X ). In the case of Example 3.2.3 , we need to restrict ourselves to ev en functions. Deﬁnition 3.2. 2. Let 1 ≤ p ≤ ∞ , γ > 0. (a) F or f ∈ X p , w e deﬁne ∥ f ∥ W γ ,p = ∥ f ∥ p,µ ∗ + sup n> 0 n γ E n ( p, f ) , (3.2.23) and note that ∥ f ∥ W γ ,p ∼ ∥ f ∥ p,µ ∗ + sup n ∈ Z + 2 nγ E 2 n ( p, f ) . (3.2.24) The space W γ ,p comprises all f for which ∥ f ∥ W γ ,p < ∞ . (b) W e write C ∞ = \ γ > 0 W γ , ∞ . If B is a ball in X , C ∞ ( B ) comprises functions f ∈ C ∞ whic h are supp orted on B . (c) If x 0 ∈ X , the space W γ ,p ( x 0 ) comprises functions f such that there exists r > 0 with the property 56 that for every ϕ ∈ C ∞ ( B ( x 0 , r )), ϕf ∈ W γ ,p . If A ⊂ X , the space W γ ,p ( A ) = \ x 0 ∈ A W γ ,p ( x 0 ); i.e., W γ ,p ( A ) comprises functions whic h are in W γ ,p ( x 0 ) for eac h x 0 ∈ A . A central theme in appro ximation theory is to characterize the smo othness spaces W γ ,p in terms of the degree of appro ximation from some spaces; in our case we consider Π n ’s. F or this purp ose, we deﬁne some lo calized kernels and op erators. The k ernels are deﬁned by Φ n ( H ; x, y ) = ∞ X m =0 H  λ m n  ϕ m ( x ) ϕ m ( y ) , (3.2.25) where H : R → R is a compactly supp orted function. The op erators corresp onding to the kernels Φ n are deﬁned b y σ n ( H ; f , x ) = Z X Φ n ( H ; x, y ) f ( y ) dµ ∗ ( y ) = X k : λ k q + 1 b e an inte ger, H : R → R b e an even, S times c ontinuously diﬀer entiable, c omp actly supp orte d function. Then for every x, y ∈ X , N > 0 , | Φ N ( H ; x, y ) | ≲ N q max(1 , ( N d ( x, y )) S ) , (3.2.28) wher e the c onstant may dep end up on H and S , but not on N , x , or y . In the remainder of this c hapter, w e ﬁx a ﬁlter h ; i.e., an inﬁnitely diﬀeren tiable function h : [0 , ∞ ) → [0 , 1], suc h that h ( t ) = 1 for 0 ≤ t ≤ 1 / 2, h ( t ) = 0 for t ≥ 1. The domain of the ﬁlter h can b e extended to R b y setting h ( − t ) = h ( t ). Since h is ﬁxed, its mention will b e omitted from the notation unless we feel that this would cause a confusion. The following theorem giv es a crucial prop erty of the op erators, prov ed in sev eral pap ers in diﬀerent contexts, see [ 85 , Theorem 5.1] for a recent pro of. Theorem 3.2.1 ([ 85 , Theorem 5.1]) . L et n > 0 . If P ∈ Π n/ 2 , then σ n ( P ) = P . Also, for any p with 1 ≤ p ≤ ∞ , ∥ σ n ( f ) ∥ p ≲ ∥ f ∥ p , f ∈ L p . (3.2.29) 57 If 1 ≤ p ≤ ∞ , and f ∈ L p ( X ) , then E n ( p, f ) ≤ ∥ f − σ n ( f ) ∥ p,µ ∗ ≲ E n/ 2 ( p, f ) . (3.2.30) While Theorem 3.2.1 gives, in particular, a c haracterization of the global smoothness spaces W γ ,p , the c haracterization of lo cal smo othness requires t wo more assumptions: the partition of unity and pro duct assumption. Deﬁnition 3.2.3 ( P artition of unit y ) . W e say that a set X has a partition of unity if for every r > 0, there exists a coun table family F r = { ψ k,r } ∞ k =0 of C ∞ functions with the follo wing prop erties: 1. Each ψ k,r ∈ F r is supp orted on B ( x k , r ) for some x k ∈ X . 2. F or every ψ k,r ∈ F r and x ∈ X , 0 ≤ ψ k,r ( x ) ≤ 1. 3. F or every x ∈ X there exists a ﬁnite subset F r ( x ) ⊆ F r (with cardinality b ounded indep endently of x ) suc h that for all y ∈ B ( x, r ) X ψ k,r ∈F r ( x ) ψ k,r ( y ) = 1 . (3.2.31) Deﬁnition 3.2.4 ( Pro duct assumption ) . W e say that a data space Ξ satisﬁes the pro duct assumption if there exists A ∗ ≥ 1 and a family { R j,k,n ∈ Π A ∗ n } suc h that for every S > 0, lim n →∞ n S  max λ k ,λ j 0 and P , Q ∈ Π n w e ha v e P Q ∈ Π A ∗ n , then we say that Ξ satisﬁes the strong pro duct assumption . In the most imp ortan t manifold case, the partition of unit y assumption is alw ay s satisﬁed [ 14 , Chapter 0, Theorem 5.6]. It is sho wn in [ 50 , 44 ] that the strong product assumption is satisﬁed if ϕ k ’s are eigenfunctions of certain diﬀerential equations on a Riemannian manifold and the λ k ’s are the corresp onding eigenv alues. W e do not know of any example where this prop erty does not hold, y et cannot prov e that it holds in general. Hence, w e hav e listed it as an assumption. Our c haracterization of lo cal smo othness ([ 42 , 81 , 85 ]) is the following. Theorem 3.2.2. L et 1 ≤ p ≤ ∞ , γ > 0 , f ∈ X p , x 0 ∈ X . We assume the p artition of unity and the pr o d uct assumption. Then the fol lowing ar e e quivalent. (a) f ∈ W γ ,p ( x 0 ) . 58 (b) Ther e exists a b al l B c enter e d at x 0 such that sup n ≥ 0 2 nγ ∥ f − σ 2 n ( f ) ∥ p,µ ∗ , B < ∞ . (3.2.33) A direct corollary is the follo wing. Corollary 3.2.1. L et 1 ≤ p ≤ ∞ , γ > 0 , f ∈ X p , A b e a c omp act subset of X . We assume the p artition of unity and the pr o duct assumption. Then the fol lowing ar e e quivalent. (a) f ∈ W γ ,p ( A ) . (b) Ther e exists r > 0 such that sup n ≥ 0 2 nγ ∥ f − σ 2 n ( f ) ∥ p,µ ∗ , B ( A,r ) < ∞ . (3.2.34) 3.3 Join t Data Spaces In order to motiv ate our deﬁnitions in this section, we ﬁrst consider a couple of examples. Example 3.3.1. Let Ξ j = ( X j , d j , µ ∗ j , { λ j,k } ∞ k =0 , { ϕ j,k } ∞ k =0 ), j = 1 , 2 b e t w o data spaces with exp onen t q . W e denote the heat kernel in each case by K j,t ( x, y ) = ∞ X k =0 exp( − λ 2 j,k t ) ϕ j,k ( x ) ϕ j,k ( y ) , j = 1 , 2 , x, y ∈ X , t > 0 , In the pap er [ 28 ], Coifman and Hirn assumed that X 1 = X 2 = X , µ ∗ 1 = µ ∗ 2 = µ ∗ , and prop osed the diﬀusion distance b et ween p oints x 1 , x 2 to b e the square ro ot of K 1 , 2 t ( x 1 , x 2 ) + K 2 , 2 t ( x 2 , x 2 ) − 2 Z X K 1 ,t ( x 1 , y ) K 2 ,t ( y , x 2 ) dµ ∗ ( y ) . W riting, in this example only , A j,k = Z X ϕ 1 ,j ( y ) ϕ 2 ,k ( y ) dµ ∗ ( y ) , (3.3.1) w e get Z X K 1 ,t ( x 1 , y ) K 2 ,t ( y , x 2 ) dµ ∗ ( y ) = X j,k exp  − ( λ 2 1 ,j + λ 2 2 ,k ) t  A j,k ϕ 1 ,j ( x 1 ) ϕ 2 ,k ( x 2 ) . (3.3.2) 59 F urthermore, the Gaussian upp er b ound conditions imply that Z X K 1 ,t ( x 1 , y ) K 2 ,t ( y , x 2 ) dµ ∗ ( y ) ≲ t − q Z X exp  − c d 1 ( x 1 , y ) 2 + d 2 ( y , x 2 ) 2 t  dµ ∗ ( y ) ≲ t − q exp − c (min y ∈ X ( d 1 ( x 1 , y ) + d 2 ( y , x 2 ))) 2 t ! . (3.3.3) W riting, in this example only , d 1 , 2 ( x 1 , x 2 ) = min y ∈ X ( d 1 ( x 1 , y ) + d 2 ( y , x 2 )) = d 2 , 1 ( x 2 , x 1 ) , w e observe that for any x 1 , x ′ 1 , x 2 , x ′ 2 ∈ X , d 1 , 2 ( x 1 , x 2 ) ≤ d 1 , 2 ( x ′ 1 , x 2 ) + d 1 ( x 1 , x ′ 1 ) , d 1 , 2 ( x 1 , x 2 ) ≤ d 1 , 2 ( x 1 , x ′ 2 ) + d 1 ( x 2 , x ′ 2 ) . ■ Example 3.3.2. In this example we let α i , β i ≥ − 1 / 2 for i = 1 , 2 and assume that a = | α 1 − α 2 | / 2 , b = | β 1 − β 2 | / 2 ∈ N . Then we select the following tw o data spaces as deﬁned in Example 3.2.2 Ξ i = ([0 , π ] , d i , 1 π dθ , { λ i,n } , { √ π ϕ ( α i ,β i ) n } ) . (3.3.4) Since b oth spaces already ha v e the same distance, we will deﬁne a joint distance for the systems accordingly: d 1 , 2 ( θ 1 , θ 2 ) = d 1 ( θ 1 , θ 2 ) = d 2 ( θ 1 , θ 2 ) = | θ 1 − θ 2 | . (3.3.5) Similar to Example 3.3.1 ab o ve, we are considering tw o data spaces with the same underlying space and measure. How ev er, we now pro ceed in a diﬀerent manner. Let us denote Ω ( θ ) = (1 − cos θ ) a (1 + cos θ ) b . (3.3.6) Let α = max( α 1 , α 2 ) and β = max( β 1 , β 2 ). Then we deﬁne A m,n = Z π 0 ϕ ( α 1 ,β 1 ) m ( θ ) ϕ ( α 2 ,β 2 ) n ( θ ) Ω ( θ ) dθ = Z π 0 p ( α 1 ,β 1 ) m (cos θ ) p ( α 2 ,β 2 ) n (cos θ )(1 − cos θ ) α +1 / 2 (1 + cos θ ) β +1 / 2 dθ . (3.3.7) 60 The orthogonality of the Jacobi p olynomials tells us that A m,n = 0 at least when m > n + 2 a + 2 b or n > m + 2 a + 2 b . F urthermore, we hav e the following tw o sums X n A m,n ϕ ( α 2 ,β 2 ) n ( θ ) = Ω ( θ ) ϕ ( α 1 ,β 1 ) m ( θ ) , X m A m,n ϕ ( α 1 ,β 1 ) m ( θ ) = Ω ( θ ) ϕ ( α 2 ,β 2 ) n ( θ ) . (3.3.8) W e deﬁne ℓ m.n = q λ 2 1 ,m + λ 2 2 ,n , utilize the Gaussian upp er b ound prop erty for Ξ i and Equation ( 3.3.8 ) to deduce as in Example 3.3.1 that      π X m,n exp  − ℓ 2 m,n t  A m,n ϕ ( α 1 ,β 1 ) m ( θ 1 ) ϕ ( α 2 ,β 2 ) n ( θ 2 )      =     Z π 0 K 1 ,t ( θ 1 , ϕ ) K 2 ,t ( ϕ, θ 2 ) Ω ( ϕ ) dϕ     ≲ t − 1 exp  − c d 1 , 2 ( θ 1 , θ 2 ) 2 t  . (3.3.9) W e note (cf. [ 85 , Lemma 5.2]) that π X m,n : ℓ m,n 0, x 1 ∈ X 1 , x 2 ∈ X 2 , 61 A 1 ⊂ X 1 , A 2 ⊂ X 2 , w e deﬁne B 1 ( x 1 , r ) = { z ∈ X 1 : d 1 ( x 1 , z ) ≤ r } , B 2 ( x 2 , r ) = { z ∈ X 2 : d 2 ( x 2 , z ) ≤ r } , B 1 , 2 ( x 1 , r ) = { z ∈ X 2 : d 1 , 2 ( x 1 , z ) ≤ r } , B 2 , 1 ( x 2 , r ) = { z ∈ X 1 : d 2 , 1 ( x 2 , z ) ≤ r } , d 1 , 2 ( A 1 , x 2 ) = inf x ∈ A 1 ⊆ X 1 d 1 , 2 ( x, x 2 ) , d 1 , 2 ( x 1 , A 2 ) = d 2 , 1 ( A 2 , x 1 ) = inf y ∈ A 2 ⊆ X 2 d 2 , 1 ( y , x 1 ) . (3.3.12) W e recall here that an inﬁmum ov er an empty set is deﬁned to b e ∞ . Deﬁnition 3.3.2. Let A = ( A j,k ) ∞ j,k =0 ( connection coeﬃcients ) and L = ( ℓ j,k ) ∞ j,k =0 ( join t eigen v alues ) b e bi-inﬁnite matrices. F or x 1 ∈ X 1 , x 2 ∈ X 2 , t > 0, the join t heat kernel is deﬁned formally b y K t (Ξ 1 , Ξ 2 ; x 1 , x 2 ) = K t (Ξ 1 , Ξ 2 ; A , L ; x 1 , x 2 ) = ∞ X j,k =0 exp( − ℓ 2 j,k t ) A j,k ϕ 1 ,j ( x 1 ) ϕ 2 ,k ( x 2 ) = lim n →∞ X j,k : ℓ j,k 0, the set { ( j, k ) : ℓ j,k < u } is ﬁnite. A join t (compact) data space Ξ is a tuple (Ξ 1 , Ξ 2 , d 1 , 2 , A , L ) , where eac h of the following conditions is satisﬁed for some Q > 0: 1. ( Joint regularit y ) There exist q 1 , q 2 > 0 suc h that µ ∗ 1 ( B 2 , 1 ( x 2 , r )) ≤ cr q 1 , µ ∗ 2 ( B 1 , 2 ( x 1 , r )) ≤ cr q 2 , x 1 ∈ X 1 , x 2 ∈ X 2 , r > 0 . (3.3.14) 2. ( V ariation b ound) F or each n > 0, X j,k : ℓ j,k Q + 1 b e an inte ger, H : R → R b e an even, S times c ontinuously diﬀer entiable, c omp actly supp orte d function. Then for every x 1 ∈ X 1 , x 2 ∈ X 2 , N > 0 , | Φ N ( H , Ξ 1 , Ξ 2 ; x 1 , x 2 ) | ≲ N Q max(1 , ( N d 1 , 2 ( x 1 , x 2 )) S ) , (3.3.19) wher e the c onstant involve d may dep end up on H , and S , but not on N , x 1 , x 2 . In the sequel, we will ﬁx H to b e the ﬁlter h introduced in Section 3.2 , and will omit its mention from all notations. Also, we take S > max( Q, q 1 , q 2 ) + 1 to b e ﬁxed, although we may put additional conditions on S as needed. As b efore, all constants may dep end up on h and S . In the remainder of this chapter, we will take p = ∞ , work only with con tinuous functions on X 1 or X 2 , and use ∥ f ∥ K to denote the supremum norm of f on a set K . Accordingly , w e will omit the index p from the notation for the smo othness classes; e.g., w e will write W γ (Ξ 1 ; B ) instead of W γ , ∞ (Ξ 1 ; B ). The results in the sequel are similar in the case where p < ∞ due to the Riesz-Thorin interpolation theorem, but more notationally exhausting without adding an y apparent new insights. W e end the section with a condition on the op erator deﬁned in Equation ( 3.3.18 ) that is useful for our purp oses. 63 Deﬁnition 3.3.4 ( P olynomial preserv ation condition ) . Let (Ξ 1 , Ξ 2 , d 1 , 2 , A , L ) b e a join t data space. W e say the p olynomial preserv ation condition is satisﬁed if there exists some c ∗ > 0 with the prop erty that if P n ∈ Π n (Ξ 2 ), then σ m (Ξ 1 , Ξ 2 ; P n ) = σ c ∗ n (Ξ 1 , Ξ 2 ; P n ) for all m ≥ c ∗ n . Remark 3.3.1. The p olynomial preserv ation condition is satisﬁed if, for any n > 0, we hav e the follo wing inclusion: { ( i, j ) : A i,j  = 0 , λ 2 ,j < n } ⊆ { ( i, j ) : ℓ i,j ≤ c ∗ n, λ 1 ,i < c ∗ n } . (3.3.20) Example 3.3.3. W e utilize the same notation as in Examples 3.2.2 and 3.3.2 . W e now see, in light of Deﬁnition 3.3.3 , that (Ξ 1 , Ξ 2 , d 1 , 2 , A , L ) is a joint data space with exponents (1 , 1 , 1). It is clear that b oth the partition of unit y and strong pro duct assumption hold in these spaces. One ma y also recall that A m,n = 0 at least whenever m > n + 2 a + 2 b , so there exists c ∗ suc h that Equation ( 3.3.20 ) is satisﬁed. As a result, w e conclude the p olynomial preserv ation condition holds. ■ 3.4 Lo cal Appro ximation in Join t Data Spaces In this section, we assume a ﬁxed joint data space as in Section 3.3 . W e are in terested in the follo wing questions. Suppose f ∈ C ( X 2 ), and we hav e information about f only in the neigh b orho o d of a compact set A ⊆ X 2 . Under what conditions on f and a subset B ⊆ X 1 can f b e lifted to a function E ( f ) on B ? Moreo ver, how do es the lo cal smo othness of E ( f ) on B depends up on the lo cal smo othness of f on A ? W e no w give deﬁnitions for E ( f ) , A, B for which we hav e considered these questions. Deﬁnition 3.4. 1. Given f ∈ C ( X 2 ), w e deﬁne the lifted function E ( f ) to b e the limit E ( f ) = lim n →∞ σ n (Ξ 1 , Ξ 2 ; f ) , (3.4.1) if the limit exists. Deﬁnition 3.4.2. Let r, s > 0 and A ⊆ X 2 b e a compact subset with the prop ert y that there exists a compact subset B − ⊂ X 1 suc h that B − ⊆ { x 1 : d 1 , 2 ( x 1 , X 2 \ A ) ≥ s + r } (3.4.2) for some r > 0. W e then deﬁne the image set of A by I ( r, s ; A ) = B 1 ( B − , s ) = { x 1 : d 1 ( x 1 , B − ) ≤ s } . (3.4.3) 64 If the set B − do es not exist, then we deﬁne I ( r, s ; A ) = ∅ . Remark 3.4.1. In the sequel we ﬁx r, s > 0 and a compact subset A ⊆ X 2 suc h that B − deﬁned in Equation ( 3.4.2 ) is nonempty . W e write B = I ( r, s ; A ). W e note that, due to the generalized triangle inequalit y ( 3.3.11 ), we hav e the imp ortant prop erty B − ⊊ B ⊆ { x 1 : d 1 , 2 ( x 1 , X 2 \ A ) ≥ r } . (3.4.4) ■ W e no w state our main theorem. Although there is no explicit mention of B − in the statement of the theorem, Remark 3.4.2 and Example 3.4.1 clarify the b eneﬁt of such a construction. Theorem 3.4.1. L et (Ξ 1 , Ξ 2 , d 1 , 2 , A , L ) b e a joint data sp ac e with exp onents ( Q, q 1 , q 2 ) . We assume that the p olynomial pr eservation c ondition holds with p ar ameter c ∗ . Supp ose X 2 has a p artition of unity. (a) L et f ∈ C ( X 2 ) , satisfying ∞ X m =0 2 m ( Q − q 2 ) || σ 2 m +1 (Ξ 2 ; f ) − σ 2 m (Ξ 2 ; f ) || A < ∞ . (3.4.5) Then E ( f ) as deﬁne d in Deﬁnition 3.4.1 exists on B and for c ∗ r 2 n ≥ 1 we have ||E ( f ) − σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) || B ≲ 2 n ( Q − q 2 ) || f − σ 2 n (Ξ 2 ; f ) || A + || f || X 2 2 n ( Q − S ) r q 2 − S + ∞ X m = n 2 m ( Q − q 2 ) || σ 2 m +1 (Ξ 2 ; f ) − σ 2 m (Ξ 2 ; f ) || A . (3.4.6) In p articular, if Ξ 1 satisﬁes the str ong pr o duct assumption, X 1 has a p artition of unity, and α > 0 is given such that α ℓ j,k ≥ λ 1 ,j for al l j, k ∈ N , then σ n (Ξ 1 , Ξ 2 ; f ) ∈ Π αn (Ξ 1 ) . (b) If additional ly, f ∈ W γ (Ξ 2 ; A ) with Q − q 2 < γ < S − q 2 , then E ( f ) is c ontinuous on B and for ϕ ∈ C ∞ ( B ) , we have ϕ E ( f ) ∈ W γ − Q + q 2 (Ξ 1 ) . Remark 3.4.2. Given the assumptions of Theorem 3.4.1 , E ( f ) is not guaranteed to be con tin uous on the en tirety of X 1 (or ev en deﬁned outside of B ). As a result, in the setting of 3.4.1 (b) we cannot sa y E ( f ) b elongs to any of the smo othness classes deﬁned in this chapter. Ho w ev er we can still say , for instance, that inf P ∈ Π 2 n (Ξ 1 ) ||E ( f ) − P || B − ≲ 2 − n ( γ − Q + q 2 ) (3.4.7) (this can b e seen directly by taking ϕ ∈ C ∞ (Ξ 1 ) suc h that ϕ ( x ) = 1 when x ∈ B − and ϕ ( x ) = 0 when x ∈ X 1 \ B ). Consequently , if it happ ens that E ( f ) ∈ C (Ξ 1 ), then E ( f ) ∈ W γ − Q + q 2 (Ξ 1 ; B − ). ■ 65 Example 3.4.1. W e now conclude the running examples from 3.2.2 , 3.3.2 , and 3.3.3 by demonstrating ho w one ma y utilize Theorem 3.4.1 . W e assume the notation given in eac h of the prior examples listed. First, w e ﬁnd the image set for A = B 2 ( θ 0 , r 0 ) given some θ 0 ∈ [0 , π ] and r 0 > 0. W e let r = s = r 0 / 8 in corresp ondence to Deﬁnition 3.4.2 and deﬁne B − = B 1  θ 0 , 3 r 0 4  = n θ 1 ∈ [0 , π ] : d 1 ( θ 1 , [0 , π ] \ B 1 ( θ 0 , r 0 )) ≥ r 0 4 o = { θ 1 ∈ [0 , π ] : d 1 , 2 ( θ 1 , [0 , π ] \ A ) ≥ r + s } . (3.4.8) Then we can let B = B 1  θ 0 , 7 r 0 8  = B 1 ( B − , r ). By Theorem 3.4.1 (a), f ∈ C ([0 , π ]) can b e lifted to B 1  θ 0 , 7 r 0 8  (where we note that Equation ( 3.4.5 ) is automatically satisﬁed due to Q = q 2 = 1). Since ℓ m,n = λ 1 ,n , we hav e σ n (Ξ 1 , Ξ 2 ; f ) ∈ Π n (Ξ 1 ). If we supp ose f ∈ W γ (Ξ 2 ; A )for some γ > 0 (with h chosen so S is suﬃciently large), then Theorem 3.4.1 (b) informs us that ϕ E ( f ) ∈ W γ (Ξ 1 ) for ϕ ∈ C ∞ ( B ). Lastly , as a result of Equation ( 3.4.7 ), w e can conclude inf P ∈ Π 2 n (Ξ 1 ) ||E ( f ) − P || B 1 ( θ 0 , 3 r 0 r ) ≲ 2 − nγ . (3.4.9) ■ 3.5 Pro ofs In this section, we give a proof of Theorem 3.4.1 after pro ving some prep eratory results. W e assume that (Ξ 1 , Ξ 2 , d 1 , 2 , A , L ) is a join t data space with exp onents Q, q 1 , q 2 . Lemma 3.5.1. L et x 1 ∈ X 1 , r > 0 . We have Z X 2 \ B 1 , 2 ( x 1 ,r ) | Φ n (Ξ 1 , Ξ 2 ; x 1 , x 2 ) | dµ ∗ 2 ( x 2 ) ≲ n Q − q 2 (max(1 , nr )) q 2 − S . (3.5.1) In p articular, Z X 2 | Φ n (Ξ 1 , Ξ 2 ; x 1 , x 2 ) | dµ ∗ 2 ( x 2 ) ≲ n Q − q 2 . (3.5.2) Pr o of. In this pro of only , deﬁne A 0 = B 1 , 2 ( x 1 , r ) , A m = B 1 , 2 ( x 1 , r 2 m ) \ B 1 , 2 ( x 1 , r 2 m − 1 ) for all m ∈ N . (3.5.3) 66 Then the join t regularity condition ( 3.3.14 ) implies µ ∗ 2 ( A m ) ≲ ( r 2 m ) q 2 , for eac h m . W e can also see by deﬁnition that when x ∈ A m , then d 1 , 2 ( x 1 , x ) > r 2 m − 1 . Since S > q 2 , w e deduce that for r n ≥ 1, Z X 2 \ A 0 | Φ n (Ξ 1 , Ξ 2 ; x 1 , x 2 ) | dµ ∗ 2 ( x 2 ) ≲ ∞ X m =1 n Q µ ∗ 2 ( A m ) ( r n 2 m − 1 ) S ≲ r q 2 − S n Q − S ∞ X m =1 2 m ( q 2 − S ) ≲ n Q − q 2 ( nr ) q 2 − S . (3.5.4) This completes the pro of of ( 3.5.1 ) when nr ≥ 1. The joint regularit y condition and Prop osition 3.3.1 show further that Z A 0 | Φ n (Ξ 1 , Ξ 2 ; x 1 , x 2 ) | dµ ∗ 2 ( x 2 ) ≲ n Q µ ∗ 2 ( A 0 ) ≲ n Q r q 2 = n Q − q 2 ( nr ) q 2 . (3.5.5) W e use r = 1 /n in the estimates ( 3.5.4 ) and ( 3.5.5 ) and add the estimates to arrive at b oth ( 3.5.2 ) and the case r ≤ 1 /n of ( 3.5.1 ). ■ The next lemma giv es a lo cal b ound on the kernels σ n deﬁned in ( 3.3.18 ). Lemma 3.5.2. L et A and B b e as deﬁne d in R emark 3.4.1 . F or a c ontinuous f : A → R , we have ∥ σ n (Ξ 1 , Ξ 2 ; f ) ∥ B ≲ n Q − q 2  ∥ f ∥ A + ∥ f ∥ X 2 (max(1 , nr )) q 2 − S  . (3.5.6) Pr o of. Let x 1 ∈ B . In view of the joint triangle inequalit y ( 3.3.11 ), we ha ve d 1 , 2 ( x 1 , x 2 ) ≥ r for all x 2 ∈ X 2 \ A . Therefore, Lemma 3.5.1 shows that | σ n (Ξ 1 , Ξ 2 ; f )( x 1 ) | ≤ Z X 2 | f ( x 2 )Φ n (Ξ 1 , Ξ 2 ; x 1 , x 2 ) | dµ ∗ 2 ( x 2 ) = Z A | f ( x 2 )Φ n (Ξ 1 , Ξ 2 ; x 1 , x 2 ) | dµ ∗ 2 ( x 2 ) + Z X 2 \ A | f ( x 2 )Φ n (Ξ 1 , Ξ 2 ; x 1 , x 2 ) | dµ ∗ 2 ( x 2 ) ≲ n Q − q 2 ∥ f ∥ A + ∥ f ∥ X 2 Z X 2 \ B 1 , 2 ( x 1 ,r ) | Φ n (Ξ 1 , Ξ 2 ; x 1 , x 2 ) | dµ ∗ 2 ( x 2 ) ≲ n Q − q 2  ∥ f ∥ A + ∥ f ∥ X 2 (max(1 , nr )) q 2 − S  . (3.5.7) ■ Lemma 3.5.3. We assume the p olynomial pr eservation c ondition with p ar ameter c ∗ . L et f ∈ C ( X 2 ) satisfy ( 3.4.5 ) . Then E ( f ) = lim n →∞ σ 2 n (Ξ 1 , Ξ 2 ; f ) (3.5.8) 67 exists on B . F urthermor e, when c ∗ 2 n > 1 /r , we have ||E ( f ) − E ( σ 2 n (Ξ 2 ; f )) || B ≲ ∞ X m = n 2 m ( Q − q 2 ) || f − σ 2 m (Ξ 2 ; f ) || A + || f || X 2 n Q − S r q 2 − S . (3.5.9) Pr o of. In this pro of only we denote P n = σ n (Ξ 2 ; f ). Since P n ∈ Π n (Ξ 2 ), the condition ( 3.3.20 ) implies that E ( P n ) = σ c ∗ n (Ξ 1 , Ξ 2 ; P n ) = lim k →∞ σ k (Ξ 1 , Ξ 2 ; P k ) (3.5.10) is deﬁned on X 1 . Theorem 3.2.1 and Lemma 3.5.2 then imply that ||E ( P 2 m +1 ) − E ( P 2 m ) || B = || σ c ∗ 2 m +1 (Ξ 1 , Ξ 2 ; P 2 m +1 ) − σ c ∗ 2 m +1 (Ξ 1 , Ξ 2 ; P 2 m ) || B ≲ 2 m ( Q − q 2 ) ( || P 2 m +1 − P 2 m || A + || f || X 2 (max(1 , 2 m r )) q 2 − S ) . (3.5.11) W e conclude that ||E ( P 1 ) || B + ∞ X m =0 ||E ( P 2 m +1 ) − E ( P 2 m ) || B ≲ || P 1 || + ∞ X m =0 2 m ( Q − q 2 ) || P 2 m +1 − P 2 m || A + || f || X 2   X c ∗ 2 m ≤ 1 /r 2 m ( Q − q 2 ) + r q 2 − S X c ∗ 2 m > 1 /r 2 m ( Q − S )   < ∞ . (3.5.12) Th us, E ( f ) = E ( P 1 ) + ∞ X m =0 ( E ( P 2 m +1 ) − E ( P 2 m )) (3.5.13) is deﬁned on B . In particular, when c ∗ 2 n ≥ 1 /r it follo ws ||E ( f ) − E ( P 2 n ) || B ≤ ∞ X m = n 2 m ( Q − q 2 ) || P 2 m +1 − P 2 m || A + || f || X 2 2 n ( Q − S ) r q 2 − S . (3.5.14) ■ No w we giv e the pro of of Theorem 3.4.1 . Pr o of. In this pro of only denote P n = σ n (Ξ 2 ; f ) ∈ Π n (Ξ 2 ). W e can deduce from Theorem 3.2.1 and 68 Lemma 3.5.2 that for c ∗ r 2 n ≥ 1, || σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) − σ c ∗ 2 n (Ξ 1 , Ξ 2 ; P 2 n ) || B ≲ 2 n ( Q − q 2 ) ( || f − P 2 n || A + || f − P 2 n || X 2 2 n ( q 2 − S ) r q 2 − S ) ≲ 2 n ( Q − q 2 ) ( || f − P 2 n || A + || f || X 2 2 n ( q 2 − S ) r q 2 − S ) . (3.5.15) The p olynomial preserv ation condition (Deﬁnition 3.3.4 ) gives us that || σ c ∗ 2 n (Ξ 1 , Ξ 2 ; P 2 n ) − E ( P 2 n ) || B = 0 . (3.5.16) Then utilizing Equation ( 3.5.15 ) and Lemma 3.5.3 , w e see || σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) − E ( f ) || B ≤ || σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) − σ c ∗ 2 n (Ξ 1 , Ξ 2 ; P 2 n ) || B + || σ c ∗ 2 n (Ξ 1 , Ξ 2 ; P 2 n ) − E ( P 2 n ) || B + ||E ( P 2 n ) − E ( f ) || B ≲ 2 n ( Q − q 2 ) || f − P 2 n || A + ∞ X m = n 2 m ( Q − q 2 ) || P 2 m +1 − P 2 m || A + || f || X 2 2 n ( Q − S ) r q 2 − S . (3.5.17) This pro ves Equation ( 3.4.6 ). In particular, when αℓ j,k ≥ λ 1 ,j and α > 0, the only ϕ 1 ,j ( x 1 ) with non-zero co eﬃcients in Equa- tion ( 3.3.18 ) are those where ℓ j,k < n , which implies λ 1 ,j < αn and further that σ n (Ξ 1 , Ξ 2 ; f ) ∈ Π αn (Ξ 1 ). This completes the pro of of part (a). In the pro of of part (b), we may assume without loss of generality that ∥ f ∥ W γ (Ξ 2 ; A ) + || f || X 2 = 1. W e can see from Corollary 3.2.1 that for each m || P 2 m +1 − P 2 m || A ≤ || P 2 m +1 − f || A + || P 2 m − f || A ≲ 2 − mγ , (3.5.18) whic h implies that whenever Q − q 2 < γ w e hav e ∞ X m = n 2 m ( Q − q 2 ) || P 2 m +1 − P 2 m || A ≲ 2 n ( Q − q 2 − γ ) . (3.5.19) F urther, the assumption that γ < S − q 2 giv es us 2 n ( Q − S ) ≲ 2 n ( Q − q 2 − γ ) . (3.5.20) 69 Since f ∈ W γ (Ξ 2 ; A ), w e hav e from Corollary 3.2.1 that || f − P 2 n || A ≲ 2 − nγ . (3.5.21) Using Equation ( 3.4.6 ) from part (a), w e see ||E ( f ) − σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) || B ≲ (1 + r q 2 − S || f || X 2 )2 n ( Q − q 2 − γ ) . (3.5.22) Th us, { σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) } is a sequence of contin uous functions con verging uniformly to E ( f ) on B , so E ( f ) itself is contin uous on B . Let us deﬁne R c ∗ α 2 n ∈ Π c ∗ α 2 n for each n suc h that || R c ∗ α 2 n − ϕ || X 1 ≲ 2 − nγ . Theorem 3.2.1 and the strong pro duct assumption (Deﬁnition 3.2.4 ) allow us to write σ c ∗ A ∗ α 2 n +1 (Ξ 1 ; R c ∗ α 2 n σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f )) = R c ∗ α 2 n σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) . (3.5.23) Using Equations ( 3.5.2 ) and ( 3.5.23 ), Theorem 3.2.1 , and the fact ϕ is supp orted on B , we can deduce     σ c ∗ A ∗ α 2 n +1  Ξ 1 ; R c ∗ α 2 n σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) − ϕ E ( f )      X 1 ≲ || R c ∗ α 2 n σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) − ϕ E ( f ) || X 1 ≲ || ϕ E ( f ) − ϕσ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) || X 1 + || R c ∗ α 2 n − ϕ || X 1 || σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) || X 1 ≲ ||E ( f ) − σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) || B + 2 n ( Q − q 2 − γ ) || f || X 2 . (3.5.24) In view of Equations ( 3.5.22 ) and ( 3.5.24 ), w e can conclude that E c ∗ A ∗ α 2 n +1 (Ξ 1 , ϕ E ( f )) ≲ || ϕ E ( f ) − σ c ∗ A ∗ α 2 n +1 (Ξ 1 , ϕ E ( f )) || X 1 ≤ || ϕ E ( f ) − R c ∗ α 2 n σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) || X 1 +     σ c ∗ A ∗ α 2 n +1  Ξ 1 ; R c ∗ α 2 n σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) − ϕ E ( f )      X 1 ≲ ||E ( f ) − σ c ∗ 2 n (Ξ 1 , Ξ 2 ; f ) || B + || f || X 2 2 n ( Q − q 2 − γ ) ≲ (1 + || f || X 2 (1 + r q 2 − S ))2 n ( Q − q 2 − γ ) . (3.5.25) Th us, ϕ E ( f ) ∈ W γ − Q + q 2 (Ξ 1 ), completing the pro of of part (b). ■ 70 Chapter 4 Classiﬁcation The conten t in this chapter is sourced from our pap er p ending publication titled “A signal separation view of classiﬁcation” [ 88 ]. 4.1 In tro duction A fundamental problem in machine learning is the following. Let { ( x j , y j ) } M j =1 b e random samples from an unkno wn probability distribution τ . The problem is to appro ximate the conditional exp ectation f ( x ) = E τ ( y | x ) as a function of x . Naturally , there is a h uge amoun t of literature studying function approximation b y commonly used to ols in machine learning such as neural and k ernel based net w orks. F or example, the univ ersal approximation theorem gives conditions under which a neural netw ork can approximate an arbitrary con tin uous function on an arbitrary compact subset of the ambien t Euclidean space. The estimation of the complexity of the approximation pro cess typically assumes some smo othness conditions on f , examples of whic h include, the num b er of deriv atives, mem bership in v arious classes such as Beso v spaces, Barron spaces, v ariation spaces, etc. A very imp ortant problem is one of classiﬁcation. Here the v alues of y j can take only ﬁnitely many (say K ) v alues, kno wn as the class lab els. In this case, it is fruitful to approximate the classiﬁcation function, deﬁned b y f ( x ) = argmax k Prob ( k | x ) [ 106 ]. Ob viously , this function is only piecewise contin uous, so that the universal approximation theorem do es not apply directly . In the case when the classes are supp orted on w ell-separated sets, one may refer to extension theorems such as Stein extension theorems [ 128 , Chapter 6] in order to justify the use of the v arious approximation theorems to this problem. While these argument s are suﬃcient for pure existence theorems, they also create diﬃculties in an actual implemen tation, in particular, b ecause these extensions are not easy to construct. In fact, this w ould b e 71 imp ossible if the classes are not w ell-separated, and migh t even ov erlap. Even if the classes are well-separated, and each class represen ts a Euclidean domain, an y lack of smo othness in the b oundaries of these domains is a problem. Some recent eﬀorts, for example, by Petersen and V oigtl¨ ander [ 111 ] deal with the question of accuracy in appro ximation when the class b oundaries are not smooth. How ev er, a p opular assumption in the last tw en t y years or so is that the data is distributed according a probability measure supported on a lo w dimensional manifold of a high dimensional ambien t Euclidean space. In this case, the classes ha v e b oundary of measure 0 with resp ect to the Leb esgue measure on the ambien t space. Finally , approximation algorithms, esp ecially with deep netw orks, utilize a great deal of lab eled data. In this c hapter, we prop ose a diﬀeren t approach as advocated in [ 26 ]. Th us, we do not assume that Prob ( k | x ) is a function, but assume instead that the points x j in class k comprise the support of a probabilit y measure µ k . The marginal distribution µ of τ along x is then a con v ex combination of the measures µ k . The fundamen tal idea is to determine the supp orts of the measures µ k rather than approximating µ k ’s themselv es i . This is done in an unsupervised manner, based only on the x j ’s with no lab el information. Ha ving done so, we ma y then query an oracle for the lab el of one p oint in the supp ort of each measure, whic h is then necessarily the lab el for every other p oint in the supp ort. Thus, we aim to achiev e in theory a p erfect classiﬁcation using a minimal amount of judiciously c hosen lab eled data. In order to address the problem of ov erlapping classes, we take a hierarchical multiscale approach moti- v ated by a pap er [ 15 ] of Chaudhury and Dasgupta. Thus, for each v alue η of the minimal separation among classes, w e assume that the supp ort of µ is a disjoint union of K η subsets, eac h representing one of K η classes, leaving an extra set, representing the ov erlapping region. When we decrease η , w e may even tually capture all the classes, lea ving only a negligible ov erlapping region (ideally with µ -probability 0). In [ 26 ], it is seen that the problem is analogous to the problem of p oint source signal separation. If each µ k w ere a Dirac delta measure supported at sa y ω k , the point source signal separation problem is to ﬁnd these p oin t sources from ﬁnitely many observ ations of the F ourier transform of µ . In the classiﬁcation problem we do not ha ve p oint sources and the information comprises samples from µ rather than its F ourier transform. Nev ertheless, w e observed in [ 26 ] that the techniques developed for the point source signal separation problem can b e adapted to the classiﬁcation problem viewed as the the problem of separation of the supp orts of µ k . In that paper, it was assumed only that the data is supp orted on a compact subset of a Eulidean space, and used a specially designed lo calized k ernel based on Hermite p olynomials [ 23 ] for this purp ose. Since Hermite p olynomials are intrinsically deﬁned on the whole Euclidean space, this creates b oth numerical and theoretical diﬃculties. In this c hapter, we allow the data to come from an arbitrary compact metric space, i If ν is a p ositive measure on a metric space M , we deﬁne the supp ort of any positive measure ν by supp ( ν ) = { x ∈ M , ν ( B ( x, r )) > 0 for all r > 0 } , where B ( x, r ) is the ball of radius r centered at x . 72 and use localized trigonometric p olynomial kernels instead. W e feel that this leads to a more satisfactory theory , although one of the accomplishmen ts of this chapter is to resolve the technical diﬃculties required to ac hieve this generalization. T o summarize, the main accomplishments of this chapter are: • W e provide a uniﬁed approach to signal separation problems and classiﬁcation problems. • W e deal with the classiﬁcation of data coming from an arbitrary metric space with no further structure, suc h as the manifold structure. • Our results suggest a multiscale approach which do es not assume any constraints on class b oundaries, including that the classes not o verlap. • In theory , the n um b er of classes at each scale is an output of the theorem rather than a prior assumption. • W e develop an algorithm to illustrate the theory , esp ecially in the context of active learning on hyper- sp ectral imaging data. Our work b elongs in the general theory of active learning. In Section 4.2 , we review some literature in this area whic h is somewhat related to the presen t w ork. In Section 4.3 , w e giv e a brief discussion of the p oin t source s ignal separation problem and the use of lo calized trigonometric p olynomial kernels to solve it. In Section 4.4 , we describe the background needed to form ulate our theorems, which are giv en in Section 4.5 . The algorithm MASC to implement these results in practice is given in Section 4.6 , and illustrated in the con text of a simulated circle and ellipse data set, a do cument dataset, and tw o hypersp ectral datasets. The pro ofs of the results in Section 4.5 are given in Section 4.8 . 4.2 Related works P erhaps the most relev ant work to this chapter is that of [ 26 ]. That pap er also outlines the theory and an algorithm for a classiﬁcation pro cedure using a thresholding set based on a lo calized kernel. There are three ma jor improv ements we hav e made relative to that work in this chapter: 1. W e hav e constructed the k ernel in this chapter in terms of trigonometric functions, whereas in [ 26 ] the kernel was constructed from Hermite p olynomials. The trigonometric kernel is muc h faster in implemen tations for tw o reasons: 1) each individual p olynomial is extremely quick to compute and 2) the trigonometric kernel deals only with trigonometric p olynomials up to degree n , whereas the Hermite p olynomial based kernel needs p olynomials up to degree n 2 to achiev e the same supp ort estimation b ounds. 73 2. This c hapter deals with arbitrary compact metric spaces (allo wing for a rescaling of the data so that the maxim um distance b et ween v alues is ≤ π ), whereas [ 26 ] dealt with compact subsets of the Euclidean space and had a requirement on the degree of the kernel dep endent up on the diameter of the data in terms of Euclidean distance. 3. In [ 26 ], an algorithm known as Cautious Active Clustering (C A C) was developed. In this chapter, we presen t a new algorithm with several implementation adv an tages ov er CA C. W e discuss this topic in more depth in Section 4.6.2 . Another pair of related w orks is that of the Learning b y Activ e Nonlinear Diﬀusion (LAND) and Learning b y Evolving Nonlinear Diﬀusion (LEND) algorithms [ 74 , 133 ]. Lik e the present work, these algorithms use a kernel-based densit y estimation when deciding p oints to query . How ev er, LAND and LEND b oth use a Gaussian kernel applied on k neighbors for the density estimation and weigh t it by a diﬀusion v alue. Then, the queried p oints are simply those with the highest of the com bined weigh ts. The diﬀusion v alue corresp onds to a minimal diﬀusion distance among p oints with a higher density estimation. F or the p oint with the maximal density estimation, a maximal diﬀusion distance among other data p oints is taken as the w eight. This extra weigh ting pro cedure is absent from our theory and algorithm, which uses an estimation approac h based purely on a lo calized k ernel to decide on p oin ts to query . In our algorithm, we take a m ultiscale approach and decide on query p oints at each lev el instead of a global listing of the data p oints. In [ 136 ], an activ e learning approach using neural netw orks is developed. This work fo cuses on binary classiﬁcation and developing mo dels using a neural net work framew ork suc h that a suﬃcient num ber of queries will ac hieve a desired accuracy . A study of t wo types of uncertaint y in active learning problems is discussed in [ 122 ]. The tw o critical t yp es of uncertaint y are 1) a data p oint is likely to b elong to multiple lab els, 2) a data p oin t is not lik ely to b elong to any lab el. Our work also seeks to distinguish betw een data p oints which are uncertain in the second sense, using a graph construction approach and potentially also a thresholding set for high-densit y p oin ts. When our algorithm encounters p oints whic h are uncertain in the ﬁrst sense, it elects not to assign a lab el right a wa y , instead coming back to it once the “conﬁdent” p oints hav e b een classiﬁed. Our method and algorithm are mean t to w ork on general data sampled from compact metric spaces. One diﬃculty that algorithms may face is the presence of highly imbalanced data (i.e. where some class lab els dominate ov er others in a data set). The problem of tackling this diﬃculty is studied in [ 131 ], where an approach to querying im balanced data using a balance of tw o principles is emplo yed: exploration and exploitation. During the exploration phase, the algorithm seeks out p oints to query in low-sough t regions. During the exploitation phase, the algorithm seeks to query p oin ts in the most critical explored regions. Our 74 algorithm works in a diﬀerent fashion, by querying p oints which w e b elieve to b e in high-density p ortions of a lab el’s supp ort and extending the lab el to nearby points un til it “bumps” against p oints which ma y b elong to another lab el. W e list the surv ey by Tharwat and Schenc k [ 132 ] as a resourceful survey of recent dev elopmen ts in active learning. 4.3 P oin t source signal separation The problem of signal separation go es back to early work of de Prony [ 36 ], and can b e stated as: estimate the co eﬃcien ts a k and lo cations ω k constituting µ = P K k =1 a k δ ω k , from observ ations of the form ˆ µ ( x ) = ˆ µ ( x ) = X k a k e − iω k x , x ∈ R . (4.3.1) There is m uch literature on metho ds to approach this problem, and w e cite [ 113 ] as a text one can use to familiarize themselves with the topic. If we assume ω k = k ∆ for some ∆ ∈ R + and allow measurements for an y x ∈ [ − Ω , Ω] for some Ω ∈ R + , then recov ery is p ossible so long as we are ab ov e the Rayleigh threshold, i.e. Ω ≥ π / ∆ [ 41 ]. Since The case where this threshold is not satisﬁed is known as sup er-resolution. Much further researc h has gone on to inv estigate the sup er-resolution problem, suc h as [ 3 , 13 , 68 ]. W e now introduce a particular metho d of interest for signal separation from [ 93 ] and further developed in [ 87 ]. The metho d takes the following approac h to estimate the co eﬃcients and lo cations of µ , without the assumption that the ω k ’s should b e at grid p oints, and the additional restriction that only ﬁnitely many in teger v alues of x are allow ed. W e start with the trigonometric moments of µ : ˆ µ ( ℓ ) = X k a k e − iω k ℓ , | ℓ | < n, where n ≥ 1 is an integer. Clearly , the quantities ˆ µ ( ℓ ) remain the same if an y ω k is replaced by ω k plus an integer m ultiple of 2 π . Therefore, this problem is prop erly treated as the recup eration of a p erio dic measure µ from its F ourier co eﬃcients rather than the recup eration of a measure deﬁned on R from its F ourier transform. Accordingly , we deﬁne the quotien t space T = R / (2 π Z ), and denote in this context, | x − y | = | ( x − y ) mo d 2 π | . Here and in the rest of this chapter, we consider a smo oth band pass ﬁlter h ; i.e., an even function h ∈ C ∞ ( R ) such that h ( u ) = 1 for | u | ≤ 1 / 2 and h ( u ) = 0 for | u | ≥ 1. W e then deﬁne σ n ( µ )( x ) : = X | ℓ | 0 is a constan t dep ending only on h and S . F or any p oint x ∈ M , and any sets A, B ⊆ M , we deﬁne the following notation for the balls and neigh b orho o ds. dist( x, A ) = inf y ∈ A ρ ( x, y ) , B ( x, r ) = { y ∈ M : ρ ( x, y ) ≤ r } , dist( A, B ) = inf y ∈ A dist( y , B ) , B ( A, r ) = { x ∈ M : dist( x, A ) ≤ r } . (4.4.3) F or any A ⊆ M , we deﬁne diam( A ) = sup x,y ∈ A ρ ( x, y ). 4.4.1 Measures Let µ b e a p ositive, Borel, probability measure on M (i.e. R M dµ ( y ) = 1). W e denote X : = supp ( µ ). Much of this c hapter fo cuses on X . How ev er, w e wish to treat X as an unknown subset of a known ambien t space M rather than treating it as a metric space in its o wn right. In particular, this emphasizes the fact the data measure µ ma y not hav e a density , and may not b e supp orted on the entire ambien t space. In the case of signal separation, we hav e seen that if the minimal amplitude for a certain p oint source is suﬃcien tly small, we may not b e able to detect that p oint source. Likewise, if the measure µ is to o small on parts of X , w e ma y not b e able to detect those parts. F or this reason, w e mak e some assumptions on the measure µ as in [ 26 ]. The ﬁrst prop erty , dete ctability , determines the rate of growth of the measure lo cally around each p oint in the supp ort. The second prop ert y , ﬁne-structur e , relates the measure to the classiﬁcation problem by equipping the supp ort with some well-separated (except maybe for some subset of relativ ely small measure) partition which may corresp ond to some diﬀerent class lab els in the data. Deﬁnition 4.4. 1. W e sa y a measure µ on M is detectable if there exist α ≥ 0 , κ 1 , κ 2 > 0 suc h that µ ( B ( x, r )) ≤ κ 1 r α , x ∈ M , r > 0 , (4.4.4) 79 and there exists r 0 > 0 suc h that µ ( B ( x, r )) ≥ κ 2 r α , x ∈ X , 0 < r ≤ r 0 . (4.4.5) Deﬁnition 4.4.2. W e say a measure µ has a ﬁne structure if there exists an η 0 suc h that for ev ery η ∈ (0 , η 0 ] there is an in teger K η and a partition S η : = { S k,η } K η +1 k =1 of X where both of the follo wing are satisﬁed. 1. ( Cluster Minimal Separation ) F or any j, k = 1 , 2 , . . . , K η with j  = k we hav e dist( S j,η , S k,η ) ≥ 2 η . (4.4.6) 2. ( Exhaustion Condition ) W e hav e lim η → 0 + µ ( S K η +1 ,η ) = 0 . (4.4.7) W e will say that µ has a ﬁne structure in the classical sense if µ = P K k =1 a k µ k for some probabilit y measures µ k , a k ’s are > 0 and P k a k = 1, and the compact subsets S k : = supp ( µ k ) are disjoint. In this case η is the minimal separation among the supp orts and there is no ov erlap. Remark 4.4.1. It is p ossible to require the condition ( 4.4.5 ) on a subset of X having measure con v erging to 0 with r . This will add some diﬃculties in our pro of of ( 4.8.5 ) and Lemma 4.8.4 . How ev er, in the case when µ has a ﬁne structure, this exceptional set can b e absorbed in S K η +1 with appropriate assumptions. W e do not ﬁnd it worth while to explore this further in this chapter. ■ Example 4.4.1. Supp osing that µ = P K k =1 a k δ ω k (as in the signal separation problem), then we see that µ is detectable with α = 0, κ 1 = max k | a k | , κ 2 = min k | a k | . It has ﬁne structure in the classical sense whenever η < min j  = k | ω j − ω k | . In this sense, the theory presented in this c hapter is a generalization of results for signal separation in this regime. ■ Example 4.4.2. If X is a α -dimensional, compact, connected, Riemannian manifold, then the normalized Riemannian v olume measure is detectable with parameter α . ■ 80 4.4.2 F-score W e will give results on the theoretical performance of our measure estimating pro cedure b y giving an asymp- totic result inv olving the so-called F-score. The F-score for binary classiﬁcation (true/false) problems is a measure of classiﬁcation accuracy taking the form of the harmonic mean b etw een precision and recall. In a predictive mo del, precision is deﬁned as the fraction of true p ositive outputs ov er all the p ositive outputs of the model. Recall is the fraction of true positive outputs ov er all the actual p ositives. In a multi-class problem, we extend this deﬁnition as follows (cf. [ 119 ]). If { C 1 , . . . , C N } is a partition of { x j } M j =1 indicating the predicted output lab els of a mo del and { L 1 , . . . , L K } is the ground-truth partition of the data, then one can deﬁne the precision of C j against the true label L k b y | C j ∩ L k | / | C j | and the corresponding recall b y | C j ∩ L k | / | L k | . T aking the maxim um of the harmonic means of the precisions and recalls with resp ect to all the ground truth lab els leads to F ( C j ) = 2 max k ∈{ 1 ,...,K } | C j ∩ L k | | C j | + | L k | . (4.4.8) Then the F-score is giv en by F  { C j } N j =1  = P N j =1 | C j | F ( C j ) P N j =1 | C j | . (4.4.9) Since we are treating the data as samples from a measure µ , we replace cardinalit y in the ab ov e formulas with measure. Our ﬁne structure condition giv es us the true supp orts as { S k,η } K η k =1 for any v alid η , so w e can deﬁne the F-score for the supp ort estimation clusters {C j,η } N j =1 b y F η ( C j,η ) = 2 max k ∈{ 1 ,...,K } µ ( C j,η ∩ S k,η ) µ ( C j,η ) + µ ( S k,η ) , (4.4.10) and F η  {C j,η } N j =1  = P N j =1 µ ( C j,η ) F η ( C j,η ) µ  S N j =1 C j,η  . (4.4.11) Remark 4.4. 2. W e observ e that 1 − 2 µ ( C j,η ∩ S k,η ) µ ( C j,η ) + µ ( S k,η ) = µ ( C j,η ∆ S k,η ) µ ( C j,η ) + µ ( S k,η ) , where in this remark only , ∆ denotes the symmetric diﬀerence. It follows that 0 ≤ F η ≤ 1. If w e estimate eac h supp ort p erfectly so C j,η = S j,η for all j and each C j,η is η -separated from an y other, then w e see that F η  {C j,η } N j =1  = 1. Otherwise, we will attain an F-score strictly low er than 1. ■ 81 4.5 Main Results In this section we introduce the main theorems of this c hapter, which inv olve the recov ery of supp orts of a measure from ﬁnitely man y samples. Theorem 4.5.1 pertains to the case where we only assume the detectabilit y of the measure. Theorem 4.5.2 p ertains to the case where we additionally assume the ﬁne structure condition. Before stating the results, we must in tro duce our discrete measure supp ort estimator and supp ort estimation sets. W e deﬁne our data-based measure supp ort estimator by F n ( x ) : = 1 M M X j =1 Ψ n ( x, x j ) . (4.5.1) This deﬁnition is then used directly in the construction of our data-based support estimation sets , giv en b y G n (Θ) : =  x ∈ M : F n ( x ) ≥ Θ max 1 ≤ k ≤ M F n ( x k )  . (4.5.2) Our ﬁrst theorem gives us b ounds on how well G n (Θ) approximates the entire support X with the detectabilit y assumption. Theorem 4.5.1. L et µ b e dete ctable and supp ose M ≳ n α log( n ) . L et { x 1 , x 2 , . . . , x M } b e indep endent samples fr om µ . Ther e exists a c onstant C > 0 such that if Θ < C < 1 , then ther e exists r (Θ) ∼ Θ − 1 / ( S − α ) such that with pr ob ability at le ast 1 − c 1 / M c 2 we have X ⊆ G n (Θ) ⊆ B ( X , r (Θ) /n ) . (4.5.3) Our second theorem additionally assumes the ﬁne-structure condition on the measure, and giv es con- ditions so that for an y satisfactory η , the supp ort estimation set G n (Θ) splits into K η subsets eac h with separation η , thus solving the machine learning classiﬁcation problem in theory . Theorem 4.5.2. Supp ose, in addition to the assumptions of The or em 4.5.1 , that µ has a ﬁne structur e, n ≳ 1 / ( η Θ 1 / ( S − α ) ) , and µ ( S K η +1 ,η ) ≲ Θ n − α . Deﬁne G k,η ,n (Θ) : = G n (Θ) ∩ B ( S k,η , r (Θ) /n ) . (4.5.4) Then, with pr ob ability at le ast 1 − c 1 / M c 2 , {G k,η ,n (Θ) } K η k =1 is a p artition of G n (Θ) such that dist( G j,η,n (Θ) , G k,η ,n (Θ)) ≥ η j  = k , (4.5.5) 82 and in this c ase, ther e exists c < 1 such that X ∩ B ( S k,η , cr (Θ) /n ) ⊆ G k,η ,n (Θ) ⊆ B ( S k,η , r (Θ) /n ) . (4.5.6) Remark 4.5.1. If C = { z 1 , · · · , z M } is a random sample from µ , n ≥ 1 and M ≳ n α log n , then it can b e sho wn (cf. [ 85 , Lemma 7.1]) that for an y p oint x ∈ X , there exists some z ∈ C such that ρ ( x, z ) ≤ 1 /n . Hence, the Hausdorﬀ distance b et ween X and C is ≤ 1 /n . If µ has a ﬁne structure in the classical sense, and n ≳ η − 1 , then this implies that a correct clustering of C w ould giv e rise to a correct classiﬁcation of every p oin t in X . This justiﬁes our decision to construct the algorithm in Section 4.6 to classify only the p oints in C . On the other hand, the use of the lo calized kernel as in the theorems ab ov e guide us ab out the c hoice of the p oin ts at which to query the lab el. ■ In Figure 4.3 w e illustrate Theorem 4.5.2 applied to a simple t wo-moons data set. W e see that the supp ort estimation set, shown in y ellow, cov ers the data p oin ts as well as their nearby area, predicting the supp ort of the distribution from which the data came from. F urthermore, we show in the ﬁgure a motiv ating idea: b y querying a single p oint in each comp onent for its class lab el we can extend the lab el to the other p oints in order to classify the whole data set. This is how we utilize the active learning paradigm in our algorithm discussed in Section 4.6 . Figure 4.3: Demonstration of the supp ort estimation set G 32 (0 . 15) (yello w) applied to a simple tw o-moons data set from [ 48 ] (blue and red). By querying one p oint from eac h component of the supp ort estimation set and extending the label to the other p oints in the same comp onen t, we can classify the entire data set with 100% accuracy . Our ﬁnal result examines the ﬁdelity of our classiﬁcation scheme in terms of the asymptotics of the F-score asso ciated with our supp ort estimation theorems as η → 0. W e sho w that our supp ort estimation setup asymptotically approac hes the ideal F-score of 1. 83 Theorem 4.5.3. Supp ose the assumptions of The or em 4.5.2 ar e satisﬁe d and that lim η → 0 + max 0 ≤ k ≤ K η  µ ( S K η +1 ,η ) µ ( S k,η )  = 0 . (4.5.7) Then, with pr ob ability at le ast 1 − c 1 / M c 2 , we have lim η → 0 + F η  {G k,η ,n (Θ) } K η k =1  = 1 . (4.5.8) wher e F η is the F -sc or e with r esp e ct to S η . 4.6 MASC Algorithm 4.6.1 Algorithm Description In the following paragraphs we describ e the motiv ation and intuition of the algorithm MASC (Algorithm 1 ). Throughout this section w e will refer to line num b ers asso ciated with Algorithm 1 . One obvious wa y to em b ed a data in to a metric space with diameter ≤ π is just to rescale it. If the data is a compact subset of an ambien t Euclidean space R q , w e ma y pro ject the data on the unit sphere S q ⊂ R q +1 b y a suitable in verse stereographic pro jection. The metric space S q , equipp ed with the geo desic distance arccos( ◦ , ◦ ) has diameter π by construction. One of the main obstacles we m ust ov ercome in an implementation of our theory is the follo wing. In practice, we often do not know the minimal separation η of the data classes b eforehand, nor do we know optimal v alues for Θ , n . T aking a machine learning p ersp ective, we dev elop a multiscale approac h to remedy these technical challenges: treat n, Θ as hyperparameters of the mo del and increment η . Firstly , MASC will threshold out any data p oin ts not b elonging to G n (Θ) (line 2). F or eac h v alue of η (initialize while lo op in line 4) w e construct a (un w eigh ted) graph where an edge go es b etw een tw o points x i , x j if and only if ρ ( x i , x j ) < η (line 5). A t this point, w e hav e a metho d for unsup ervised clustering b y simply examining graph comp onents (line 6, see b elow for discussion on p ). The idea to implement active learning is to then query a mo dal p oint of eac h graph comp onent (line 11), also referred to in this section as a cluster, with resp ect to Ψ n and extend that lab el to the rest of the cluster (line 13). A trade-oﬀ asso ciated with this idea is the following: if we initialize η to o small (resp ectively , n to o large) then each p oin t in the data set will b e its own cluster and we will simply query the whole data set, whereas if we initialize η to o large (respectively , n too small) then the whole data set will b elong to a single cluster destroying any classiﬁcation accuracy . Therefore, w e initialize η small and introduce a minimum cluster size threshold v alue p to a void this issue. 84 An y cluster of size < p will b e remov ed from consideration (line 6), so we will not query any p oints until η is large enough to pro duce a cluster of size p or greater. After the lab el extension is done in each cluster of size ≥ p , we keep track of whic h p oin ts w e queried (line 12), incremen t η (line 16), and rep eat (line 4). Sometime after the ﬁrst incrementation of η , w e will exp erience the combination of clusters which were previously disconnected. When this o ccurs we chec k whether each of the previously queried p oin ts in the new cluster hav e the same label (line 14). If so, then w e extend it to the new cluster (line 15). Otherwise, we halt the extension of lab els for all p oints in that cluster. In this w ay , the metho d pro ceeds by a cautious clustering to av oid lab eling p oints that are either 1) in a to o-low density region, or 2) within a cluster where we hav e queried multiple p oin ts with contradicting lab els. Once η is large enough that the data set all b elongs to a single cluster, w e will not gain an y new information b y incrementing η further, and hence MASC will halt the iterations of η (lines 7 and 8). The ﬁnal pro cess is to implemen t a metho d for estimating the lab els of p oints that did not receive a predicted lab el in the ﬁrst part, either b ecause they b elonged to a lo w-densit y region and w ere thresholded out or because they b elonged to a cluster with conﬂicting queried points. The remaining task is equiv alen t to the semi-supervised regime of classiﬁcation and we ac knowledge that there is a v ast v ariety of semi-sup ervised learning metho ds to c ho ose from. In MASC, we hav e elected to use a traditional k -nearest neighbors approach. F or a data p oin t x j , w e denote the set of its nearest k neighbors whic h already hav e labels ˆ y ( x j ) estimated from MASC b y A j,k . The k -nearest neighbors formula to estimate the lab el of x j is then giv en by: argmax k ∈ [ K ] |{ x i ∈ A j,k : ˆ y ( x j ) = k }| , (4.6.1) with some wa y to decide on the c hoice of k in the ev en t of a tie. In binary classiﬁcation tasks, the v alue of k can b e chosen as an o dd v alue to preven t ties. Otherwise, a tie can b e broken by choosing the lab el of the nearest point with a tied label, a hierarchical ordering of the lab els, at random, etc. In our Python implemen tation of the algorithm used to pro duce the ﬁgures in this chapter, we use the scipy.stats.mode function, whic h returns the ﬁrst lab el in the list of tied lab els up on such a tie. MASC will collect all p oints which do not y et hav e predicted lab els (line 17), and apply the nearest- neigh b ors approach as describ e ab ov e to each of these p oints (lines 19 and 20). A t this p oint, every element in the data set will ha ve a predicted lab el, so the algorithm will return the list of lab els (line 21). In MASC, w e require deﬁning a starting η and η step . Once the matrix with entries given by Ψ n ( x i , x j ) is calculated, one may search for the range of η v alues which give non-trivial clusters of size ≥ p with relative ease. If η is too small, no cluster will con tain a suﬃcien t num b er of p oin ts and if η is too large, ev ery p oin t will 85 b elong to the same cluster, b oth of which w e consider a “trivial” case. Then η step ma y b e chosen to satisfy some total n umber of iterations across this domain. The v alues n, Θ , p, k are considered hyperparameters. Algorithm 1: Multiscale Active Sup er-resolution Classiﬁcation (MASC) Input: Data set X , kernel degree n , threshold parameter Θ, η initialization, step size η step > 0, cluster size minim um p , oracle f , neighbor parameter k . Output: Predicted lab els ˆ y for all p oints in X . 1 A ← ∅ ; (Initialize queried point set) 2 V ← { x i ∈ X : x i ∈ G n (Θ) } ; (Prune data to consider only those in threshold set ( 4.5.2 ) ) 3 STOP ← F ALSE ; 4 while STOP = F ALSE do 5 E ← { ( x i , x j ) ∈ V × V : ρ ( x i , x j ) < η , x i  = x j } ; (Edge set consisting of points within η distance from each other) 6 { C η ,ℓ } K η ℓ =1 ← connected comp onents of G = ( V , E ) with size ≥ p ; 7 if | C η , 1 | = | V | then 8 STOP ← TRUE ; (End while loop once G is connected) 9 for ℓ = 1 to K n do 10 if C η ,ℓ ∩ A = ∅ then 11 x i ← argmax x ∈ C η,ℓ P M j =1 Ψ n ( x, x j ) ; (Locate maximizer of F n (cf. ( 4.5.1 ) ) in C η ,ℓ without any queried points) 12 A ← A S { x i } ; (Append maximizer to queried point set) 13 ˆ y ( x j ) ← f ( x i ) for all x j ∈ C η,ℓ ; (Query point and extend label to all of C η ,ℓ ) 14 else if ∀ x i , x j ∈ C η ,ℓ ∩ A , f ( x i ) = f ( x j ) =: c η ,ℓ then 15 ˆ y ( x j ) ← c η,ℓ for all x j ∈ C η,ℓ ; (If all queried points in component have same label, extend label to entire component) 16 η ← η + η step ; 17 C uncertain ← { x ∈ X : ˆ y ( x j ) = DNE } ; (Set of points which do not have a predicted label) 18 for x j ∈ C unc ertain do 19 A j,k ← { x ∈ X \ C uncertain : x is the k th closest element to x j or closer } ; 20 ˆ y ( x j ) ← argmax k ∈ [ K ] |{ x i ∈ A j,k : y j = k }| ; ( k -nearest neighbors approach to estimate labels for uncertain points) 21 return ˆ y . 4.6.2 Comparison With CA C and SCALe In [ 26 ], a similar theoretical approach to this chapter except on the Euclidean space was dev eloped and an algorithm we will call “Cautious Active Clustering” (CA C) was introduced. MASC and CAC are both m ultiscale algorithms using G n (Θ) to threshold the data set, then constructing graphs to query p oints and extend lab els. The main diﬀerence b et ween the algorithms is the following. In CAC, η , Θ are considered h yp erparameters while n is incremented, whereas in MASC, n, Θ are considered hyperparameters while η is incremen ted. This adjustment serves three purp oses: 1. It connects the algorithm closer to the theory , whic h states that a single n, Θ v alue will suﬃce for the 86 righ t v alue of η . W e do not kno w η in adv ance, but by incrementing η until all of the data b elongs to a single cluster, we will attain a v alue close to the true v alue at some step. At this step, we will query p oints b elonging roughly to the “true” clusters and that information will b e carried onw ard to the subsequen t steps. 2. Consistency in query procedure: w e use the same function to decide which points to query at eac h lev el, rather than it changing as the algorithm progresses. 3. It impro ves computation times since computing the Ψ n matrix for v arying v alues of n tends to take more time than incremen ting η and chec king graph comp onents. In MASC, w e hav e the additional parameter p sp ecifying the minimum size of the graph comp onent to allow a query . While this is new compared to CAC, the main purp ose is to reduce the total num ber of queries to just those that contain more information. One could implemen t such a c hange to CA C as well for similar eﬀect. A further diﬀerence is that CAC uses a lo calized summability kernel approach to classify uncertain samples, whereas MASC uses a nearest-neigh b ors approach. SCALe, as introduced in [ 98 ] is an even more similar algorithm to MASC. The main diﬀerence b etw een MASC and SCALe is the ﬁnal step, where in the present metho d we use a nearest-neigh b ors approac h to extend lab els to uncertain p oints while in SCALe the choice w as to use a function approximation technique dev elop ed in [ 90 ]. Both metho ds hav e their pros and cons. Compared to SCALe, the nearest-neighbors approac h of MASC: 1. works in arbitrary metric spaces, without requiring a summability kernel as in SCALe. 2. extends lab els to uncertain p oints (sometimes muc h) faster, reducing computation time while usually pro viding comparable or b etter results with suﬃciently many queries, but 3. reduces accuracy in extremely sparse query setting, where the function estimation metho d with the manifold assumption empirically seems to extend lab els more consistently . 4.7 Numerical Examples In this section, w e lo ok at the p erformance of the MASC algorithm applied to 1) a synthetic data set with o verlapping class supp orts (Section 4.7.1 ), 2) a do cument data set (Section 4.7.2 ), and 3) tw o diﬀeren t h yp ersp ectral imaging data sets: Salinas (Section 4.7.3 ) and Indian Pines (Section 4.7.4 ). On the Hyp er- sp ectral data sets, we compare our metho d with tw o other algorithms for active learning: LAND and LEND (Section 4.7.5 ). 87 F or hyperparameter selection on our mo del as well as the comparisons, we hav e not done any v alidation but rather optimized the hyperparameters for each mo del on the data itself. So the results should b e inter- preted as b eing near-b est-p ossible for the mo dels applied to the data sets in question rather than a demon- stration of generalization capabilities. While this approach is non-traditional for unsup ervised/sup ervised learning, it has b een done for other active learning research ([ 133 ], for example) so we hav e elected to follow the same pro cedure in this chapter. F urther, an exhaustive grid searc h w as not conducted but rather lo cal minima among grid v alues were selected for each hyperparameter. F or MASC, we lo ok ed at n in p o wers of 2 and k v alues in m ultiples of 5. F or LAND we looked at K, t at increm en ts of 10, and with LEND we used the same parameters from LAND and lo oked at integer J v alues and α v alues in increments of 0 . 1. F or Θ, w e tried v alues less rigorously , meaning that b etter Θ v alues may exist than the ones c hosen. Due to the nature of the algorithm, increasing Θ will increase the num ber of samples that the nearest-neighbors approach has to estimate, while reducing the num ber of labeled neighbors it has to do so. How ev er, increasing Θ can also reduce the n um b er of queries used, sometimes without deterioration in accuracy . So there may b e some tradeoﬀ, but we generally see the best results when Θ is chosen to threshold a small p ortion of the initial data (outlier remov al). In T able 4.2 , we summarize the choice of parameters for each of the data sets in the subsequen t sections. MASC hyperparameter selection for eac h data set Dataset Θ η p k Circle+Ellipse 0.12 [0 . 006 , 0 . 036] 15 5 (Section 4.7.1 ) (step size 0.005) Do cumen t 0.51 [0 . 08 , 0 . 15] 3 25 (Section 4.7.2 ) (step size 0.002) Salinas 0.32 [0 . 21 , 0 . 27] 3 25 (Section 4.7.3 ) (step size 0.005) Indian Pines 0.08 [0 . 03 , 0 . 13] 5 15 (Section 4.7.4 ) (step size 0.005) T able 4.2: Selected h yperparameter v alues for our MASC algorithm applied to the data sets in the subsequen t sections. 4.7.1 Circle on Ellipse Data Although the theory in this chapter fo cuses on the case where the supp orts of the classes are separated (or at least satisfy a ﬁne-structure condition), our MASC algorithm still p erforms well at classiﬁcation tasks of data with ov erlapping supp orts in the regions without ov erlap. T o illustrate this, we generated a synthetic data set of 1000 p oin ts sampled along the arclength of a circle and another 1000 sampled along the arclength of an ellipse with eccentricit y 0.79. F or eac h data point, normal noise with standard deviation 0.05 was additiv ely applied indep endently to b oth comp onen ts. Figure 4.4 shows the true class lab el for each of the 88 (a) T rue lab els of the circle and ellipse data. (b) Predicted lab els using MASC with 35 queries, ac hieving 83% accuracy . Figure 4.4: This ﬁgure illustrates the result of applying MASC to a synthetic circle and ellipse data set. On the left are true lab els of the given data, and on the right is the estimation attained b y MASC. p oin ts on the left and the estimated class labels on the right. W e can see that the misclassiﬁcations are mostly lo calized to the area where the supp orts of the tw o measures ov erlap. Near the in tersection p oints of the circle and ellipse the classiﬁcation problem b ecomes extremely diﬃcult due to a high probability that a data p oin t could hav e b een sampled from either the circle or ellipse. 4.7.2 Do cumen t Data This n umerical example uses the document data set pro vided b y Jensen Baxter through Kaggle [ 4 ]. The data set con tains 1000 do cuments total, 100 eac h belonging to a particular category from: business, entertainmen t, fo o d, graphics, historical, medical, p olitics, space, sp ort, and tec hnology . F or prep ossessing w e run the data through the Python sklearn pack age’s TﬁdfV ectorizer function to con vert the do cumen ts in to v ectors of length 1684. Then we implement MASC. In Figure 4.5 w e see the results of applying MASC on the do cument data in t wo steps. On the left w e see the classiﬁcation task b y MASC paused at line 17 of Algorithm 1 , b efore lab els hav e b een extended via the nearest neighbor p ortion at the end of the algorithm. On the right we see the result of the density estimation extension. In Figure 4.6 we see on the left a confusion matrix for the result shown in Figure 4.5 , allo wing us to see which classes were classiﬁed the most accurately v ersus whic h ones had more trouble. W e see the largest misclassiﬁcations had to do with documents that w ere truly “en tertainmen t” but got classiﬁed as either “sp ort” or “technology”, and do cuments whic h were actually “graphics” but got classiﬁed as “medical”. On the righ t of Figure 4.6 we hav e a plot indicating the resulting accuracy vs. the num ber 89 (a) Classiﬁcation of certain p oints in MASC algo- rithm (b efore densit y estimation extension). (b) Classiﬁcation of remainder p oints using densit y estimation extension. Figure 4.5: This ﬁgure illustrates the classiﬁcation process undergone MASC on the document data set at t wo p oints. On the left, w e see the classiﬁcation of p oints b efore the k -nearest neighbors extension. On the righ t, we see the result after k -nearest neighbors extension. of queries whic h MASC was allow ed to do. Naturally as the num ber of queries approaches 1000 this plot will gradually increase to 100% accuracy . Lastly , in Figure 4.7 we see a side-by-side comparison of the true lab els for the do cument data set vs. the predicted lab els. (a) Confusion matrix. (b) Plot of MASC accuracy vs. n umber of allow ed query p oin ts Figure 4.6: F urther details on the classiﬁcation results for the do cument data set. (Left) Confusion matrix for single run of MASC algorithm. (Right) Accuracy of MASC algorithm vs. the num ber of queries used. 90 (a) T rue lab els. (b) Predicted lab els. Figure 4.7: Visual comparison of true labels (left) versus predicted lab els output b y the mo del (right) for the do cumen t data set. 4.7.3 Salinas Hyp ersp ectral Data This n umerical example is done on a subset of the Salinas h yp ersp ectral image data set from [ 53 ]. Our subset of the Salinas data set consists of 20034 data vectors of length 204 b elonging to 10 classes of the 16 original classes. Speciﬁcally , we to ok half of the data points at random from each of the ﬁrst 10 classes of the original data set. F or prepro cessing w e ran PCA and kept the ﬁrst 50 comp onents. Then we implemented MASC. In Figure 4.8 we see the results of applying MASC on the Salinas data in tw o steps. On the left we see the classiﬁcation task b y MASC paused at line 17 of Algorithm 1 , b efore lab els hav e b een extended via the nearest neighbor p ortion at the end of the algorithm. At this stage, our algorithm has classiﬁed 1518 p oints with 99.60% accuracy using 261 queries. On the right we see the result of the k -nearest neighbors extension, where all 20034 p oints hav e b een classiﬁed with 97.11% accuracy . In Figure 4.9 we see a confusion matrix for the result shown in Figure 4.8 , allowing us to see which classes were classiﬁed the most accurately versus whic h ones had more trouble. W e see the largest misclassiﬁcation inv olved our predicted class 5, which included p oints from several other classes. Lastly , in Figure 4.10 w e see a side-by-side comparison of the true lab els for the Salinas data set versus the predicted lab els. 4.7.4 Indian Pines Hyp ersp ectral Data This numerical example is done on a 5-class subset of the Indian Pines hypersp ectral image data set from [ 53 ]. Our subset of the Indian Pines data set consists of 5971 data vectors of length 200 b elonging to classes num b er 2,6,11,14,16 of the 16 original classes. F or prepro cessing w e normalized each vector. Then w e implement MASC. 91 (a) Classiﬁcation of certain p oints in MASC algo- rithm (b efore k -nearest neighbors extension). (b) Classiﬁcation of remainder points using k -nearest neigh b ors extension. Figure 4.8: This ﬁgure illustrates the classiﬁcation pro cess undergone by MASC at tw o p oin ts on the Salinas h yp ersp ectral data set. On the left, w e see the classiﬁcation of points b efore the k -nearest neigh b ors extension. On the righ t, we see the result after k -nearest neighbors extension. Figure 4.9: Confusion matrix for single run of MASC algorithm on Salinas. In Figure 4.11 we see the results of applying MASC on the Indian Pines data in tw o steps. On the left w e see the classiﬁcation task by MASC paused at line 17 of Algorithm 1 , b efore lab els hav e b een extended via the nearest neighbor p ortion at the end of the algorithm. On the right we see the result of the k -nearest neigh b ors extension. In Figure 4.12 w e see a confusion matrix for the result shown in Figure 4.11 , allowing us to see whic h classes were classiﬁed the most accurately versus which ones had more trouble. As w e can see from the confusion matrix, the largest error comes from distinguishing class 2 from 11 and vise versa. These classes corresp ond to portions of the images b elonging to corn-notill and soybean-mintill. Lastly , in Figure 4.13 we see a side-by-side comparison of the true lab els for the Indian Pines data set versus the 92 (a) T rue lab els. (b) Predicted lab els. Figure 4.10: Visual comparison of true lab els (left) versus predicted lab els output by the mo del (right) for the Salinas h yp ersp ectral data set. predicted lab els. (a) Classiﬁcation of certain p oints in MASC algo- rithm (b efore k -nearest neighbors extension). (b) Classiﬁcation of remainder points using k -nearest neigh b ors extension. Figure 4.11: This ﬁgure illustrates the classiﬁcation pro cess undergone by MASC at tw o p oints on the Salinas hypersp ectral data set. On the left, we see the classiﬁcation of p oints b efore the k -nearest neighbors extension. On the right, we see the result after k -nearest neighbors extension. 4.7.5 Comparison With LAND and LEND W e compare our metho d with the LAND [ 74 ] algorithm and its bo osted v arian t, LEND [ 133 ]. In Figure 4.14 , w e see the resulting accuracy that eac h algorithm ac hiev es on b oth Salinas and Indian P ines for v arious query budgets. On the left, w e observ e that our metho d ac hiev es a comparable accuracy to both LAND and LEND 93 Figure 4.12: Confusion matrix for result of MASC applied to Indian Pines. (a) T rue lab els. (b) Predicted lab els. Figure 4.13: Visual comparison of true lab els (left) versus predicted lab els output by the mo del (right) for the Indian Pines h yp ersp ectral data set. at around 50 queries, then gradually surpasses the accuracy of LAND as the num b er of queries surpasses around 200. On the righ t, our metho d achiev es a low er accuracy for a small n um ber of queries, but then outp erforms b oth LAND and LEND after the budget exceeds ab out 60 queries. The query budgets were decided by how man y queries were used at v arious η levels of while lo op in the MASC Algorithm 1 . W e then forced the nearest-neighbors p ortion of the MASC algorithm to extend lab els to the remainder of the data set at each suc h level, which is sho wn in the plot. A separate aspect of comparison in v olv es the run-time of b oth algorithms. In T able 4.3 , w e see that while LEND has the highest accuracy on the Salinas data set with 261 queries, it takes signiﬁcantly longer than the other t w o metho ds to attain this result. Of the three metho ds, MASC has the quick est run-time at 110.8s, achieving a b etter accuracy than LAND in less time. In T able 4.4 , we see that MASC pro duces 94 Comparison of MASC with LAND and LEND on Salinas subset Salinas MASC LAND LEND Accuracy 97.1% 95.7% 99.2% Run-time 110.8s 190.0s 669.1s T able 4.3: Com parison b et ween MASC, LAND, and LEND on the Salinas data set using 261 queries. Comparison of MASC with LAND and LEND on Indian Pines subset MASC LAND LEND Accuracy 84.4% 79.5% 82.8% Run-time 15.5s 19.6s 97.6s T able 4.4: Comparison b et ween MASC, LAND, and LEND on the Indian Pines data set using 211 queries. b oth the b est result and has the fastest run-time for the case of 211 queries on the Indian Pines data set. When deciding whic h algorithm to use for an active learning classiﬁcation task, one has to consider the trade oﬀ b etw een query budget/cost, computation time, and ac curacy . Our initial results indicate that if the query cost is not so high compared to the run-time of the algorithm, then one may elect to use MASC with its low er run-time and simply query more p oin ts. How ev er, if the query cost is high compared to the run-time, then one may instead elect to use an algorithm like LEND instead. The comparison results in this section are not meant to giv e an exhaustiv e depiction of which algorithm to use in any case, only illustrate that in tw o data sets of interest, MASC p erforms comp etitively with the existing metho ds in terms of either or b oth accuracy and run-time. (a) Plot of accuracy vs. num ber of query p oints for Salinas. (b) Plot of accuracy vs. num ber of query p oints for Indian Pines. Figure 4.14: Plots indicating the accuracy of MASC, LAND, and LEND for diﬀeren t query budgets, for both Salinas (left) and Indian Pines (righ t). 95 Comparison of SCALe with function appro ximation on randomly queried p oin ts Do cumen t Data Salinas Data Num b er of queries 51 261 MASC accuracy (%) 87.3 97.11 Mean accuracy of the random trials (%) 70.2 87.5 Standard Deviation of the random trials (%) 3.72 1.06 Standard deviations b elo w MASC accuracy 4.60 9.03 T able 4.5: T able of comparison results b et ween MASC and 10 KNN semi-supervised learning trials with randomly queried data on the Do cument and Salinas data sets. The mean and standard deviation of the 10 trials in each case are shown, as well has how many standard deviations ab ov e these means MASC achiev es in the activ e learning setting. 4.7.6 Activ e V ersus Semi-Sup ervised Learning The bulk of the MASC algorithm serv es to decide how to cluster the data and query p oints eﬀectively . An imp ortan t question to ask is: do w e ultimately query useful points? T o analyze this question, w e compare the results using MASC to those using the same semi-supervised approac h dra wn from lines 17-21 of Algorithm 1 . W e will lo ok at semi-sup ervised learning on the do cumen t data set with 51 random queries and on the Salinas data set with 261 random queries. In T able 4.5 we can see the results from running this exp eriment 10 times on eac h data set. On av erage, the results based on using random queries for lines 17-21 of Algorithm 1 p erformed 4.60 and 9.03 standard deviations worse than MASC for the do cument and Salinas data sets resp ectively . Both cases show that pro ceeding b y querying p oin ts randomly will with high probability achiev e an accuracy muc h low er than MASC. 4.8 Pro ofs In this section we give pro ofs for our main results in Section 4.5 . W e assume that X : = supp( µ ) ⊆ M and n ≥ 1 is giv en. Essential to our theory is the construction of an integral supp ort estimator: σ n ( x ) : = Z X Ψ n ( x, y ) dµ ( y ) . (4.8.1) W e also deﬁne the following tw o asso ciated v alues which will b e imp ortant: I n : = max x ∈ M | σ n ( x ) | , J n : = min x ∈ X | σ n ( x ) | . (4.8.2) Informally , we expect the ev aluation of σ n ( x ) /I n to give us an estimation on whether or not the p oint x b elongs to X . W e enco de this intuition by setting a thresholding (hyper)parameter θ > 0 in a supp ort 96 estimation set: S n ( θ ) : = { x ∈ M : σ n ( x ) ≥ 4 θ I n } . (4.8.3) When the measure µ is detectable, we show that S n ( θ ) is an estimate to the supp ort of µ (Theorem 4.8.1 ). When the measure µ has a ﬁne structure, we show that S n ( θ ) is partitioned exactly into K η separated comp onen ts and eac h comp onent estimates the supp ort of the corresp onding partition S k,η (Theorem 4.8.2 ). These results then give us give us the ability to estimate the classiﬁcation ability in the discrete setting via probabilistic results, as w e inv estigate in Section 4.8.2 . 4.8.1 Measure Supp ort Estimation In this section we develop k ey results to estimate the supp orts of measures deﬁned on a contin uum. W e ﬁrst start with a useful lemma giving upp er and low er b ounds on I n , J n resp ectiv ely . Additionally for any given x ∈ M , w e determine a b ound for the integral of Ψ n tak en ov er p oints aw a y from x . Lemma 4.8.1. L et n ≥ 1 and S > α . Then ther e exist C 1 , C 2 > 0 (dep ending on α, S, h ) such that I n = max x ∈ M | σ n ( x ) | ≤ C 1 n 2 − α (4.8.4) and J n = min x ∈ X | σ n ( x ) | ≥ C 2 n 2 − α . (4.8.5) In p articular, C 1 ≥ C 2 . F or d > 0 and any x ∈ M , Z M \ B ( x,d ) Ψ n ( x, y ) dµ ( y ) ≤ C 1 n 2 − α max(1 , ( nd ) S − α ) . (4.8.6) In order to pro ve this lemma, we ﬁrst recall a consequence of the Bernstein inequality for trigonometric p olynomials ([ 103 ], Chapter I I I, Section 3, Theorems 1 and Lemma 5). Lemma 4.8.2. L et T b e a trigonometric p olynomial of or der < 2 n . Then ∥ T ∥ = max x ∈ T | T ′ ( x ) | ≤ 2 n max x ∈ T | T ( x ) | . (4.8.7) Mor e over, if | T ( x 0 ) | = ∥ T ∥ then | T ( x ) | ≥ ∥ T ∥ cos(2 nx ) , | ( x − x 0 ) mo d 2 π | ≤ π / (2 n ) . (4.8.8) 97 The following corollary gives a consequence of this lemma for the k ernel Ψ n , which will b e used often in this c hapter. Corollary 4.8.1. L et x, y , z , w ∈ M , n ≥ 1 . Then ther e ar e c onstants c, C 0 such that cn 2 ≤ Ψ n ( x, x ) ≤ C 0 n 2 . (4.8.9) Mor e over, Ψ n ( x, y ) ≤ Ψ n ( x, x ) ∼ n 2 , (4.8.10) | Ψ n ( x, y ) − Ψ n ( z , w ) | ≲ n 3 { ρ ( x, z ) + ρ ( y , w ) } . (4.8.11) and | Ψ n ( x, y ) | ≳ n 2 , for ρ ( x, y ) ≤ π / (6 n ) . (4.8.12) Pr o of. The estimate ( 4.8.9 ) follows from the fact that Ψ n ( x, x ) = Φ n (0) 2 = X ℓ h ( ℓ/n ) ! 2 ∼ n 2 , where the last estimate is easy to see using Riemann sums for R h ( t ) dt . W e observ e that Φ 2 n is a trigonometric p olynomial, and it is clear that | Φ n ( t ) | 2 ≤ Φ n (0) 2 . Consequen tly , ( 4.8.10 ) follows from the deﬁnition of Ψ n . The estimate ( 4.8.11 ) is easy to deduce from the fact that ∥ (Φ 2 n ) ′ ∥ ≲ n 3 , so that | Ψ n ( x, y ) − Ψ n ( z , w ) | ≤| Φ 2 n ( ρ ( x, y )) − Φ 2 n ( ρ ( z , w )) | ≤| Φ 2 n ( ρ ( x, y )) − Φ 2 n ( ρ ( z , y )) | + | Φ 2 n ( ρ ( z , y )) − Φ 2 n ( ρ ( z , w )) | ≲ n 3 {| ρ ( x, y ) − ρ ( z , y ) | + | ρ ( z , y ) − ρ ( z , w ) |} ≲ n 3 { ρ ( x, z ) + ρ ( y , w ) } . (4.8.13) The estimate ( 4.8.12 ) follo ws from ( 4.8.8 ) and the deﬁnition of Ψ n . ■ Pr o of of L emma 4.8.1 . W e proceed b y examining concentric ann uli. Let x ∈ M b e ﬁxed, and set A 0 = B ( x, d ) and A k = B ( x, 2 k d ) \ B ( x, 2 k − 1 d ) for every k ≥ 1. First supp ose that nd ≥ 1. Then by ( 4.4.2 ) and ( 4.4.4 ), 98 w e deduce Z M \ B ( x,d ) Ψ n ( x, y ) dµ ( y ) = ∞ X k =1 Z A k Ψ n ( x, y ) dµ ( y ) ≲ ∞ X k =1 µ ( A k ) n 2 max(1 , 2 k − 1 dn ) S ≲ ∞ X k =1 2 kα d α n 2 2 S ( k − 1) ( dn ) S ≲ n 2 − α ( nd ) α − S ∞ X k =1 2 k ( α − S ) ≲ n 2 − α ( nd ) α − S . (4.8.14) If nd = 1, w e observe Z A 0 Φ n ( ρ ( x, y )) 2 dµ ( y ) ≲ µ ( A 0 ) n 2 ≲ d α n 2 = n 2 − α . (4.8.15) Com bining ( 4.8.14 ) and ( 4.8.15 ) when nd = 1 yields ( 4.8.4 ). When dn ≤ 1, we see Z M \ B ( x,d ) Ψ n ( x, y ) dµ ( y ) ≤ I n ≲ n 2 − α . (4.8.16) T ogether with ( 4.8.14 ), this completes the pro of of ( 4.8.6 ). There is no loss of generality in using the same constan t C 1 in b oth of these estimates. W e see, in view of ( 4.8.10 ), ( 4.8.12 ), and the detectability of µ , that if x ∈ X it follo ws that Z X Ψ n ( x, y ) dµ ( y ) ≳ Z B ( x,π / (6 n )) n 2 dµ ( y ) ≳ n 2 − α , (4.8.17) demonstrating ( 4.8.5 ) and completing the pro of. ■ Theorem 4.8.1. L et µ b e dete ctable and S > α . If θ ≤ C 2 / (4 C 1 ) , then by setting d ( θ ) =  C 1 C 2 θ  1 / ( S − α ) , (4.8.18) it fol lows that (cf. ( 4.8.3 ) ) X ⊆ S n ( θ ) ⊆ B ( X , d ( θ ) /n ) . (4.8.19) Pr o of. F rom ( 4.8.4 ) and ( 4.8.5 ), w e see that for any x ∈ X , σ n ( x ) ≥ J n ≥ C 2 I n C 1 . (4.8.20) With our assumption of θ ≤ C 2 / (4 C 1 ), this pro ves the inclusion X ⊆ S n ( θ ) . (4.8.21) Note that C 2 1 /C 2 2 ≥ 1 > 1 / 4, so that θ ≤ C 2 / (4 C 1 ) < C 1 /C 2 , and hence, d ( θ ) > 1. Then, for any x ∈ M 99 suc h that dist( x, X ) ≥ d ( θ ) /n , we hav e by ( 4.8.6 ) that σ n ( x ) ≤ Z M \ B ( x,d ( θ ) /n ) Ψ n ( x, y ) dµ ( y ) ≤ C 1 n 2 − α /d ( θ ) S − α ≤ θ C 2 n 2 − α ≤ θ I n . (4.8.22) This demonstrates the inclusion S n ( θ ) ⊆ B ( X , d ( θ ) /n ) , (4.8.23) completing the pro of. ■ Theorem 4.8.2. Assume the setup of The or em 4.8.1 and supp ose µ has a ﬁne structur e. Deﬁne S k,η ,n ( θ ) : = S n ( θ ) ∩ B ( S k,η , d ( θ ) /n ) . (4.8.24) L et n ≥ 2 d ( θ ) /η , µ ( S K η +1 ,η ) ≤ C 2 C 0 θ n − α , and j , k = 1 , . . . , K η with j  = k . Then S n ( θ ) = K η [ k =1 S k,η ,n ( θ ) (4.8.25) and, dist( S j,η,n ( θ ) , S k,η ,n ( θ )) ≥ η . (4.8.26) F urthermor e, X ∩ B ( S k,η , d ( θ ) /n ) ⊆ S k,η ,n ( θ ) ⊆ B ( S k,η , d ( θ ) /n ) . (4.8.27) Pr o of. The ﬁrst inclusion in ( 4.8.27 ) is satisﬁed from ( 4.8.19 ) and the second is satisﬁed by the deﬁnition of S k,η ,n . In view of the assumption that η ≥ 2 d ( θ ) /n and Deﬁnition 4.4.2 , we see that dist( B ( S j,η , d ( θ ) /n ) , B ( S k,η , d ( θ ) /n )) ≥ η , (4.8.28) for any j  = k . Since S k,η ,n ( θ ) ⊆ B ( S k,η , d ( θ ) /n ), it follows that the separation condition ( 4.8.26 ) m ust also b e satisﬁed. Now it remains to show ( 4.8.25 ). Let us deﬁne, in this pro of only , S = K η [ k =1 S k,η . (4.8.29) It is clear from ( 4.8.24 ) that S K η k =1 S k,η ,n ( θ ) ⊆ S n ( θ ). W e note that for any x ∈ M \ B ( S , d ( θ ) /n ), we hav e 100 dist( x, S ) ≥ d ( θ ) /n and as a result σ n ( x ) = Z S K η +1 Ψ n ( x, y ) dµ ( y ) + Z S Ψ n ( x, y ) dµ ( y ) ≤ C 0 n 2 µ ( S K η +1 ) + Z S \ B ( x,d ( θ ) /n ) Ψ n ( x, y ) dµ ( y ) (By ( 4.8.10 )) ≤ C 2 n 2 − α θ + C 1 n 2 − α d ( θ ) α − S (By the assumption on µ ( S K η +1 ) and ( 4.8.6 )) ≤ 2 C 2 n 2 − α θ ≤ 2 J n θ (By ( 4.8.18 )) ≤ 2 θ I n . (4.8.30) Th us, x / ∈ S n ( θ ) and, equiv alen tly , S n ( θ ) ⊆ B ( S , d ( θ ) /n ). Therefore we hav e shown S n ( θ ) = B ( S , d ( θ ) /n ), completing the pro of. ■ 4.8.2 Discretization In this section we relate the contin uous support estimator and estimation sets to the discrete cases based on randomly sampled data. The conclusion of this section will b e the proofs to the theorems from Section 4.5 . T o aid us in this process w e ﬁrst state a consequence of the Bernstein Concen tration inequality as a prop osition. Prop osition 4.8.1. L et X 1 , · · · , X M b e indep endent r e al value d r andom variables such that for e ach j = 1 , · · · , M , | X j | ≤ R , and E ( X 2 j ) ≤ V . Then for any t > 0 , Prob         1 M M X j =1 ( X j − E ( X j ))       ≥ V t/R   ≤ 2 exp  − M V t 2 2 R 2 (1 + t )  . (4.8.31) Information and a deriv ation for the Bernstein concentr ation inequalit y are standard among many texts in probabilit y; we list [ 12 , Section 2.1, 2.7] as a reference. It is instinctive to use Proposition 4.8.1 with X j = Ψ n ( x, x j ). This w ould yield the desired b ound for any v alue of x ∈ M . How ever, to get an estimate on the supremum norm of the diﬀerence F n − σ n , we need to ﬁnd an appropriate net for M (and estimate its size) so that the p oint where this supremum is attained is within the right ball around one of the p oin ts of the net. Usually , this is done via a Bernstein inequality for the gradients of the ob jects inv olved. In the absence of an y diﬀerentiabilit y structure on M , we need a more elab orate argument. The following prop osition is a consequence of [ 85 , Theorem 7.2], and asserts the existence of a partition of X satisfying prop erties which will b e helpful to proving our main results. Prop osition 4.8.2. L et δ > 0 . Ther e exists a p artition { Y k } N k =1 of X such that for e ach k , diam ( Y k ) ≤ 36 δ , and µ ( Y k ) ∼ δ α . In p articular, N ≲ δ − α . 101 Recall that we denote our data by D = { x j } M j =1 , where each x j is sampled uniformly at random from µ . In the sequel, w e let D k = D ∩ Y k , k = 1 , · · · , N . The following lemma gives an estimate for |D k | . Lemma 4.8.3. L et 0 < δ, ϵ, t < 1 . If M ≳ t − 2 δ − α log( c/ ( ϵδ α )) , (4.8.32) then Prob  max 1 ≤ k ≤ N     µ ( Y k ) M |D k | − 1     ≥ t  ≲ ϵ. (4.8.33) Pr o of. Let k b e ﬁxed, and in this pro of only , χ k denotes the characteristic function of Y k . Thought of as a random v ariable, it is clear that | χ k | ≤ 1, R χ k ( z ) dµ ( z ) = R χ k ( z ) 2 dµ ( z ) = µ ( Y k ). Moreov er, P M j =1 χ k ( z j ) = |D k | . So, we may apply Prop osition 4.8.1 , and recall that µ ( Y k ) ∼ δ α to conclude that Prob      |D k | M − µ ( Y k )     ≥ µ ( Y k ) t 1 + t  = Prob      |D k | M µ ( Y k ) − 1     ≥ t 1 + t  ≤ 2 exp  − M µ ( Y k ) t 2 (1 + t )(1 + 2 t )  ≤ 2 exp  − cM δ α t 2  . (4.8.34) (In the last estimate, w e hav e used the fact that for 0 < t < 1, (1 + t )(1 + 2 t ) ∼ 1.) Next, we observe that     M µ ( Y k ) |D k | − 1     = M µ ( Y k ) |D k |     |D k | M µ ( Y k ) − 1     . So, if     |D k | M µ ( Y k ) − 1     < t 1 + t , then |D k | M µ ( Y k ) ≥ 1 / (1 + t ), and hence,     M µ ( Y k ) |D k | − 1     < t . Th us, for every k , Prob      M µ ( Y k ) |D k | − 1     ≥ t  ≤ Prob      |D k | M µ ( Y k ) − 1     ≥ t 1 + t  ≤ 2 exp  − cM δ α t 2  . (4.8.35) Since the n umber of elements Y k in the partition is ≲ δ − α , w e conclude that Prob  max k ∈ [ N ]     M µ ( Y k ) |D k | − 1     ≥ t  ≲ δ − α exp  − cM δ α t 2  . (4.8.36) W e set the right hand side of the ab ov e inequality to ϵ and solve for M to complete the pro of. ■ In order to prov e the b ounds w e wan t in Lemma 4.8.5 , w e rely on a function which estimates b oth our discrete and con tinuous measure supp ort estimators F n and σ n . W e deﬁne this function as H n ( x ) : = N X k =1 µ ( Y k ) |D k | X x j ∈D k Ψ n ( x, x j ) . (4.8.37) 102 The follo wing lemma relates this function to our contin uous measure supp ort estimator. Lemma 4.8.4. L et 0 < γ < 2 , n ≥ 2 . Ther e exists a c onstant c ( γ ) with the fol lowing pr op erty. Supp ose 0 < δ ≤ c ( γ ) /n , { Y k } b e a p artition of X as in Pr op osition 4.8.2 , and we c ontinue the notation b efor e. We have max x ∈ M | H n ( x ) − σ n ( x ) | ≤ ( γ / 2) I n . (4.8.38) Pr o of. In this pro of, all the constants denoted by c 1 , c 2 , · · · will retain their v alues. Let x ∈ M . W e will ﬁx δ to b e c hosen later. Also, let r ≥ δ b e a parameter to b e c hosen later, N = { k : dist ( x, Y k ) < r } , L = { k : dist ( x, Y k ) ≥ r } and for j = 0 , 1 , · · · , L j = { k : 2 j r ≤ dist ( x, Y k ) < 2 j +1 r } . In view of ( 4.8.11 ), we hav e for k ∈ N and x j ∈ D k ,     µ ( Y k )Ψ n ( x, x j ) − Z Y k Ψ n ( x, y ) dµ ( y )     ≤ Z Y k | Ψ n ( x, x j ) − Ψ n ( x, y ) | dµ ( y ) ≲ n 3 Z Y k ρ ( z , y ) dµ ( y ) ≤ c 1 n 3 diam ( Y k ) µ ( Y k ) . (4.8.39) Consequen ty , for k ∈ N ,       µ ( Y k ) |D k | X x j ∈D k Ψ n ( x, x j ) − Z Y k Ψ n ( x, y ) dµ ( y )       ≤ c 1 n 3 diam ( Y k ) µ ( Y k ) . (4.8.40) Since ∪ k ∈N Y k ⊆ B ( x, r ), w e hav e X k ∈N µ ( Y k ) = µ ( ∪ k ∈N Y k ) ≤ µ ( B ( x, r )) ≲ r α . (4.8.41) W e deduce from ( 4.8.40 ), ( 4.8.41 ) and the fact that diam ( Y k ) ≲ δ that       X k ∈N   µ ( Y k ) |D k | X x j ∈D k Ψ n ( x, x j ) − Z Y k Ψ n ( x, y ) dµ ( y )         ≤ c 3 n 3 δ r α = c 3 ( nδ )( nr ) α n 2 − α . (4.8.42) Next, let k ∈ L j for some j ≥ 0. Then the lo calization estimate ( 4.4.2 ) shows that for any x j ∈ D k ,     µ ( Y k )Ψ n ( x, x j ) − Z Y k Ψ n ( x, y ) dµ ( y )     ≤ c 4 n 2 (2 j nr ) − S µ ( Y k ) , so that       µ ( Y k ) |D k | X x j ∈D k Ψ n ( x, x j ) − Z Y k Ψ n ( x, y ) dµ ( y )       ≤ c 4 n 2 µ ( Y k )(2 j nr ) − S . (4.8.43) 103 Arguing as in the deriv ation of ( 4.8.41 ), we deduce that µ  ∪ k ∈L j Y k  ≲ (2 j r ) α . Since S > α , we deduce that if r ≥ 1 /n , then       X k ∈L µ ( Y k ) |D k | X x j ∈D k Ψ n ( x, x j ) − Z Y k Ψ n ( x, y ) dµ ( y )       ≤ ∞ X j =0 X k ∈L j       µ ( Y k ) |D k | X x j ∈D k Ψ n ( x, x j ) − Z Y k Ψ n ( x, y ) dµ ( y )       ≤ c 4 n 2 − α ( nr ) α − S ∞ X j =0 2 j ( α − S ) ≤ c 5 n 2 − α ( nr ) α − S . (4.8.44) Since S > α , we may choose r ∼ γ n suc h that c 5 ( nr ) α − S ≤ γ / 4, and then require δ ≤ min( r, c 6 ( γ ) /n ) so that in ( 4.8.42 ), c 3 ( nr ) α nδ ≤ γ / 4. Then, recalling that (cf. ( 4.8.4 )) I n ∼ n 2 − α , ( 4.8.44 ) and ( 4.8.42 ) lead to ( 4.8.38 ). ■ In the following lemma, w e establish a connection b etw een the sum F n as deﬁned in ( 4.5.1 ) and the v alue I n from Section 4.8.1 . Since we hav e already established the b ound b etw een H n and σ n , we fo cus on the b ound b etw een H n and F n in this lemma. Lemma 4.8.5. L et n ≥ 2 , 0 < β < 2 , M ≳ β n α log n , and D = { x j } M j =1 b e indep endent r andom samples fr om a dete ctable me asur e µ . Then with pr ob ability ≥ 1 − 1 /n , we have for al l x ∈ M , | F n ( x ) − σ n ( x ) | ≤ β I n . (4.8.45) Conse quently, (1 − β ) I n ≤ max x ∈ M F n ( x ) ≤ (1 + β ) I n . (4.8.46) Pr o of. Let γ ∈ (0 , 1) to b e chosen later, and { Y k } b e a partition as in Lemma 4.8.4 . In view of Lemma 4.8.4 , w e see that (1 − γ / 2) I n ≤ max x ∈ M H n ( x ) ≤ (1 + γ / 2) I n . (4.8.47) In view of Lemma 4.8.3 , w e see that for M ≥ c ( γ ) n α log n , we hav e with probability ≥ 1 − 1 /n , max k     µ ( Y k ) M |D k | − 1     ≤ γ / 2 . (4.8.48) In particular, 1 − γ / 2 ≤ µ ( Y k ) M |D k | ≤ 1 + γ / 2 . (4.8.49) 104 Hence, ( 4.8.47 ) leads to max x ∈ M F n ( x ) ≤ 2 + γ 2 − γ I n . (4.8.50) Using ( 4.8.48 ) again, we see that | F n ( x ) − H n ( x ) | ≤ ( γ / 2) max x ∈ M F n ( x ) ≤ ( γ / 2) 2 + γ 2 − γ I n . (4.8.51) T ogether with ( 4.8.38 ), this implies that | F n ( x ) − σ n ( x ) | ≤ ( γ / 2)  1 + 2 + γ 2 − γ  I n = 2 γ 4 − γ I n (4.8.52) W e now c ho ose γ = 4 β / (2 + β ), so that the right hand side of ( 4.8.52 ) is β I n . W e can verify 0 < γ < 2 whenev er β < 2. ■ The follo wing lemma giv es bounds on max k ∈ [ M ] F n ( x k ), which is a crucial component to the proof inv olv- ing our ﬁnite data support estimation set G n (Θ). W e note brieﬂy the critical diﬀerence b et ween Lemma 4.8.5 and Lemma 4.8.6 is that the former considers the maximum of F n o ver the entire metric space M , while the latter considers the maxim um ov er the ﬁnite set of data p oin ts which are sampled from the measure µ . Lemma 4.8.6. L et D = { x j } M j =1 b e indep endent r andom samples fr om a dete ctable me asur e µ . If 0 < β < C 2 /C 1 and M ≳ β n α log( n ) , with pr ob ability ≥ 1 − 1 /n we have  C 2 C 1 − β  I n ≤ max k ∈ [ M ] F n ( x k ) ≤ (1 + β ) I n . (4.8.53) Pr o of. Necessarily , D ⊆ X . Using ( 4.8.5 ), we deduce that max k ∈ [ M ] F n ( x ) ≥ max k ∈ [ M ] σ n ( x k ) − β I n ≥ J n − β I n ≥ C 2 n 2 − α − β I n ≥  C 2 C 1 − β  I n . (4.8.54) This pro ves ( 4.8.53 ). The second inequality is satisﬁed by ( 4.8.46 ) since max k ∈ [ M ] F n ( x k ) ≤ max x ∈ M F n ( x ). ■ No w with the prepared b ounds on max k ∈ [ M ] F n ( x k ), we give a theorem from which Theorems 4.5.1 and 4.5.2 are direct consequences. In the sequel, we will denote C 2 /C 1 b y C ∗ . Theorem 4.8.3. L et n ≥ 2 , and D = { x j } M j =1 b e sample d fr om the dete ctable pr ob ability me asur e µ . L et 105 0 < Θ < 1 . If M ≳ n α log( n ) then with pr ob ability at le ast 1 − c 1 M − c 2 we have S n  (1 + C ∗ )Θ 4  ⊆ G n (Θ , D ) ⊆ S n ( C ∗ Θ / 8) . (4.8.55) Pr o of. In this pro of, we will inv ok e Lemma 4.8.5 and 4.8.6 with β = C ∗ Θ / 2. F or this pro of only , deﬁne t = 2Θ + C ∗ Θ(1 + Θ) 8 = (1 + β )Θ + β 4 , (4.8.56) and supp ose x ∈ S n ( t ); i.e., σ n ( x ) ≥ 4 tI n . Then with probability 1 − c 1 M − c 2 w e hav e Θ max k ∈ [ M ] F n ( x k ) = 4 t − β 1 + β max k ∈ [ M ] F n ( x k ) ≤ (4 t − β ) I n (F rom ( 4.8.53 )) ≤ σ n ( x ) − β I n (F rom ( 4.8.3 )) ≤ F n ( x ) . (F rom ( 4.8.45 )) (4.8.57) This results in the ﬁrst inclusion in ( 4.8.55 ) b ecause Θ ≤ 1 implies that (1 + C ∗ )Θ 4 ≥ 2Θ + C ∗ Θ(1 + Θ) 8 , and hence, S n  (1 + C ∗ )Θ 4  ⊆ S n  2Θ + C ∗ Θ(1 + Θ) 8  ⊆ G n (Θ , D ) . (4.8.58) No w supp ose x ∈ G (Θ , D ). Then 4( C ∗ Θ / 8) I n =( C ∗ Θ − β ) I n ≤ Θ max k ∈ [ M ] F n ( x k ) − β I n (F rom ( 4.8.53 )) ≤ F n ( x ) − β I n (F rom ( 4.5.2 )) ≤ σ n ( x ) (F rom ( 4.8.45 )) . (4.8.59) This giv es us the second inclusion in ( 4.8.55 ), completing the pro of. ■ W e are now prepared to giv e the proofs of our main results from Section 4.5 . A key element of b oth pro ofs is to utilize Theorem 4.8.3 in conjunction with the corresp onding results on the contin uous supp ort estimation set S n ( θ ) (Theorems 4.8.1 and 4.8.2 ). 106 Pr o of of The or em 4.5.1 . By ( 4.8.19 ) and ( 4.8.55 ) with θ 1 = (1+ C ∗ )Θ 4 , w e hav e X ⊆ S n ( θ 1 ) ⊆ G n (Θ) . (4.8.60) Similarly , with θ 2 = C ∗ Θ 8 , and the deﬁnition of d ( θ 2 ) =  C 1 C 2 θ 2  1 / ( S − α ) (from ( 4.8.18 )), w e see G n (Θ) ⊆ S n ( θ 2 ) ⊆ B X ,  C 1 C 2 θ 2  1 / ( S − α ) . n ! = B X ,  8 C 1 C ∗ C 2 Θ  1 / ( S − α ) . n ! . (4.8.61) The c hoice of θ in each case satisﬁes the conditions of Theorem 4.8.1 b ecause θ 2 = C ∗ Θ 8 ≤ θ 1 = (1 + C ∗ )Θ 4 < C 2 4 C 1 . (4.8.62) Setting r (Θ) :=  8 C 1 C ∗ C 2 Θ  1 / ( S − α ) , (4.8.63) then ( 4.8.61 ) demonstrates ( 4.5.3 ). ■ Pr o of of The or em 4.5.2 . W e note that the inclusion G k,η ,n (Θ) ⊆ B ( S k,η , r (Θ) /n ) is already satisﬁed by ( 4.5.4 ). Let θ 1 = (1+ C ∗ )Θ 4 , and observ e that d ( θ 1 ) =  4 C 1 C 2 (1 + C ∗ )Θ  1 / ( S − α ) =  C ∗ 2(1 + C ∗ )  1 / ( S − α ) r (Θ) , (4.8.64) as deﬁned in ( 4.8.63 ). W e set c =  C ∗ 2(1+ C ∗ )  1 / ( S − α ) and note c < 1. Therefore, X ∩ B ( S k,η , cr (Θ) /n ) = X ∩ B ( S k,η , d ( θ 1 ) /n ) ⊆S n ( θ 1 ) ∩ B ( S k,η , d ( θ 1 ) /n ) (F rom ( 4.8.27 )) ⊆S n ( θ 1 ) ∩ B ( S k,η , r (Θ) /n ) (Since d ( θ 1 ) < r (Θ)) ⊆G n (Θ) ∩ B ( S k,η , r (Θ) /n ) (F rom ( 4.8.55 )) = G k,η ,n (Θ) , (4.8.65) completing the pro of of ( 4.5.6 ). Setting θ 2 = C ∗ Θ 8 , then r (Θ) = d ( θ 2 ) and ( 4.8.55 ) giv es us the inclusion G k,η ,n (Θ) = G n (Θ) ∩ B ( S k,η , r (Θ) /n ) ⊆ S n ( θ 2 ) ∩ B ( S k,η , d ( θ 2 ) /n ) = S k,η ,n ( θ 2 ) . (4.8.66) Then, ( 4.8.26 ) implies ( 4.5.5 ). ■ 107 4.8.3 F-score Result Pro of In this section w e give a pro of for Theorem 4.5.3 . Pr o of. Observe that under the conditions of Theorem 4.5.2 we hav e µ ( S k,η ) ≤ µ ( G k,η ,n ) ≤ µ ( S k,η ) + µ ( S K η +1 ,η ) . (4.8.67) Therefore, F η ( G k,η ,n ) ≥ 2 µ ( S k,η ) 2 µ ( S k,η ) + µ ( S K η +1 ,η ) ≥ 1 − µ ( S K η +1 ,η ) 2 µ ( S k,η ) + µ ( S K η +1 ,η ) . (4.8.68) Also, {G k,η ,n } K η k =1 is a partition of G n (Θ) ⊇ X . Then, K η X k =1 µ ( G k,η ,n ) F η ( G k,η ,n ) ≥ 1 − µ ( S K η +1 ,η ) K η X k =1 µ ( G k,η ,n ) 2 µ ( S k,η ) + µ ( S K η +1 ,η ) , (4.8.69) further implying 1 ≥ F η  {G k,η ,n } K η k =1  ≥ 1 − µ ( S K η +1 ,η ) 2 min k ∈{ 1 ,...,K η } µ ( S k,η ) . (4.8.70) Th us, by our assumption lim η → 0 + F η  {G k,η ,n } K η k =1  ≤ lim η → 0 + 1 − µ ( S K η +1 ,η ) 2 min k ∈{ 1 ,...,K η } µ ( S k,η ) = 1 . (4.8.71) ■ 108 Chapter 5 Conclusion This c hapter contains excerpts from [ 90 , 88 ]. Firstly , in Chapter 2 w e hav e discussed a central problem of machine learning, namely to approximate an unknown target function based only on the data drawn from an unknown probability distribution. While the prev alen t paradigm to solv e this problem in general is to minimize a loss functional, we hav e initiated a new paradigm where we can do the approximation directly from the data, under the manifold assumption. This is a substantial theoretical improv emen t o ver the classical manifold learning tec hnology , which inv olv es a tw o-step pro cedure: ﬁrst to get some information ab out the manifold and then to do the approximation. Instead, our metho d learns on the manifold without manifold learning. Our construction in itself do es not require any prior on the target function. W e derive uniform error bounds with high probabilit y regardless of the nature of the distribution, provided we know the dimension of the unknown manifold. The theorems are illustrated with some numerical examples. One of these examples is closely related to an imp ortan t problem in magnetic resonance relaxometry , in which one seeks to ﬁnd the prop ortion of water molecules in the m yelin co vering in the brain based on a mo del that inv olves the inv ersion of Laplace transform. W e view this work as the b eginning of a new direction. As such, there are plent y of future researc h pro jects, some of which w e plan to undertake ourselves. • Find alternative metho ds that improv e up on the error estimates on unknown manifolds, and more general compact sets. The enco ding describ ed in App endix E gives a represen tation of a function on an unknown manifold. Such an enco ding is useful in the emerging area of appro ximation of op erators. W e plan to dev elop this theme further in the con text of appro ximation of op erators deﬁned in diﬀerent function spaces. • Explore real-life applications other than the examples which we hav e discussed in this c hapter. 109 • W e feel that our metho d will work b est if we are working in the righ t feature space. One of the vexing problems in machine learning is to identify the right features in the data. Deep netw orks are supp osed to b e doing this task automatically . How ev er, there is no clear explanation of whether they w ork in ev ery problem or otherwise develop a theory of what “features” should mean and how deep netw orks can extract these automatically . Secondly , in Chapter 3 we ha v e provided theory for the lifting of a function from one data space to another. P articularly , our theory lo oks at the case where there is data only in some region of the base space. W e answered the question of where we can leverage this knowledge on the target space via the construction of a generalized distance b etw een the spaces and the careful use of a kernel localized with resp ect to this distance. W e also examined the resulting smo othness of the lifted function, and found that the smo othness of the original function as well as parameters asso ciated with each data space play a role in this v alue. Our w ork here has applications to lo cal transfer learning, where one has some function approximation technique whic h works well on one domain (base space) and wishes to apply the learned knowledge to a new domain (target space). Under the framework we dev elop ed, this can b e done without needing the global data from the original domain, and instead one can look at an area of in terest in the target domain to decide whic h data they need from the original domain. F or large mo dels deﬁned on massive data this dev elopment could sav e considerable computation time and even op en do ors to new results which w ere computationally infeasible b efore due to requiring data on the entire space. F uture work in this direction also includes the ﬁeld of image reco very , where one is given data of the transformation of an image and wishes to recov er the original image. When a singular v alue decomp osition of the transformation is av ailable, our theory may b e directly applicable. In particular, we are working to understand the Radon transform in this manner. Thirdly , in Chapter 4 w e ha ve introduced a new approach for active learning in the con text of machine learning. W e pro vide theory for measure supp ort estimation based on ﬁnitely many samples from an unkno wn probabilit y measure supp orted on a compact metric space. With an additional assumption on the measure, kno wn as the ﬁne structure, w e then relate this theory to the classiﬁcation problem, which can be viewed as estimating the supp orts of a measure of the form µ = P K k =1 c k µ k . W e ha ve sho wn that this setup is a generalization of the signal separation problem. Therefore our theory uniﬁes ideas from signal separation with mac hine learning classiﬁcation. Since the measures w e are considering may b e supp orted on a contin uum, our theory additionally relates to the sup er-resolution regime of the signal separation problem. W e also give some empirical analysis for the p erformance of our new algorithm MASC, which was orig- inally introduced in a v aried form in [ 98 ]. The k ey fo cus of the algorithm is on querying high-information p oin ts whose labels can b e extended to others b elonging to the same class with high probability . This is 110 done in a m ultiscale manner, with the in ten tion to b e applied to data sets where the minimal separation b et ween supp orts of the measures for diﬀerent classes may b e unknown or even zero. W e applied MASC to a do cumen t data set as well as tw o h yp ersp ectral data sets, namely subsets of the Indian Pines and Sali- nas hypersp ectral imaging data sets. In the pro cess of these exp eriments, we demonstrate empirically that MASC is selecting high-information p oints to query and that it gives competitive p erformance compared to t wo other recen t active learning metho ds: LAND and LEND. Sp eciﬁcally , MASC consistently outp erforms these algorithms in terms of computation time and exhibits competitive accuracy on Indian Pines for a broad range of query budgets. These three pro jects embo dy a new approach to machine learning in general, where the mo dels are dev elop ed through ideas in approximation theory and harmonic analysis and justiﬁed by pro ven, constructive, con vergence guarantees. This contrasts the curren t metho dology in machine learning where the mo dels are justiﬁed by existence results for the degree of approximation on their underlying hypothesis spaces, and then trained via optimization pro cedures often without knowledge of whether or not the constructive mo del will attain such approximation p ow er. In this wa y , we view the notion of learning without training to b e a p o werful shift in metho dology for machine learning problems which may yield further improv emen ts to the ﬁeld in the future. 111 App endices A T aub erian Theorem The following T aub erian theorem comes from [ 95 , Theorem 4.3]. This section is copied from the appendix of [ 89 ]. W e recall that if µ is an extended complex v alued Borel measure on R , then its total v ariation measure is deﬁned for a Borel set B b y | µ | ( B ) = sup X | µ ( B k ) | , where the sum is ov er a partition { B k } of B comprising Borel sets, and the supremum is ov er all suc h partitions. A measure µ on R is called an even measure if µ (( − u, u )) = 2 µ ([0 , u )) for all u > 0, and µ ( { 0 } ) = 0. If µ is an extended complex v alued measure on [0 , ∞ ), and µ ( { 0 } ) = 0, we deﬁne a measure µ e on R b y µ e ( B ) = µ ( {| x | : x ∈ B } ) , and observ e that µ e is an even measure suc h that µ e ( B ) = µ ( B ) for B ⊂ [0 , ∞ ). In the sequel, we will assume that all measures on [0 , ∞ ) which do not asso ciate a nonzero mass with the p oint 0 are extended in this wa y , and will abuse the notation µ also to denote the measure µ e . In the sequel, the phrase “measure on R ” will refer to an extended complex v alued Borel measure having b ounded total v ariation on compact in terv als in R , and similarly for measures on [0 , ∞ ). Our main T aub erian theorem is the following. Theorem A.1. L et µ b e an extende d c omplex value d me asur e on [0 , ∞ ) , and µ ( { 0 } ) = 0 . We assume that ther e exist Q, r > 0 , such that e ach of the fol lowing c onditions ar e satisﬁe d. 1. | | | µ | | | Q := sup u ∈ [0 , ∞ ) | µ | ([0 , u )) ( u + 2) Q < ∞ , (A.1) 112 2. Ther e ar e c onstants c, C > 0 , such that     Z R exp( − u 2 t ) dµ ( u )     ≤ c 1 t − C exp( − r 2 /t ) | | | µ | | | Q , 0 < t ≤ 1 . (A.2) L et H : [0 , ∞ ) → R , S > Q + 1 b e an inte ger, and supp ose that ther e exists a me asur e H [ S ] such that H ( u ) = Z ∞ 0 ( v 2 − u 2 ) S + dH [ S ] ( v ) , u ∈ R , (A.3) and V Q,S ( H ) = max  Z ∞ 0 ( v + 2) Q v 2 S d | H [ S ] | ( v ) , Z ∞ 0 ( v + 2) Q v S d | H [ S ] | ( v )  < ∞ . (A.4) Then for n ≥ 1 ,     Z ∞ 0 H ( u/n ) dµ ( u )     ≤ c n Q max(1 , ( nr ) S ) V Q,S ( H ) | | | µ | | | Q . (A.5) Prop osition 3.2.1 is prov ed using this theorem with µ ( u ) = µ x,y ( u ) = X k : λ k 0, and γ : ( − ϵ, ϵ ) → X is a curve with x = γ (0), then we can deﬁne the tangent ve ctor to γ at γ ( t 0 ) as a functional γ ′ ( t 0 ) acting on the class of diﬀeren tiable functions f : X → R b y γ ′ ( t 0 ) f : = d ( f ◦ γ ) dt ( t 0 ) . (B.2) The tangent sp ac e of X at a p oint x ∈ X , denoted by T x ( X ), is the set of all suc h functionals γ ′ (0). A Riemannian manifold is a diﬀerentiable manifold with a family of inner pro ducts {⟨◦ , ◦⟩ x } x ∈ X suc h that for an y X , Y ∈ T x ( X ), the function φ : X → C given by x 7→ ⟨ X ( x ) , Y ( x ) ⟩ x is diﬀerentiable. W e can deﬁne an asso ciated norm || X || = ⟨ X ( x ) , X ( x ) ⟩ x . The length L ( γ ) of a curve γ deﬁned on [ a, b ] is deﬁned to b e L ( γ ) : = Z b a || γ ′ ( t ) || dt. (B.3) W e will call a curve γ : [ a, b ] → X a ge o desic if L ( γ ) = inf { L ( r ) : r : [ a, b ] → X , r is a curve } . It is well-kno wn that if γ is a geo desic, then γ ′ ( t ) · γ ′′ ( t ) = 0 for an y t ∈ [ a, b ]. In the sequel, we assume that X is a compact, connected, Riemannian manifold. Then for every x, y ∈ X there exists a geo desic γ : [ a, b ] → X suc h that γ ( a ) = x, γ ( b ) = y . The quantit y ρ ( x, y ) = L ( γ ) deﬁnes a metric on X such that the corresp onding metric topology is consistent with the top ology deﬁned by any atlas on X . F or any x ∈ X , there exists a neighborho o d V ⊂ X of x , a num b er δ = δ ( x ) > 0 and a mapping E : ( − 2 , 2) × U → X , where U = { ( y , v ) : y ∈ V , v ∈ T y X , ∥ v ∥ 2 < δ } such that t 7→ E ( t, y , v ) is the unique geo desic of X which, at t = 0, passes through y and has the prop erty that ∂ E /∂ t = v for each ( y , v ) ∈ U . As a result, we can deﬁne the exp onential map at x to b e the function ε x : B T ( x, δ ( x )) ⊂ T x ( X ) → X b y 114 ε x ( v ) = E (1 , x, v ). Intuitiv ely , the line joining x and v in T x ( X ) is mapp ed to the geo desic joining x with ε x ( v ). W e call the supremum of all δ ( x ) for which the exp onential map is so deﬁned the inje ctivity r adius at x , denoted by ι ( x ). W e call ι ∗ = inf x ∈ X ι ( x ) the glob al inje ctivity r adius of X . Since x 7→ ι ( x ) is a contin uous function of x , and ι ( x ) > 0 for each x , it follows that ι ∗ > 0 when X is compact. Correspondingly , on compact manifolds, one can conclude that for y ∈ B T ( x, ι ∗ ), ρ ( x, ε x ( y )) = || x − y || . Next, w e discuss the metric tensor and v olume elemen t on X . Let ( U, x ) be a coordinate c hart with 0 ∈ U , x (0) = x ∈ X , and ∂ j ( x ) b e the tangen t vector at x to the coordinate curv e t 7→ x ((0 , . . . , 0 | {z } j − 1 , t, 0 , . . . , 0)). Then w e can deﬁne the metric tensor g to b e the matrix where g ij = ⟨ ∂ i ( x ) , ∂ j ( x ) ⟩ x . When one expands the metric tensor g as a T aylor series in local co ordinates on B ( x, ι ∗ ), it can b e shown [ 115 , Page 21] that for an y δ < ι ∗ , on the ball B ( x, δ ) we hav e | g | = 1 + O ( δ 2 ) . (B.4) In turn, this implies p | g | − 1 ≲ δ 2 . (B.5) The follo wing proposition lists some imp ortant properties relating the geodesic distance ρ on an unkno wn submanifold of S Q with the geo desic distance on S Q as w ell as the Euclidean distance on R Q +1 . Prop osition B.1. L et η x b e deﬁne d as in Se ction 2.5 . (a) F or every η x ( u ) ∈ B ( x, ι ∗ ) , | arccos( x · η x ( u )) − ρ ( x, η x ( u )) | ≲ ρ ( x, η x ( u )) 3 . (B.6) (b) F or any x, y ∈ X , ρ ( x, y ) ∼ arccos( x · y ) . (B.7) Pr o of. First, w e observe the fact that || x − y || 2 ∼ arccos( x · y ) because || x − y || 2 / 2 = sin(arccos( x · y ) / 2) and θ /π ≤ sin( θ / 2) ≤ θ / 2 for all θ ∈ [0 , π ]. Fix x ∈ X and let γ b e a geo desic on X parametrized by length t from x . In particular we then hav e || γ ′ (0) || 2 = 1 and γ ′ (0) · γ ′′ (0) = 0. T aking a T aylor expansion for γ ( t ) 115 with | t | < ι ∗ (w e recall that ι ∗ ≤ 1), w e can see γ ′ (0) · ( γ ( t ) − γ (0)) = γ ′ (0) ·  γ ′ (0) t + 1 2 γ ′′ ( t ) t 2 + O ( t 3 )  = || γ ′ (0) || 2 2 t + γ ′ (0) · γ ′′ (0) t 2 + O ( t 3 ) = t + O ( t 3 ) . (B.8) F or any y ∈ B ( x, ι ∗ ), there exists a unique u ∈ S x ( ι ∗ ) such that y = η x ( u ). W e can write y = γ ( t ) for some geo desic γ . W e know, t = ρ ( x, y ) ≥ arccos( x · y ) ≥ || x − y || 2 = || γ ( t ) − γ (0) || 2 . Using the Cauch y-Sc h warz inequalit y , we see 0 ≤ t − || x − y || 2 ≤ t − γ ′ (0) · ( γ ( t ) − γ (0)) ≲ t 3 . (B.9) As a result w e can conclude ρ ( x, η x ( u )) − arccos( x · η x ( u )) ≤ ρ ( x, η x ( u )) − || η x ( u ) − x || 2 ≲ ρ ( x, η x ( u )) 3 , (B.10) sho wing ( B.6 ). Letting c be the constan t built into the notation of ( B.6 ), then if we ﬁx x ∈ X and let y ∈ B  x, p 1 / (2 c )  , w e hav e 1 2 ρ ( x, y ) ≤ ρ ( x, y ) − cρ ( x, y ) 3 ≤ arccos( x · y ) . (B.11) F urthermore, since A = X \ B  x, p 1 / (2 c )  is a compact set and g x ( y ) = arccos( x · y ) /ρ ( x, y ) is a contin uous function of y deﬁned on A , we can conclude that g x attains a minim um on A . Therefore, ρ ( x, y ) ∼ arccos( x · y ) (B.12) for ev ery y ∈ X . W e note that the constants inv olv ed in this pro of v ary contin uously with resp ect to the c hoice of x , so in the theorem w e ma y simply use the supremum ov er all such constants which m ust b e ﬁnite since X is compact. ■ C Orthogonal Polynomials Man y of the results discussed in this section can b e found with corresp onding pro ofs in [ 91 , Chapter 7]. As men tioned abov e, p olynomials are a p opular choice for appro ximan ts in man y s ettings. On the real line, p olynomials are ideal approximan ts for a wide v ariet y of measures b ecause the space of p olynomials can b e 116 decomp osed in to a unique orthonormal basis. T o illustrate, we give the follo wing theorem. Theorem C.1. L et µ b e a me asur e on R satisfying b oth 1. R R | x | n dµ ( x ) < ∞ for every n ∈ N . 2. Ther e exists A ⊆ R with | A | = ∞ such that for any t ∈ A and h > 0 we have Z t + h t dµ ( x ) > 0 . (C.1) Then ther e exists a unique se quenc e of p olynomials { P n } ∞ n =1 , wher e e ach P k is de gr e e k and has a p ositive le ading c o eﬃcient γ k , such that Z R P n ( x ) P m ( x ) dµ ( x ) = δ m,n . (C.2) Using the same notation, each of these p olynomials also satisﬁes the following three-term recurrence prop ert y: xP n ( x ) = γ n γ n +1 P n +1 ( x ) + Z R tP 2 n ( t ) dµ ( t ) P n ( x ) + γ n − 1 γ n P n − 1 ( x ) . (C.3) Giv en n ∈ N , if one knows the moments u k = Z R x k dµ ( x ) (C.4) for k = 1 , ..., 2 n − 1, then the ﬁrst n orthogonal p olynomials (not necessarily orthonormal) can be constructed b y P n ( x ) =                     u 0 u 1 u 2 . . . u n − 1 1 u 1 u 2 u 3 . . . u n x u 2 u 3 u 4 . . . u n +1 x 2 . . . . . . . . . . . . . . . . . . u n − 1 u n u n +1 . . . u 2 n − 2 x n − 1 u n u n +1 u n +2 . . . u 2 n − 1 x n                     . (C.5) W e note that column k enforces orthogonalit y against x k . One of the primary examples of orthogonal p olynomials on the real line are the orthonormalize d Hermite p olynomials , deﬁned b y the Ro drigues-type form ula: H n ( x ) = e x 2 p 2 n n ! √ π  d dx  n e − x 2 . (C.6) These p olynomials satisfy the orthonormality prop ert y with resp ect to a Gaussian weigh t function, i.e. Z R H n ( x ) H m ( x ) e − x 2 dx = δ m,n . (C.7) 117 These p olynomials are also used to construct the Hermite functions, deﬁned by h n ( x ) = e − x 2 / 2 H n ( x ) , (C.8) whic h are orthonormal functions with resp ect to the standard Leb esgue measure on the real line. The main orthogonal polynomials on the real line of interest in this dissertation are the Jacobi polynomials. W e include the following excerpt ab out Jacobi p olynomials from [ 89 ]. F or α , β > − 1, x ∈ ( − 1 , 1) and integer ℓ ≥ 0, the Jacobi p olynomials p ( α,β ) ℓ are deﬁned b y the Ro drigues’ formula [ 129 , F ormulas (4.3.1), (4.3.4)] (1 − x ) α (1 + x ) β p ( α,β ) ℓ ( x ) =  2 ℓ + α + β + 1 2 α + β +1 ℓ !( ℓ + α + β )! ( ℓ + α )!( ℓ + β )!  1 / 2 ( − 1) ℓ 2 ℓ ℓ ! d ℓ dx ℓ  (1 − x ) ℓ + α (1 + x ) ℓ + β  , (C.9) where z ! denotes Γ( z + 1). The Jacobi p olynomials s atisfy the following well-kno wn diﬀerential equation: p ′′ n ( α,β ) ( x )(1 − x 2 ) + ( β − α − ( α + β + 2) x ) p ′ n ( α,β ) ( x ) = − n ( n + α + β + 1) p ( α,β ) n ( x ) . (C.10) Eac h p ( α,β ) ℓ is a p olynomial of degree ℓ with p ositiv e leading co eﬃcient, satisfying the orthogonality relation Z 1 − 1 p ( α,β ) ℓ ( x ) p ( α,β ) j ( x )(1 − x ) α (1 + x ) β = δ ℓ,j , (C.11) and p ( α,β ) ℓ (1) =  2 ℓ + α + β + 1 2 α + β +1 ℓ !( ℓ + α + β )! ( ℓ + α )!( ℓ + β )!  1 / 2 ( ℓ + α )! α ! ℓ ! ∼ ℓ α +1 / 2 . (C.12) It follows that p ( α,β ) ℓ ( − x ) = ( − 1) ℓ p ( β ,α ) ℓ ( x ). In particular, p ( α,α ) 2 ℓ is an even p olynomial, and p ( α,α ) 2 ℓ +1 is an o dd p olynomial. W e note (cf. [ 129 , Theorem 4.1]) that p ( α,α ) 2 ℓ ( x ) =2 α/ 2+1 / 4 p ( α, − 1 / 2) ℓ (2 x 2 − 1) = 2 α/ 2+1 / 4 ( − 1) ℓ p ( − 1 / 2 ,α ) ℓ (1 − 2 x 2 ) p ( α,α ) 2 ℓ +1 ( x ) =2 α/ 2+1 / 2 xp ( α, 1 / 2) ℓ (2 x 2 − 1) = 2 α/ 2+1 / 2 ( − 1) ℓ xp (1 / 2 ,α ) ℓ (1 − 2 x 2 ) . (C.13) It is kno wn [ 110 , Theorem A] that for α, β ≥ − 1 / 2 and θ , ϕ ∈ [0 , π ], ∞ X j =0 exp( − j ( j + α + β + 1) t ) p ( α,β ) j (cos θ ) p ( α,β ) j (cos ϕ ) ≲ ( t + θ ϕ ) − α − 1 / 2 ( t + ( π − θ )( π − ϕ )) − β − 1 / 2 t − 1 / 2 exp  − c ( θ − ϕ ) 2 t  . (C.14) 118 W e note that when β = − 1 / 2, this yields ∞ X j =0 exp( − j ( j + α + 1 / 2) t ) p ( α, − 1 / 2) j (cos θ ) p ( α, − 1 / 2) j (cos ϕ ) ≲ t − α − 1 exp  − c ( θ − ϕ ) 2 t  . (C.15) D Net w ork Represen tation Algorithm 2: Clenshaw algorithm to compute P n − 1 k =0 C k p k , where p k ( x ) = a k xp k − 1 ( x ) + b k p k − 2 ( x ), k = 1 , 2 , · · · , n − 1, b 1 = 0. Input: p 0 , C 0 , · · · , C n − 1 , x , a n +1 , · · · , a 1 , b n +1 , · · · , b 1 . Output: The v alue of P n − 1 k =0 C k p k . 1 out1 ← 0 , out2 ← 0 , C − 1 ← 0 , C n ← 0. ; 2 for k = n + 1 , n, n − 1 , . . . , 1 do 3 temp ← a k ∗ out 1 ∗ x + out2 ; 4 out2 ← b k ∗ out1 + C k − 2 ; 5 out1 ← temp . ; 6 return out1 ∗ p 0 . This app endix comes from [ 90 ]. Let { p k } b e a system of orthonormal p olynomials satisfying a recurrence relation p k ( x ) = a k xp k − 1 ( x ) + b k p k − 2 ( x ) , k = 1 , 2 , · · · , b 1 = 0 . (D.1) The Clensha w algorithm is a mo diﬁcation of the classical Horner metho d to compute p olynomials expressed in the monomial basis that ev aluates a p olynomial expressed in terms of the orthonormalized polynomials { p k } [ 25 , 49 ]. T o understand the metho d, let P = P n − 1 k =0 C k p k . It is conv enien t to write C k = 0 if k ≥ n or k < 0. The recurrence ( D.1 ) shows that C k p k ( x ) + C k − 1 p k − 1 ( x ) + C k − 2 p k − 2 ( x ) = ( a k C k x + C k − 1 ) p k − 1 ( x ) + ( b k C k + C k − 2 ) p k − 2 . (D.2) This leads to Algorithm 2 . Figure 5.1: The implementation of a linear combination as a ReLU netw ork. Here all op erations are p oint wise. The sym b ols ⊙ represents Hadamard pro duct of matrices, ⊕ is the sum of matrices. 119 Figure 5.2: One step of the Clenshaw algorithm, using tw o circuits of the form LC (4 neurons) as in Figure 5.1 . The circuit diagram is sho wn in general with four input pins and tw o output pins. Figure 5.3: Unrolling the Clenshaw algorithm as a cascade of the circuits of the form CL( k ) as in Figure 5.2 . By algorithm unrolling, we ma y express this algorithm in terms of a deep neural netw ork ev aluating a ReLU activ ation function. The net w ork is a cascade of diﬀeren t circuits. The most fundamen tal is the implemen tation of a linear combination as a ReLU netw ork (see Figure 5.1 ) ax + b = ( ax + b ) + − ( − ax − b ) + . Using the circuits LC in Figure 5.1 , w e next construct a circuit to implement recursiv e reduction ( D.2 ). This is illustrated in Figure 5.2 . Finally , we unroll the Clenshaw algorithm by cascading the circuits CL( k ) from Figure 5.2 for k = n + 1 do wn to k = 1 with diﬀerent inputs and outputs as sho wn in Figure 5.3 . W e use this in order to compute Φ n,q ( x · y j ) by using the recursive formula for ultraspherical p olynomials ( 2.4.7 ) in 120 the follo wing wa y . W e set C k = ω q ω q − 1 h ( k /n ) p q ,k (1) , a k =            p Γ( q )Γ( q + 1) Γ( q − 1) k = 1 s (2 k + q − 3)(2 k + q − 1) k ( n + q − 2) k ≥ 2 , b k = s ( k − 1)( k + q − 3)(2 k + q − 1) k ( k + q − 2)(2 k + q − 5) . (D.3) F or the matrix X sho wn in Figure 5.3 , we consider the ( Q + 1) × N test data matrix S where each column represen ts one test data x , and a ( Q + 1) × M train data matrix R where column j represen ts data p oin t y j . Then we set X = S T R . In this wa y , we w ould return Φ n,q ( S T R ) from running Algorithm 2 , with a time complexit y of O ( N M n ). E Enco ding This app endix comes from [ 90 ]. Our construction in ( 2.5.5 ) allows us to enco de the target function in terms of ﬁnitely man y real num b ers. F or each integer ℓ ≥ 0, let { Y Q,ℓ,k } dim( H Q ℓ ) k =1 b e an orthonormal basis for H Q ℓ on S Q . W e deﬁne the enco ding of f by ˆ z ( ℓ, k ) : = 1 M M X j =1 z j Y Q,ℓ,k ( y j ) . (E.1) Giv en this enco ding, the deco ding algorithm is given in the following prop osition. Prop osition E.1. Assume Φ n,q is given as in ( 2.4.12 ) . Given the enc o ding of f as given in ( E.1 ) , one c an r ewrite the appr oximation in ( 2.5.5 ) as F n ( D ; x ) = n X ℓ =0 Γ ℓ,n dim( H Q ℓ ) X k =1 ˆ z ( ℓ, k ) Y Q,ℓ,k ( x ) x ∈ S Q , (E.2) wher e Γ ℓ,n : = ω q ω Q − 1 ω Q ω q − 1 n X i = ℓ h  i n  p q ,i (1) p Q,ℓ (1) C Q,q ( ℓ, i ) , (E.3) and C Q,q ( ℓ, i ) is deﬁne d in ( 2.4.11 ) . 121 Pr o of. The pro of follows from writing out F n ( D ; x ) = ω q M ω q − 1 M X j =1 z j n X i =1 h  i n  p q ,i (1) p q ,i ( x · y j ) , (E.4) making substitutions using ( 2.4.11 ), ( 2.4.4 ), and ( 2.4.5 ), then collecting terms. ■ Remark E.1. The enco ding ( E.1 ) is not parsimonious. Since the basis functions { Y Q,ℓ,k } n, dim( H Q ℓ ) ℓ =0 ,k =1 is not necessarily indep endent on X , the enco ding can b e made parsimonious by exploiting linear relationships in this system. Giv en a reparametrization the functions as { Y j } P n ℓ =0 dim( H Q ℓ ) j =1 , we form the discrete Gram matrix G b y the entries G i,j : = 1 M M X k =1 Y i ( y k ) Y j ( y k ) ≈ Z X Y i ( y ) Y j ( y ) f 0 dµ ∗ ( y ) . (E.5) In practice, one may formulate a QR decomp osition by ﬁxing some ﬁrst basis vector and pro ceeding b y the Gram-Sc hmidt process until a basis is formed, then setting some threshold on the eigenv alues to get the desired dep endencies among the Y j ’s. ■ 122 Bibliograph y [1] Richard Ask ey. Ortho gonal Polynomials and Sp e cial F unctions . Society for Industrial and Applied Mathematics, 1975. [2] A. R. Barron. “Univ ersal approximation bounds for sup erp ositions of a sigmoidal function”. In: IEEE T r ans. Inf. The or. 39.3 (Ma y 1993), pp. 930–945. issn : 0018-9448. doi : 10 . 1109 / 18 . 256500 . url : https://doi.org/10.1109/18.256500 . [3] Dmitry Batenk o v and Nuha Diab. “Super-resolution of generalized spikes and spectra of conﬂuent V andermonde matrices”. In: Applie d and Computational Harmonic A nalysis 65 (2023), pp. 181– 208. issn : 1063-5203. doi : https : / / doi . org / 10 . 1016 / j . acha . 2023 . 03 . 002 . url : https : //www.sciencedirect.com/science/article/pii/S1063520323000234 . [4] Jensen Baxter. (10)Dataset T ext Do cument Classiﬁc ation . https : / / www . kaggle . com / datasets / jensenbaxter/10dataset- text- document- classification . Kaggle. 2020. [5] M. Belkin, I. Matveev a, and P . Niy ogi. “Regularization and semi-sup ervised learning on large graphs”. In: L e arning the ory . Springer, 2004, pp. 624–638. [6] M. Belkin and P . Niyogi. “Laplacian eigenmaps for dimensionalit y reduction and data represen tation”. In: Neur al c omputation 15.6 (2003), pp. 1373–1396. [7] M. Belkin and P . Niy ogi. “Semi-sup ervised learning on Riemannian manifolds”. In: Machine le arning 56.1-3 (2004), pp. 209–239. [8] M. Belkin and P . Niyogi. “T ow ards a theoretical foundation for Laplacian-based manifold metho ds”. In: Journal of Computer and System Scienc es 74.8 (2008), pp. 1289–1308. [9] A. L. Bertozzi and A. Flenner. “Diﬀuse in terface mo dels on graphs for classiﬁcation of high dimensional data”. In: Multisc ale Mo deling & Simulation 10.3 (2012), pp. 1090–1118. 123 [10] Helmut B¨ olcskei et al. “Optimal Approximation with Sparsely Connected Deep Neural Net works”. In: SIAM Journal on Mathematics of Data Scienc e 1.1 (2019), pp. 8–45. doi : 10.1137/18M118709X . eprin t: https://doi.org/10.1137/18M118709X . url : https://doi.org/10.1137/18M118709X . [11] W.M. Boothb y. An Intr o duction to Diﬀer entiable Manifolds and Riemannian Ge ometry: A n Intr o- duction to Diﬀer entiable Manifolds and Riemannian Ge ometry . ISSN. Elsevier Science, 1975. isbn : 9780080873794. [12] St´ ephane Bouc heron, G´ abor Lugosi, and P ascal Massart. Conc entr ation Ine qualities. A Nonasymptotic The ory of Indep endenc e . Oxford Universit y Press, 2013. [13] Emmanuel J. Cand` es and Carlos F ernandez-Granda. “T ow ards a Mathematical Theory of Sup er- Resolution”. In: Communic ations on Pur e and Applie d Mathematics 67 (2013), pp. 906–956. issn : 6. [14] Manfredo P . do Carmo. Riemannian Ge ometry . Birkh¨ auser, 1992. isbn : 9780817634902. [15] K. Chaudh uri and S. Dasgupta. “Rates of con v ergence for the cluster tree”. In: A dvanc es in Neur al Information Pr o c essing Systems . 2010, pp. 343–351. [16] Margaret Cheney and Brett Borden. F undamentals of r adar imaging . SIAM, 2009. [17] Ming-Y en Cheng and Hau-Tieng W u. “Local Linear Regression on Manifolds and Its Geometric In terpretation”. In: Journal of the Americ an Statistic al Asso ciation 108.504 (2013), pp. 1421–1434. issn : 01621459. [18] C. K. Ch ui and D. L. Donoho. “Sp ecial issue: Diﬀusion maps and wa velets”. In: Appl. and Comput. Harm. Anal. 21.1 (2006). [19] C. K. Ch ui and X. Li. “Appro ximation b y ridge functions and neural net w orks with one hidden la y er”. In: Journal of Appr oximation The ory 70.2 (1992), pp. 131–141. [20] C. K. Chui and J. W ang. “Dimensionality reduction of hypersp ectral imagery data for feature classi- ﬁcation”. In: Handb o ok of Ge omathematics . Springer, 2010, pp. 1005–1047. [21] C. K. Chui and J. W ang. “Nonlinear metho ds for dimensionalit y reduction”. In: Handb o ok of Ge o- mathematics . Springer, 2015, pp. 1–46. [22] C. K. Ch ui and J. W ang. “Randomized anisotropic transform for nonlinear dimensionality reduction”. In: GEM-International Journal on Ge omathematics 1.1 (2010), pp. 23–50. [23] Charles K Ch ui and Hrushikesh N Mhask ar. “A uniﬁed metho d for sup er-resolution reco very and real exp onential-sum separation”. In: Applie d Computational Harmonic Analysis 46.2 (March 2019), pp. 431–451. doi : doi.org/10.1016/j.acha.2017.12.007 . 124 [24] Charles K. Chui and Hrushik esh N. Mhask ar. “Deep Nets for Lo cal Manifold Learning”. In: F r ontiers in Applie d Mathematics and Statistics 4 (2018), p. 12. issn : 2297-4687. doi : 10 . 3389 / fams . 2018 . 00012 . url : https://www.frontiersin.org/article/10.3389/fams.2018.00012 . [25] C. W. Clenshaw. “A note on the summation of Chebyshev series”. In: Mathematics of Computation 9 (1955), pp. 118–120. [26] A. Cloninger and H. N. Mhask ar. “Cautious active clustering”. In: Applie d and Computational Har- monic Analysis 54 (2021), pp. 44–74. issn : 1063-5203. doi : https : / / doi . org / 10 . 1016 / j . acha . 2021.02.002 . url : https://www.sciencedirect.com/science/article/pii/S1063520321000154 . [27] Alexander Cloninger et al. “Bigeometric organization of deep nets”. In: Applie d and Computational Harmonic Analysis 44.3 (2018), pp. 774–785. [28] R. R. Coifman and M. J. Hirn. “Diﬀusion maps for c hanging data”. In: Applie d and Computational Harmonic Analysis 36.1 (2014), pp. 79–107. [29] R. R. Coifman and S. Lafon. “Diﬀusion maps”. In: Applie d and c omputational harmonic analysis 21.1 (2006), pp. 5–30. [30] R. R. Coifman and M. Maggioni. “Diﬀusion wa velets”. In: Applie d and Computational Harmonic Anal ysis 21.1 (2006), pp. 53–94. [31] Ronald R Coifman and St´ ephane Lafon. “Geometric harmonics: a no v el to ol for multiscale out-of- sample extension of empirical functions”. In: Applie d and Computational Harmonic Analysis 21.1 (2006), pp. 31–52. [32] G. Cyb enk o. “Approximation b y sup erp ositions of a sigmoidal function”. In: Mathematics of Contr ol, Signals, and Systems 2 (1989), pp. 303–314. doi : https://doi.org/10.1007/BF02551274 . [33] F. Dai and Y. Xu. Appr oximation The ory and Harmonic Analysis on Spher es and Bal ls . Springer Monographs in Mathematics. Springer New Y ork, 2013. isbn : 9781461466604. [34] L. D. David and G. Carrie. Hessian eigenmaps: new lo c al ly line ar emb e dding te chniques for high dimensional data, TR2003-08, Dept. of Statistics . 2003. [35] E. B. Davies. He at kernels and sp e ctr al the ory . V ol. 92. Cambridge Universit y Press, 1990. [36] Baron Gaspard Riche De Prony. “Essai ´ exp erimental et analytique: sur les lois de la dilatabilit´ e de ﬂuides ´ elastique et sur celles de la force expansive de la v ap eur de l’alkool,a diﬀ´ eren tes temp´ eratures”. In: Journal de l’´ ec ole p olyte chnique 1.22 (1795), pp. 24–76. [37] Ronald A De V ore. The appr oximation of c ontinuous functions by p ositive line ar op er ators . V ol. 293. Springer, 2006. 125 [38] Ronald A. DeV ore, Ralph How ard, and Charles Micc helli. “Optimal nonlinear appro ximation”. In: Manuscripta Math 63 (1989), pp. 469–478. [39] D. L. Donoho and C. Grimes. “Image manifolds which are isometric to Euclidean space”. In: Journal of mathematic al imaging and vision 23.1 (2005), pp. 5–24. [40] D. L. Donoho et al. “Multiscale geometric analysis for 3d catalogs”. In: Astr onomic al T elesc op es and Instrumentation . In ternational So ciety for Optics and Photonics. 2002, pp. 101–111. [41] David L. Donoho. “Sup erresolution via Sparsit y Constrain ts”. In: SIAM Journal on Mathematic al Anal ysis 23.5 (1992), pp. 1309–1331. doi : 10.1137/0523074 . [42] M. Ehler, F. Filbir, and H. N. Mhask ar. “Lo cally Learning Biomedical Data Using Diﬀusion F rames”. In: Journal of Computational Biolo gy 19.11 (2012), pp. 1251–1264. [43] F. Filbir and H. N. Mhask ar. “A quadrature formula for diﬀusion p olynomials corresp onding to a generalized heat k ernel”. In: Journal of F ourier Analysis and Applic ations 16.5 (2010), pp. 629–657. [44] F. Filbir and H. N. Mhask ar. “Marcinkiewicz–Zygmund measures on manifolds”. In: Journal of Com- plexity 27.6 (2011), pp. 568–596. [45] F rank Filbir, Hrushik esh Narhar Mhask ar, and J ¨ urgen Prestin. “On the Problem of P arameter Esti- mation in Exp onen tial Sums”. In: Constructive Appr oximation 35.3 (June 2012), pp. 323–343. doi : 10.1007/s00365- 011- 9136- 9 . [46] J. F riedman and J.-P . Tillich. “W av e equations for graphs and the edge-based Laplacian”. In: Paciﬁc Journal of Mathematics 216.2 (2004), pp. 229–266. [47] Edward F uselier and Grady B W right. “Scattered data interpolation on embedded submanifolds with restricted positive deﬁnite k ernels: Sob olev error estimates”. In: SIAM Journal on Numeric al Analysis 50.3 (2012), pp. 1753–1776. [48] Y aroslav Ganin et al. “Domain-Adversarial T raining of Neural Net w orks”. In: Journal of Machine L e arning R ese ar ch 17.59 (2016), pp. 1–35. [49] W. Gautschi. Ortho gonal p olynomials: c omputation and appr oximation . Oxford Univ ersit y Press on Demand, 2004. [50] Daryl Geller and Isaac Z Pesenson. “Band-limited lo calized Parsev al frames and Besov spaces on compact homogeneous manifolds”. In: Journal of Ge ometric Analysis 21.2 (2011), pp. 334–371. [51] Y. v an Gennip et al. “Communit y detection using spectral clustering on sparse geoso cial data”. In: SIAM Journal on Applie d Mathematics 73.1 (2013), pp. 67–83. 126 [52] Ian Goo dfellow, Y oshua Bengio, and Aaron Courville. De ep L e arning . http://www.deeplearningbook. org . MIT Press, 2016. [53] M Gra ˜ na, MA V eganzons, and B Ayerdi. Hyp ersp e ctr al R emote Sensing Sc enes . https : / / www. ehu . eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes . Grup o De Inteligencia Computacional. 2021. [54] A. Grigor’y an. “Gaussian upper b ounds for the heat k ernel on arbitrary manifolds”. In: J. Diﬀ. Ge om. 45 (1997), pp. 33–52. [55] A. Grigory an. “Upp er b ounds of deriv ativ es of the heat kernel on an arbitrary complete manifold”. In: Journal of F unctional Analysis 127.2 (1995), pp. 363–389. [56] V. Guillemin and A. Pollac k. Diﬀer ential T op olo gy . AMS Chelsea Pub., 2010. isbn : 9780821851937. [57] X. He et al. “F ace recognition using Laplacianfaces”. In: Pattern Analysis and Machine Intel ligenc e, IEEE T r ansactions on 27.3 (2005), pp. 328–340. [58] Asmaul Hosna et al. “T ransfer learning: a friendly intr o duction”. In: Journal of Big Data 9 (Oct. 2022). doi : 10.1186/s40537- 022- 00652- w . [59] Andrei A. Istratov and Oleg F. Vyvenk o. “Exp onential analysis in physical phenomena”. In: R eview of Scientiﬁc Instruments 70.2 (1999), pp. 1233–1257. issn : 0034-6748. doi : 10.1063/1.1149581 . url : https://doi.org/10.1063/1.1149581 . [60] P . W. Jones, M. Maggioni, and R. Sc hul. “Manifold parametrizations by eigenfunctions of the Lapla- cian and heat kernels”. In: Pr o c e e dings of the National A c ademy of Scienc es 105.6 (2008), pp. 1803– 1808. [61] P . W. Jones, M. Maggioni, and R. Sc h ul. “Universal lo cal parametrizations via heat kernels and eigenfunctions of the Laplacian”. In: Ann. A c ad. Sci. F enn. Math. 35 (2010), pp. 131–174. [62] W. H. Kim et al. “Multi-resolutional shap e features via non-Euclidean wa v elets: Applications to statistical analysis of cortical thic kness”. In: Neur oImage 93 (2014), pp. 107–123. [63] Sippanon Kitimo on. “Lo calized Kernel Metho ds for Signal Pro cessing”. PhD thesis. The Claremont Graduate Univ ersity, 2025. url : . [64] Y. A Kordyuko v. “ L p –theory of elliptic diﬀerential operators on manifolds of b ounded geometry”. In: A cta Applic andae Mathematic a 23.3 (1991), pp. 223–260. [65] N Lehmann et al. “Ambien t approximation on hypersurfaces”. In: Constructive Appr oximation 49 (2019), pp. 175–190. 127 [66] Jason P Lerc h et al. “F o cal decline of cortical thickness in Alzheimer’s disease identiﬁed by compu- tational neuroanatom y”. In: Cer ebr al c ortex 15.7 (2005), pp. 995–1001. [67] Moshe Leshno et al. “Multila y er feedforward netw orks with a nonp olynomial activ ation function can approximate any function”. In: Neur al Networks 6.6 (1993), pp. 861–867. issn : 0893-6080. doi : https : / / doi . org / 10 . 1016 / S0893 - 6080(05 ) 80131 - 5 . url : https : / / www . sciencedirect . com / science/article/pii/S0893608005801315 . [68] W eilin Li, W enjing Liao, and Alb ert F annjiang. “Sup er-Resolution Limit of the ESPRIT Algorithm”. In: IEEE T r ansactions on Information The ory 66.7 (2020), pp. 4593–4608. doi : 10.1109/TIT.2020. 2974174 . [69] Z. Li, U. Park, and A. K. Jain. “A discriminativ e mo del for age inv ariant face recognition”. In: Information F or ensics and Se curity, IEEE T r ansactions on 6.3 (2011), pp. 1028–1037. [70] W enjing Liao and Mauro Maggioni. “Adaptiv e Geometric Multiscale Approximations for In trinsically Lo w-dimensional Data”. In: arXiv pr eprint arXiv:1611.01179 (2016). [71] W enjing Liao, Mauro Maggioni, and Stefano Vigogna. “Learning adaptive m ultiscale appro ximations to data and functions near lo w-dimensional sets”. In: Information The ory Workshop (ITW), 2016 IEEE . IEEE. 2016, pp. 226–230. [72] Lu Lu et al. “Dying relu and initialization: Theory and numerical examples”. In: arXiv pr eprint arXiv:1903.06733 (2019). [73] M. Maggioni and H.N. Mhask ar. “Diﬀusion p olynomial frames on metric measure spaces”. In: Applie d and Computational Harmonic Analysis 24.3 (2008), pp. 329–353. issn : 1063-5203. doi : https://doi. org / 10 . 1016 / j . acha . 2007 . 07 . 001 . url : https : / / www . sciencedirect . com / science / article / pii/S106352030700070X . [74] Mauro Maggioni and James M. Murph y . “Learning b y active nonlinear diﬀusion”. In: F oundations of Data Scienc e 1.3 (2019), pp. 271–291. doi : 10 . 3934 / fods . 2019012 . url : https : / / www . aimsciences.org/article/id/6f8fefb2- e464- 48ea- b2de- f37686725966 . [75] Lorenzo Manoni, Claudio T urchetti, and Laura F alaschetti. “An eﬀective manifold learning approach to parametrize data for generative mo deling of biosignals”. In: IEEE A c c ess 8 (2020), pp. 207112– 207133. [76] Sohir Mask ey, Ron Levie, and Gitta Kutyniok. “T ransferability of graph neural net works: an extended graphon approac h”. In: Applie d and Computational Harmonic Analysis 63 (2023), pp. 48–83. 128 [77] Eric Mason, H. N. Mhask ar, and Adam Guo. “A manifold learning approach for gesture identiﬁ- cation from micro-Doppler radar measurements”. In: Neur al Networks 152 (2022). arXiv preprint arXiv:2110.01670, 2021, pp. 353–369. [78] Andreas Maurer, Massi Pon til, and Bernardino Romera-P aredes. “Sparse co ding for multitask and transfer learning”. In: International c onfer enc e on machine le arning . PMLR. 2013, pp. 343–351. [79] H. Mhask ar. “Approximation prop erties of a m ultilay ered feedforward artiﬁcial neural netw ork”. In: A dvanc es in Computational Mathematics 1 (1993), pp. 61–80. doi : 10.1007/BF02070821 . [80] H. N. Mhask ar. “A direct approac h for function appro ximation on data deﬁned manifolds”. In: Neur al Networks 132 (2020), pp. 253–268. issn : 0893-6080. [81] H. N. Mhask ar. “A generalized diﬀusion frame for parsimonious representation of functions on data deﬁned manifolds”. In: Neur al Networks 24.4 (2011), pp. 345–359. [82] H. N. Mhask ar. “Dimension indep endent b ounds for general shallow netw orks”. In: Neur al Networks 123 (2020), pp. 142–152. [83] H. N. Mhask ar. “Eignets for function approximation on manifolds”. In: Applie d and Computational Harmonic Analysis 29.1 (2010), pp. 63–87. [84] H. N. Mhask ar. “F unction approximation with zonal function net works with activ ation functions analogous to the rectiﬁed linear unit functions”. In: Journal of Complexity 51 (April 2019), pp. 1–19. [85] H. N. Mhask ar. “Kernel-Based Analysis of Mass iv e Data”. In: F r ontiers in Applie d Mathematics and Statistics 6 (Oct. 2020), p. 30. [86] H. N. Mhask ar. “Neural Netw orks for Optimal Approximation of Smooth and Analytic F unctions”. In: Neur al Computation 8.1 (1996), pp. 164–177. doi : 10.1162/neco.1996.8.1.164 . [87] H. N. Mhask ar, S. Kitimo on, and Raghu G. Ra j. R obust and tr actable multidimensional exp onential analysis . 2025. arXiv: 2404.11004 [eess.SP] . url : . [88] H. N. Mhask ar and Ryan O’Do wd. A signal sep ar ation view of classiﬁc ation . 2025. arXiv: 2509.24140 [cs.LG] . url : https://arxiv.org/abs/2509.24140 . [89] H. N. Mhask ar and Ry an O’Dowd. “Lo cal T ransfer Learning from One Data Space to Another”. In: Inverse Pr oblems, R e gularization Metho ds and R elate d T opics: A V olume in Honour of Thamb an Nair . Ed. by Sergei V. P ereverzy ev, R. Radha, and S. Siv anan than. Singap ore: Springer Nature Singap ore, 2025, pp. 183–209. isbn : 978-981-97-7989-5. doi : 10. 1007 / 978 - 981 - 97- 7989 - 5 _ 9 . url : https://doi.org/10.1007/978- 981- 97- 7989- 5_9 . 129 [90] H. N. Mhask ar and Ry an O’Dowd. “Learning on manifolds without manifold learning”. In: Neur al Networks 181 (2025), p. 106759. issn : 0893-6080. doi : https://doi.org/10.1016/j.neunet.2024. 106759 . url : https://www.sciencedirect.com/science/article/pii/S089360802400683X . [91] H. N. Mhask ar and D. V. P ai. F undamentals of Appr oximation The ory . New Dehli, India: Narosa Publishing House, 2000. [92] H. N. Mhask ar, S. V. Perev erzyev, and M. D. v an der W alt. “A F unction Approximation Approach to the Prediction of Blo o d Glucose Lev els”. In: F r ontiers in Applie d Mathematics and Statistics 7 (2021), p. 53. issn : 2297-4687. [93] H. N. Mhask ar and J. Prestin. “On local smoothness classes of p erio dic functions”. In: Journal of F ourier Analysis and Applic ations 11.3 (2005), pp. 353–373. [94] H.N Mhask ar and Charles A Micc helli. “Approximation by sup erp osition of sigmoidal and radial basis functions”. In: A dvanc es in Applie d Mathematics 13.3 (1992), pp. 350–373. issn : 0196-8858. doi : https : / / doi . org / 10 . 1016 / 0196 - 8858(92 ) 90016 - P . url : https : / / www . sciencedirect . com/science/article/pii/019688589290016P . [95] H.N. Mhask ar. “A uniﬁed framew ork for harmonic analysis of functions on directed graphs and c hang- ing data”. In: Applie d and Computational Harmonic Analysis 44.3 (2018), pp. 611–644. issn : 1063- 5203. doi : https: / /doi .org / 10. 1016/ j .acha .2016 . 06. 007 . url : https :/ /www . sciencedirect. com/science/article/pii/S1063520316300343 . [96] H.N. Mhask ar. “On the represen tation of smo oth functions on the sphere using ﬁnitely many bits”. In: Applie d and Computational Harmonic Analysis 18.3 (2005), pp. 215–233. issn : 1063-5203. doi : https : / / doi . org / 10 . 1016 / j . acha . 2004 . 11 . 004 . url : https : / / www . sciencedirect . com / science/article/pii/S1063520304000831 . [97] H.N. Mhask ar. “P olynomial op erators and lo cal smo othness classes on the unit interv al”. In: Journal of Appr oximation The ory 131.2 (2004), pp. 243–267. issn : 0021-9045. doi : https : / / doi . org / 10 . 1016 / j . jat . 2004 . 10 . 002 . url : https : / / www . sciencedirect . com / science / article / pii / S0021904504001698 . [98] Hrushikesh Mhask ar, Ryan O’Dowd, and Efstratios Tsouk anis. “Active learning classiﬁcation from a signal separation p ersp ective”. In: 2025 International Confer enc e on Sampling The ory and Applic a- tions (SampT A) . IEEE. 2025, pp. 1–5. 130 [99] Hrushikesh N. Mhask ar, Sergei V. Perev erzyev, and Maria D. v an der W alt. “A Deep Learning Ap- proac h to Diab etic Blo o d Glucose Prediction”. In: F r ontiers in Applie d Mathematics and Statis- tics 3 (2017), p. 14. issn : 2297-4687. doi : 10 . 3389 / fams . 2017 . 00014 . url : http : / / journal . frontiersin.org/article/10.3389/fams.2017.00014 . [100] Hrushikesh N. Mhask ar, Efstratios Tsouk anis, and Amey a D. Jagtap. A n Appr oximation The ory Persp e ctive on Machine L e arning . 2025. arXiv: 2506 . 02168 [cs.LG] . url : https : / / arxiv . org / abs/2506.02168 . [101] Hrushikesh Narhar Mhask ar. “Sup er-resolution meets mac hine learning: appro ximation of measures”. In: Journal of F ourier Analysis and Applic ations 25.6 (2019), pp. 3104–3122. [102] Hrushikesh Narhar Mhask ar and Charles A. Micchelli. “Degree of Approximation b y Neural and T ranslation Netw orks with a Single Hidden Lay er”. In: A dvanc es in Applie d Mathematics 16 (1995), pp. 151–183. url : https://api.semanticscholar.org/CorpusID:119768191 . [103] Hrushikesh Narhar Mhask ar and Devidas V Pai. F undamentals of appr oximation the ory . CRC Press, 2000. [104] Claus M ¨ uller. Spheric al harmonics . V ol. 17. Springer, 2006. [105] David C Munson, James D O’Brien, and W Kenneth Jenkins. “A tomographic formulation of sp otligh t- mo de syn thetic ap erture radar”. In: Pr o c e e dings of the IEEE 71.8 (1983), pp. 917–925. [106] Kevin P Murphy. Pr ob abilistic machine le arning: an intr o duction . MIT press, 2022. [107] F rank Natterer. The mathematics of c omputerize d tomo gr aphy . SIAM, 2001. [108] V. Naumov a et al. Meta-le arning b ase d Blo o d Gluc ose Pr e dictor for Diab etic Smartphone App . 2014. [109] Cliﬀord J Nolan and Margaret Cheney. “Synthetic ap erture in version”. In: Inverse Pr oblems 18.1 (2002), p. 221. [110] Adam Now ak and Peter Sj¨ ogren. “Sharp estimates of the Jacobi heat kernel”. In: arXiv pr eprint arXiv:1111.3145 (2011). [111] Philipp P etersen and F elix V oigtlaender. “Optimal approximation of piecewise smooth functions using deep ReLU neural netw orks”. In: Neur al Networks 108 (2018), pp. 296–330. issn : 0893-6080. doi : https : / / doi . org / 10 . 1016 / j . neunet . 2018 . 08 . 019 . url : https : / / www . sciencedirect . com / science/article/pii/S0893608018302454 . [112] Allan Pinkus. “Approximation theory of the MLP mo del in neural netw orks”. In: A cta Numeric a 8 (1999), pp. 143–195. url : https://api.semanticscholar.org/CorpusID:16800260 . 131 [113] G. Plonk a et al. Numeric al F ourier Analysis . Applied and Numerical Harmonic Analysis. Springer In ternational Publishing, 2023. isbn : 9783031350054. url : https :// books. google.com /books ?id= ZVXiEAAAQBAJ . [114] Bogdan Raonic et al. “Con volutional neural op erators for robust and accurate learning of PDEs”. In: A dvanc es in Neur al Information Pr o c essing Systems 36 (2024). [115] John Ro e. El liptic op er ators, top olo gy and asymptotic metho ds . 2nd ed. Addison W esley Longman Inc., 1998. isbn : 0582325021. [116] S. T. Ro weis and L. K. Saul. “Nonlinear dimensionalit y reduction by lo cally linear em b edding”. In: Scienc e 290.5500 (2000), pp. 2323–2326. [117] Michael Rozo wski et al. “Input lay er regularization for magnetic resonance relaxometry biexp onential parameter estimation”. In: Magnetic R esonanc e in Chemistry 60(11) (2022), pp. 1076–1086. doi : https://doi.org/10.1002/mrc.5289 . [118] Kh P Rustamov. “ON APPRO XIMA TION OF FUNCTIONS ON THE SPHERE”. In: Izvestiya: Mathematics 43.2 (Apr. 1994), p. 311. [119] V enu Satuluri and Sriniv asan P arthasarathy. “Symmetrizations for clustering directed graphs”. In: Pr o c e e dings of the 14th International Confer enc e on Extending Datab ase T e chnolo gy . ACM. 2011, pp. 343–354. [120] Johannes Schmidt-Hieber. “Deep ReLU net w ork appro ximation of functions on a manifold”. In: arXiv pr eprint arXiv:1908.00695 (2019). [121] Uri Shaham, Alexander Cloninger, and Ronald R. Coifman. “Prov able approximation prop erties for deep neural netw orks”. In: Applie d and Computational Harmonic A nalysis 44.3 (2018), pp. 537– 557. issn : 1063-5203. doi : https : / / doi . org / 10 . 1016 / j . acha . 2016 . 04 . 003 . url : https : //www.sciencedirect.com/science/article/pii/S1063520316300033 . [122] Manali Sharma and Mustafa Bilgic. “Evidence-based uncertain ty sampling for activ e learning”. In: Data Mining and Know le dge Disc overy 31 (2016), pp. 164–202. [123] Zuow ei Shen, Haizhao Y ang, and Shijun Zhang. “Optimal appro ximation rate of ReLU netw orks in terms of width and depth”. In: Journal de Math ´ ematiques Pur es et Appliqu ´ ees 157 (2022), pp. 101– 135. issn : 0021-7824. doi : https : / / doi . org / 10 . 1016 / j . matpur . 2021 . 07 . 009 . url : https : //www.sciencedirect.com/science/article/pii/S0021782421001124 . [124] A. Sikora. “Riesz transform, Gaussian b ounds and the metho d of wa v e equation”. In: Mathematische Zeitschrift 247.3 (2004), pp. 643–662. 132 [125] A. Singer. “F rom graph to manifold Laplacian: The con v ergence rate”. In: Applie d and Computational Harmonic Analysis 21.1 (2006). Sp ecial Issue: Diﬀusion Maps and W a v elets, pp. 128–134. issn : 1063- 5203. doi : https: / /doi .org / 10. 1016/ j .acha .2006 . 03. 004 . url : https :/ /www . sciencedirect. com/science/article/pii/S1063520306000510 . [126] Barak Sober, Y ariv Aizenbud, and David Levin. “Appro ximation of functions ov er manifolds: A Mo ving Least-Squares approac h”. In: Journal of Computational and Applie d Mathematics 383 (2021), p. 113140. issn : 0377-0427. doi : https : / / doi . org / 10 . 1016 / j . cam . 2020 . 113140 . url : https : //www.sciencedirect.com/science/article/pii/S0377042720304313 . [127] Elias M Stein and Guido W eiss. Intr o duction to F ourier analysis on Euclide an sp ac es (PMS-32) . V ol. 32. Princeton universit y press, 2016. [128] Elias M. Stein. Singular Inte gr als and Diﬀer entiability Pr op erties of F unctions (PMS-30) . Princeton Univ ersity Press, 1970. isbn : 9780691080796. url : http : / / www . jstor . org / stable / j . ctt1bpmb07 (visited on 05/09/2025). [129] G. Szeg¨ o. Ortho gonal Polynomials . American Math. So c: Collo quium publ. American Mathematical So ciet y, 1975. isbn : 9780821810231. url : https://books.google.com/books?id=ZOhmnsXlcY0C . [130] J. B. T enenbaum, V. De Silv a, and J. C. Langford. “A global geometric framework for nonlinear dimensionalit y reduction”. In: Scienc e 290.5500 (2000), pp. 2319–2323. [131] Alaa Tharw at and W olfram Sc henck. “A No v el Low-Query-Budget Active Learner with Pseudo-Lab els for Imbalanced Data”. In: Mathematics 10.7 (2022). issn : 2227-7390. doi : 10 . 3390 / math10071068 . url : https://www.mdpi.com/2227- 7390/10/7/1068 . [132] Alaa Tharwat and W olfram Sc henc k. “A Survey on Activ e Learning: State-of-the-Art, Practical Challenges and Research Directions”. In: Mathematics 11.4 (2023). issn : 2227-7390. url : https : //www.mdpi.com/2227- 7390/11/4/820 . [133] Kabir T ripathi and James M. Murphy. “Learning By Evolving Nonlinear Diﬀusion for Activ e Learning on Hyp ersp ectral Images”. In: 2024 14th Workshop on Hyp ersp e ctr al Imaging and Signal Pr o c essing: Evolution in R emote Sensing (WHISPERS) . 2024, pp. 1–5. [134] K. Q. W einberger, B. D. P ack er, and L. K. Saul. “Nonlinear dimensionalit y reduction by semideﬁnite programming and k ernel matrix factorization”. In: Pr o c e e dings of the tenth international workshop on artiﬁcial intel ligenc e and statistics . Citeseer. 2005, pp. 381–388. [135] Z.-Y. Zhang and H.-Y. Zha. “Principal manifolds and nonlinear dimensionality reduction via tangen t space alignmen t”. In: Journal of Shanghai University (English Edition) 8.4 (2004), pp. 406–424. 133 [136] Yinglun Zhu and Robert Now ak. “Activ e Learning with Neural Net works: Insigh ts from Nonparamet- ric Statistics”. In: A dvanc es in Neur al Information Pr o c essing Systems . Ed. b y S. Koy ejo et al. V ol. 35. Curran Asso ciates, Inc., 2022, pp. 142–155. url : https://proceedings.neurips.cc/paper_files/ paper/2022/file/01025a4e79355bb37a10ba39605944b5- Paper- Conference.pdf . 134

Learning Without Training

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment