Estimating the Average of a Lipschitz-Continuous Function from One Sample

Estimating the Av erage of a Lipsc hitz-Con tin uous F unction from One Sample Abhiman yu Das Univ ersity of Southern California abhimand@us c.edu Da vid Kemp e ∗ Univ ersity of Southern California c l kempe @ usc.edu No v em b er 13 , 2018 Abstract W e study the problem of estimating the av er age of a Lips chitz contin uous function f deﬁned ov er a metric space, by querying f at only a single point. Mo re speciﬁcally , we explore the r ole o f randomness in drawing this sa mple. Our goal is to ﬁnd a dis tribution minimizing the expected estimation erro r against an adversarially chosen L ips c hitz c on tinu ous function. Our work falls int o the br oad class of e stimating agg regate statistics of a function from a small num b er of carefully chosen samples. The general pr oblem has a wide range of practical applications in areas as diverse as se ns or netw orks, so cial sciences and numerical ana lysis. Ho w ever, traditional work in n umerical analysis has fo cused on as ymptotic b ounds, whereas we are in ter e sted in the b est algo rithm. F or arbitrar y discrete metric spaces of b ounded doubling dimension, we obtain a PT AS for this problem. In the sp ecial case w he n the p o in ts lie on a line, the running time improv es to an FPT AS. Both algorithms a r e based o n approximately solv ing a linea r prog ram with an inﬁnite set of constra in ts, by using an appro ximate separatio n oracle. F or Lipschitz- contin uous functions ov er [0 , 1], we c alculate the prec ise achiev able error as 1 − √ 3 2 ≈ 0 . 134, which impr ov es upo n the 1 4 which is b est p ossible for deterministic algo rithms. 1 In tro duction One of the most fu ndamen tal problems in data-driv en sciences is to estimate some aggregate statistic of a real-v alued function f by samplin g f in few place s. F requently , obtaining samples incurs a co st in terms of human lab or, computation, energy or time. Th us, r esearc hers face an inher ent tradeoﬀ b et w een the accuracy of estimating the aggregate statistic and the num b er of samples required . With samples a scarce resource, it b ecomes an imp ortant computational pr oblem to d etermine where to sample f , and h o w to p ost-pro cess the samples. Naturally , there are man y m athematica l formulatio ns of this estimati on pr oblem, dep ending on the aggregate statistic that we wish to estimate (suc h as the a v er age, median or maxim um v alue), the error ob jectiv e that w e wish to minimize (such as w orst-case ab s olute error, av erage-case squared error, etc.), and on the conditions imp osed on the f u nction. In this pap er, w e study algorithms optimizing a worst-ca se err or ob jectiv e, i.e., w e assume that f is c hosen adv ersarially . Motiv ated b y the applications describ ed b elow, we use Lipschitz- con tinuit y to imp ose a “smo othness” condition ∗ Supp orted in part by NSF CAREER aw ard 0545 855, and NSF gran t dddas-tmrp 0540420 1 on f . (Note that without any smo othness conditions on f , we cannot hop e to appr o ximate any aggrega te function in an adv ers arial setting without learning all function v alues.) That is, w e assume that the domain of f is a metric space, and that f is Lipsc hitz-con tinuous o v er its domain. Th us, nearby p oints are guaran teed to ha v e similar fu nction v alues. Here, w e fo cus on p erhaps the simplest aggregatio n fu nction: the a v erage f . Despite its sim- plicit y , it has man y natural applications, such as 1. In sens or n et wo rks co v ering a geographical area, the av erage of a natural phenomenon (such as temp erature or pressure) is frequently one of the most in teresting quantitie s. Here, nearb y lo catio ns tend to yield similar measuremen ts. Since energy is a scarce resource, it is desirable to sample only a few of the deplo y ed sensors. 2. In p opulation surveys, researchers are frequent ly int erested in the a v erage of qu antitie s suc h as income or education lev el. A metric on the p opulation ma y b e based on job similarit y , whic h w ould ha ve strong predictiv e v alue for these quan tities. Inte rviewing a sub ject is time- consuming, and th us sample sizes tend to b e muc h smaller than the en tire p opulation. 3. In n umerical analysis, one of the most fund amen tal p roblems is numerical in tegration of a function. If th e d omain is con tin uous, this corresp onds precisely to computing the a v erage. If th e function to be in tegrated is costly to ev aluate, then aga in, it is desirable to samp le a small n umber of p oints. If f is to b e ev aluated at k p oin ts, c hosen deterministically and n on-adaptiv ely , then pr evious w ork [4] sh ows that the optimum sampling lo cations for estimating the a verag e of f form a k - median of the metric sp ace. Ho wev er, the p roblem b ecomes signiﬁcan tly more co mplex when the algorithm get s to r andomize its c h oices of sampling locations. In fact, ev en the seemingly trivia l case of k = 1 tur ns out to b e h ighly non-trivial, and is the focus of this pap er. Add r essing this case is an imp ortan t step to w ard th e ultimate goal of u nderstanding the tradeoﬀs b et ween the n umber of samples and the estimation error. F ormally , w e th us study the follo wing question: Giv en a metric sp ace M , a randomized sampling algorithm is describ ed b y (1) a metho d for s amp ling a lo cation x ∈ M from a d istribution p ; (2) a function g for predicting the a v erage f of the function f ov er M , using the sample ( x, f ( x )). The exp ected estimation error is then E ( p , g , f ) = P x ∈M p x · | g ( x, f ( x )) − f | . (The sum is replaced b y an in tegral, and p b y a densit y , if M is con tinuous.) The w orst-case error is E w ( p , g ) = sup f ∈ L E ( p , g , f ), where L is the set of all 1-Lipschitz con tin uous fun ctions deﬁned on M . Our goal is to ﬁnd a r an d omized samplin g algorithm (i.e., a d istribution p and fu nction g , compu table in p olynomial time) that (appro ximately) min imizes E w ( p , g ). In this pap er, w e pr ovide a PT AS for this p roblem of min im izing E w ( p , g ), for an y discrete metric space M with constan t d oubling d imension. (This includes constan t-dimensional Euclidean metric spaces.) F or discrete metric spaces M em b edded on a line, we impro ve this result to an FPT AS. Both of these algorithms are based on a lin ear pr ogram w ith inﬁ nitely man y constraints, for whic h an approximat e separation oracle is obtained. W e next study the p erhaps simplest v ariant of this problem, in whic h the metric space is the in terv al [0 , 1]. While the w orst-case error of an y deterministic algorithm is ob viously 1 4 in this case, w e show that for a randomized algorithm, th e b ound improv es to 1 − √ 3 2 . W e pro v e this by providing an explicit distribution, and obtaining a matc h in g lo wer b ou n d using Y ao’s Minimax Principle. O ur 2 result can also b e interpreted as sho wing ho w “close” a collectio n of Lipsc hitz-con tinuous functions on [0 , 1] must b e. 1.1 Related W ork Estimating the int egral of a smo oth function f using its v alues at a discrete set of p oints is one of the core problems in n umerical analysis. Th e tradeoﬀs b et w een the num b er of samples needed and the estimation er r or b ounds ha v e b een inv estigated in detail under the name of Information Base d Complexity (IBC) [10, 11]. More generally , IBC stud ies the problem of computing approxima tions to an op erator S ( f ) on fu nctions f from a set F (with certain “smo othness” prop erties) using a ﬁn ite set of samples N ( f ) = [ L 1 ( f ) , L 2 ( f ) , . . . , L n ( f )]. Th e L i are functionals. F or a give n algorithm U , its error is E ( U ) = su p f ∈ F k S ( f ) − U ( f ) k . Th e goal in IBC is to ﬁnd an ǫ -appr o ximation U (i.e. , ensuring that E ( U ) ≤ ǫ ) with least in formation cost c ( U ) = n . One of the common problems in IBC is m ultiv ariate in tegration of real-v alued fu nctions with a smo othness parameter r o v er d -dimensional u nit balls. F or such p roblems, Bakh v alo v [2] designed a randomized al gorithm providing an ǫ -appro ximation with cost Θ( 1 ǫ 2 d/ ( d +2 r ) ). Bakhv alo v [2] and No v ak [9] also s ho w that this cost is asymp totica lly optimal. The pap ers b y Nov ak [9] and Mathe [7 ] sho w that if r = 0, th en simple Mont e-Carlo in tegratio n alg orithms (wh ic h sample from the u niform distribution) ha v e an asymptotically optimal cost of 1 ǫ 2 . In [13, 14], W ozniako w s ki s tudied the a verage case complexit y of linear multiv ariate IBC prob- lems, and deriv ed conditions u nder which the problems are tractable, i.e., ha v e cost p olynomial in 1 ǫ and d . W o jtaszczyk [12 ] pro v ed that the m ultiv ariate in tegration problem is not strongly tractable (p olynomial in 1 ǫ and in d ep endent of d ). In [3], Baran et al. stud y the IBC problem in the univ ariate in tegration mo del for Lipsc hitz con tinuous functions, and formulate appr o ximation b oun ds in an adaptive setting. T hat is, the sampling strategy can c hange adaptiv ely based on the p r eviously sampled v alues. They pro vide deterministic and r andomized ǫ -approximat ion algorithms, whic h, for any problem instance P , use O (log( 1 ǫ · OPT ) · OPT) samples for the deterministic case and O (OPT 4 / 3 + OPT · log( 1 ǫ )) samples for the randomized case. Here, OPT is the optimal num b er of samples for the problem instance P . They pro v e that their algorithms are asymp totica lly optimal, compared to an y other adaptive algorithm. There are t w o m ain diﬀerences b et ween the results in IBC and our work: ﬁr st, IBC treats the target ap p ro xim ation as gi v en and the n umber of samples as the quan tit y to b e minimized. Our goal is to minimize the exp ected w orst-case err or with a ﬁxed num b er of samp les (one). More imp ortan tly , results in IBC are traditionally asymptot ic , ignoring constan ts. F or a sin gle sample, this w ould trivialize the prob lem: it is im p licit in our proofs that samp ling at the m etric space’s median is a constan t-factor approximat ion to the b est rand omized algorithm. The deterministic version of our problem was studied previously in [4]. Th ere, it w as sho wn that the b est s ampling lo cations for r eadin g k v alues non-adaptiv ely constitute the optimal k - median of the m etric space. Thus, the algorithm of Arya et al. [1] giv es a p olynomial-time (3 + ǫ )- appro ximation algorithm to identify the b est k v alues to read. 3 2 Preliminaries W e are in terested in real-v alued Lipsc hitz-con tinuous functions o ver metric spaces of constan t dou- bling d imension (e.g., [6]). Let ( M , d ) b e a co mpact metric space with distances d ( x, y ) b et ween pairs of p oin ts. W.l.o.g., we assu me th at max x,y ∈M d ( x, y ) = 1. W e require ( M , d ) to h a ve constan t doubling d im en sion β , i.e., for every δ , eac h ball of d iameter δ can b e co v er ed b y at most c β balls of diameter δ /c , for any c ≥ 2. A real-v alued fun ction f is Lipsc hitz-con tinuous (with constant 1) if | f ( x ) − f ( y ) | ≤ d ( x, y ) for all p oints x, y . W e deﬁne L to b e the set of all suc h L ip sc h itz-con tin uous fu nctions f , i.e., L = { f | | f ( x ) − f ( y ) | ≤ d ( x, y ) for all x, y } . S ince we will frequent ly wan t to b ound th e fun ction v alues, w e also deﬁne L c = { f ∈ L | | R x f ( x ) dx | ≤ c } . Notice that L c is a compact set. W e wish to p redict the a v erage f = R x f ( x ) dx of all th e function v alues. When M is ﬁ nite of size n , th en the a v erage is of cour se f = 1 n · P x f ( x ) instead. The algorithm ﬁr st gets to c ho ose a single p oin t x according to a (p olynomial-time computable) density function p ; it then learns the v alue f ( x ), and may p ost-pro cess it with a pr e diction function g ( x, f ( x )) to pro d uce its estimate of the a v erage f . The goal is to min im ize the exp ected estimation error of th e av erage, assuming f is chosen adve rsarially from L w ith k n o wledge of the algorithm, but not its random c h oices. F ormally , th e goal is to minimize E w ( p , g ) = sup f ∈ L ( R x p x · | f − g ( x, f ( x )) | dx ). If M is ﬁ nite, then p w ill b e a pr obabilit y distribu tion instead of a den s it y , and the err or can no w b e w ritten as E w ( p , g ) = sup f ∈ L ( P x p x · | f − g ( x, f ( x )) | ). F ormally , we consid er an algo rithm to b e the p air ( p , g ) of the distr ib ution and prediction function. Let A denote the set of all su c h pairs, and D the set of all deterministic algorithms, i.e., algorithms f or whic h p has all its density on a s ingle p oint. Our analysis will make hea vy use of Y ao’s Minimax Prin ciple [8]. T o state it, w e deﬁne L to b e the set of all p r obabilit y d istributions o ver L . W e also deﬁne the estimation error ∆( f , A ) = R x p x · | f − g ( x, f ( x )) | dx , where A corresp onds to the pair ( p , g ). Theorem 2.1 (Y ao’s Minimax Principle [8]) sup q ∈L inf A ∈D E f ∼ q [∆( f , A )] = inf A ∈A sup f ∈ L ∆( f , A ) . W e ﬁrst sho w that without loss of generalit y , we can fo cus on algo rithms wh ose p ost-pro cessing is just to outp ut the observed v alue, i.e., algorithms ( p , id) with id( x, y ) = y , for all x, y . W hen g is the ident it y function, we simply write ∆( f , p ) = R x p x · | f − f ( x ) | dx for the err or incurred by using the distribution p . Theorem 2.2 L et A ∗ = ( p ∗ , g ∗ ) b e the optimum r andomize d algorithm. Then, for every ǫ > 0 , ther e is a r andomize d algorithm A = ( p , id ) , such that E w ( A ) ≤ E w ( A ∗ ) + ǫ . Pro of. Let A I denote the set of all (randomized) algorithms using the iden tit y f unction for p ost- pro cessing, i.e., A I = { A = ( p , id) | p is a distribu tion o v er M} . F or the analysis, w e are in terested in equiv alence classes of fu nctions; we sa y that f , f ′ are e quivalent if either (1) there exists a constant c such that f ( x ) = c + f ′ ( x ) for all x , or (2) there is a constant c such th at f ( x ) = c − f ′ ( x ) for all x . Let ˜ f denote the equiv alence class of f , i.e., the set of all v ertical translations of f and its v ertical reﬂection. 4 Giv en ǫ , w e let r b e a large enough constan t deﬁn ed b elo w. A distribution q o ver L r is called e quivalenc e-unif orm if the distrib ution, restricted to an y equiv alence class, is uniform 1 . That is, for an y f ′ ∈ ˜ f with f ′ , f ∈ L r , w e hav e q ( f ′ ) = q ( f ). Let U r denote the set of all equiv alence-uniform distributions o v er L r . W e will sh o w tw o facts: 1. If q ∈ U r , th en for any d eterministic algorithm A ∈ D , there is a deterministic algorithm A ′ ∈ D ∩ A I whic h outputs simply the v alue it s ees, su c h that E f ∼ q [∆( f , A ′ )] ≤ E f ∼ q [∆( f , A )] + ǫ/ 2 . (1) 2. F or an y d istribution q ∈ L of L ipsc h itz-co n tin uous fun ctions, there is an equiv alence-uniform distribution q ′ ∈ U r (where r ma y dep end on q ) su c h that inf A ∈D ∩A I E f ∼ q [∆( f , A )] ≤ inf A ∈D ∩A I E f ∼ q ′ [∆( f , A )] + ǫ/ 2 . (2) Using these t w o inequ alities, and applying Y ao’s Minimax Theorem t wice then completes the pro of as follo ws: inf A ∈A I sup f ∈ L ∆( f , A ) = sup q ∈L inf A ∈D ∩A I E f ∼ q [∆( f , A )] ( 2 ) ≤ sup q ∈U r inf A ∈D ∩A I E f ∼ q [∆( f , A )] + ǫ/ 2 ( 1 ) ≤ sup q ∈U r inf A ∈D E f ∼ q [∆( f , A )] + ǫ ≤ sup q ∈L inf A ∈D E f ∼ q [∆( f , A )] + ǫ = inf A ∈A sup f ∈ L ∆( f , A ) + ǫ. It thus remains to show the t wo inequ alities. W e b egin with Inequalit y (1). Let x b e the p oint at w hic h A samples the f unction. F or an y fu nction f ∈ L r let f b e the “ﬂipp ed” f unction aroun d x , deﬁned by f ( y ) = 2 f ( x ) − f ( y ) for all y . Let L b e the set of all fu nctions f such that b oth f and f are in L r . Because f = 2 f ( x ) − f a nd | f ( x ) − f | ≤ 1 2 for all x , we obtain that L ⊇ L r − 1 . Also, b ecause f ∈ ˜ f and q ∈ U r , w e ha v e q ( f ) = q ( f ) wh enev er f ∈ L . W e thus obtain that E f ∼ q [∆( f , A )] = Z L | g ( x, f ( x )) − f | q ( f ) d f + Z L | g ( x, f ( x )) − f | q ( f ) d f ≥ 1 2 Z L ( | g ( x, f ( x )) − f | + | g ( x, f ( x )) − f | ) q ( f ) d f = 1 2 Z L ( | g ( x, f ( x )) − f | + | g ( x, f ( x )) − f | ) q ( f ) d f ≥ 1 2 Z L | f − f | q ( f ) d f . 1 Unfortunately , this deﬁnition do es not ex tend to L , since ˜ f is not b ounded, and a uniform distribution is thus not deﬁned. This issue causes the ǫ terms in the th eorem. 5 F or the ﬁrst in equalit y , we dropp ed the second in tegral, and used the s ymmetry of the d istribution to write the ﬁ r st in tegral twice and then regroup. The s econd step used that by d eﬁnition, f ( x ) = f ( x ), and the third the in verse tr iangle in equ alit y . By deﬁnition, f ( x ) lies b et w een f and f ; therefore, | f − f | = | f − f ( x ) | + | f ( x ) − f | , and we can further b ound E f ∼ q [∆( f , A )] ≥ 1 2 Z L | f − f ( x ) | + | f ( x ) − f | q ( f ) d f = Z L | f − f ( x ) | q ( f ) d f ≥ Z L r | f − f ( x ) | q ( f ) d f − ǫ/ 2 , b ecause sym metry of q implies that Prob[ f / ∈ L ] ≤ 1 /r ≤ ǫ/ 2 (w e w ill set r ≥ 2 /ǫ ), and Lipschitz con tinuit y implies that | f − f ( x ) | ≤ 1 for all x . Next, we prov e Inequ ality (2). Let q b e an arbitrary distribu tion, and r large enough such that Prob f ∼ q [ f / ∈ L r ] ≤ ǫ/ 2. First, w e truncate q to a distribution q ′′ o ver L r : W e set q ′′ ( f ) = 0 for all f / ∈ L r , and renormalize b y setting q ′′ ( f ) = 1 Prob f ∼ q [ f ∈ L r ] · q ( f ) for f ∈ L r . Next, w e deﬁne a distribution ˜ q o v er equiv alence classes ˜ f as ˜ q ( ˜ f ) = R f ′ ∈ ˜ f q ′′ ( f ′ ) d f ′ ; ﬁnally , let q ′ b e deﬁned b y c ho osing an equiv alence class ˜ f according to ˜ q , and sub sequen tly c h o osing a member of ˜ f ∩ L r uniformly at r andom; clearly , q ′ is equiv alence-uniform. Let A ∈ argmin A ∈D ∩A I E f ∼ q ′ [∆( f , A )] b e a deterministic algorithm with ident it y p ost-pro cessing function minimizing the exp ected estimation error for q ′ ; let x b e the p oint at whic h A samples the fun ction. Since the algorithm alwa ys samples at x and outputs f ′ ( x ), the estimation er r or | f ′ ( x ) − f ′ | is the same for all f ′ ∈ ˜ f , b ecause all these f ′ are simply shifted or mir rored from eac h other. If used instead on the initial distribu tion q , A has exp ected estimation error Z L | f ( x ) − f | q ( f ) d f ≤ Pr ob f ∼ q [ f / ∈ L r ] + Z L r | f ( x ) − f | q ′′ ( f ) d f ≤ ǫ/ 2 + Z Z f ′ ∈ ˜ f | f ′ ( x ) − f ′ | q ( f ′ ) d f ′ d ˜ f = ǫ/ 2 + Z | f ( x ) − f | ˜ q ( ˜ f ) d ˜ f = ǫ/ 2 + Z Z f ′ ∈ ˜ f | f ′ ( x ) − f ′ | q ′ ( f ′ ) d f ′ d ˜ f = ǫ/ 2 + Z L r | f ( x ) − f | q ′ ( f ) d f . The inequalit y in the ﬁrst step came from upp er-b ounding the estimation error outside L r b y 1, and using that q ′′ ( f ) ≥ q ( f ). 3 Discrete Metric S paces In this section, we fo cus on ﬁnite metric spaces, consisting of n p oints. Thus, in stead of integ rals and d ensities, w e will b e considering sums and probabilit y distributions. The characte rization of 6 using the id en tity function for p ost-pro cessing f rom T heorem 2.2 holds in this case as well; hence, without loss of generalit y , we assume that all algorithms simply output th e v alue they observ e. T h e problem of ﬁnding the b est probabilit y distribution for a single sample can b e expressed as a linea r program, with v ariables p x for the s ampling probabilities at eac h of the n p oin ts x , and a v ariable Z for the estimation err or. Minimize Z sub ject to (i) P x p x = 1 (ii) P x p x · | f − f ( x ) | ≤ Z for all f ∈ L (iii) 0 ≤ p x ≤ 1 for all p oints x (3) Since this LP (wh ic h we refer to as th e “exact LP”) has inﬁnitely many constrain ts, our approac h is to replace the set L in the second constrain t w ith a set Q δ . W e will choose Q δ carefully to ensure that it “appr o ximates” L w ell, and su ch th at the resulting LP b elo w (whic h we refer to as the “discretized LP”) can b e solv ed eﬃcien tly . Minimize ˆ Z sub ject to (i) P x p x = 1 (ii) P x p x · | f − f ( x ) | ≤ ˆ Z for all f ∈ Q δ (iii) 0 ≤ p x ≤ 1 for all p oints x (4) T o deﬁne the n otion of appro ximation formally , let o b e a 1-median of the metric sp ace, i.e., a p oin t m inimizing P x d ( o, x ). Let m = 1 n P x d ( o, x ) b e the a verage distance of all p oints from o . Because we assumed w.l.o.g. that max x,y ∈M d ( x, y ) = 1, at least one p oint h as distance at least 1 2 from o , and therefore, m ≥ 1 2 n . The median v alue m forms a lo wer b oun d for r an d omized algorithms in the follo w in g sense. Lemma 3.1 The worst-c ase exp e cte d err or for any r andomize d algorithm is at le ast 1 4 · 6 β · m , wher e β is the doubling dimension of the metric sp ac e. Pro of. Consider any randomized algorithm with probabilit y d istr ibution p ; w.l.o.g., the algorithm outputs the v alue it observ es. Let R = { x | m 2 ≤ d ( x, o ) ≤ 3 m 2 } b e the ring of points at d istance b et w een m 2 and 3 m 2 from o . W e distinguish t w o cases: 1. If P x ∈ R p x ≤ 1 2 , then consider the Lipsc h itz-con tin uous fun ction deﬁned b y f ( x ) = d ( x, o ). This fun ction has a ve rage f = m . With p robabilit y at least 1 2 , the algorithm samples a p oint outside R , and thus outputs a v alue outside the interv al [ m 2 , 3 m 2 ], wh ic h incurs error at least m 2 . Thus, the exp ected err or is at least m 4 . 2. If P x ∈ R p x > 1 2 , then consider a collectio n of b alls B 1 , . . . , B k of diameter m 2 co vering all p oints in R . Because R is cont ained in a ball of diameter 3 m , the d oubling constrain t implies that k ≤ 6 β balls are suﬃcient . A t least one of these b alls — sa y , B 1 — has P x ∈ B p x ≥ 1 2 k . Fix an arb itrary p oin t y ∈ B 1 , and deﬁne the Lips c hitz-con tin u ous fu nction f as f ( x ) = d ( x, y ). Because o w as a 1-median, we get that f ≥ m . With probability at lea st 1 2 k , the alg orithm will choose a p oint in s ide B 1 and outpu t a v alue of at most m 2 , thus incurring an error of at least m 2 . Hence, th e exp ected error is at least 1 2 k · m 2 ≥ 1 4 · 6 β · m . 7 W e no w formalize our notion for a set of fun ctions Q δ to b e a go o d ap p ro xim ation. Deﬁnition 3.2 ( δ - a ppro ximating function classes) F or any sampling distribution p , deﬁne E 1 ( p ) = max f ∈ L ∆( f , p ) and E 2 ( p ) = max f ∈ Q δ ∆( f , p ) to b e the maximum err or o f sampling ac c or ding to p against a worst-c ase function fr om L and Q δ , r esp e ctively. The class Q δ is said to δ -appro ximate L if the fol lowing two c onditions hold: 1. F or e ach f ∈ L , ther e is a f unction f ′ ∈ Q δ such that | ∆( f ′ , p ) − ∆( f , p ) | ≤ δ 2 · E 1 ( p ) , for al l distributions p . 2. F or e ach f ∈ Q δ , ther e i s a function f ′ ∈ L such that | ∆( f ′ , p ) − ∆( f , p ) | ≤ δ 2 · E 1 ( p ) , for al l distributions p . Theorem 3.3 Assume that for e very δ , Q δ is a class of functions δ -appr oximating L , such that the fol lowing pr oblem c an b e solve d in p olynomial time (for ﬁxe d δ ): Given p , ﬁnd a function f ∈ Q δ maximizing ∆( f , p ) . Then, solving the discr etize d LP (4) inste ad of the exact LP (3 ) g ives a PT A S for th e pr oblem of ﬁnding a sampling distribution that minimizes the worst-c ase exp e cte d e rr or. Pro of. First, an algorithm to ﬁn d a function f ∈ Q δ maximizing P x p x · | f − f ( x ) | giv es a separation oracle for the d iscretized LP . Thus, using the Ellips oid Method (e.g., [5]), an optimal solution to the discretized LP can b e found in p olynomial time, for an y ﬁ xed δ . Let p , q b e optimal solutions to the exact and discretized LPs, resp ectiv ely . Let f 1 ∈ L maximize P x q x · | f − f ( x ) | o v er f ∈ L , and f 2 ∈ Q δ maximize P x p x · | f − f ( x ) | o v er f ∈ Q δ . Th us, ∆( f 1 , q ) = E 1 ( q ) and ∆( f 2 , p ) = E 2 ( p ). No w, applyin g the ﬁr st prop ert y from Deﬁnition 3.2 to f 1 ∈ L give s us a fu nction f ′ 1 ∈ Q δ suc h that | ∆( f ′ 1 , q ) − E 1 ( q ) | ≤ δ 2 E 1 ( q ). Since E 2 ( q ) ≥ ∆( f ′ 1 , q ), we obtain that E 2 ( q ) ≥ E 1 ( q )(1 − δ 2 ). Similarly , app lying the second prop erty from Deﬁn ition 3.2 to f 2 ∈ Q δ , giv es us a fun ction f ′ 2 ∈ L with | ∆( f ′ 2 , p ) − E 2 ( p ) | ≤ δ 2 E 1 ( p ). Since E 1 ( p ) ≥ ∆( f ′ 2 , p ), w e h a ve th at E 1 ( p ) ≥ E 2 ( p ) − δ 2 E 1 ( p ), or E 1 ( p ) ≥ E 2 ( p ) 1+ δ 2 . Also, by optimalit y of q in Q δ , E 2 ( q ) ≤ E 2 ( p ). Thus, w e obtain that E 1 ( q ) ≤ E 2 ( q ) 1 − δ 2 ≤ E 2 ( p ) 1 − δ 2 ≤ E 1 ( p )(1+ δ 2 ) 1 − δ 2 ≤ E 1 ( p )(1 + 2 δ ) . In ligh t of Theorem 3.3, it s u ﬃces to exh ibit classes of f unctions δ -appr o ximating L for which the corresp ondin g optimiza tion problem can b e solv ed eﬃcien tly . W e do so f or metric spaces of b ounded doubling dimension and metric sp aces that are con tained on the lin e. 3.1 A PT AS for A rbitrary Metric Spaces W e ﬁrst observ e that since the error for an y translation of a function f is the same as for f , w e can assume w.l.o.g. that f ( o ) = 0 for all functions f considered in this section. Thus, in this sect ion, w e implicitly restrict L to fun ctions with f ( o ) = 0. W e next d escrib e a set Q δ of fun ctions w hic h δ -approximate L . Roughly , w e will discretize function v alues to diﬀerent multi ples of γ , and consid er distance scales that are diﬀerent multiples of γ . W e later set γ = δ 48 · 6 β +6 . W e then show in Lemm a 3.4 that Q δ has size n log(2(1+ γ ) /γ )(2 /γ ) β = n O (1) for constan t δ ; the discretized L P can therefore b e solv ed in time O (p oly( n ) · n log(2(1+ γ ) /γ )(2 /γ ) β ) 8 (using exhaustiv e search for th e separation oracle), and w e obtain a PT AS for ﬁnding the optimum distribution for arbitrary metric sp aces. W e let k = log 2 1 2 m , and deﬁne a sequence of k r ings of exp onentia lly d ecreasing diameter aroun d o , that divide the s p ace in to k + 1 r egions R 1 , . . . , R k +1 . Sp eciﬁcally , R k +1 = { x | d ( x, o ) ≤ 2 m } , and R i = { x | 2 − i < d ( x, o ) ≤ 2 − ( i − 1) } for i = 1 , . . . , k . Notice that b ecause m ≥ 1 2 n , we h a ve th at k ≤ log n su ﬃces to obtain a d isjoin t co v er. Since th e metric space h as doub lin g dimension β , ea c h region R i can b e co v ered with at most (2 /γ ) β balls of diameter 2 γ · 2 − i . Let B i,j denote the j th ball from the co ver of R i ; w ithout loss of generalit y , eac h B i,j is non-empty and con tained in R i (otherwise, consider its in tersection with R i instead). W e call B i,j the j th grid b al l for region i . Thus, the grid balls co v er all p oints, and th er e are at most (2 /γ ) β · log n grid balls. See Figure 1 for an illustration of this cov er. F or eac h grid ball B i,j , let o i,j ∈ B i,j b e an arbitrary , but ﬁ xed, r e pr esentative of B i,j . Th e exception is that for the grid ball co n taining o , o must b e c hosen as the representati v e. W e no w deﬁne the class Q δ of functions f as follo ws: 1. F or eac h i, j , f ( o i,j ) is a m ultiple of γ · 2 − i . 2. F or all ( i, j ) , ( i ′ , j ′ ), the f u nction v alues satisfy the follo wing r elaxe d Lipschitz-c ondition : | f ( o i,j ) − f ( o i ′ ,j ′ ) | ≤ d ( o i,j , o i ′ ,j ′ ) + γ · (2 − i + 2 − i ′ ). 3. All p oin ts in B i,j ha v e the same function v alue, i.e., f ( x ) = f ( o i,j ) f or all x ∈ B i,j . Figure 1: C o vering with grid balls W e ﬁrst sho w that the size of Q δ is p olynomial in n . Lemma 3.4 The size of Q δ is at most n log(2(1+ γ ) /γ )(2 /γ ) β . 9 Pro of. Because of the ﬁ rst and second constraint in th e d eﬁnition of Q δ , eac h p oin t o i,j can tak e on at most d ( o i,j ,o )+2 γ 2 − i γ · 2 − i ≤ 2(1+ γ ) · 2 − i γ · 2 − i = 2(1+ γ ) γ distinct v alues. Setting the v alues for all o i,j uniquely determines the fun ction f ; the relaxed Lipschitz condition will result in some of these functions not b eing in Q δ , and thus only decreases the n u m b er of p ossible functions. Because there are at most (2 /γ ) β · log n grid balls, there are at most (2(1 + γ ) /γ ) (2 /γ ) β · log n = n log(2(1+ γ ) /γ )(2 /γ ) β functions in Q δ . W e need to pro ve th at Q δ appro ximates L w ell, by verifying that for ea c h function f ∈ L , there is a “close” fu nction in Q δ , and vice v ersa. W e ﬁr st show that for any function satisfying the relaxed Lipsc hitz condition, w e can c hange the fun ction v alues s lightly and obtain a Lipschitz cont in uous function. Lemma 3.5 F or e ach x ∈ M , let s x b e some non-ne gative numb e r. Assume that f satisﬁes the “r elaxe d Lipschitz c ondition ” | f ( x ) − f ( y ) | ≤ d ( x, y ) + s x + s y for al l x, y . Then, ther e is a Lipschitz c ontinuous function f ′ ∈ L su c h that | f ( x ) − f ′ ( x ) | ≤ s x for al l x . Pro of. W e describ e an algorithm wh ic h runs in iterations ℓ , and sets the v alue of one p oin t x p er iteration. S ℓ denotes the set of x suc h that f ′ ( x ) has b een set. W e mainta in the follo wing tw o in v arian ts after the ℓ th iteration: 1. f ′ satisﬁes th e Lipschitz condition for all pairs of p oin ts in S ℓ , and | f ′ ( x ) − f ( x ) | ≤ s x for all x ∈ S ℓ . 2. F or ev ery f unction f ′′ satisfying the previous condition, f ′ ( x ) ≤ f ′′ ( x ) for all x ∈ S ℓ . Initially , th is clearly holds f or S 0 = ∅ . And clearly , if it h olds after iteration n , the function f ′ satisﬁes the claim of the lemma. W e now describ e iteration ℓ . F or eac h x / ∈ S ℓ − 1 , let t x = max y ∈ S ℓ − 1 ( f ′ ( y ) − d ( x, y )). W e sho w b elo w that for all x , we h a ve t x ≤ f ( x ) + s x . Let x / ∈ S ℓ − 1 b e a p oint maximizing max( f ( x ) − s x , t x ), and set f ′ ( x ) = max ( f ( x ) − s x , t x ). It is easy to v erify that this deﬁnition satisﬁes b oth parts of the inv arian t. It remains to sh o w that t x ≤ f ( x ) + s x for all p oin ts x / ∈ S ℓ − 1 . Assu me that t x > f ( x ) + s x for some p oint x . Let x 1 b e the p oint in S ℓ − 1 for whic h t x = f ′ ( x 1 ) − d ( x, x 1 ). By deﬁnition, either f ′ ( x 1 ) = f ( x 1 ) − s x 1 , or there is an x 2 suc h that f ′ ( x 1 ) = t x 1 = f ′ ( x 2 ) − d ( x 1 , x 2 ). In this wa y , we obtain a chain x 1 , . . . , x r suc h that f ′ ( x i ) = f ′ ( x i +1 ) − d ( x i , x i +1 ) for all i < r , and f ′ ( x r ) = f ( x r ) − s x r . Rearranging as f ′ ( x i +1 ) − f ′ ( x i ) = d ( x i , x i +1 ), and adding all these equalities for i = 1 , . . . , r giv es us that f ( x r ) − f ′ ( x 1 ) = s x r + P r − 1 i =1 d ( x i , x i +1 ). By assum p tion, we h a ve f ′ ( x 1 ) − d ( x, x 1 ) = t x > f ( x ) + s x . S u bstituting the previous equ alit y , rearranging, and applying the triangle inequalit y give s u s that f ( x r ) − f ( x ) > s x + s x r + d ( x, x 1 ) + P r − 1 i =1 d ( x i , x i +1 ) ≥ s x + s x r + d ( x, x r ) , whic h con tradicts the relaxed Lip sc h itz cond ition for the p air x, x r . W e no w use Lemma 3.5 to obtain, for an y giv en f ∈ Q δ , a function f ′ ∈ L close to f . Lemma 3.6 L et f ∈ Q δ . Ther e exists an f ′ ∈ L such that | ∆( f , p ) − ∆ ( f ′ , p ) | ≤ δ 2 · E 1 ( p ) , for al l distributions p . 10 Pro of. Because f ∈ Q δ , it m ust satisfy , for all ( i, j ) and ( i ′ , j ′ ), th e relaxed Lip sc hitz condition | f ( o i,j ) − f ( o i ′ ,j ′ ) | ≤ d ( o i,j , o i ′ ,j ′ ) + γ · (2 − i + 2 − i ′ ). No w , we apply Lemma 3.5 with s o i,j = γ · 2 − i to get fu nction v alues f ′ ( o i,j ) for all i, j that satisfy the Lipschitz condition and th e condition that | f ′ ( o i,j ) − f ( o i,j ) | ≤ γ · 2 − i . F or an y other p oint x , let L max ( x, f ) = min i,j ( f ′ ( o i,j ) + d ( x, o i,j )) and L min ( x, f ) = max i,j ( f ′ ( o i,j ) − d ( x, o i,j )), and set f ′ ( x ) = 1 2 · ( L max ( x, f ) + L min ( x, f )). It is easy to see th at L min ( x, f ) ≤ L max ( x, f ) for all x , and that th is deﬁnition give s a L ip sc h itz con tin u ous function f ′ . F or a p oint x ∈ B i,j , triangle inequalit y , the ab o v e construction, and the fact that B i,j has diameter at most 2 γ · 2 − i imply that | f ′ ( x ) − f ( x ) | ≤ | f ′ ( x ) − f ′ ( o i,j ) | + | f ′ ( o i,j ) − f ( o i,j ) | + | f ( o i,j ) − f ( x ) | ≤ 2 γ · 2 − i + γ · 2 − i + 0 = 3 γ · 2 − i . F or eac h p oin t x , let ι ( x ) b e the index of the region i suc h that x ∈ R i . No w, us in g th e triangle inequalit y and Lemma 3.1, we can b oun d | f ′ − f | ≤ 1 n · X x | f ′ ( x ) − f ( x ) | ≤ 1 n · X x 3 γ · 2 − ι ( x ) ≤ 1 n · ( X x / ∈ R k +1 3 γ · d ( x, o ) + X x ∈ R k +1 3 γ · m ) ≤ 1 n · (3 γ nm + 3 γ n m ) ≤ 24 · 6 β · γ · E 1 ( p ) . Similarly , w e can b ound X x p x · | f ′ ( x ) − f ( x ) | ≤ 3 γ · ( X x / ∈ R k +1 p x · d ( x, o ) + X x ∈ R k +1 p x m ) ≤ 3 γ · ( m + X x / ∈ R k +1 p x · d ( x, o )) . Let f ′′ b e deﬁn ed as f ′′ ( x ) = d ( x, o ). Clearly , f ′′ ∈ L , f ′′ = m , and the estimation err or for p when the input is f ′′ is ∆( f ′′ , p ) = P x p x · | f ′′ ( x ) − m | ≥ P x / ∈ R k +1 p x · | d ( x, o ) − m | ≥ ( P x / ∈ R k +1 p x · d ( x, o )) − m. Com bining these observ ations, and using Lemma 3.1 and the f act that ∆( f ′′ , p ) ≤ E 1 ( p ), w e get P x p x · | f ′ ( x ) − f ( x ) | ≤ 6 γ · m + 3 γ ∆( f ′′ , p ) ≤ (8 · 6 β + 1) · 3 γ · E 1 ( p ). No w, b y usin g the fact that | ∆( f , p ) − ∆( f ′ , p ) | ≤ | f ′ − f | + P x p x · | f ′ ( x ) − f ( x ) | , and setting γ = δ 48 · 6 β +6 , w e obtain the desired b oun d. Finally , we need to analyze the con verse d irection. Lemma 3.7 L et f ∈ L . Ther e exists an f ′ ∈ Q δ such that | ∆( f , p ) − ∆ ( f ′ , p ) | ≤ δ 2 · E 1 ( p ) , for al l distributions p . 11 Pro of. The pro of is similar to that of Lemma 3.6. First, for eac h grid ball represen tativ e o i,j , we let f ′ ( o i,j ) be f ( o i,j ), rounded down (up for negativ e num b ers) to the nearest m ultiple of γ · 2 − i . Then, for all p oint s x ∈ B i,j , w e set f ′ ( x ) = f ′ ( o i,j ). C learly , the resulting function f ′ is in Q δ . By a similar argum en t as b efore, | f ′ ( x ) − f ( x ) | ≤ 3 γ · 2 − ι ( x ) , for all p oints x . Th us, we immediately get | f ′ − f | ≤ 24 · 6 β · γ · E 1 ( p ) as w ell. Deﬁne the fu nction f ′′ exactly as in the pro of of Lemma 3.6. Th en, exactly the same bou n ds as in that pro of apply , and giv e us the claim. 3.2 An FPT AS for p oints on a line In this s ection, we sho w th at if th e m etric consists of a d iscr ete p oint set on the line, then the general PT AS of the previous section can b e improv ed to an FPT AS. Since w e assumed the maximum d istance to b e 1, we can assume w.l.o.g . that the points are 0 = x 1 ≤ x 2 ≤ · · · ≤ x n = 1. Also, b ecause w.l.o.g. th e p ost-pro cessing is the identi t y fu nction, w e only n eed to consid er fu nctions f ∈ L 0 , i.e., su c h that P i f ( x i ) = 0. W e d eﬁne γ = δ 144 n , and the class Q δ to con tain the follo wing fun ctions f : 1. F or eac h i , f ( x i ) is a multiple of γ . 2. Th e function v alues satisfy the relaxed Lipsc hitz-condition | f ( x i ) − f ( x j ) | ≤ d ( x i , x j ) + γ for all i, j . 3. Th e sum is “close to 0”, in the s en se that P i f ( x i ) ≤ n γ . W e ﬁr st establish that, giv en a probabilit y distrib ution p , a function f ∈ Q δ maximizing P i p x i · | f ( x i ) | can b e found in p olynomial time using Dynamic Programming. T o s et up the recurrence, let a [ j, t, s ] b e th e maxim um exp ected err or that can b e achiev ed with f u nction v alues at x 1 , . . . , x j , und er the constraint s that f ( x j ) = t and P i ≤ j f ( x i ) = s . Then, we obtain the recurrence a [1 , t, s ] =  p x 1 · t if s = t −∞ otherwise a [ j + 1 , t, s ] = max y ∈ [ t − ( x j +1 − x j ) ,t +( x j +1 − x j )] ,γ | y ( p x j +1 | t | + a [ j, y , s − t ]) The maximizing v alue is th en max s ∈ [ − nγ , nγ ] ,γ | s,t ∈ [ − 1 , 1] , γ | t a [ n, t, s ]. The total num b er of entries is O ( n · 1 γ 2 ), and eac h entry r equires time O ( 1 γ ) to compute. The o v erall ru n ning time is th us O ( n · 1 γ 3 ) = O ( n 4 δ 3 ), giving us an FPT AS. All w e need no w is to sho w that Q δ δ -appro ximates L . W e use the follo wing lemma: Lemma 3.8 F or e ach f ∈ L , ther e is a function f ′ ∈ Q δ such that | ∆( f , p ) − ∆( f ′ , p ) | ≤ 2 γ for al l distributions p . Also , for e ach f ∈ Q δ , ther e is a function f ′ ∈ L such that | ∆( f , p ) − ∆( f ′ , p ) | ≤ 6 γ for al l distributions p . Pro of. F or the ﬁ rst p art, deﬁne f ′ b y roun ding eac h f ( x i ) down (to w ard 0 for negativ e v alues) to the nearest m u ltiple of γ . Clearly , f ′ ∈ Q δ . F urthermore, the a verage c h anges b y at most γ , and P x p x | f ′ ( x ) − f ( x ) | ≤ P x p x γ = γ . F or the s econd part, ﬁr st crea te a Lipsc h itz con tinuous fu nction f ′′ from f according to Lemma 3.5; then deﬁne f ′ ( x ) = f ′′ ( x ) − f ′′ for all x . The ﬁ rst step c hanged eac h function v alue by at most γ , and b ecause f ′′ ≤ γ + f ≤ 2 γ . w e h a ve that | f ′ ( x ) − f ( x ) | ≤ 3 γ for all x . Th us, | f ′ − f | + P x p x · | f ′ ( x ) − f ( x ) | ≤ 6 γ . 12 By Lemma 3.1, app lied with β = 1, an y rand omized algorithm must hav e exp ected er r or at least 1 12 n . In particular, subs tituting the d eﬁnition of γ = δ 144 n giv es us that 6 γ ≤ δ 2 · E 1 ( p ) for all distributions p . Thus, Q δ appro ximates L w ell. 4 Sampling in the In terv al [0 , 1] In this section, we f o cu s on what is probably the most b asic v ersion of the problem: the metric space is the in terv al [0 , 1]. In th is con tin u ous case, w e can explicitly c h aracterize the optimum sampling distribution an d estimation error. It is easy to see (and follo ws from a more general r esult in [4]) that the b est deterministic algorithm samples the fun ction at 1 2 and outputs the v alue read. The wo rst-case error of this algorithm is 1 4 . W e prov e that rand omization can lead to the follo wing impro v emen t. Theorem 4.1 An optimal distribution that minimizes the worst-c ase exp e cte d estimation err or is to sample uniformly fr om the interval [2 − √ 3 , √ 3 − 1] . This sampling gives a worst-c ase err or of 1 − √ 3 2 ≈ 0 . 134 . F ollo w ing the discu ssion in Section 2, we restrict ou r analysis w.l.o.g. to fun ctions f ∈ L 0 , i.e., w e assume that R 1 0 f ( x ) dx = 0. Then, the exp ected error of a d istr ibution p against input f is ∆( f , p ) = R 1 0 p x | f ( x ) | dx . The key p art of the p ro of of T heorem 4.1 is to sho w that wh en the algorithm samples un iformly ov er an in terv al [ c, 1 − c ], then with loss of only an arbitrarily small ǫ , w e can fo cus on functions consisting of jus t tw o line segment s. Theorem 4.2 F or any b , deﬁne f b ( x ) = 1 2 + b 2 − b − | b − x | . If p is uniform over [ c, 1 − c ] wher e c = 2 − √ 3 , then for every ǫ > 0 , ther e exists some b = b ( ǫ ) such that for al l fu nctions f ∈ L 0 , we have ∆( f b , p ) ≥ ∆( f , p ) − ǫ . All of Section 4.1 is dev oted to the pro of of Theorem 4.2 . Here, we sho w h o w to use Theorem 4.2 to prov e the upp er b oun d from Theorem 4.1. Let c = 2 − √ 3, so that the algorithm samples un iformly from [ c, 1 − c ]. Let ǫ > 0 b e arbitrary; w e later let ǫ → 0. L et b = b ( ǫ ) b e the v alue whose existence is guaran teed by T heorem 4.2. W e distinguish t w o cases: 1. If b ≤ c , then ∆( f b , p ) = 1 1 − 2 c · Z 1 − c c | 1 2 + b 2 − b − | b − x || dx = 1 1 − 2 c · ( 1 2 ( b 2 + 1 2 − c ) 2 + 1 2 (1 − c − b 2 ) 2 ) = 1 1 − 2 c · ( b 4 + ( 1 2 − c ) 2 ) . 2. If b ≥ c , then ∆( f b , p ) = 1 1 − 2 c · Z 1 − c c | 1 2 + b 2 − b − | b − x || dx 13 = 1 1 − 2 c · (2 bf ( b ) + 2 cb + f ( b ) 2 − c − b 2 ) = 1 1 − 2 c ( b 4 − b 2 + 2 cb + 1 4 − c ) = 1 1 − 2 c ( b 4 − ( b − c ) 2 + ( 1 2 − c ) 2 ) . The ﬁrst formula is increasing in b , and thus maximized at b = c ; at b = c , the v alue equals that of the second form ula, so the maximization must h app en for b ≥ c . A deriv ativ e test sh o ws that it is maximized for b = √ 3 − 1 2 , giving an error of 1 − √ 3 2 . By Th eorem 4.2, for any f unction f , the error is at most 1 − √ 3 2 + ǫ , and letting ǫ → 0 now p r o ves an upp er b ound of 1 − √ 3 2 on the error of the giv en distribution. Next, w e pro v e optimalit y of th e u niform distrib ution ov er [2 − √ 3 , √ 3 − 1], by providing a lo wer b ound on all randomized s ampling distributions. Again, b y T heorem 2.2, w e focus on ly on algorithms whic h output the v alue f ( x ) after sampling at x , by incu r ring an error ǫ > 0 that can b e made arbitrarily sm all. Our pro of is based on Y ao’s Minimax principle: we exp licitly prescrib e a d istribution q ov er L 0 suc h that for an y deterministic algorithm using the iden tit y fu nction, the exp ected estimat ion error is at least 1 − √ 3 2 . Since a deterministic algorithm is c h aracterize d completely by its samp lin g lo cation x , this is equiv alen t to sho wing that E f ∼ q [ | f ( x ) | ] ≥ 1 − √ 3 2 for all x . W e let b = √ 3 − 1 2 , and deﬁne t wo f unctions f , f ′ as f ( x ) = 1 2 + b 2 − b − | x − b | an d f ′ ( x ) = f (1 − x ). The d istribution q is then s imply to c ho ose eac h of f and f ′ with probabilit y 1 2 . Fix a sampling lo catio n x ; by symmetry , we can restrict ourselve s to x ≤ 1 2 . Bec ause f = f ′ = 0, the exp ected estimation error is 1 2 ( | f ( x ) | + | f ′ ( x ) | ) = 1 2 ( | 1 2 + b 2 − b − | x − b || + | 1 2 + b 2 − b − | 1 − x − b || ) =    1 2 − b, if x ≤ b 1 2 − x, if b ≤ x ≤ 1 2 − b 2 b 2 , if 1 2 − b 2 ≤ x ≤ 1 2 . This function is cle arly non-increasing in x , and thus minimized at x = 1 2 , where its v alue is b 2 = 1 − √ 3 2 . Thus, ev en at the b est sampling lo cation x = 1 2 , the error cannot b e less than 1 − √ 3 2 . This completes the pro of of Theorem 4.1. Notice that the pro of of Theorem 4.1 has an interesting alternativ e in terpretation. F or a (ﬁn ite) m ultiset S ⊂ L 0 of Lipschitz contin uous functions f with R x f ( x ) dx = 0, w e sa y that S is δ -c lose if there exist x, y such that 1 n · P f ∈ S | f ( x ) − y | ≤ δ . In other w ords, the a v erage distance of the functions from a carefully c hosen reference p oin t is at most δ . Th en, the pro of of Th eorem 4.1 implies: Theorem 4.3 Every set S ⊆ L 0 is (1 − √ 3 2 ) -close, and th is is tight. 4.1 Pro of of Theorem 4.2 W e b egin with the follo wing lemma w hic h guaran tees that w e can fo cus on fun ctions f with ﬁnitely man y zero es. 14 Lemma 4.4 F or any ǫ > 0 and any function f , ther e exists a function f ′ such that ther e ar e at most O (1 /ǫ ) p oints x with f ′ ( x ) = 0 , and ∆( f ′ , p ) ≥ ∆( f , p ) − ǫ , for al l distributions p . Pro of. Let ǫ > 0 b e arbitrary . W e prov e the lemma by mo difyin g f to ensure th at it meets the requirement s, and sho wing that its estimation err or d ecreases by at most ǫ in the pro cess. W e r eplace f with a function f ′ with the follo win g prop erties: (1) f ′ is Lipsc hitz con tinuous, (2) R 1 0 f ( x ) dx = R 1 0 f ′ ( x ) dx , (3) | f ( x ) − f ′ ( x ) | ≤ ǫ for all x , and (4) for eac h j = 1 , . . . , 1 /ǫ , the set Z f ′ j = { x ∈ [( j − 1) ǫ, j ǫ ] | f ′ ( x ) = 0 } co n tains at most three p oin ts. T he error can c hange b y at most ǫ due to the thir d condition, and the fourth condition ensures the b ound on the num b er of zero es. T o describ e the construction, ﬁrst fo cus on one in terv al [( j − 1) ǫ, j ǫ ], an d deﬁne x − = inf Z f j , x + = sup Z f j , and δ = x + − x − . No w let α = δ 2 +4 R x + x − f ( x ) dx 4 δ , and deﬁne the function f ′ suc h that f ′ ( x ) =    α − | x − + α − x | , if x ∈ [ x − , x − + 2 α ] α − δ / 2 + | x + + α − δ / 2 − x | , if x ∈ [ x − + 2 α, x + ] f ( x ) , if x ∈ [( j − 1) ǫ, j ǫ ] \ [ x − , x + ] . In tuitiv ely , this replaces the function on the inte rv al b y a zigzag shap e with the same integral that has the same leftmost and rightmo st zero. Do this for eac h j . By the careful c hoice of α , the in tegral remains unchanged. Because eac h function v alue c hanges by at most δ ≤ ǫ , the third condition is satisﬁed; the fourth condition is directly by construction, and Lipschitz con tin uit y is obvi ous. Next, w e sho w a series of lemmas restricting the fun ctions f under consideration. When w e sa y that f h as a certain prop ert y without loss of generalit y , w e mean that changing f to f ′ with that prop erty can b e accomplished while ensur ing that ∆( f ′ , p ) ≥ ∆( f , p ) for all un iform d istributions p o v er in terv als [ c, 1 − c ]. Since our goal is to c h aracterize the functions that make the algorithm’s error large, this restriction is ind eed withou t loss of generalit y . W e focu s on p oin ts x ∈ ( c, 1 − c ) with f ( x ) = 0. Let c ≤ z 1 ≤ . . . ≤ z k ≤ 1 − c b e all suc h p oin ts. F or ease of n otatio n, w e write z 0 = c and z k +1 = 1 − c . By conti n uit y , f ( x ) has the same sign for all x ∈ ( z i , z i +1 ), for i = 0 , . . . , k . W e sho w that w.l.o.g ., f is as large as p ossible o v er areas of the same sign. Lemma 4.5 Assume w.l.o.g. that f ( x ) ≥ 0 for al l x ∈ [ z i , z j ] , with j > i . Th en, w.l.o.g., f maximizes the ar e a over [ z i , z j ] subje ct to the Lipschitz c onstr aint and the f u nction values at z i and z j . Mor e formal ly, w.l.o.g., f satisﬁes, 1. If 1 ≤ i < j ≤ k , then f ( x ) = min( x − z i , z j − x ) for al l x ∈ [ z i , z j ] . 2. If i = 0 , then f ( x ) = min( f ( c ) + ( x − c ) , z 1 − x ) for al l x ∈ [ c, z 1 ] , and if i = k , then f ( x ) = m in ( f (1 − c ) + (1 − c ) − x, x − z k ) for al l x ∈ [ z k , 1 − c ] . Pro of. W e pr o ve th e ﬁrst part here (the pro of of the second part is an alogous). Deﬁne a fun ction f ′ as f ′ ( x ) = min( x − z i , z j − x ) for x ∈ [ z i , z j ], and f ′ ( x ) = f ( x ) otherwise. L et f ′′ = f ′ − f ′ , so 15 that f ′′ is renormalized to ha ve in tegral 0. Since f ′ ( x ) ≥ f ( x ) for all x , and f = 0, we ha v e that f ′ ≥ 0. Then Z 1 − c c | f ′′ ( x ) | − | f ( x ) | dx = Z z j z i | f ′′ ( x ) | − | f ( x ) | dx + Z z i c | f ( x ) − f ′ | − | f ( x ) | dx + Z 1 − c z j | f ( x ) − f ′ | − | f ( x ) | dx ≥ Z z j z i ( | f ′ ( x ) − f ′ | − | f ′ ( x ) | ) + ( | f ′ ( x ) | − | f ( x ) | ) dx − (1 − 2 c − ( z j − z i )) f ′ ≥ Z z j z i | f ′ ( x ) | − | f ( x ) | dx − Z z j z i f ′ dx − (1 − 2 c − ( z j − z i )) f ′ = Z z j z i f ′ ( x ) − f ( x ) dx − (1 − 2 c ) f ′ = 2 c · f ′ ≥ 0 . Th us, the estimation error of f ′′ is at least as large as the one for f , so w.l.o.g., f satisﬁes the statemen t of the lemma. Lemma 4.6 W.l.o.g., k ≤ 2 , i. e., ther e ar e at most two p oints x ∈ ( c, 1 − c ) such that f ( x ) = 0 . Pro of. Assume that f ( z 1 ) = f ( z 2 ) = f ( z 3 ) = 0. Consider mirr orin g the function on the interv al [ z 1 , z 3 ]. F ormally , we deﬁn e f ′ ( x ) = f ( z 3 − ( x − z 1 )) if x ∈ [ z 1 , z 3 ], and f ′ ( x ) = f ( x ) otherwise. Clearly , f ′ is Lipsc hitz con tinuous and has the same a v erage and same exp ected estimation error as f . Ho wev er, the signs of f ′ on the int erv als [ c, z 1 ] and [ z 1 , z 1 + z 3 − z 2 ] are no w the same; similarly for the in terv als [ z 1 + z 3 − z 2 , z 3 ] and [ z 3 , 1 − c ]. Thus, applying Lemma 4.5 , we ca n fur th er reduce the num b er of p oin ts x with f ( x ) = 0, without decreasing the estimation error. Hence, it suﬃces to fo cus on fu nctions f th at h a ve at most tw o p oin ts z ∈ ( c, 1 − c ) with f ( z ) = 0. W e distinguish th ree cases accordingly: 1. If th ere is no p oin t z ∈ ( c, 1 − c ) with f ( z ) = 0, then f ( c ) and f (1 − c ) ha ve the same sign; w ithout loss of generalit y , f is n egati v e ov er ( c, 1 − c ). Th en, the exp ected error is maximized w hen R c 0 f ( x ) dx and R 1 1 − c f ( x ) dx are as p ositive as p ossible, su b ject to the Lip sc hitz condition and the constraint that R 1 0 f ( x ) dx = 0. Otherwise, we could increase the v alue of R c 0 f ( x ) dx and R 1 1 − c f ( x ) dx , and th en lo w er the function to restore the in tegral to 0. By doing this, the expected estimatio n error cannot d ecrease. Thus, by Lemma 4.5, f is of the form f ( x ) = | x − b | + f ( b ), where b = argmin x ∈ ( c, 1 − c ) f ( x ). 2. If there is exactly one p oin t z ∈ ( c, 1 − c ) with f ( z ) = 0, then f ( c ) and f (1 − c ) hav e opp osite signs. Without loss of generalit y , assume that f ( c ) > 0 > f (1 − c ) and th at z ≤ 1 2 . (Otherw ise, w e could consider f ′ ( x ) = f (1 − x ) instead.) The exp ected error is maximized wh en f ( c ) is as large as p ossible, and R 1 − c z f ( x ) dx is as negativ e as p ossible, s ub ject to the Lipschitz condition and the constrain t that R 1 0 f ( x ) dx = 0. Because z ≤ 1 2 and the in tegral of th e function f ′ ( x ) = z − x is th us negati v e, by starting fr om 16 f ′ , then raising the function in the in terv al [1 − c, 1] and , if necessary , increasing f ′ (1 − c ), it is alwa ys p ossible to ensur e that f ( x ) = z − x for all x ∈ [0 , z ]. Th en, R 1 − c z f ( x ) dx is as negativ e as p ossible if, for some v alue b , f is of th e follo win g form: f ( x ) = − ( x − z ) for x ≤ b , an d f ( x ) = − ( b − z ) + ( x − b ) = z + x − 2 b for x ≥ b . Thus, f o verall is of the form f ( x ) = | x − b | − ( b − z ). 3. If there are t w o p oints z 1 < z 2 ∈ ( c, 1 − c ) with f ( z 1 ) = f ( z 2 ) = 0, then f ( c ) and f (1 − c ) ha v e the same sign; w.l.o.g., they are b oth p ositiv e. W e distinguish t wo sub cases: • If z 2 − z 1 ≥ (1 − c − z 2 ) + ( z 1 − c ), then the exp ected er r or is maximized when R z 1 0 f ( x ) dx and R 1 z 2 f ( x ) dx are as p ositive as p ossible, s ub ject to th e Lip sc h itz condition and the constrain t that R 1 0 f ( x ) dx = 0. Otherwise, w e could increase the v alue of R z 1 0 f ( x ) dx and R 1 z 2 f ( x ) dx , and then lo w er the function b y some small resulting δ to restore the in tegral to 0. If the fu nction is th us low ered b y δ , then for the int erv al ( z 1 , z 2 ), the error increases by δ , while for the interv als [ c, z 1 ] and [ z 2 , 1 − c ], it at most decreases by δ . By the condition z 2 − z 1 ≥ (1 − c − z 2 ) + ( z 1 − c ), the lo w ering would o verall increase the error. No w , applying Lemma 4.5 giv es us that w.l.o.g., f ( x ) = | x − z 1 + z 2 2 | − z 2 − z 1 2 . • If z 2 − z 1 < (1 − c − z 2 ) + ( z 1 − c ), then the exp ected error is m aximized when R c 0 f ( x ) dx and R 1 1 − c f ( x ) dx are as negativ e as p ossible, su b ject to the Lip s c hitz condition and the constrain t that R 1 0 f ( x ) dx = 0. Otherw ise, w e could decrease the v alue of R c 0 f ( x ) dx and R 1 1 − c f ( x ) dx , and then raise the fu nction to restore the integ ral to 0. An argument just as in the previous case sho ws that the error cannot decrease. Hence, w.l.o.g. , f ( x ) = f ( c ) − ( c − x ) for x ∈ [0 , c ] and f ( x ) = f (1 − c ) − ( x − (1 − c )) for x ∈ [1 − c, 1]. W e next claim that there m ust b e at least one p oin t z ∈ [0 , c ) ∪ (1 − c, 1] such that f ( z ) = 0. F or con tradiction, assu m e that f is p ositiv e in [0 , c ) ∪ (1 − c, 1]. T hen, f (0) , f (1) > 0, and therefore, f ( c ) , f (1 − c ) > c . Because f ( z 1 ) = f ( z 2 ) = 0, this implies that z 1 > 2 c and z 2 < 1 − 2 c . Bu t with our choice of c = 2 − √ 3, this implies th at z 2 < z 1 , a con tradiction. Without loss of generalit y , assume that the in terv al (1 − c, 1] conta ins su ch a p oin t z ; deﬁne z 3 = min { z ∈ (1 − c, 1] | f ( z ) = 0 } . F urther, assume that w e ha ve applied Lemma 4.5 to f , such that f maximizes the area in th e in terv als [ c, z 1 ] , [ z 1 , z 2 ] and [ z 2 , 1 − c ]. Consider mirroring the function f on the inte rv al [ z 1 , z 3 ]. F ormally , w e deﬁne f ′ ( x ) = f ( z 3 − ( x − z 1 )) if x ∈ [ z 1 , z 3 ], and f ′ ( x ) = f ( x ) otherwise. Clearly , f ′ is Lipsc hitz con tinuous and has the same integ ral (namely , zero) as f . Next, w e deﬁ n e a new function f ′′ b y mo d ifying f ′ so that it is as negativ e as p ossible in the int erv al [ z 1 + z 3 − z 2 , 1]. F ormally , we deﬁn e f ′′ ( x ) = z 1 + z 3 − z 2 − x if x ∈ [ z 1 + z 3 − z 2 , 1], and f ′′ ( x ) = f ′ ( x ) otherw ise. (See Figure 2 for an illustr ation of this mirroring, and the resulting shap es of f ′ and f ′′ ) Notice that f ′′ is not normalized to h a ve an integral of 0, since R 1 z 1 + z 3 − z 2 f ′′ ( x ) < R 1 z 1 + z 3 − z 2 f ′ ( x ). Ho w ev er, since (by assumption on the cur ren t case)) z 2 − z 1 < (1 − c − z 2 ) + ( z 1 − c ), raising f ′′ to restore the integ ral to 0 can only increase the r esu lting estimation error, by an argument similar to the previous case. The remainder of the pro of for this case is as f ollo ws: W e w ill ﬁ rst pro v e that the estimation error of f ′′ is at least as large as the estimation err or of f . This implies that ev en after normalizing f ′′ , 17 Figure 2: Mirr oring f in [ z 1 , z 3 ] its estimatio n error remains at lea st as large as that of f . Finally we can use Lemma 4.5 on th e normalized v ersion of f ′′ to reduce the num b er of p oints x ∈ ( c, 1 − c ) w ith f ′′ ( x ) = 0 do w n to either one or zero, without decreasing th e estimatio n error, and th us reduce this sub case to one of the previous tw o cases. W e no w compare the estima tion err or of f against that of f ′′ . S im p ly by deﬁnition of f ′′ , w e h a ve that R 1 − c z 2 | f ( x ) | dx = R z 1 +(1 − c − z 2 ) z 1 | f ′′ ( x ) | dx , and R z 1 c | f ( x ) | dx = R z 1 c | f ′′ ( x ) | dx . F urthermore, we hav e that R z 2 z 1 | f ( x ) | dx ≤ R 1 − c z 1 +(1 − c − z 2 ) | f ′′ ( x ) | dx . This follo ws, since for an y v alues p, q suc h that 0 < p < q , we h a ve R q 0 | q 2 − | x − q 2 || dx ≤ R q 0 | p − x | dx . Hence, R 1 − c c | f ′′ ( x ) | dx ≥ R 1 − c c | f ( x ) | dx . In all three cases, w e ha v e thus sho wn that w.l.o.g., f ( x ) = | x − b | − t , for some v alues b, t . Finally , the normalization R 1 0 f ( x ) dx = 0 implies that t = 1 2 + b 2 − b , completing the pro of of Theorem 4.2. 5 F uture W ork Our work is a ﬁ rst step to w ard obtaining optimal (as opp osed to asymptotically optimal) random- ized algorithms for c h o osing k sample lo cations to estimate an aggregate qu an tity of a fun ction f . The most obvious extension is to extend our results to the case of estimating the a v erage using k samples. It would b e in teresting w h ether appro ximation guaran tees for the k -median pr oblem (t he deterministic coun terpart) can b e exceeded usin g a rand omized strategy . Also, ou r precise charac terization of the optimal sampling distribution f or functions on the [0 , 1] 18 in terv al should b e extended to higher-dimensional con tin uous metric sp aces. Another natural dir ec- tion is to consider other aggregat ion goals, su c h as p redicting the function’s m axim um , minimum, or median. F or pred icting th e maximum from k deterministic samples, a 2-appro ximation algorithm w as giv en in [4], w hic h is is b est p ossible u nless P=NP . How ev er, it is not clear if equally go o d appro ximations can be ac hiev ed for the rand omized case. F or the median, ev en the deterministic case is op en. On a tec h nical note, it would b e interesting whether ﬁnd ing the b est sampling distribution for the single sample case is NP-hard. While we presen ted a PT AS in this pap er, n o hardness result is currentl y kno wn. Ac knowledgmen ts W e w ould lik e to thank Da vid Ep p stein, Bobb y Klein b erg and Alex Slivkins for helpful discussions and p oint ers, and anon ymous referees for useful feedback on previous versions. References [1] V. Ary a, N. Garg, R. Khandek ar, A. Mey erson, K. Munagala, and V. P andit. Local search heuristics for k-median and f acilit y location problems. In Pr o c. ACM Symp osium on The ory of Computing , 2001. [2] N. S. Bakh v alo v. On appro ximate calculation of integral s. V estnik M GU, Ser. M at. Mekh. Astr on. Fiz. Khim , 4:3–18, 1959. [3] I. Baran, E. Demaine, and D. K atz. Optimally adaptiv e in tegration of u niv ariate lipsc hitz functions. Algorithmic a , 50(2):2 55–27 8, 2008 . [4] A. Das and D. Kemp e. Sen s or selection for minim izing w orst-case prediction err or. In Pr o c. International Confer enc e on Information Pr o c essing in Sensor N etworks, IPSN , 2008. [5] M. Gr¨ otsc hel, L. Lo v´ asz, and A. S c h r ijv er . The ellipsoid metho d and its consequences in com b in atorial optimizatio n. Combinato ric a , 1:169– 197, 1981. [6] A. Gupta, R. K r authgamer, an d J . R. Lee. Bounded geometries, fr actals, and lo w-distortion em b eddings. In Pr o c. IEEE Symp osium on F oundations of Computer Scienc e , 2003. [7] P . Mathe. The optimal error of mon te ca rlo int egration. Journal of Complexity , 11(4):39 4–415 , 1995. [8] R. Motw ani and P . Ragha v an. R andomize d A lgorithms . Cambridge Universit y Pr ess, 1990. [9] E. No v ak. S tochastic prop erties of quadrature formula s. Numer. Math. , 53(5):60 9–620 , 1988 . [10] J. F. T raub , G. W. W asilk o wski, and H. W o ´ zniako w ski. Information-Base d Complexity . Aca- demic P ress, New Y ork, 1988. [11] J. F. T raub and A. G. W ersc h ulz. Complexity and Inform ation . C am b r idge Univ ersit y Press, Cam bridge, 1998 . 19 [12] J. W o jtaszczyk. Multiv ariate integrati on in c ∞ ([0 , 1] d ) is not strongly tractable. Journal of Complexity , 19(5):6 38–64 3, 200 3. [13] H. W ozniak o wski. Averag e case complexit y of linear multiv ariate problems part 1: Theory . Journal of Complexity , 8(4):337 –372, 1992. [14] H. W ozniak o w ski. Av er age case complexit y of linear multiv ariate problems part 2: Applica- tions. J ournal of Complexity , 8(4):37 3–392, 1992. 20

Estimating the Average of a Lipschitz-Continuous Function from One Sample

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment