Universal Behavior in Large-scale Aggregation of Independent Noisy Observations
Aggregation of noisy observations involves a difficult tradeoff between observation quality, which can be increased by increasing the number of observations, and aggregation quality which decreases if the number of observations is too large. We clari…
Authors: Tatsuto Murayama, Peter Davis
Univ ersal Beha vior in Large-scale Aggregation of I ndep enden t Noisy Ob serv ations T atsuto Mura yama ∗ and Peter Davis † NTT Communic ation Scienc e L a b or atories, NTT Corp or ation, 2-4, Hikaridai, Seika, Kyoto 6 19-0237, Jap a n (Dated: No v emb er 9, 201 8) Abstract Aggregat ion of noisy obs er v ations in volv es a d ifficult tradeoff b etw een observ ation qualit y , whic h can b e increased b y increasing the num b er of observ ations, and agg r egation quality whic h decreases if the n u m b er of ob s erv ations is to o large. W e clarify this b ehavi or for a prot yp ical s ystem in whic h arbitrarily large n umbers of observ ations exceeding the system capacit y ca n b e aggreg ated using lossy data co m pression. W e sho w the existence of a scaling relati on b et ween the colle ctive error and the system capacit y , and sho w that large scale lossy aggreg ation can outp erform lossless aggregat ion ab o ve a critical lev el of obser v ation noise. F urther, w e sho w that un iv ersal resu lts for scaling and critical v alue of noise whic h are ind ep end en t of system capacit y can b e obtained by considering asymptotic b eha vior when the system capacit y increases to w ard infinit y . P ACS num b ers: 89.70 .-a, 6 4.60.-i, 02.50 .Cw, 7 5.10.Nr ∗ Electronic a ddress: muray ama@cs lab.kecl.n tt.co.jp † Electronic addres s: davis@cslab.kecl.nt t.co .jp 1 This letter presen ts results whic h giv e a new p ersp ectiv e o n the gro wing field of sensory data aggregation by clarifying fundamen tal principles of large-scale agg regation. Examples of large scale ag gregation o f observ ations include astronomical observ ations [1], biolo gical sensing [2], early detection of na tural disasters suc h as earthquak es, tidal wa v es a nd flo o ds [3] and wireless sensor netw orks [4]. Errors in o bserv ations can be reduced b y collecting obser- v atio n data from more sensors. How eve r, collecting data f rom many sensors usually in volv es some cost in terms of net work resources, resulting in fundamen t a l tradeoffs [5]. The the- oretical understanding of these tra deoffs in na t ural and engineered systems is now a high priorit y . An imp ortant fundamen tal problem in this field is the problem of aggregating indep enden t observ ations of the same phenomenon with a r esource constrain t. Previous works hav e analyzed t he tradeoff b ehav io r betw een aggregate data rat e and sens ing erro r from the fundamen tal view of information theory . The analysis ha s b een extended to include the situation where arbitrarily lar ge n um b ers of samples can b e collected b y reducin g the data aggregated from eac h sample using lossy data compress ion. Ho w ev er, so far results ha v e only b een obtained for the fundamental information theoretic bounds with infinitely man y sensors [6, 7], or sp ecific situations in whic h the num b er o f sensors is fixed [8]. The previous w orks do no t include the situation where the n umber o f observ a t io ns can b e v aried, a nd th us the results are not sufficien t to supp ort our understanding and design of real world systems. In this pap er w e in t r o duce a mo dification of the common basic mo del for dat a a ggregation with compression whic h mak es it more tractable and amenable to a nalysis when the num b er of sensors can v a ry . Sp ecifically , w e consider indep enden t decompression of eac h observ ation in a discrete v ersion of the CEO pro blem [6]. W e show that this mo del rev eals a new prop erty , the existenc e of noise threshold b ey ond whic h large scale ag gregation is superior to lossles s aggregation with no compression. This can b e seen as a manifestation of “more is different” in sensor net w orks [9]. Moreo v er, w e sho w that unive r sal results for scaling b eha vior of collectiv e estimation error can b e obtained by considering asymptotic b eha vior when the sys tem capacit y div erges to infinit y . Supp ose that w e hav e L indep enden t sensors which eac h indep enden tly observ e an M -bit state X , X µ for µ = 1, · · · , M , of a common, unifo rm binary source, and obtain an M -bit observ ation Y ( a ) ( a = 1, · · · , L ) where each bit Y µ ( a ) has common probability p of error, i.e. differing fro m the corresp onding source bit X µ . The v alue of p specifies the lev el of 2 observ ation noise. No w the sensors indep enden tly compress their M -bit observ ation in to shorter N - bit co dew ords, Z ( a ), and send them to t he agg regator. The condition ’indep en- den t’ excludes the p ossibilit y of m utual comm unications b et wee n sensors. W e a ssume the rate R = N / M is common to all the sensors. In addition, w e suppose that the sum t o tal of the rate, the system capacity λ , is fixed, with λ = LR . (1) The a ggregator then deco des ev ery N bit co dew ord indep enden tly to obtain L separate M - bit repro ductions ˆ Y ( a ) ( a = 1 , · · · , L ). Finally , the ˆ Y ( a ) are used to obta in a single collectiv e estimator ˆ X . W e analyze the b ehavior of the bit error probability , denoted p e ( p, R ; λ ), in the collectiv e estimate. The theoretical low er b ound of av erage distortion for a giv en rate R is giv en by the distortion-rate function, or simply the Shannon b ound [10]. Though w e kno w that the b ound could b e a c hiev ed asymp t otically b y using Shannon’s random co des, the expo nen tial enco ding complexit y prohibits us from using them in practice. F or uniform binary sources, ho we ver, an alternativ e a ppro ac h has b een recen t ly dev elop ed based on linear co des with iterativ e, or message passing, enco ding achiev ing close to the theoretical limit [11, 12, 13]. Applying these new results allows us to obtain n umerical results f o r arbitrary data reduction. FIG. 1 sho ws t ypical results from a n umerical experiment for t he av erag e v alues of p er-bit error probability , p e ( p, R ; λ ), obta ined using a linear co de with an iterativ e enco der [11]. The linear co des are defined by a class of sparse matrices ha ving K ones (’1’) p er row and C ones p er column , respective ly , where K/C = N / M . Therefore w e ma y write R = K /C [14]. F or ease of comparison, the v a lues of error pro babilit y p e ( p, R ; λ ) for noise p and rate R = N / M are divided b y a reference lev el p e ( p, 1; λ ) for R = 1 under the same system capacity λ [15]. The example FIG. 1 demonstrates the follo wing t w o p oin ts; (1) Ther e exists a thr eshold value o f noise wher e lossy lar ge-sc ale ag gr e gation b e c omes sup erior to loss less aggr e gation . Lossless a ggregation with R = 1 outp erfo rms the lossy ag g regation with R smaller than 1 at low er noise lev els. Ho w eve r, at higher noise lev els the alternative strategy with lo ssy data compression b ecomes superior . (2) T her e exists a sc aling r elation with r esp e ct to system c ap ac ity. The error curve s ha v e a univ ersal shap e in the sense that plots for differen t λ o ve r la p with appropriate re-scaling, as shown b y the example fo r λ = 500 using the scale on the righ t side. This observ ation implies a scaling law for the data aggregation with respect 3 0.2 0.3 0.4 0.5 10 −2 10 −1 10 0 10 1 10 2 0.2 0.3 0.4 0.5 10 −1 10 0 10 1 p R = 2 / 3 , λ = 1 00 0 (lef t) R = 1 / 6 , λ = 1 00 0 (lef t) R = 2 / 3 , λ = 5 00 (r i ght) FIG. 1: Semilog plots for a verage error probabilit y in noisy data ag gregation u sing linear codes with K = 2. The v alues of error probabilit y p e ( p, R ; λ ) for noise p an d rate R = K/C are divid ed b y a reference lev el p e ( p, 1; λ ) under the same system capacit y λ . Here parameters are c hosen to b e λ = 500 with C = 3 (pluses, righ t scale) and λ = 1000 with C = 3 and 12 (circles and squares, left scale), resp ective ly . to λ . Introducing the co efficien t β w e can write the emp ir ical scaling relation a s follows: log h p e ( p, R ; β λ ) p e ( p, 1; β λ ) i = β log h p e ( p, R ; λ ) p e ( p, 1; λ ) i . (2) Using the base-10 lo garithm, t he scaling in FIG. 1 is well defined b y the scaling factor β = 2. In this letter, w e presen t a theoretical analysis whic h explains these empirical results, a nd presen ts them in a univ ersal form. First, w e assume that t he error due to lossy compression is indep enden t of µ and a , a nd denoted by D , that is h δ ( Y µ ( a ) , − ˆ Y µ ( a )) i = D . Here we used Kronec k er’s delta δ and the brake t denotes av eraging o v er random v ariables. This includes the standard exc hangeable sensor ansatz f or our mo del [6 , 7], whic h means that all sensors ha ve the same ra t e R and distortion D . The p ossible v alue of the distortion D dep ends o n R , so we explicitly denote D as D ( R ). The combine d error probabilit y for ˆ Y µ ( a ), indep endent of µ a nd a , is obtained as ρ = (1 − 2 p ) D ( R ) + p . (3) The combined error pro babilit y ρ is a function of b oth p and R . In particular, the equation (3) implies t ha t ρ is a decreasing function of R , since D ( R ) should b e a decreasing function 4 of R . Since we assume Bernoulli statistics, t he b est estimate fro m the set of aggregat ed v alues can b e obtained b y the sim ple ma jority-v ote operatio n: ˆ X µ = sgn L X a =1 ˆ Y µ ( a ) . Then, the error probability for the final estimate is giv en, in terms of ρ and L , by p e ( p, R ; λ ) = P L l = L +1 2 Q ρ ( l ; L ), whic h is just the pro babilit y of getting more tha n L/ 2 errors out of L Bernoulli trials. W e assume for simplicit y that only o dd v alues of L are tak en. The Q ρ ( l ; L ) = L l ρ l (1 − ρ ) L − l represen ts the binomial dis t ribution. It is ob vious that p e ( p, R ; λ ) is a decreasing function of L if ρ is fixed. How eve r, due to the constrain t (1), a nd the decrease in distortion D ( R ) with increase of R , ρ actually increase s with an increase of L , resulting in con t rary effects on p e ( p, R ; λ ). Therefore the c hallenge here is to incorp orate consideration of the distortion D in a w a y whic h clarifies the interpla y b et wee n the con trary effects induced b y the constrain t (1). In the following, w e consider the asymptotic analysis in the limit of larg e λ , for whic h w e can obtain explicit results. F or sufficien tly large L , t he binomial distribution Q ρ ( l | L ) is w ell appro ximated b y the Gaussian distribution N( Lρ, Lρ (1 − ρ )) with mean Lρ and v aria nce Lρ (1 − ρ ). No w we examine t he asymptotic b eha vior f or λ . W rite α ( p, R ) = (1 − 2 p )(1 − 2 D ( R )) and define, for simplicity , ν = α ( p, R ) √ λ p R (1 − α ( p, R ))(1 + α ( p, R )) . Then, in the limit λ → ∞ , the asymptotic expansion of the cumm ulat ive G aussian distri- bution gives p e ( p, L ; λ ) ∼ 1 2 erfc ν √ 2 , where erfc( x ) is the complimen tary error function [16]. By analogy with large deviation theory [17], w e can define and calculate t he exponential rate o f decay as follows : I p ( R ) = − lim λ →∞ 1 λ ln p e ( p, R ; λ ) = α ( p, R ) 2 2 R (1 − α ( p, R ))(1 + α ( p, R )) (0 < R ≤ 1) . (4) Notice that the ab ov e f orm ula holds fo r an y function D ( R ) . Indeed this univ ersal prop ert y w ell describes the exp onen tial scaling (2). 5 In particular, the smallest av erage distortion D ( R ) is obtained in the limit of M → ∞ , and is called the distortion-rate function [10]. In our mo del, its in v erse function, the ra te- distortion f unction [10], can b e ana lytically giv en b y R ( D ) = 1 + D log 2 D + (1 − D ) log 2 (1 − D ) . (5) W e ma y use either the distort io n-rate function or the rate- distort io n function to describ e the optimal b oundary , since the t w o des criptions are equiv alen t in the large M limit. No w assume hereafter that the distortion-rat e function D ( R ) is the sp ecific case implicitly giv en b y the in v erse formula (5) f or R ( D ). Then asym pto tics of R ( D ) enables us to obtain the large s cale de ca y rate as I p (0) = − lim λ →∞ 1 λ lim R → 0 ln p e ( p, R ; λ ) = (1 − 2 p ) 2 ln 2 . No w w e can see that if w e compare just the t wo aggregation strategies R = 1 or R → 0, the threshold v alue of noise p 1 corresp onding to the switc h of the sup erior aggregation can b e determined by solving the equation (1 − 2 p 1 ) 2 ln 2 = (1 − 2 p 1 ) 2 2(1 − (1 − 2 p 1 ) 2 ) . The analytical solution p 1 = 0 . 236 gives the threshold beyond whic h the larg e scale agg r e- gation with R → 0 outp erforms t he R = 1 strategy . Next w e n umerically examine the v alue of R whic h maximizes I p ( R ) for a giv en p . The optimal v alue R ∗ is plotted in FIG. 2 as a function of p . W e find that the optimal rate v anishes, i.e. R ∗ = 0, f o r no ise lev els larg er than a critical p o in t p 0 = 0 . 295. In con tra st, w e can alwa ys find non-zero optimal v alues of R b elow this p oin t. In particular, if the noise lev el is near zero, then R = 1 is optimal. The c ha nge in v alue o f optimal R ∗ with resp ect to noise lev el p is con tin uous at p 0 , as in a second o r der phase transition. W e note that the a nalytical results presen ted here using ( 4 ) and (5) a r e consisten t with the results of t he numerical sim ulations with linear co des. That is, t he exp onen tial ra te of deca y (4) w ell describ es the scaling la w (2). Moreov er, they add more sp ecific and fundamen tal conditions to our first observ ation on F IG. 1 that ag gregation with R smaller than 1 is sup erior for larger noise. The critical p oin t b ey ond whic h the strategy with R = 1 is not optimal in FIG. 2 indicates the low est b ound for such threshold, and is ob viously consisten t with the n umerical simulations. 6 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 p R ∗ R † FIG. 2: Op timal rate R ∗ for lossy aggregat ion of observ ations f r om indep end en t sensors in noisy en vironm ent w ith noise level p . R ∗ is the optimal v alue of R , the aggregation r ate p er sensor, whic h maximizes the asymptotic deca y r ate I p ( R ) of error probabilit y with increase of system capacit y λ . I p ( R ) is d efined in (4). Disto r tion due to lossy compression is given implicitly b y (5). F or comparison, R † is th e p essimistic v alue of R w hic h minimizes I p ( R ). No w let us consider the v alue of R whic h minimizes I p ( R ), sa y R † . In contrast with the con tinuous c hange in the b eha vior of optimal R ∗ , the p essimistic R † sho ws an abr upt c hange with respect to the noise p . Our numerical ana lysis indicates that there are only tw o cases for the w orst solution: R † = 0 and R † = 1, so t he threshold v a lue of noise p 1 corresp onds to the switc h of the R † . W e no t e that in the in termediate rang e of p t he optimal R ∗ is a finite v alue b etw een R = 1 and R = 0 . It is natural to ask how m uc h the estimates obtained with these in termediate v alues of R ∗ differ fr o m the estimates obtained using the extreme v alues of R = 1 or R = 0. FIG. 3 sho ws the noise dep endence of deca y rates I p ( R ) with R = 0 , 1, and R ∗ , resp ective ly . The size of the difference I p ( R ∗ ) − I p (1) and I p (0) − I p (1) is show n in the inset of FIG. 3. F or comparison with these results whic h we r e obtained using the Shannon limit, the ra te- distortion function in (5), we also sho w the result obtained for linear co de with K = 2, corresp onding to FIG. 1. This result for K = 2 was obtained using the replica metho d for diluted spin systems [18, 19]. First we note that in the case of compression using R ( D ), expression (5 ), the com bination strategy of using only either R = 0 o r R → 0, switc hing at 7 the threshold p oin t p 1 , w ell appro ximates the optimal p erformance giv en b y R ∗ . Next, w e fo cus o n the b eha vior of the difference I p ( R ∗ ) − I p (1) with res p ect to the noise p (solid line in inset). The la rgest gain is ac hiev ed at p ∗ = 0 . 305 (indicated in the figure b y a v ertical dotted line), whic h differs sligh tly from the v a lue for p 0 whic h was p 0 = 0 . 295. Finally , w e consider the result for the linear co de with K = 2 . It sho ws a similar threshold b eha vior - the v alue o f I p ( R ) fo r R = 0 b ecomes greater than the v alue fo r R = 1 when the noise p exc eeds a threshold v alue [20]. Ho we ver, the gain is less than that obtained for the rate- distortion function, whic h show s that there is still ro o m for improv emen t b y using alternativ e tec hniques [21, 22]. Our results sho w that the optimal aggregatio n for a system of sensors with constrained system capacit y exhibits a kind of threshold behavior with res p ect to the observ ation noise lev el. If we imagine the system autonomously switc hing to the optimal aggregation metho d, then it w ould app ear to b e a phase transition b eha vior. This result is significant for under- standing the principles of large scale a g gregation in se nsing systems, natura l or engineered. W e describ ed the b eha vior of the optimal aggregation rate p er sensor R = λ/L , the ratio of the system capacit y λ and the n um b er of s ensors L . The analysis shows that in the hig h noise region b ey o nd a critical v alue of noise, t he rate R should a pproac h to zero in order to reduce collectiv e estimation error. This means that v ery man y sensors with L ≫ λ should b e used. In contrast, if the noise level is low er than the critical p oin t, the ratio R should tak e a p ositiv e v a lue. In this case , the n umber of sensors scales as L = O ( λ ). This w o rk has b een supp orted in part b y Gr an t- in- Aid for Scien tific Researc h o n Priority Areas, Ministry of Education, Culture, Sp orts, Science and T ec hnology (MEXT), Ja pan, No. 18079015. [1] M. Ryle and A. Hewish, Mon th ly Notices of the Ro ya l Astronomical So ciet y 120 , 220 (1960). [2] N. F ran ceschini, Ph otoreceptor Op tics pp. 98–12 5 (1975). [3] J. Zschau and A. K ¨ upp ers, E arly Warning Systems for Natur al Di saster R e duction (Sprin ger, 2003) . [4] J. Kahn , R. Katz , and K . Pister, in Pr o c e e dings of the 5th annua l ACM/IEEE internatio nal c onfer enc e on Mobile c omputing and networking (A CM Press New Y ork, NY, USA, 1999) , p p. 8 0.15 0.2 0.25 0.3 0.35 0.1 0.2 0.3 0.4 0.5 0.6 p 0 0.1 0.2 0.3 0.4 0.5 0 0.01 0.02 p I p ( R ∗ ) − I p (1) I p (0) − I p (1) K = 2 R ( D ) I p ( R ∗ ) I p (1) I p (0) FIG. 3: Error deca y r ates. I p ( R ∗ ) is the error d eca y rate with lossy d ata compression at the optimal aggrega tion rate R ∗ , while I p (1) is the error deca y rate without data compression. I (0) corresp onds to the error deca y rate for the large system limit wh en R → 0. Inset: Information gain. R ( D ) corresp ond s to the Shann on limit, wh ile K = 2 ind icates p erformance of the linear co des wh en C → ∞ . 271–2 78. [5] I. Akyildiz, W. Su, Y. Sank arasubramaniam, and E. Cayirci, IEEE Communications Magazine 40 , 102 (2002). [6] T. Berger, Z. Zhang, and H. Viswanat h an, IEE E T ransactions on Information T heory 42 , 887 (1996 ). [7] Y. Oohama, IEE E T ransactions on Information T heory 44 , 1057 (1998). [8] M. Gastpar, IEEE T ransactions on Information Theory 54 , 5247 (2008). [9] P . An derson, S cience 177 , 393 (1972). [10] T. Cov er and J . Thomas, Elements of informa tion the ory (Wiley New Y ork, 1991). [11] T. Mur a ya ma, P hysical Review E 69 , 35105 (2004 ). [12] M. W ain wr igh t and E. Manev a, in Pr o c e e dings of IEEE International Symp osium on Infor- mation The ory (2005), pp. 1493–1497 . [13] S. Cilib erti, M. M ´ ezard, and R. Zecc hina, Physical Review Letters 95 , 38701 (2005). [14] R. Gallag er, IEEE T r ansactions on Inform ation Theory 8 , 21 (1962). [15] T. Mura yama and P . Da vis, Adv ances in Neural Information Pro cessing Systems 18 (NIPS’05) 9 pp. 931–938 (2006) . [16] E. Copson, Asymptotic E xp ansions (Cam br idge Universit y Press, 2004). [17] R. Ellis, Entr opy, L ar ge Deviations a nd Statistic al Me chanics (S p ringer, 1985). [18] K. W ong and D. Sherrington, Journ al of Ph ysics A: Mathematical and General 20 , L793 (1987 ). [19] Y. K abashima and D. S aad, Europhysics Letters 45 , 97 (1999). [20] T. Mura yama and M. Ok ada, Ad v ances in Neural Inf ormation Pro cessing Sys tems 15 (NIPS’02) p p. 423–430 (2003) . [21] T. Hosak a, Y. Kabash im a, and H. Nishimori, Physic al Review E 66 , 66126 (2002). [22] M. O pp er and O. Win ther, Physical R eview Letters 86 , 3695 (2001) . 10
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment