Distributed Estimation of Gaussian Correlations
We study a distributed estimation problem in which two remotely located parties, Alice and Bob, observe an unlimited number of i.i.d. samples corresponding to two different parts of a random vector. Alice can send $k$ bits on average to Bob, who in t…
Authors: Uri Hadar, Ofer Shayevitz
Distributed Estimat ion of G aussian Correla tions Uri Hadar and Ofer Sha y ev itz ∗ June 26, 2018 Abstract W e study a distributed estimation problem in which t w o remotely located parties, A lice and Bob, observe an unlimited number of i.i.d. samples corresponding to t w o different parts of a random vector . Al- ice can send k b its on a v erage to Bo b, who in turn wan ts to estimate the cross-correla tion matrix b etw een the tw o parts of the vecto r. In the cas e where the parties observ e join tly Gaussian scalar random v ariables with an u nknown correlatio n ρ , w e obtain tw o constructive and simple unbi- ased estimato rs attaining a v ariance of (1 − ρ 2 ) / (2 k ln 2) , whic h coinci des with a k now n but n on-constructive random coding result of Zhang and Berger. W e extend our approac h to the vector Gaussian case, which has not b een treated b efore, and construct an estimator that is uniformly b et- ter than the scalar estimator applied separately to eac h of the correlations. W e then sho w that the Gaussian p erformance can essen tially b e attained even when the distribution is completely unknown. This in particular implies that in the general problem of distributed correla tion estimation, the v ariance can decay at least as O (1 /k ) with the number of transmitted bits. This b ehavior, ho we ver, is not tigh t: w e giv e an example of a ric h family of distributions fo r which local samples revea l essen tially nothing about the correlations, and where a slig htly mo dified estimator attains a v aria nce of 2 − Ω( k ) . 1 In tro duction and Main Results Estimating the parameters of an unkno wn distribution from its samples is a basic task in many scientific pro blems. The v ast ma jor ity of r esearch in this field has bee n dedica ted to the centralized setup, where a num ber of indep endent samples are being obser ved by the estimating entit y [1]. How e ver, in many c a ses the data for the es timation task migh t b e collected b y remote terminals, who then need to communicate information regar ding their o bserv a tions in order to per form (or improv e) estimation. When the budget for co mmu nication is limited, the parties ∗ The authors are with the Depart men t of EE–Systems, T el A viv Universit y , T el A viv, Israel {emails: urihadar@mail.tau.ac.il, ofer sha@eng.tau.ac.il}. This work was supported by an ERC gran t no. 639573. The first author would like to ack no wledge the generous supp ort of T he Yitzhak and Chay a W einstein Research Institute for Signal Processi ng. 1 m ust judiciously enco de their obser v ations and s e nd a compress ed version that is as useful as p ossible, crea ting a tension b etw een communication and estimatio n. In this pap er, we study the follo wing distributed estimation setup. Let X and Y b e a pair of jointly distributed r andom vectors taking v alues in Euclidean spaces of dimensions d X and d Y resp ectively . Assume the distribution of the pair is only known to belong to a given family of distributions, but is o therwise arbitrary . T wo remotely lo cated parties, Alice and Bob, draw i.i.d. samples { ( X i , Y i ) } from this distribution, where the X comp onent is o bserved o nly b y Alice and the Y component is obser ved only by Bob. The parties ar e interested in estimating the s e t o f correlatio ns betw een the en tries of X and Y using their loca l s amples and limited co mm unication. Specifica lly , we fo cus on the regime where the num ber of samples lo cally av a ilable to ea ch par ty is essentially unlimite d , but only a fixe d n um ber of k bits can b e tra ns mitted on average from (say) Alice to Bob. In this extremal regime there is no coupling b etw een data collection and c o mmu nication (t ypically captured b y the notion of r ate , of co mm unication bits per data sample), a nd the only constraint in the system stems from its distributive nature. Moreov er, w e res trict attention to ca ses where the corr elations cannot b e estimated lo cally (e.g . Gaussian mar ginals do not dep end on the cross- correla tion parameters), which further distills the distributiv e a sp ect o f the problem. In what follows we fo cus mainly o n the Gaussia n case, i.e., where X and Y are jo intly Gaus s ian ra ndom vectors. W e b egin our discussio n w ith the scalar d X = d Y = 1 case, where our goal is to estimate the co rrelation co efficien t ρ . The only w ork w e are aware of that deals with distributed estimation of the biv aria te nor mal cor relation under communication constraints is by Zhang and Berger [2], who studied the problem as an application of a more gener al res ult. Using random co ding techniques, they prov ed the existence of an a symptotically un biased estimator whose v a riance they pr ovided a s a function of the num ber of samples and the rate R of communication bits per sa mple. Specia lizing to our setup by plugging in k /R as the num ber o f samples, the Zhang-Berger v a riance is g iven by V ar ˆ ρ Z B = R k 1 + ρ 2 + 1 − ρ 2 2 2 R − 1 + o (1) . (1) Since we do no t impo se a rate constra int in our setup, we can minimize the v aria nce ov er R to obtain inf R> 0 V ar ˆ ρ Z B = 1 k 1 − ρ 2 2 ln 2 + o (1) , (2) which is attained (not surprisingly) in the zero-ra te limit as R → 0 . It should be no ted that this estimator was not claimed to be o ptimal in any s ense. F ur- thermore, as the a uthors themselves indicate, the results i n [2] a pply only to the single sca la r parameter case, and it is not clear how to extend this approach to the v ector case. In this Gauss ian scalar setup, a ddressed in Section 2 , w e introduce the fol- lowing constructive scheme: Alice sends to Bo b the index J o f the larges t s ample 2 among her first 2 k samples, and Bob computes the un biased es timator ˆ ρ max = Y J E X J ≈ Y J √ 2 k ln 2 . (3) In Theor em 1, w e show that this s imple estimato r attains the same v a riance a s the non-constructive Z hang-Ber g er estimator (2), i.e., V ar ˆ ρ max = 1 k 1 − ρ 2 2 ln 2 + o (1) . (4) Then, in pr eparations for the v ector case, w e describ e a simple v ariation of this estimator: Alice sca ns her samples s equentially and finds the index J of the first sample to pass a suitably chosen thresho ld. She then compres ses this index using an o ptimal lossless v a riable-length co de a nd sends the enco ded version to Bob, who computes an es timator using his cor resp onding Y sample, in a wa y s imilar to the maximum estimato r above. This threshold estimator is unbiased, and also attains the Zhang- Berger v aria nce. W e note tha t the maximal/ threshold- passing v alue of a scalar i.i.d. Gaussia n sequence has b een employ ed b efore in problems of writing on dir ty pa pe r [3], [4], and Gaussian lossy so urce co ding [5]. W e pr o ceed to cons ider the vector Gaussian setup (Section 3). Without loss of generality , we a s sume that b oth par ties kno w the distribution of Alice’s vector, since she can estimate it a r bitrarily well from her lo cal samples a nd s end a sufficiently accurate qua nt ization to Bob with what can be shown to b e a negligible cost in communication. In the ca se where d X = 1 and d Y > 1 we can trivially extend the scalar estimator by having Alice p erfor m the same enco ding (maximal or threshold) and hav e Bob apply the same type of estimation to ea ch of the entries of Y using the single index o btained from Alice. The case of d X > 1 and d Y = 1 is more interesting. O f co urse, o ne c o uld simply estimate each one o f the corr elations ρ ℓ betw een ( X ) ℓ and Y se parately b y rep eating the scalar method. A worth y goal is therefore to find a n estimator that dominates the scalar appro ach, uniformly for all cor relation v alues. In Prop osition 3, w e show that p erfor ming genera l linear op erations (e.g ., whitening the signal) b efor e applying the scalar estimator , do es not dominate the scalar appro ach. W e then describ e a multid imensional estimator that do es dominate the scala r appro ach, b y gener alizing the scalar thres hold to a n appropr iately chosen d X -dimensional stopping set . W e show that the resulting (constr uctive) estimator ˆ ρ attains a total mean squared error that is a function of the highest c orr elation only , and is g iven by E k ˆ ρ − ρ k 2 ≤ 1 k d 2 X 2 ln 2 min ℓ ∈ [ d ] { 1 − ρ 2 ℓ } + o (1) . (5) This is proved Theorem 4 . W e note that the cas e of d X , d Y > 1 is ag ain a trivial extension of the d X > 1 , d Y = 1 case. Returning to the genera l non-Gaussian setup (Section 4), we provide tw o ad- ditional results. In Section 4.1 we show how our estimators ab ov e can essentially 3 be used to obtain the same varianc e guar ante es when ( X , Y ) a re arbitr arily di s- tribute d , sub ject only to uniform integrabilit y fourth moment conditions. This in particular mea ns that one can alwa ys get a O (1 /k ) v ariance in distributed correla tio n estimation with k tr a nsmitted bits on a verage. Recall that in cen- tralized es timation problems, when the family of distributions is sufficiently smo oth in the pa rameter of interest, the Cramér–Rao lower b o und implies that the o ptimal estimation v aria nce is Θ(1 /n ) , where n is the num ber of samples. Thu s, the centralized num ber of sa mples required to achiev e the same v ariance as in the distributed case is at least linear in the num b er o f c ommunication bits, i.e., ea ch communication bit is worth at least a constant num ber of samples. It is perhaps tempting to guess that this r elation is fundamental, i.e., that a bit is equiv alen t to a constant n um ber of samples, hence that the v ariance ca nnot decrease faster than Ω(1 /k ) , assuming that the family of distributions is suc h that Bob cannot es timate the cor relations from his lo cal samples. While we conjecture this is true in the Gauss ia n case, it do es not hold in genera l: In Subsection 4 .2 we give an ex ample o f a rich family of distributions for which lo cal samples reveal essentially no thing ab out the corre la tions, and where the v aria nce of our (slightly modified) estimator is 2 − Ω( k ) . 1.1 Related W ork The problem of distributed estimation under communication constraints has bee n studied in the last couple of decades b y se veral author s. Zhang a nd Berger [2] used random co ding techn iques to e s tablish the existence of a n asymp- totically unbiased estimator whose v a riance is u pp er bounded b y a s ing le-letter expression. Their results are limited to a cer tain family o f joint distributions (that m ust satisfy an addi tivity condition) that dep end on a one-dimensio nal parameter. Ahlsw ede and Burnas hev [6] gave a mu lti-letter lower bo und on the minimax e s timation v aria nce in the one-dimensional case. Han and Amari [7] (see also the surv ey pap er [8]) suggested a ra te constrained enco ding sc heme, and obtained the likelihoo d eq ua tion based on the deco ded statistic. They also show ed that the estimation v ariance asymptotically achiev es the in v erse of the Fisher informa tion o f that statistic. Their results only a pply to finite alphab ets. Amari [9] discussed o ptimal compres sion in the sp ecific setting of es timating the correla tio n b etw een tw o binary sources. He show ed that under linear-thre s hold enco ding, there do es not exist a single sch eme that is uniformly optimal for all correla tio n v a lues. A similar se tup was discus sed by Haim an d Ko chman [10] in the context of hypothesis tes ting betw een tw o cor relation v a lues. Zhang et al [11] provided minimax lower b o unds for a distributed es timatio n s etting in which all terminals observe samples fro m the same distribution. El Gamal and Lai [12] show ed that Slepia n-W olf r ates are no t neces sary for distr ibuted estimation over finite alphabets. There is a rich literatur e a ddressing o ther asp ects of the distributed estima- tion problem. Xiao et al [13] and Lou [14] consider ed distr ibuted estimation of a lo ca tion parameter under energy a nd bandwidth constrai nts. Gubner [1 5] considered a Bayesian distributed estimation setting and s uggested a lo ca l quan- 4 tization a lgorithm. Xu and Ra g insky [16] provided lower b ounds on the risk in a distributed Bay esian estimation setting with noisy channels betw een the data collection terminals and the estimation entit y . Braverman et al [17] provided low er b ounds for so me high dimensiona l distributed estimation problems, aga in when the samples of all terminals are fro m the same distribution, e .g . for dis- tributed estimation of the mult iv ar iate Guassian mean when it is kno wn to be sparse. The a uthors of [18], [1 9], [20] a nd [21] addressed v a rious distr ibuted estimation setups where the mea surements a cross the sensor s are assumed to b e independent. 1.2 Notations and preliminaries The sta ndard normal densit y is denoted by φ ( x ) = e − x 2 / 2 / √ 2 π , and the tail probability b y Q ( x ) , R ∞ x φ ( t ) dt . F or Z ∼ N (0 , 1) the inv erse Mills r atio is denoted by s ( x ) , E ( Z | Z > x ) = φ ( x ) Q ( x ) . (6) W e write log and ln for the base 2 and natural lo garithm, res pe ctively . The entr opy of the geo metric distribution with para meter p is giv en by h g ( p ) , h ( p ) /p , where h ( p ) = − p log p − (1 − p ) log(1 − p ) is the bina ry entrop y function. Note that h g ( p ) = − log( p )(1 + o (1)) a s p → 0 . Recall also that an y discrete random v ariable (e.g. in our case, a geometric r.v.) can b e losslessly encoded using a prefix-free co de with exp ected length excee ding its entropy by at most one bit [22]. Since in the setups we c o nsider the entrop y gr ows large, this e x cess one bit has v anishing effect on o ur results, hence for the s a ke of readability we disregar d it throughout. F or any natural n w e denote [ n ] , { 1 , . . . , n } . F or a vector v , the i -th co ordinate is denoted b y ( v ) i . Similarly , ( M ) ij denotes the ij -th entry o f the matrix M . The d × d identit y ma tr ix is denoted by I d . W e use the standard order notation; in the following, f and g ar e p o sitive functions with discrete or con tin uous doma in. W e write f = o ( g ) to indicate that lim f / g = 0 , and f = O ( g ) to indicate that lim sup f /g < ∞ , where the arguments and implied limits should b e clear from the context. W riting f = Ω( g ) means that g = O ( f ) , and f = Θ( g ) means that both f = O ( g ) a nd f = Ω( g ) . Given a statistic T , a nd a sca lar par ameter θ we wish to estimate, T he Fisher information of estimating θ from T (see e.g. [1]) is giv en by I T ( θ ) , E " ∂ log f ( T | θ ) ∂ θ 2 # , (7) where f ( t | θ ) is the p.d.f. o f T fo r the given v alue of θ . The Cra mér–Rao low er b ound (CRLB) states that, under s o me r egularity co nditio ns (see e.g. [1 ]) that are trivia lly satisfied in our setups, a ny un biased estimator ˆ θ = ˆ θ ( T ) of θ satisfies V ar ˆ θ ≥ 1 / I T ( θ ) . (8) 5 An estimato r ˆ θ that satisfies (8 ) with equality is sa id to b e efficient . W e empha- size tha t the efficiency is with respect to the s ta tistic T by saying it is efficient given T . The estimato rs and s ta tistics in this paper depend on the n um ber of communicated bits, k . W e call an estimator ˆ θ asymptotic al ly efficient given T if E ˆ θ → θ , and I T ( θ ) · Va r ˆ θ → 1 a s k → ∞ . The estimated parameter may b e vector v alued, in which case I T ( θ ) is a matrix giv en by I T ( θ ) , E " ∂ log f ( T | θ ) ∂ θ T · log f ( T | θ ) ∂ θ # , (9) and the CRLB reads Cov ˆ θ ≥ I − 1 T where the inequality is in the p os itive s emidef- inite s ense. In the vector case we say that an estimator ˆ θ ( T ) is asy mptoti- cally efficient if the estimator v T ˆ θ ( T ) of v T θ is asy mptotically e fficient for a ny v ∈ R dim( θ ) . W e note that since the a forementioned regularity conditions are satisfied in our Gaus s ian setups, then (asymptotic) efficienc y of an estimator implies that it is (asymptotic al ly) minimum varianc e u nbiase d . The Fisher information matrix of a Gaussia n vector with mean µ and co- v aria nce matrix Σ , where bo th are functions of a parameter vector θ , is given b y (see e.g. [23]) (I) ij = ∂ µ T ∂ ( θ ) i Σ − 1 ∂ µ ∂ ( θ ) j + 1 2 tr Σ − 1 ∂ Σ ∂ ( θ ) i Σ − 1 ∂ Σ ∂ ( θ ) j . (10) A common setup throughout is where a pa r ameter is estimated from ( X , Y ) where Y | X is Gaussian, and the distribution of X do e s not dep end on the parameter. In this case we ha ve I X , Y = E X I Y | X (11) where I Y | X is o btained via (10). 2 Estimating a single correlation In this se c tio n, we consider the case where X and Y ar e b o th s calar, jointly Gaussian r.v.s, with unknown parameters satisfying only E X 2 , E Y 2 < u for some known u . Since the n um ber o f loca l samples av aila ble to Alice and Bob is unlimited, they can both estimate their own mean a nd v a riance ar bitrarily well (taking u into account) and normalize their sa mples accor dingly . Hence, without los s of generality we can assume that X , Y ∼ N (0 , 1) , a nd that the o nly unknown parameter is their cor relation co efficient ρ . This mo del can b e written as Y = ρX + p 1 − ρ 2 Z (12) where Z ∼ N (0 , 1) is statistically independent o f X . Alice, who observes the i.i.d. samples { X i } , can transmit k bits on a verage to Bob, who observes the corr esp onding { Y i } sa mples and would like to obtain a 6 go o d estimate of ρ in the mean squa red erro r sense. W e note that the conditional Fisher inf ormation of ρ from Y , giv en that X = x , is I Y | X = x ( ρ ) = (1 − ρ 2 ) x 2 + 2 ρ 2 (1 − ρ 2 ) 2 , (13) which is linear in x 2 . This motiv a tes using an e stimator ba sed on a meas ur ement for which | x | is a s lar ge as p ossible. The same can also b e in tuitiv ely deduced from (12), since if one con trols X , then picking it as large as poss ible w ould “maximize the SNR”. F o r simplicity , we lo o k at larg e p ositive v a lues of x rather than large v a lue of | x | . Our deriv a tions can b e easily mo dified to ho ld in the latter case (with one e x tra bit desc r ibing the sig n) without affecting the results. 2.1 Max estimator F ollowing the heuristic discussion ab ove, cons ider the fo l lowing scheme. Giv en the constr aint k on the expected nu mber of communication bits, Alice lo oks at her first 2 k samples, finds the maximal one, and sends its index J = argmax i ∈ [2 k ] X i (14) to B o b, using exactly k bits. Bob now lo o ks at Y J , his sample that co rresp onds to the same index, which we r efer to as the c o-max 1 . If Bob were in p osses sion of X J as well, and observ ing the mo del (12) again, a natural estimator for ρ he could have used is Y J /X J . In fact, it can b e s hown that this estimator is an approximated solution to the maximum likelihoo d equation, which is third a degree polyno mial in this case (see appendix A.7). How ever, since X J is not av a ilable, Bob uses the estimator ˆ ρ max = Y J E X J (15) that dep ends only o n J (co mmunicated by Alice) and on his own sa mples. The following Theorem shows that this s imple estimato r attains the same v ariance as the non-constructive Zhang -Berger estimator (2), and also that k nowing the v alue of X J do es not help. Theorem 1. The estimator ˆ ρ max is unbiase d with V ar ˆ ρ max = 1 k 1 − ρ 2 2 ln 2 + o (1) (16) wher e k is the numb er of tr ansmitte d bits. F urthermor e, ˆ ρ max is asymptotic al ly efficient given ( X J , Y J ) . 1 This is also kno wn in the literature as the max c onc omitant , s ee e. g. [ 24] 7 Pr o of. It is easy to chec k that ˆ ρ max is unb iased. In order to compute its v ar iance, we nee d to compute the mean and v a r iance o f X J , which is the maximum of 2 k i.i.d. standa rd normal r.v.s . F r om extreme v alue theory (see e.g . [24]) applied to the normal distribution case , w e obtain: E X J = q 2 ln(2 k )(1 + o (1)) (17) E X 2 J = 2 ln (2 k )(1 + o (1)) (18) V ar X J = O 1 ln(2 k ) . (19) Therefore, we have that V ar ˆ ρ max = 1 ( E X J ) 2 V ar ( ρX J + p 1 − ρ 2 Z ) (20) = 1 ( E X J ) 2 ( ρ 2 V ar X J + 1 − ρ 2 ) (21) = 1 2 k ln 2 (1 − ρ 2 + o (1)) . (22) Now, r ecalling (13), the Fisher Information o f ρ from ( X J , Y J ) is given by I X J Y J ( ρ ) = (1 − ρ 2 ) E X 2 J + 2 ρ 2 (1 − ρ 2 ) 2 (23) = 2 k ln 2 1 1 − ρ 2 + o (1) , (24) and hence ˆ ρ max is a symptotically efficien t given ( X J , Y J ) . Theorem 1 sug gests that us ing a b etter estimator o f X J in lieu o f its exp ec- tation (by having Alice send so me qua nt ization of ˆ X J and having Bo b compute ˆ ρ = Y J / ˆ X J ) w ould no t impro ve the perfo rmance asymptotically , a s ˆ ρ max is op- timal among all unbiased estimato r s that use b oth the max a nd co-max. In Section 4.2, we will see that this obser v ation do es no t extend to some o ther additive mo dels . W e note that the rando m co ding Zhang-Ber ger estimator only deals with the scalar case, and as the authors themselves indicate [2], it remains unclea r whether it could b e extended to the the case of multiple co rrela tions. In contrast, our co nstructive approach ca n a lso b e naturally extended to the multidimen- sional case. T o that end, it is instr uctive to first describ e a simple v ar iation of our scalar estimator. 2.2 Threshold estimator W e now intro duce a simple mo dification to max estimator that w ill b e us e ful in the sequel. Instead of ta k ing the maximum of a fixed num ber of measur e ments, Alice sequentially s c a ns her samples until she finds a s ample that exceeds some 8 fixed threshold, to b e determined later. She then s ends the index of this sample to Bo b, w ho pro ceeds similarly to the max metho d. The main difference is that using the max metho d Alice sends a fixed num ber of bits, whereas using the threshold metho d she sends a ra ndom n um ber of bits. In this s ubsection, we in tro duce and analyze the thres ho ld es timato r and demo nstrate that it is asymptotically eq uiv alent to the max estimator , in terms of how the estimation v aria nce is related to the exp ected n um ber of bits trans mitted. As men tioned ab ov e, the main motiv a tion for studying the threshold es timator is that in co n- trast to the max es timator, it can b e natura lly extended to the multidim ensional case. Precisely , let J = min { i : X i > t } , (25) and co nsider the estimator ˆ ρ th = Y J E X J . (26) Note that the index J is distributed geometrically w ith parameter p = P r( X > t ) = Q ( t ) . Alice can therefor e repr esent J using a pr efix-free co de (e.g., Huff- man) with at mos t h g ( p ) + 1 bits on av erage, w her e h g ( p ) is the entrop y of this geometric distr ibution [22]. F or brevit y of exp osition, w e assume that the exp ected num ber of bits is ex actly k = h g ( p ) , as this do es not affect the asy mp- totic b ehavior. T her efore, to satisfy the communication constra int the thres ho ld m ust b e set to t = Q − 1 ( h − 1 g ( k )) . (27) W e la ter show that t = √ 2 k ln 2 (1 + o (1)) as k gr ows large. The following Theorem shows that a s the max es timator, the threshold es timator also attains the same v ariance as the non- constructive Z hang-Berg er estimator (2), and als o that kno wing the v alue of X J again do es not help. Theorem 2. The estimator ˆ ρ th is unbiase d with V ar ˆ ρ th = 1 k 1 − ρ 2 2 ln 2 + o (1) (28) wher e k is t he exp e cte d numb er of tr ansmitte d bits. F urthermor e, ˆ ρ th is asymp- totic al ly efficient given ( X J , Y J ) . Pr o of. It is immediate to verify that ˆ ρ th is unbiased. W e hav e from (6) that E X J = s ( t ) , and straig ht forward calculations give E X 2 J = 1 + ts ( t ) . Also it is known that t ≤ s ( t ) ≤ t + t − 1 , and that (see e.g. [25]) 1 s ( t ) = 1 t − 1 t 3 + 3 t 5 + O 1 t 7 . (29) 9 Combining the abov e yields E X 2 J = t 2 (1 + o (1)) , V ar X J = 1 /t 2 + O (1 /t 4 ) . (30) Let us now express the threshold t in terms o f k . W e hav e h g ( p ) = − log( p )(1+ o (1)) as p → 0 , and also that − ln Q ( t ) = t 2 2 (1 + o (1)) . Therefore the exp ected n umber of bits sen t by Alice is k = h g ( Q ( t )) (31) = − log( Q ( t ))(1 + o (1)) (32) = t 2 1 2 ln 2 + o (1) , (33) which yield t = √ 2 k ln 2 / (1 + o (1)) . Combining this with (30) and recalling the mo del (12 ), w e obtain V ar ˆ ρ th = 1 s 2 ρ 2 V ar X J + 1 − ρ 2 (34) = 1 − ρ 2 t 2 (1 + o (1)) (35) = 1 k 1 − ρ 2 2 ln 2 + o (1) . (36) Recalling (13), the Fisher information is giv en by I X J Y J = t 2 1 − ρ 2 (1 + o (1)) = 2 k ln 2 1 − ρ 2 (1 + o (1)) , (37) concluding the pro o f. Note tha t unlike the ma x im um e s timator, the threshold estimator’s v ar iance admits a n exact non-a symptotic express io n: V ar ˆ ρ th = 1 s 2 ( t ) (1 − ρ 2 ( s ( t ) − t ) · s ( t )) , (38) where t = Q − 1 ( h − 1 g ( k )) . 3 Estimating m ul tiple correlations W e pro ceed to address the mor e challenging multidim ensional case where X , Y are jointly Ga ussian r andom vectors with unknown par ameters. As in the scala r case, w e o nly assume that the v ar ia nces of all the entries of bo th X and Y ar e bo unded by so me known co nstant, hence Alice and Bob ca n c o mpute the means and v a riances of their sa mples, and no r malize them accor dingly . Thus, without loss of ge ner ality we can a ssume that a ll the ent ries of X and Y ha ve zero mean 10 and unit v ariance. In fact, for the same r e a sons we can assume that Alice knows Cov X and Bo b knows Cov Y . As b efore, Alice observes the i.i.d. samples { X i } and c an transmit k bits on av erage to B ob, who obs e r ves the co rresp onding { Y i } sa mples a nd would like to obtain a goo d estimate o f E YX T , the collection o f all the corr elations b etw een the differen t en tries of X and Y . F or simplicit y , our p erforma nce measure will be the expected sum o f squared estimation errors across all suc h co rrelatio ns . Below we discus s the tw o e x tremal setups: The case where X is a scala r a nd Y is a vector, and the opp o site case where X is a vector and Y is a scala r. This is sufficien t s ince estimators for the gener a l s etup where b ot h X , Y are v ectors are s traightforw ard to co nstruct b y combining the t w o extremal setups, hence discussing this more general setup adds no useful insight. Clea rly , the scalar methods sug gested in Sec tio n 2 ca n b e directly applied to the multidimensional case, by allo cating the bits b etw een the tasks of estimating each cor relation separately . It is therefore in teresting to try and find a truly multidimensional scheme that dominates the sca lar metho d, i.e., per fo rms at leas t a s go o d uni- formly for all possible v alues of the corr elations. 3.1 X is a scalar, Y is a v ector Suppos e ( X, Y ) are jointly Gaussia n, wher e X ∼ N (0 , 1 ) , Y ∼ N ( 0 , Σ Y ) is a d -dimensional (column) vector, and Σ Y has an all-ones diago nal and is known to Bo b, who is in terested in estimating the column correlatio n vector ρ = E Y X = [ ρ 1 , . . . , ρ d ] T . (39) The natural extensio n of the tw o scalar methods of Section 2 to this case is obvious. Here we analyze the thresho ld method, y et the max method is as simple and would yield the s ame results. Alice waits until X i passes a threshold t > 0 and transmits the resulting index J = min { i : X i > t } (40) to Bo b, wher e t = Q − 1 ( h − 1 g ( k )) . The es timato r is then ˆ ρ = 1 E X J Y J = 1 s ( t ) Y J , (41) which is an un biased approximation o f the maximum lik eliho o d estimator (see Appendix A.7). Theorem 3. The estimator ˆ ρ in (41) is u nbiase d with tr Co v ˆ ρ = 1 k 1 2 ln 2 d X ℓ =1 (1 − ρ 2 ℓ ) + o (1) ! (42) wher e k is t he exp e cte d n umb er of tra nsmitte d bits. F urthermor e, ˆ ρ is asymp- totic al ly efficient given ( X J , Y J ) . 11 Pr o of. This is simple consequence of Theo rem 2, except for a symptotic efficiency which we prove in Appendix A.1. This metho d (trivially ) dominates the sca lar metho d applied separa tely to each o f the corre la tions, a s the latter would yield P V ar ˆ ρ i = 1 k d 2 ln 2 P d ℓ =1 (1 − ρ 2 ℓ ) + o (1) . 3.2 X is a v ector, Y is a scalar Consider the setup where ( X , Y ) are jointly Gauss ian where Y ∼ N (0 , 1) and X is a d -dimensio nal (co lumn) vector ∼ N ( 0 , Σ X ) where Σ X is known to Alice and has an a ll-ones diagonal. Alice obs erves { X i } and transmits k bits to Bob on average, w ho observes { Y i } and wishes to estimate the ro w vector ρ = E Y X T = [ ρ 1 , . . . , ρ d ] . (43) The mo del can be written as (see e.g. [26 ]) Y = ρ Σ − 1 X X + σ Z (44) where Z ∼ N (0 , 1) is indep endent of X , and σ 2 = 1 − ρ Σ − 1 X ρ T . A naive extension o f the scalar metho d to this s e tup would b e to allo cate the bits b etw ee n the cor relations and apply the scalar (max o r thres ho ld) scheme d times, using the fact that the mo del ca n a ls o be written as Y = ρ ℓ ( X ) ℓ + q 1 − ρ 2 ℓ Z ( 45) for any ℓ ∈ [ d ] . One co uld sugg est to improve p erforma nce by ha ving Alice lo cally p er form so me genera l linear o pe r ation o n X b efore a pplying the scala r method, then having Bo b p er form the inv erse o per ation. While this can indeed help for certain corr elation v a lues, it c a nnot improv e the p erformance uniformly , even if the linear op e r ation can dep end on Σ X (hence can e.g. whiten X ). See Appendix A.2 for details . W e now intro duce an estimator that do es dominate the sca lar method. In fact, the mean squared er ror attained by this estimator is dictated by the sin- gle “best” entry of ρ , namely by the highest c orrelatio n only . Our method is based on replacing the scalar one-dimensional thres ho ld by d - dimensio nal stopping sets A 1 , . . . , A d ⊂ R d . Similarly to the scalar cas e, Alice waits un- til X i ∈ A 1 for the first time, then ag a in unt il X i ∈ A 2 , and so on 2 un til X i ∈ A d . Alice then describ e s the resulting indices J 1 , . . . J d to Bob using a n optimal v ariable-rate prefix-fre e co de of expected length equal to the en trop y of the asso c ia ted geometric distribution (again, we neglect the excess one bit). Defining Alice’s corresp onding sample matrix X J = [ X J 1 , . . . , X J d ] ∈ R d × d , Al- ice cr eates some quanti zation ˆ X J of X J , as further discussed b elow. W riting 2 The comm unication cost can b e sligh tly impro ve d if Ali ce first seeks X i that lies in the union of all sets, then X i that l ies in the union of the remaining sets, and so on. The difference is negligible for small Pr( A 1 ) , . . . , Pr( A d ) . 12 Y J = [ Y J 1 , . . . , Y J d ] ∈ R d for the corr esp onding sa mple vector on Bob’s s ide, we consider the estimator ˆ ρ = Y J ˆ X − 1 J Σ X (46) Note that in o rder to compute this estimator, Bo b needs to know Alice’s cov ari- ance matrix Σ X . Reca ll how ever that we hav e assumed without loss o f genera lity that this is in fact a corr elation matr ix, hence all its entries hav e abso lute v alue at most 1 . Using a uniform quantizer of [ − 1 , 1] with (say) √ k bits, each en try of this matrix can b e described to Bo b with a resolution of r oughly 2 − √ k , using only d 2 √ k bits ov erall. It is simple to chec k that this results in a neg lig ible cost bo th in comm unication and in the mean sq uared er r or, and henc e we disrega rd this issue b elow. The genera l ta s k is the following. Given a sp ecified av erage n um ber o f bits k , find some quant ization scheme X J → ˆ X J using k q bits p er entry , and sets A 1 , . . . , A d ∈ R d , that minimize E k Y J ˆ X − 1 J Σ X − ρ k 2 (47) sub ject to d X ℓ =1 h g (Pr( X ∈ A ℓ )) + d 2 · k q = k (48) Since the mo del (44) is linear with d pa r ameters, it is clear that w e need at least d different samples in order to obtain a n estimator with a v anishing mean squared err o r. F urthermor e, since Alice is giv en some control ov er the c hoice of X via her abilit y to pic k s amples from a large rando m set, it mak es sense to try a nd make the problem a s “ well-posed” as po ssible, e.g., by striving to make the matrix X J hav e the smallest p ossible condition num ber while sa tisfying the co mm unication constraints, which essentially dictate the num ber o f samples we c a n choose from. A reasonable choice is therefore to try and make X J as diagonal as po ssible, by w aiting eac h time for one co ordinate to b e strong and the others weak. T o make the pro blem tractable w e a pply the rationale a bove to a whitened version of X , whic h a llows us to directly compute the sto pping probability . Let W = Σ − 1 2 X X ∼ N ( 0 , I d ) (49) be the whitened version of X , a nd { W i = Σ − 1 2 X X i } i the as so ciated whitened samples. W e define the s to pping s e ts A w ℓ = w ∈ R d : | w ℓ | > a, | w j | < b ∀ j 6 = ℓ , (50) and the corre sp onding time indices J w ℓ = min { i > J w ℓ − 1 : W i ∈ A w ℓ } (51) for ℓ ∈ [ d ] , with J w 0 = 0 by definition. 13 Note that b y constructio n, Pr( W ∈ A w ℓ ) = 2 Q ( a )(1 − 2 Q ( b )) d − 1 for any ℓ . Alice then crea tes the matrix W J = [ W J w 1 , . . . , W J w d ] ∈ R d × d , (52) and tr a nsmits to Bob the indices J w 1 , . . . , J w d using k l = h g 2 Q ( a )(1 − 2 Q ( b )) d − 1 (53) bits per index o n average, and ˆ W J , which is a quan tized version o f W J . Note that in this method, in contrast to the ones considered thus far , Alice transmits to Bo b some information reg arding the actual v alues of her obser- v ations, ra ther than their locations alone. The reas on is tha t the v a r iance of the o ff-dia gonal entries of W J do not v anish as k g e ts larg e. Nevertheless, we will show that a v ery simple quantizer using only a negligible num ber of bits is eno ugh to represent W J with sufficien t accuracy for our pur p o ses. Precisely , Alice qua ntizes W J using exactly k q bits p er ent ry , as follows: The diagonal ent ries are truncated to a maximal absolute v alue of c = √ 3 a , and the dou- ble segment [ − c, − a ] ∪ [ a, c ] is unifor mly qua nt ized into 2 k q levels. Off-diagonal ent ries, that a ll lie in the segment [ − b, b ] , are uniformly qua nt ized int o 2 k q levels. Given a comm unication constr a int of k bits on av erage, we need to choose k l , k q that satisfy d · k l + d 2 · k q = k , (54) and thresholds a, b that satisfy (53). F ur thermore, for rea sons explained in the pro of of Theorem 4 , we need b oth a 2 and ( a − b ) 2 to increas e with k l , a nd k q , k l to sa tisfy k l = k (1 /d − o (1)) and k l 2 − k q → 0 . One such choice is k l = 1 d √ k + 1 − 1 2 , k q = r 4 k l d 3 , (55) and a = Q − 1 h − 1 g ( k l ) 2(1 − 2 Q ( b 0 )) d − 1 ! , b = b 0 , (56) for some small fixed b 0 . After receiving J w 1 , . . . , J w d and ˆ W J , Bob crea tes the v ector Y J = [ Y J w 1 , . . . , Y J w d ] (57) and p er forms estimation. The mo del (44) can b e written as Y = ρ Σ − 1 2 X W + σ Z , (58 ) and thus the estimator is ˆ ρ = Y J ˆ W − 1 J Σ 1 2 X . (59) 14 Theorem 4. The estimator ˆ ρ in (59) satisfies E k ˆ ρ − ρ k 2 ≤ 1 k d 2 2 ln 2 min ℓ ∈ [ d ] { 1 − ρ 2 ℓ } + o (1) , (60) wher e k is t he exp e cte d n umb er of tra nsmitte d bits. F urthermor e, ˆ ρ is asymp- totic al ly efficient given ( W J , Y J ) . W e prov e this theorem in the nex t subsection. Corollary 1. ˆ ρ in (59) dominates the sc alar estimator. Pr o of. Allo cating k /d bits per corr e la tion and using the scalar estimator (max or thresho ld), results in a sum of v a riances 1 k d 2 2 ln 2 1 d X ℓ ∈ [ d ] (1 − ρ 2 ℓ ) + o (1) (61) which is greater than (60) for a ll v alues of ρ 1 , . . . , ρ d (except when they a re all equal). One could also use a nonuniform bit allo cation for the scala r estimation, in which ca se the av erage in (61) would b e replaced by a weigh ted average, which also is alwa ys greater than the minim um. Remark 1. Theo r em 4 implies in particular that when (sa y) | ρ 1 | = 1 , then the v a riance of o ur es timator decays faster than Ω(1 /k ) . This is intuit ively reasona ble, since in this case Y is eq ua l to ± X 1 , hence Σ X itself provides all the information ab out ρ , which can b e lo ca lly computed by Alice and communicated to Bob with v a riance of 2 − Ω( k ) . Note how ever that Alice c annot know that | ρ 1 | = 1 , and neither can Bob (though he may hav e go o d reason to susp ect so), hence it is still a bit surprising that our estimator allows this situation to nevertheless b e exploited. Remark 2. It is in teresting to compare the performa nce of the estimator dis- cussed in this subsection, to the per formance of the estimator in the o ther extremal setup of Subsection 3 .1, where X is a scalar and Y is a vector. While bo th dominate the naive scheme of applying the sca lar metho d d times, neither dominates the other. The difference betw een them, essen tially , is the difference betw een P (1 − ρ 2 ℓ ) /d a nd d min { 1 − ρ 2 ℓ } . F o r example, the former outp erforms the latter if all correla tions are equal, whereas the latter outp erfor ms the former if an y of the cor relations is ± 1 . 3.3 Pro of of Theorem 4 Consider the estimator ˆ ρ 0 = Y J W − 1 J Σ 1 2 X . (62) Note that this estimator uses the non-qua nt ized W J which cannot be describ ed to B o b with a finite num ber of bits, a nd hence is unr ealizable. Nevertheless, as the following lemma shows, the loss incurre d by employing ˆ ρ instead, which uses the quantized W J , is sma ll. 15 Lemma 1. F or any a, b such that a > d ( b + 1) , E k ˆ ρ − ρ k 2 ≤ E k ˆ ρ 0 − ρ k 2 + (2 d ) 6 e − a 2 2 + 2 − k q (63) wher e k q bits ar e use d to re pr esent e ach ent ry in ˆ W J . Pr o of. See App endix A.3. The estimator (62) can b e written as ˆ ρ 0 = ρ + σ Z J W − 1 J Σ 1 2 X . (64) where Z J = [ Z J w 1 , . . . , Z J w d ] ∼ N ( 0 , I d ) is independent of W J . It follows that ˆ ρ 0 is unbiase d with Cov ˆ ρ 0 = σ 2 Σ 1 2 X E ( W J W T J ) − 1 Σ 1 2 X . (65) In view of Lemma 1, it is sufficient to ana lyze the p erfor mance o f the unr ealizable estimator ˆ ρ 0 . F or the purpose of a nalyzing ˆ ρ 0 only , w e ca n assume tha t Bob is given the v alue of W J for free, and the only cost is in tra nsmitting the indices J w 1 , . . . , J w d . Given k l bits for the repres ent ation of each of the lo cations, o ur general goal is to find a, b tha t minimize tr Cov ˆ ρ 0 (66) sub ject to h g (2 Q ( a )(1 − 2 Q ( b )) d − 1 ) = k l . (67) Before pro ceeding to the ana lysis of ˆ ρ 0 , we need the following tw o techn ical lemmas. Lemma 2. L et M b e a squar e r andom matrix with indep endent entries, wher e the diagonal entries ar e i.i.d. with one distribution, and the off-diagonal en- tries ar e i.i.d. with another, symmetric distribution. Then E M , E MM T and E ( MM T ) − 1 ar e sc alar mu ltiples of the identity matrix. Pr o of. The claim for E M and E MM T is trivial. F or E ( MM T ) − 1 see Ap- pendix A.4. Lemma 3 (Johnso n [2 7]) . F or any n -by- m matrix B = ( b ij ) , n ≤ m , the smal lest singular value is b ounde d b elow by min i ∈ [ n ] | b ii | − 1 2 X j ∈ [ n ] \ i | b ij | + X j ∈ [ n ] \ i | b j i | (68) The following lemma provides a simplified expression and b ounds for E ( W J W T J ) − 1 , that will aid in pr oving Prop osition 1 b elow. Lemma 4. The fol lowing claims hold for α = d − 1 tr E W J W T J , β = d − 1 tr E ( W J W T J ) − 1 . (69) 16 (i) E W J W T J = α I d (ii) E ( W J W T J ) − 1 = β I d (iii) F or any a, b su ch that a > ( d − 1 ) b , ( a 2 + d + 1) − 1 ≤ α − 1 ≤ β ≤ ( a − ( d − 1 ) b ) − 2 . (70) Pr o of. Recall that the vectors W i are i.i.d. across the time index i , a nd that the en tries of eac h o ne ar e i.i.d. with a symmetric distribution. T aking int o account the rectangular structure of the stopping sets A w ℓ we see that W J has independent en tries where diago nal elements have one distribution, and o ff- diagonal elements follow another, symmetric distribution. Thus, the matrix W J satisfies the conditions of Lemma 2. This proves claim (i) (which als o holds trivially b y construction) and claim (ii). W e pro cee d to prov e claim (iii). Denoting the singular v alues of W J b y √ λ 1 ≥ . . . ≥ √ λ d , w e hav e that tr( W J W T J ) − 1 = X ℓ ∈ [ d ] λ − 1 ℓ ≤ dλ − 1 d . (71) By constructio n, the diago nal entries of W J are larg er than a in abso lute v alue, and the off-diag onal entries are smaller than b in abso lute v alue. Therefore Lemma 3 yields p λ d ≥ a − ( d − 1) b. (72) W e th us have β d = tr β I d = tr E ( W J W T J ) − 1 ≤ d ( a − ( d − 1 ) b ) − 2 , (73) which esta blishes the rightm ost inequalit y in claim (iii). The middle inequality holds s ince β d = X ℓ ∈ [ d ] E 1 λ ℓ ≥ X ℓ ∈ [ d ] 1 E λ ℓ ≥ d 2 P ℓ ∈ [ d ] E λ ℓ (74) = d 2 tr E W J W T J = d 2 dα , (75) where the tw o inequalities follow from Jensen’s inequa lity applied to the func- tion 1 /x . Note that the r ows and columns of W J hav e the sa me distribution. Therefore α = E k W I 1 k 2 (76) = E ( W ) 2 1 | ( W ) 1 | > a + ( d − 1) E ( W ) 2 2 | ( W ) 2 | < b (77) = 1 + as ( a ) + ( d − 1) E ( W ) 2 2 | ( W ) 2 | < b (78) ≤ 1 + a ( a + a − 1 ) + ( d − 1)1 (79) = a 2 + 1 + d (80) which co mpletes the pro of. 17 Lemma 4 and (65) implies that the o ptimizatio n problem (66)-(67) can b e written as minimize β (81) sub ject to h g (2 Q ( a )(1 − 2 Q ( b )) d − 1 ) = k l . (82) Note that b o th Q ( · ) and h g ( · ) are monotonically decre asing. Therefore from (82) it is clear that incr e a sing k l means increasing a and/or decrea sing b . F r om (70) w e get that β decr eases as a increases and gets farther awa y from b . W e conclude ther efore that a reasonable appr oximation to the so lution of the o pti- mization problem a b ove, for large k l , ia as given in (5 6). Note that the pro p osed approximated solution satisfies the constraint exa ctly . Prop ositio n 1. The estimator (62) is unbiase d and, for the choic e of a, b given in (56) , it satisfies tr Co v ˆ ρ 0 ≤ 1 k l d 2 ln 2 min ℓ ∈ [ d ] { 1 − ρ 2 ℓ } + o (1) (83) wher e k l is the exp e cte d numb er of bits use d to describ e each of the lo c ations J w 1 , . . . , J w d . F urt hermor e, ˆ ρ 0 is asymptotic al ly efficient given ( W J , Y J ) . Pr o of. In ligh t of Lemma 4, (65) can b e written a s Cov ˆ ρ 0 = β σ 2 Σ X . (84) Using (10)-(11) with µ = ρ Σ − 1 2 X W J and Σ = σ 2 I d , w e get that the Fisher information matrix of ( W J , Y J ) is I W J Y J = 1 σ 2 Σ − 1 2 X E W J W T J Σ − 1 2 X + 2 d σ 4 Σ − 1 X ρ T ρ Σ − 1 X (85) = α σ 2 Σ − 1 X + 2 d σ 4 Σ − 1 X ρ T ρ Σ − 1 X , (86) and using the Sherman–Morrison for mula (e.g. [28]) w e get I − 1 W J Y J = σ 2 α Σ X − 2 d ασ 2 + 2 d (1 − σ 2 ) ρ T ρ . (87) W e take a , b of (56). Note that b is fixed and that a increases with k l . F rom Lemma 4 we hav e that α − 1 = a − 2 (1 + o (1)) , β = a − 2 (1 + o (1)) , (88) which implies Cov ˆ ρ 0 = σ 2 a 2 (1 + o (1))Σ X (89) I − 1 W J Y J = σ 2 a 2 (1 + o (1))Σ X (90) 18 and thus ˆ ρ 0 is asymptotically efficient. F or large k l we have k l = h g ( Q ( a )2(1 − 2 Q ( b )) d − 1 ) (91) = − lo g( Q ( a )2(1 − 2 Q ( b )) d − 1 )(1 + o (1)) (92) = − lo g( Q ( a ))(1 + o (1)) (93) = a 2 2 ln 2 (1 + o (1)) (94) (95) and thus tr Cov ˆ ρ 0 = dβ σ 2 = dσ 2 a 2 (1 + o (1)) (96) = 1 k l dσ 2 2 ln 2 + o (1) . (97) It remains to show that σ 2 ≤ min { 1 − ρ 2 ℓ } . ( 98) Note that σ 2 = V ar ( Y | X ) is the MMSE of estimating Y from X (see e.g. [23]). Therefore it is not greater than 1 − ρ 2 ℓ = Va r ( Y | ( X ) ℓ ) , whic h is the MMSE of estimating Y from the ℓ -th coo rdinate only . Theorem 4 now follows from Lemma 1 and Propo sition 1. 4 Non-Gaussian F amili es In this sectio n, w e mov e be yond the Ga ussian setup a nd co nsider the problem of distributed correlation estimation in more g eneral families of distributions, based on our Gaus s ian constr uctions. F or bre v it y of expo sition, we limit o ur discussion to the scalar ca se; the results can be e x tended in an ob vious w ay to the vector case . W e note tha t in contrast to the Gaussian setting, the marginal distribution of X or Y in other families of distributions may dep end on the correla tio n, in whic h case Alice or B o b could use their (unlimited) lo cal mea- surements to improv e their inference (and in s ome cases to even lear n ρ exactly without a ny communication). F or example, if X is uniformly distributed ov er the interv al [ − √ 3 , √ 3] , a nd Y = ρX + p 1 − ρ 2 Z where Z is uniformly dis- tributed ov er the discrete se t {− 1 , 1 } , then it is clea r that the distribution of Y , which can b e determined with arbitrar y accur acy by Bob, determines ρ up to its sign, reducing our pr o blem to a binary h yp othesis testing one. Suc h s cenarios render our metho d useless, or , a t the very least, degenerate. Our int erest, therefore , is in families of distributions wher e the marginals reveal little or nothing about the cor relation. Sp ecifically , we say that a family F of distributions on ( X , Y ) is c orr elation-hiding if each pair of marg inals can be asso ciated with an infinite num ber of p ossible co rrelatio ns; namely , for any 19 t wo margina ls p X and p Y that are p ossible for some mem ber of F , there ex is ts a coun tably infinite set F ′ ⊆ F of join t distributions with mar g inals p X and p Y , and with correlation c o efficien ts that are a ll distinct. Below, we discuss tw o t ypes of corr elation-hiding families. The fir s t is the family o f all p o s sible distributions (sub ject only to mild moment constra ints), which is obviously corr elation-hiding. W e show that for this family , the Gaussian per formance can b e unifor mly attained. The idea is very simple: w e p erform “Gaussianizatio n” o f the samples using the Cen tral Limit Theor em (CL T), and then apply the Gaussia n estimators; showing that this indeed works, howev er, is somewhat technically in volv ed. The second type of families that we cons ider are ones where p X is known, a nd where Y = αX + Z fo r some unknown co efficient α and unknown indep endent noise Z . W e show that such families a re corre la tion- hiding, and that we can sometimes (dep ending on p X ) obtain a v ariance that decays muc h faster with k than the Ga us sian one. 4.1 Unkno wn Distributions In this subsection, we consider the case where the joint distribution of X and Y is co mpletely unknown, sub ject only to mild mo ment conditio ns . W e show how the threshold metho d of Subsection 2.2 ca n b e extended to this setup, using the CL T, to y ield the same per formance guar antees. The basic idea is to use the unlimited num ber of samples in order to create Gaussian r.v.s with the same correla tio n, b y av eraging over blocks of sa mples. Due to the CL T, it is int uitiv ely clear that this approach works if Alice and Bob use infinite sized blo cks. This is how ever impractical, and the main technical challenge is to show that using finite large enough blocks, i.e., c hanging the or der of limits, still works. Let ( X, Y ) b e drawn fro m the family F = { p X Y : E X 2 , E Y 2 < u, E Y 4 < ∞} . (99) where u is some kno wn constant. Again, since we as sume that lo cal measure- men ts a r e essentially unlimited, and the second moments have known upp er bo unds, we can assume without loss of g enerality that E X = E Y = 0 , and E X 2 = E Y 2 = 1 . The fo llowing claim is immediate from the fact that F contains in particular the Gaussia n distributions. Corollary 2. The family F in (99) is c orr elation-hidi ng. Let us no w pro ceed to describ e our es timator. Alice and Bob fir st lo cally sum o ver their meas ur ement s to crea te the new i.i.d. s equences { ¯ X i } i , { ¯ Y i } i , given by ¯ X i = 1 √ m X j ∈ S i X j , ¯ Y i = 1 √ m X j ∈ S i Y j (100) where the S i ’s are disjoin t index sets of size m . F or brevit y , w e suppress the dependence o f these new r .v.s on m . The sequence o f pair s { ( ¯ X i , ¯ Y i ) } i is clearly 20 i.i.d. Denoting by ( ¯ X , ¯ Y ) a gener ic pair in this sequence , the correla tion b etw een ¯ X and ¯ Y is clearly the sa me a s the correla tio n b etw een X and Y . Alice and Bob can therefor e apply the threshold metho d to the sequence { ( ¯ X i , ¯ Y i ) } i in o rder to estimate the original ρ . W e now show that the p erformance of this es timator approaches the Gaussian p erformance a s m → ∞ . Given a comm unication constraint of k bits, the thresho ld t is chosen (as in the Gaussian case) such that h g ( Q ( t )) = k . W e denote ¯ J = min { i : ¯ X i > t } (101) and the estimator ˆ ρ ( m ) th = ¯ Y ¯ J s ( t ) , (102) where s ( t ) is given in (6). Note we c annot normalize by E ¯ X ¯ J to get a str ictly un biased estimator since we a ssume unknown distributions and thus E ¯ X ¯ J is not known for finite m . The exp ected num ber of bits needed to describ e ¯ J is k ( m ) = h g (Pr( ¯ X > t )) . (103) Remark 3. Note that practical scena rios would require the choice of some fixed m . Therefore, in cas es where the supp ort o f X is finite, we might get that Pr( ¯ X > t ) = 0 which means Alice waits for ever and the estimator is undefined. Therefore, while the distribution of ( X , Y ) need not be known in genera l, such a prac tica l sc e nario r equires some kno wledge r egarding the supp ort of X in the form of a num ber x such that Pr( X i > x ) > 0 (which must exist since E X = 0 ). Then we can take m > t 2 /x 2 to assure Pr( ¯ X > t ) > 0 . Theorem 5. L et t = Q − 1 ( h − 1 g ( k )) . Then for t he family F in (99) it hold s t hat lim m →∞ k ( m ) = k and lim m →∞ E ( ˆ ρ ( m ) th − ρ ) 2 = 1 k 1 − ρ 2 2 ln 2 + o (1) (104) Pr o of. Due to the CL T w e have for an y fix e d t > 0 that lim m →∞ Pr( ¯ X > t ) = Q ( t ) (105) and th us, s ince h g is smo oth, the comm unication constraint is a symptotically satisfied. W e hav e E ( ˆ ρ ( m ) th − ρ ) 2 = E ¯ Y 2 ¯ J s 2 ( t ) − 2 ρ E ¯ Y ¯ J s ( t ) + ρ 2 (106) and th us it suffices to s how that the first t wo moments o f ¯ Y ¯ J conv erge to their v alues under the Gaussian distr ibution. Denoting b y ( X N , Y N ) and Y N J the asso ciated r.v.s under a Gaussia n distribution, it is eno ugh to show that ¯ Y ¯ J 21 conv erges in distribution to Y N J as m → ∞ , and that ¯ Y 2 ¯ J is uniformly in tegrable [29]. T o sho w convergence in distribution, obser ve that lim m →∞ Pr( ¯ Y ¯ J > y ) = lim m →∞ Pr( ¯ Y > y | ¯ X > t ) (107 ) = lim m →∞ Pr( ¯ Y > y , ¯ X > t ) Pr( ¯ X > t ) (108) = lim m →∞ Pr( ¯ Y > y , ¯ X > t ) lim m →∞ Pr( ¯ X > t ) (109) = Pr( Y N > y , X N > t ) Pr( X N > t ) (110) = Pr( Y N > y | X N > t ) (111) = Pr( Y N J > y ) (112) where (109) holds since the deno minator is not zero, and (11 0) holds by virtue of the CL T. It follows from (10 5) that there exist some m 0 and c > 0 (e.g., c = Q ( t ) / 2 ) such that Pr( ¯ X > t ) > c ∀ m ≥ m 0 , (113) and therefore we assume without loss o f gener a lity that m ≥ m 0 . T o prov e uniform in tegrability o f ¯ Y 2 ¯ J it suffices to show that sup m E | ¯ Y ¯ J | γ < ∞ for so me γ > 2 [29]. F or simplicit y , w e set γ = 4 : E | ¯ Y ¯ J | 4 = E ( | ¯ Y | 4 | ¯ X > t ) ( 114) ≤ E | ¯ Y | 4 Pr( ¯ X > t ) (115) = E ( 1 √ m P j Y j ) 4 Pr( ¯ X > t ) (116) = 1 m E Y 4 + 3 m − 1 m ( E Y 2 ) 2 Pr( ¯ X > t ) (117) < 1 c E Y 4 m + 3 , (1 18) which is finite since E Y 4 < ∞ . Example 1 (Doubly symmetric binar y r.v.s) . Consider the family of dis tr i- butions where X ∼ Bernoulli(1 / 2) and Y = X ⊕ Z where Z ∼ Bernoulli( p ) is independent of X , p ∈ [0 , 1] is unknown, and ⊕ is the binary X OR op- eration. The asso cia ted Gaussia n version o f these r.v.s (after removing the mean) are the join tly nor mal, zero mea n unit norm r.v.s ¯ X and ¯ Y , with corre- lation ρ = 1 − 2 p . Our unbiased estimator can therefore obtain a v ariance of 1 − (1 − 2 p ) 2 2 k ln 2 = 2 p (1 − p ) k ln 2 for the estimation of ρ , which corresp onds to a v ariance of 22 p (1 − p ) 2 k ln 2 for the estimatio n o f p . This can b e juxtaposed with the straightforward approach of simply sending X 1 , . . . , X k to Bob and applying the (efficient) es- timator ˆ p = 1 k P k j =1 X j ⊕ Y j . This unbiased estimator has a v ariance of p (1 − p ) k , which is in terestingly slight ly w orse tha n what we go t us ing the Gaussian ap- proach. It may b e po ssible to improv e the former b y using lossy compression, but w e do not explo re this direction here. Remark 4. Estimating the join t probabilit y ma ss function of general discrete distributions o n X , Y can b e similarly cast as a correla tion estimation problem. How ev er, the ga in obser ved in the bina ry case ab ove do es not carry ov er to the general case. This is howev er not unexp ected, since our estimato r does not assume an y b ound on the ca rdinality of X and Y . 4.2 A dditiv e Noise F amilies In this subsectio n, we consider a more r estricted mo del wher e the distr ibution p X of X is fixed (but not necessarily Ga ussian) and ha s b ounded v ariance, and where Y = αX + Z (119) for some unknown b ounded cons ta nt α , where Z is an arbitrary r.v. with bo unded v a riance that is independent 3 of X . Let us denote this family of dis- tributions by F ( p X ) . Fir st, we note: Corollary 3. F ( p X ) is c orr elation-hiding for any p X . Pr o of. See App endix A.6. W e now sho w that the threshold estimato r pro p o sed for the Gaussian case applies to F ( p X ) as well, a nd that its p erfor mance ca n b e b etter or worse, depending on p X . Specifica lly , we show that the O (1 / k ) decay of the v a riance with the num ber o f bits is no t fundamental, a s for so me (heavier tailed) choices of p X we obtain a b ehavior of O (1 /k 2 ) using the same threshold estimator, and 2 − Ω( k ) using a slightly mo dified estimator. The latter is essentially the bes t p ossible using our appro ach, s ince we utilize O (2 k ) samples (with high probability), which corr esp onds to a v a riance o f Ω(2 − k ) even in the centralized case. As in the previous sections, Alice a nd B o b can no r malize their measur ement s lo cally . Ther efore, we can as s ume without los s of generality that (119) can b e written as Y = ρX + p 1 − ρ 2 Z (120) where X and Z are independent, zero mean unit v ariance r.v.s, and the corr e - lation is ρ = E X Y . W e assume that p Z is arbitra ry and unknown, and that 3 It i s in fact sufficien t for our purposes to assume only that E ( Z | X ) and Va r ( Z | X ) do not depend on X 23 p X is arbitrary but known. Applying the threshold metho d of Subsection 2.2 to this non-Gaussian setup, we denote as usua l J = min { i : X i > t } the first index to pass the threshold t , where t is chosen such that h g (Pr( X > t )) = k . Our estimator is ˆ ρ th = Y J E X J . ( 121) The following claim is immediate. Corollary 4. ˆ ρ th is unbiase d, and V ar ˆ ρ th = ρ 2 V ar ( X | X > t ) + 1 − ρ 2 ( E ( X | X > t )) 2 . (122 ) Let us compute (122) for so me sp ecific c hoices of p X . Example 2 (Laplace Distribution) . Let p X be a zero - mean, unit-v a riance Laplace distribution, hence Pr( X > x ) = 1 2 e − √ 2 x for x > 0 . Thus, E ( X | X > t ) = t + 1 √ 2 , V ar ( X | X > t ) = 1 2 , (123) and k = h g 1 2 e − √ 2 t (124) = − log 1 2 e − √ 2 t (1 + o (1)) (125) = √ 2 t log( e )(1 + o (1)) . (126) Therefore (1 22) becomes V ar ˆ ρ th = 1 k 2 2 − ρ 2 (ln 2 ) 2 + o (1) , (127) which yields a v ariance of O (1 /k 2 ) , in contrast to the slow er O (1 /k ) attained in the Gaussian case. Example 3 (Pareto Distribution) . Motiv ated by the Lapla ce example whic h indicates that a heavier tail of X may yield a faster decay of Va r ˆ ρ th , we in- vestigate the heaviest tail po s sible with finite v a riance. Supp ose that p X is the (double-sided, ze r o mean) Pareto distribution, i.e., Pr( X > x ) = P r( X < − x ) = 1 2 x 0 x α (128) for any x > x 0 , wher e α > 2 and x 0 > 0 is set such that Va r X = 1 . Then for any t > x 0 E ( X | X > t ) = αt α − 1 , V ar ( X | X > t ) = αt 2 ( α − 1) 2 ( α − 2) (129) 24 and (1 2 2) becomes V ar ˆ ρ th = ρ 2 α ( α − 2) + O (1 /t 2 ) . (130) Thu s, the v a riance of our threshold es timato r do es not v anish with the n um ber of bits. This fla w can nevertheless be fixed in a v ery strong wa y , as we sho w next. Before w e pro ceed, we note that for p X with a tail of the form Pr( X > x ) ∝ e − x 1 m , i.e. in b etw een P areto and La place, the threshold estimator yields V ar ˆ ρ th = O (1 /k 2 ) for any natural m . Also, tails that decay faster than Gaussia n may yield worse p er fo rmance, e.g. the ta il e − x 4 yields Va r ˆ ρ th = O (1 / √ k ) . Getting ba ck to the double-s ided Pareto distribution, r ecall that in the Gaus- sian case it was shown that describing the v a lue o f X J do es not impro ve esti- mation per formance. This was due to the fa c t that for the Gauss ian family , V ar X J = V ar ( X | X > t ) → 0 . This is ho wev er no t true in general; in fact, the P areto distribution is an extreme ca se in w hich Va r ( X | X > t ) → ∞ . There- fore, providing some information reg arding the v a lue of X J at the exp ense of the n um ber of bits used to describ e the index J , migh t impr ove per formance. With that in mind, we consider the estimator ˆ ρ th - q = Y J ˆ X J (131) that alloc a tes k l bits to describ e J , a nd k q bits to des crib e the v alue of ˆ X J , where k l + k q = k . W e apply the following simple quan tizer. F or some u > t we divide the region [ t, u ] to 2 k q equal seg ments of length ∆ = 2 − k q ( u − t ) . F or x > u we set ˆ x = u . In the followin g, w e show that this estimator attains a v aria nce that decays exp o nentially fas t with k . Prop ositio n 2. Consider the family F ( p X ) wher e p X b e the double-side d Par eto distribution. Then the estimator ˆ ρ th - q in (131) satisfies E ( ˆ ρ th - q − ρ ) 2 ≤ (1 + ρ 2 ) · 2 − 2 α α − 2 α − 1 k (1 − o (1)) , (132) wher e k is the aver age numb er of tra nsmitte d bits. Pr o of. See App endix A.5. 5 Conclusions W e hav e discussed the pr oblem of estimating the correla tions b etw een remotely observed random vectors with unlimited loca l sa mples, under one-wa y commu- nication constr aints. F or the case where the vectors are joi ntly Gauss ian, w e provided simple constructive unbiased es timators for the correlations; our esti- mators attain the b est known non-cons tr uctive Zha ng-Berg er upper bo und on 25 the v a riance in the scalar case, a nd use the lo cal corr elations to uniformly im- prov e p e rformance in the v ector c a se, where the Zha ng -Berge r approach see ms inapplicable. Lo osely sp e a king, o ur approach is based on Alice s c a nning her the lo cal obs erv a tions and sending the index of suitably “large ” samples that induce go o d signal-to-noise ra tio for the estimatio n for Bob, who uses the corr esp ond- ing samples on his end. W e then show ed that using the CL T, this approach ca n be a pplied to the ca se of estimating correla tions for completely unknown distri- butions, with the e xact same v ariance gua rantees. While the Gaussia n appro ach yields a v ariance that is inv ersely pro p o rtional to the exp ected num ber of trans- mitted bits, we sho w that for join t distributions ge ner ated via unknown fading channels with unknown additive noise, whose corr elations cannot b e estimated lo cally , a slightly modified estimator attains a v ar iance decaying exp onential ly fast with the ex pec ted num ber o f transmitted bits. It re ma ins interesting to try and obtain low er b ounds on the v ar iance a s a function of the number of bits and the rich ness o f the fa mily o f distributions under considera tion. W e conjecture that the inv ersely pro p o rtional behavior of our Gaussian estimato r is order-wis e optimal in the Gaussia n case, hence also for the case of unknown distributions. A App endix A.1 Pro of of Theorem 3 The mo del can be written as (see e.g. [26 ]) Y = ρ X + Σ 1 2 Z (133) where Z ∼ N ( 0 , I d ) is indep endent of X , and Σ = Σ Y − ρ ρ T . W e hav e ˆ ρ = ( ρ X J + Σ 1 2 Z J ) / E X J . Therefor e E ˆ ρ = ρ a nd Cov ˆ ρ = 1 ( E X J ) 2 Σ + Va r ( X J ) ρ ρ T . (134 ) Using (1 0)-(11) with µ = ρ X J , Σ = Σ Y − ρ ρ T we get that the Fisher infor ma- tion matr ix p er ta ining to ( X J , Y J ) is I X J Y J = Σ − 1 ( E X 2 J + ρ T Σ − 1 ρ ) + Σ − 1 ρ ρ T Σ − 1 , (1 35) and a pplying the Sherman–Mo rrison formula (e.g. [28]) y ields I − 1 X J Y J = 1 E X 2 J + ρ T Σ − 1 ρ Σ − ρ ρ T E X 2 J + 2 ρ T Σ − 1 ρ . (136) Using the ar guments of Theor em 2 yields that b oth Cov ˆ ρ and I − 1 X J Y J are 1 t 2 (1 + o (1))Σ and th us ˆ ρ is asymptotically efficient . Theorem 2 also implies that k = t 2 (2 ln 2 + o (1)) , a nd no ting that tr Σ = tr(Σ Y − ρ ρ T ) = d − k ρ k 2 concludes the proof. 26 A.2 The scalar metho d with linear tr ansformations In this subse c tion w e show that any metho d based on d s calar transmissions cannot uniformly beat the s cheme o f applying the basic sca lar metho d d times. In this se nse, the joint metho d prop osed in Theo rem 4 is sup erior because it do es uniformly b ea t the simple scalar scheme. Specifica lly , let M b e some inv ertible d × d ma trix known to bo th Alice and Bob, and let e X = M X . Suppo se Alice and Bob apply the sca lar metho d sepa r ately to obta in a n estimator ˆ ρ M for the correlation vector ρ M = E Y e X T , and then use the es timato r M − 1 ˆ ρ M to estimate ρ . As it turns out, this family of estimators do es not dominate the naive appr o ach o f estimating eac h co r relation separ ately (i.e., M = I d ). Prop ositio n 3. F or any two invertible d × d matric es M 1 , M 2 (that c an arbi- tr aril y dep end on Σ X , and ar e known to b oth Ali c e and Bob), M − 1 1 ˆ ρ M 1 do es not dominate M − 1 2 ˆ ρ M 2 . Pr o of. W e need to s how that any linear tr ansformation applied to X , followed b y the scalar method, ca nno t b e uniformly b etter than the sca lar metho d itself. It suffices to sho w that for the tw o-dimensional ca s e. Alice creates the follo wing t w o sca lar sequences. U i = [ a 1 , b 1 ] X i , for i = 1 , . . . , n 1 and (137) V i = [ b 2 , a 2 ] X i , for i = n 1 + 1 , . . . , n 1 + n 2 (138) and allo ca tes k 1 bits for U , and k 2 bits for V (Note we can use either max o r threshold metho d, and that n 1 , n 2 can be arbitrarily larg e). One sp ecial ca s e of the ab ove is the “successive refinement” appro a ch describ ed in the introduction (for b 1 = 0 ), and a nother s p ecia l cas e is the naive scala r metho d (for a 1 = a 2 = 1 , b 1 = b 2 = 0 and k 1 = k 2 = k / 2 ). Without loss of gener a lity we a ssume a 1 , b 1 are such that E U 2 = 1 , and a 2 , b 2 are suc h that E V 2 = 1 . W e denote α 1 = E Y i U i = a 1 ρ 1 + b 1 ρ 2 (139) α 2 = E Y i V i = b 2 ρ 1 + a 2 ρ 2 , (140) and α = [ α 1 , α 2 ] T . W e also denote ρ = [ ρ 1 , ρ 2 ] T and M = a 1 b 1 b 2 a 2 , (141) and ther efore we have α = M ρ . The b est Bob can do (recall U, V are indepe n- den t) is to estimate α 1 using U and α 2 using V to obtain V ar ˆ α 1 = 1 k 1 1 − α 2 1 2 ln 2 + o (1) (142) V ar ˆ α 2 = 1 k 2 1 − α 2 2 2 ln 2 + o (1) (143) 27 and then take ˆ ρ trn = M − 1 ˆ α . The r e sulting sum of v a r iances (note Cov ( ˆ α 1 , ˆ α 2 ) = 0 ) is tr Co v ˆ ρ trn = tr M − 1 Cov ( ˆ α ) M − T (144) = tr M − 1 V ar ˆ α 1 0 0 V ar ˆ α 2 M − T (145) = ( a 2 2 + b 2 2 ) V ar ˆ α 1 + ( a 2 1 + b 2 1 ) V ar ˆ α 2 ( a 1 a 2 − b 1 b 2 ) 2 (146) Applying the simple scalar metho d t wice yields tr Cov ˆ ρ scl = 1 k k k ′ 1 1 − ρ 2 1 2 ln 2 + k k ′ 2 1 − ρ 2 2 2 ln 2 + o (1) . (147) with k ′ 1 + k ′ 2 = k 1 + k 2 = k . W e wan t to show tha t tr Cov ˆ ρ trn cannot b e uniformly better than tr Cov ˆ ρ scl , namely , show that for an y choice of a 1 , b 1 , a 2 , b 2 , k 1 , k 2 (that do not dep end on ρ 1 , ρ 2 ) we can find ρ 1 , ρ 2 such that tr Co v ˆ ρ scl < tr Co v ˆ ρ trn . This is ea sy b ecause w e can a lwa ys take ρ 1 , ρ 2 ∈ {− 1 , 1 } (or arbitra rily close to ± 1 ) whic h makes tr Cov ˆ ρ scl ≈ 0 and tr Cov ˆ ρ trn 6 = 0 . If tr Cov ˆ ρ trn = 0 (i.e. α 2 1 = α 2 2 = 1 ), we can flip the sign o f ρ 2 to o btain either α 2 1 6 = 1 or α 2 2 6 = 1 . A.3 Pro of of Lemma 1 W riting W = W J and ˆ W = ˆ W J , w e hav e ˆ ρ 0 = Y J W − 1 Σ 1 2 X = ρ + σ Z J W − 1 Σ 1 2 X (148) ˆ ρ = Y J ˆ W − 1 Σ 1 2 X = ρ Σ − 1 2 X W ˆ W − 1 Σ 1 2 X + σ Z J ˆ W − 1 Σ 1 2 X (149) ˆ ρ 0 − ρ = σ Z J W − 1 Σ 1 2 X (150) ˆ ρ − ˆ ρ 0 = ρ Σ − 1 2 X ( W ˆ W − 1 − I )Σ 1 2 X + σ Z J ( ˆ W − 1 − W − 1 )Σ 1 2 X . (151) Recall Z J is a row vector ∼ N (0 , I d ) independent of W . It follo ws that E k ˆ ρ − ˆ ρ 0 k 2 = E k ρ Σ − 1 2 X ( W ˆ W − 1 − I )Σ 1 2 X k 2 (152) + σ 2 E tr ( ˆ W − 1 − W − 1 )Σ X ( ˆ W − T − W − T ) , (15 3) and E ( ˆ ρ − ˆ ρ 0 )( ˆ ρ 0 − ρ ) T = σ 2 E tr ( ˆ W − 1 − W − 1 )Σ X W − T . (154) Therefore E k ˆ ρ − ρ k 2 = E k ( ˆ ρ − ˆ ρ 0 ) + ( ˆ ρ 0 − ρ ) k 2 (155) = E k ˆ ρ 0 − ρ k 2 + E k ˆ ρ − ˆ ρ 0 k 2 + 2 E ( ˆ ρ − ˆ ρ 0 )( ˆ ρ 0 − ρ ) T (156) = E k ˆ ρ 0 − ρ k 2 + E k ρ Σ − 1 2 X ( W ˆ W − 1 − I )Σ 1 2 X k 2 (157) 28 + σ 2 E tr ( ˆ W − 1 − W − 1 )Σ X ( ˆ W − T + W − T ) , (158) and thus E k ˆ ρ − ρ k 2 − E k ˆ ρ 0 − ρ k 2 (159) = E k ρ Σ − 1 2 X ( W − ˆ W ) ˆ W − 1 Σ 1 2 X k 2 (160) + σ 2 E tr ( W − ˆ W ) ˆ W − 1 Σ X ( ˆ W − T + W − T ) W − 1 . (161) Let us upp er bound the t wo terms separately . Recall that by (73) we ha ve k W − 1 k 2 F ≤ d/ ( a − ( d − 1) b ) 2 , which also holds for ˆ W − 1 . F urthermor e, the assumption that a > d ( b + 1) implies that a − ( d − 1) b is low er bounded by either a/ d o r d . First term (16 0): The F r ob enius norm is sub-multiplicativ e (see e.g. [3 0]), and ther e fo re E k ρ Σ − 1 2 X ( W − ˆ W ) ˆ W − 1 Σ 1 2 X k 2 F (162) ≤ k ρ Σ − 1 2 X k 2 F k Σ 1 2 X k 2 F E k W − ˆ W k 2 F k ˆ W − 1 k 2 F (163) ≤ d k ρ Σ − 1 2 X k 2 F k Σ 1 2 X k 2 F ( a − ( d − 1) b ) 2 E k W − ˆ W k 2 F (164) ≤ d k ρ Σ − 1 2 X k 2 F k Σ 1 2 X k 2 F ( a/d ) 2 E k W − ˆ W k 2 F (165) = d 4 (1 − σ 2 ) a 2 E k W − ˆ W k 2 F (166) where for (166) we use d the fact that k ρ Σ − 1 2 X k 2 = ρ Σ − 1 X ρ T = 1 − σ 2 , and k Σ 1 2 X k 2 F = tr Σ X = d . Second term (161): F or a ny tw o d × d matrices A, B , it ca n b e easily shown that tr AB T ≤ d 2 k A k F k B k F . Therefor e, σ 2 E tr ( W − ˆ W ) ˆ W − 1 Σ X ( ˆ W − T + W − T ) W − 1 (167) ≤ σ 2 d 2 E k W − ˆ W k F k ˆ W − 1 Σ X ( ˆ W − T + W − T ) W − 1 k F (168) ≤ σ 2 d 2 q E k W − ˆ W k 2 F q E k ˆ W − 1 Σ X ( ˆ W − T + W − T ) W − 1 k 2 F (169) ≤ σ 2 d 2 √ 2 d 3 2 k Σ X k F ( a − ( d − 1) b ) 3 q E k W − ˆ W k 2 F (170) ≤ √ 2 d 4 . 5 σ 2 ( a/d ) d 2 q E k W − ˆ W k 2 F (171) where (169) is due to the Cauch y–Sch w arz inequality . F or (171) note that k Σ X k 2 F ≤ d 2 bec a use all the en tries o f Σ X are less than or equa l to one. W e now pro ceed to upper bo und E k W − ˆ W k 2 F . Consider the following uniform quantizer: The diago nal entries are truncated a t some c > a . The 29 double se gment ± [ a, c ] is divided in to l 1 regions of width ǫ 1 = 2( c − a ) /l 1 each. F or | w | > c we take ˆ w = sig n ( w ) c . Ther e fo re E ( W 11 − ˆ W 11 ) 2 (172) = Q ( c ) Q ( a ) E ( W 11 − ˆ W 11 ) 2 | W 11 | > c (173) + 1 − Q ( c ) Q ( a ) E ( W 11 − ˆ W 11 ) 2 | W 11 | < c (174) ≤ Q ( c ) Q ( a ) E ( W 11 − c ) 2 | W 11 | > c + ǫ 2 1 (175) = Q ( c ) Q ( a ) (1 + c s ( c ) + c 2 ) + ǫ 2 1 (176) ≤ a + a − 1 c e − c 2 − a 2 2 2(1 + c 2 ) + ǫ 2 1 (177) ≤ 8 c 2 e − c 2 − a 2 2 + ǫ 2 1 (178) where (17 7) is obtained with some manipulations on t ≤ s ( t ) ≤ t + t − 1 . F or the off-diagonal en tries, the segment [ − b, b ] is divided int o l 2 regions of width ǫ 2 = 2 b /l 2 each. Therefore E ( W 12 − ˆ W 12 ) 2 ≤ ǫ 2 2 . (179) It follo ws that E k W − ˆ W k 2 F (180) = d E ( W 11 − ˆ W 11 ) 2 + ( d 2 − d ) E ( W 12 − ˆ W 12 ) 2 (181) ≤ 8 dc 2 e − c 2 − a 2 2 + dǫ 2 1 + d ( d − 1) ǫ 2 2 (182) ≤ 8 dc 2 e − c 2 − a 2 2 + d 2 ( ǫ 1 + ǫ 2 ) 2 . (183) W e take c = √ 3 a and l 1 = l 2 and th us ǫ 1 + ǫ 2 = 2( c − a + b ) /l 1 ≤ 4 a/l 1 . The n umber of bits used for quantization is k q = log l 1 and therefor e ǫ 1 + ǫ 2 ≤ 4 a 2 − k q . Now, q E k W − ˆ W k 2 F (184) ≤ p 24 da 2 e − a 2 + 1 6 d 2 a 2 2 − 2 k q (185) ≤ q (5 ad ) 2 ( e − a 2 + 2 − 2 k q ) (186) ≤ 5 ad ( e − a 2 2 + 2 − k q ) , (187) and fina lly , combinin g (187) with (166) and (17 1) yields E k ˆ ρ − ρ k 2 − E k ˆ ρ 0 − ρ k 2 (188) 30 ≤ 25 d 6 (1 − σ 2 )( e − a 2 2 + 2 − k q ) 2 + 5 √ 2 d 4 . 5 σ 2 ( e − a 2 2 + 2 − k q ) (189) ≤ 25 d 6 ( e − a 2 2 + 2 − k q ) (190) which co mpletes the pro of. A.4 Pro of of Lemma 2 Denote by P the set of all d × d sig ned p ermutation matrices, i.e. matrices with exactly one nonzero entry in ev ery row and every column, that takes v alues in {− 1 , 1 } . F o r any d × d matrix B and any P ∈ P , the ma trix P B P T is o btained from B by p erforming the sa me p er mut ation on the rows and columns of B , with p ossible s ig n changes. Sp ecifically , the diago nal o f P B P T is a pe r mut ation of the diagona l o f B , and the off-diago na l o f P B P T is a p ermutation of the off-diagona l entries o f B with po ssible sign changes. Suppos e that the r a ndom matrix N has the same distribution a s P N P T for any P ∈ P . It follows that E N must b e a scala r mult iple of the identit y matrix since for any i 6 = j there exist tw o matrices P 1 , P 2 ∈ P suc h that 1. for some i ′ 6 = j ′ , ( P 1 N P T 1 ) i ′ j ′ = ( N ) ij (191) ( P 2 N P T 2 ) i ′ j ′ = − ( N ) ij (192) and th us ( E N ) ij = − ( E N ) ij . 2. for some i ′ , ( P 1 N P T 1 ) i ′ i ′ = ( N ) ii (193) ( P 2 N P T 2 ) i ′ i ′ = ( N ) j j (194) and th us ( E N ) ii = ( E N ) j j . The assumptions in the lemma imply that M a nd P M P T hav e the same distribution for any P ∈ P , th us MM T and ( P M P T )( P M P T ) T hav e the sa me distribution. Hence ( P M P T )( P M P T ) T = P MM T P T (195) and th us P MM T P T has the same distribution as MM T . This implies that ( P MM T P T ) − 1 has the same distribution as ( MM T ) − 1 , and since ( P MM T P T ) − 1 = P ( MM T ) − 1 P T (196) we hav e that P ( MM T ) − 1 P T has the sa me distribution as ( MM T ) − 1 . Ther efore E ( MM T ) − 1 is a scala r multiple of the identit y matrix. 31 A.5 Pro of of Prop osition 2 F or any distribution on X , a nd for any u, ∆ , E ( ˆ ρ th - q − ρ ) 2 = ρ 2 E X J − ˆ X J ˆ X J ! 2 + (1 − ρ 2 ) E 1 ˆ X 2 J (197) ≤ ρ 2 t 2 E X J − ˆ X J 2 + 1 − ρ 2 t 2 (198) = ρ 2 t 2 Pr( X J < u ) E X J − ˆ X J 2 X J < u (199) + ρ 2 t 2 Pr( X J > u ) E ( X J − u ) 2 X J > u + 1 − ρ 2 t 2 ≤ ρ 2 t 2 ∆ 2 + ρ 2 t 2 Pr( X > u ) Pr( X > t ) E ( X − u ) 2 X > u + 1 − ρ 2 t 2 . ( 200) In this P areto example we have E ( X − u ) 2 X > u = cu 2 (201) where c = 2 / (( α − 1 ) 2 ( α − 2)) , and th us E ( ˆ ρ th - q − ρ ) 2 ≤ 1 − ρ 2 + ρ 2 ∆ 2 t 2 + cρ 2 t u α − 2 . (202) W e take u = t α α − 2 , and thus (202) becomes E ( ˆ ρ th - q − ρ ) 2 ≤ 1 − ρ 2 + ρ 2 ∆ 2 + cρ 2 t 2 . (203) The bits are allo cated by k q = 1 α − 1 k , k l = α − 2 α − 1 k , (204) and the threshold t is determined by k l from the solution of k l = h g (Pr( X > t )) , which yields t = 2 k l α (1 − o (1)) . W e hav e ∆ ≤ 1 since ∆ = 2 − k q ( u − t ) = 2 − k q ( t α α − 2 − t ) = 2 − k q t α α − 2 (1 − t − 2 α − 2 ) = 2 − k α − 1 2 k α − 1 (1 − o (1)) (1 − t − 2 α − 2 ) = 2 − o ( k ) (1 − t − 2 α − 2 ) . Note that for α > 3 w e ha v e c < 1 and thus E ( ˆ ρ th - q − ρ ) 2 ≤ 1 + ρ 2 t 2 ≤ 1 + ρ 2 2 2 α α − 2 α − 1 k (1 − o (1)) . (205) 32 A.6 Pro of of Corollary 3 Set any real- v alued sequence { α k } ∞ k =1 such that all the elements are dis tinct, and P α 2 k = 1 . Pic k Z = P k α k Z k , where Z k ∼ p X are i.i.d and mutually independent of X . Then Y = αX + P k α k Z k is a weigh ted sum of i.i.d. r.v.s, hence knowin g p X and p Y , or even knowing all the weights α, { α k } , there is no w a y to distinguish b etw een the case where X has co efficient α and where X has co efficient α k for so me k . Thu s, there is an infinite num ber of p o ssible correla tio ns. A.7 Maxim um lik eliho o d appro ximation In this section we pro vide further justification for the estimator ˆ ρ = Y J / E X J (or Y J / E X J in the sca lar setup which is a s pec ia l cas e), by showing that Y J /X J is a n a ppr oximation o f the maximum likelihoo d estimator. The model is Y J = ρ X J + Σ 1 2 Z J (206) where either J = a rgmax i { X i } if we use the max metho d, or , if we use the threshold method, J = min { i : X i > t } . W e wish to maximize f X J Y J which is equiv alent to maximizing f Y J | X J , since f X J do es not depend o n ρ . It follows that the actual distribution of X J is ir relev a nt . W e have Y J | X J ∼ N ( ρ X J , Σ) with Σ = Σ Y − ρ ρ T and thus ∂ ∂ ρ ln f X J Y J (207) = − 1 2 ∂ ∂ ρ ln det Σ + ( Y J − ρ X J ) T Σ − 1 ( Y J − ρ X J ) = Σ − 1 ρ − ( Y J − ρ X J ) ( Y J − ρ X J ) T Σ − 1 ρ − X J meaning w e wan t to solve ρ = ( Y J − ρ X J ) ( Y J − ρ X J ) T Σ − 1 ρ − X J . (208) Note that the righ tmost term is a scalar and th us the solution m ust b e of the form ˆ ρ ML = C Y J where C is a scala r that dep ends on X J and Y J . Plugg ing it yields that C is obtained a s the solution o f the third degree p olyno mial Y T J Σ − 1 Y Y J C 2 ( X J − C ) − ( X 2 J − 1 + Y T J Σ − 1 Y Y J ) C + X J = 0 . In our setups X J takes large v alues. This implies in genera l that the ent ries of Y J are a lso lar ge, and th us C should be small as we exp ect ˆ ρ ML = C Y J to pro duce mo der ate v alues. Therefore we can a ssume that X J − C ≈ X J and that X 2 J − 1 ≈ X 2 J , which results in a quadra tic equation in C who se solutions are X J / ( Y T J Σ − 1 Y Y J ) a nd 1 /X J . No te that (with either max o r thresho ld) Va r X J approaches z e r o a s the num ber of bits increases, and there fo re the loss in re- placing X J with E X J is negligible (it is also eviden t in the optimalit y claims throughout where it is sho wn that the e s timators, which do not use the actual v alue of X J , ac hieve the CRLB that assumes X J is kno wn). 33 References [1] E. L. Le hma nn a nd G. Casella , The ory of p oint estimation . Springer Science & Business Media, 2006. [2] Z. Zhang and T. Be rger, “Estimation via compressed infor mation,” IEEE tr ansactio ns on Information the ory , vol. 34, no. 2, pp. 19 8–21 1, 19 88. [3] T. Liu and P . Visw anath, “Opp ortunistic orthogo nal writing on dirt y pa- per ,” IEEE t r ansactio ns on informa tion the ory , v ol. 52, no. 5, pp. 1828– 1846, 2006. [4] S. Bo r ade and L. Zheng, “W r iting on fading pap er a nd causal transmit- ter csi,” in Information The ory, 2006 IEEE International Symp osium on , pp. 744–748 , IEEE, 2006. [5] A. No a nd T. W eissma n, “Rateless lossy compres s ion via the extremes,” IEEE T r ansactions on Information The ory , v ol. 6 2, no. 10, pp. 548 4–54 95, 2016. [6] R. Ahlswede and M. B ur nashev, “ On minimax estimation in the presence o f side information ab out remo te da ta,” The Annals of St atistics , pp. 14 1–17 1 , 1990. [7] T. S. Han and S.-i. Amar i, “Parameter estimation with multiterminal data compression,” IEEE tr ansactions on Information The ory , vol. 41 , no. 6 , pp. 1802–18 3 3, 19 95. [8] S. Amari et al. , “Statistical inference under multiterminal data compres - sion,” IEEE T r ansactions on Information The ory , v ol. 4 4, no. 6, pp. 230 0– 2324, 1998. [9] S.-i. Amari, “On optimal da ta compr ession in m ultiterminal statistical infer- ence,” IEEE T r ansactions on Informa tion The ory , vol. 57 , no . 9, pp. 55 7 7– 5587, 2011. [10] E. Haim and Y. K o chman, “Binary distributed hypothesis testing via Körner- Marton coding,” in Information The ory W orkshop (ITW), 2016 IEEE , pp. 14 6–15 0 , IEEE, 2016. [11] Y. Zhang, J. Duchi, M. I. Jorda n, and M. J . W ainwrigh t, “ Infor mation- theoretic lower b ounds for distributed statistical e s timation with communi- cation constraints,” in A dvanc es in N eur al Information Pr o c essing Systems , pp. 2328–23 3 6, 20 13. [12] M. El Gamal and L . Lai, “Are Slepian-Wolf rates necess a ry for dis- tributed parameter estimation?,” in Communic ation, Contr ol, and Comput- ing (Al lerton), 2015 53r d Annual Al lerton Confer enc e on , pp. 1249 –1255 , IEEE, 2015. 34 [13] J.-J. Xiao, S. Cui, Z.-Q. Luo, and A. J. Goldsmith, “ Joint e s timation in sensor netw orks under energy constra int s,” in Sensor and A d Ho c Com- munic atio ns and N et works, 2004. IEEE SECON 2004 . 2004 First A nnual IEEE Communic ations So ciety Confer enc e on , pp. 264– 271, IEEE, 2004. [14] Z.-Q. Luo, “Univ ersal decentralized es timatio n in a bandwidth constra ined sensor net w ork,” IEEE T r ansactions on information the ory , vol. 51, no. 6 , pp. 2210–22 1 9, 20 05. [15] J. A. Gubner, “ Distributed estimation and quantization,” IEEE T r ansac- tions on Information The ory , v ol. 3 9, no. 4, pp. 14 56–14 59, 199 3. [16] A. Xu and M. Raginsk y , “Informa tion-theoretic low er b ounds on Bayes r isk in decen tralized estimation,” IEEE T r ansactions on Information The ory , vol. 63, no. 3, pp. 1 5 80–1 600, 2017 . [17] M. Braverman, A. Garg, T. Ma , H. L. Nguy en, and D. P . W o odruff, “Communication lo w er bo unds for statistical es timation problems via a distributed data pro cessing inequality ,” in Pr o c e e dings of the forty-eighth annual ACM symp osium on The ory of Computing , pp. 1011–10 20, ACM, 2016. [18] I. D. Sc hizas, A. Rib eiro , a nd G. B . Giannakis, “Co nsensus in a d ho c wsns with noisy links —part i: Distributed estimation o f deterministic signals,” IEEE T r ansactions on Signal Pr o c essing , vol. 56, no. 1, pp. 35 0 –364 , 2 008. [19] J.-J. Xiao, A. Rib eiro , Z.-Q. Luo, and G. B. Giannakis, “Distributed compression- e s timation using wireless s ensor netw orks,” IEEE Signal Pr o- c essing Maga zine , vol. 23, no. 4, pp. 27–41, 2006 . [20] A. Ribe iro and G. B. Giannakis, “Bandwidth-constrained dis tr ibuted esti- mation for wireless senso r net w orks-par t i: Gauss ian c ase,” IEEE tr ansac- tions on signal pr o c essing , v ol. 54 , no. 3, pp. 113 1–114 3, 2006 . [21] P . V enkita subramaniam, L. T ong, and A. Swami, “Qua nt ization for max- imin are in distributed estimation,” IEEE T r ansactions on Signal Pr o c ess- ing , v ol. 55 , no. 7 , pp. 3596 –360 5 , 200 7. [22] T. M. Cov er and J. A. Tho mas, Elements of informatio n the ory . John Wiley & Sons, 2006 . [23] S. M. K ay , “F undamentals of s ta tistical signal pro cess ing, volume i: esti- mation theory ,” 1993. [24] H. David and H. Nagara ja , Or der Statistics . Wiley Series in Probabilit y and Statistics, Wiley , 2004 . [25] C. G. Small, Exp ansions and asymptotic s for statistics . CRC Press, 2010. [26] A. C. Rencher, Metho ds of multivaria te analy sis , vol. 492 . J ohn Wiley & Sons, 2003. 35 [27] C. R. Johnson, “A Ger sgorin-type low er bound for the smalles t singular v alue,” Line ar Alg ebr a and its Applic ations , vol. 112, pp. 1–7 , 198 9. [28] W. W. Hag er, “Up dating the in v erse of a matrix,” SIAM r evi ew , v ol. 31, no. 2, pp. 221–2 39, 1 989. [29] P . Billingsley , Pr ob ability and Me asur e . Wiley Ser ie s in Probability and Statistics, Wiley , 1995. [30] R. A. Hor n and C. R. J ohnson, Matrix analysis . Ca mbridge universit y press, 1990. [31] U. Hadar and O. Sha yevitz, “Distributed estimation of Gaussia n co rrela- tions,” in Information The ory (ISIT), 2018 IEEE International S ymp osium on , pp. 511 –515 , IEEE, 2018 . 36
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment