Comparing Apples and Oranges: Two Examples of the Limits of Statistical Inference, With an Application to Google Advertising Markets

Comparing Apples and Oranges: Tw o Examples of the Limits of Statistical Inference, With an Application to Go ogle Adv ertising Mark ets John Moun t ∗ , Nina Zum el † July 6, 200 7 1 Ov erview Bad exp erimen tal situations are often a source of great statistical puzzles. W e are going to des crib e an example of this sort of situatio n using what one author observ ed while watc hing a few diﬀerent companies using the Go ogle AdSens e and AdW ords pro ducts. The p oints w e argue will b e obv ious to statisticians – in fact, they are actually elemen tary exercises. W e will sho w that the measuremen ts allo w ed in the Go ogle AdSense mark ets are insuﬃcien t to allow accurate tra c king of a large num ber o f diﬀeren t reve nue sources. Our goal is to explain a we ll known limit on inference to a larger non-sp ecialist audience. This is a bit of a c hallenge as most mat hematical pap ers can only b e read b y p eople who could ha v e written the pap er themselv es. By “non-sp ecialist audience” w e mean a na lytically minded p eople that ma y not ha v e seen this sort of math b efore, or those who hav e seen the theory but are in terested in seeing a complete application. W e will include in this writeup t he notes, in tents , side-though ts and calculations tha t mathematicians pro duce to understand eve n their ow n work but, as Gian-Carlo Rota wrote, w e a re comp elled to delete f or fear our presen tation and understanding w on’t app ear as deep as eve ryone else’s.[4] The coun ter-intuitiv e p oints tha t w e wish to emphasize are: • The diﬃc ulty of estimating the v ariance of individuals from a sm all n um b er of aggregated measuremen ts. ∗ ht tp:// ww w .mzla bs.com/ † ht tp:// ww w .q uim ba.co m/ 0 This w or k is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. T o v iew a co p y o f this license, visit http://creativecommons.org/ licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 1 71 Second Street, Suite 30 0, San F rancisco, California, 9410 5, USA. 1 • The diﬃcult y of estimating the a v erages of man y groups from a small n um ber of aggregated measuremen ts. These points will b e motiv a ted as they apply in the Goo gle marke ts and w e will try to examine their consequences in a simpliﬁed s etting. 2 Con tents 1 Ov erview 1 2 The Go ogle Mark ets 4 2.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Information Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Channel Iden tiﬁers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 The Statistics 6 3.1 The V ariance is Not Measurable . . . . . . . . . . . . . . . . . . . . . 7 3.1.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.2 T rying to Estimate the V ar ia nce . . . . . . . . . . . . . . . . . 8 3.1.3 Cramer-Rao: Wh y w e can not estimate the v ar ia nce of individual Apples . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 T rying to Undo a Mixture . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.1 Cramer-Rao: Why we can’t separate App les from O r a nges . . 12 4 Other Solution Metho ds 14 5 Conclusion 14 6 App endix 16 6.1 Deriv ation That a Single M ean is Easy to Es timate . . . . . . . . . . 16 6.2 Fisher Information and the Cramer-Ra o Inequalit y . . . . . . . . . . 17 6.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6.2.2 Calculating Cramer-Rao on t he V ariance of V ariance Estimate 17 6.2.3 Calculating Cramer-Rao Inequalit y on Multiple Me an E stimates 19 6.2.4 Cramer-Rao Inequalit y Holds in Gene ral . . . . . . . . . . . . 20 3 2 The Go ogle Mark ets 2.1 In tro duction Go ogle b o th buys and sells a large n um b er of textual adv ertisemen ts through programs called G o ogle AdSense and G o ogle AdW ords.[2] What is actually purc hased and sold is “ clic ks.” W eb sites tha t agree to displa y G o ogle AdSense are paid when users clic k on these ads, and adve rtisers who place adv ertisemen ts in to Go ogle AdW ords pa y Go og le when their adv ertisemen ts are clic ke d on. The k ey item in these mark ets is the “searc h term” that the adv ertiser c ho o ses to bid on adv ertising clic ks for. “Search terms” are short phrases for whic h an adv ertiser is willing to pay , in order to get a visit from a w eb surfer who has p erformed a searc h on that phrase. F or instance a compan y lik e P anasonic might consider clic ks on the searc h te rm “rugged laptop” (and the att ention of the underlying w eb s urfer) to b e worth $ 2 to them. Because Go ogle b oth buys and sells adv ertisemen ts they are e ssen tially making a mark et. There a re some unique aspects to this marke t in that it is not the adv ertisemen ts o r ev en page-views that a re b eing traded, but clicks . Both Go og le and its aﬃlia t es serv e the adv ertisemen ts for free a nd then exc hange pa ymen t only when a w eb surfer clic ks on an a dv ertisemen t. A w ebsite can “resell” adv ertisemen ts b y sim ultaneously placing ads through AdW ords, and serving ads through AdSen se. When a user clic ks in to the w ebsite via a n adv ertisemen t, this costs t he web site money; if, how ev er, the user is then shown a n umber of other adv ertisemen ts, he or she ma y then clic k out on one of them of their o wn free will, recouping money or p erhaps ev en making a proﬁt for the site. There is signiﬁcan t uncertain ty in attempting resale and arbitrage in these advertis emen t markets, as the user who m ust b e behind all the clic ks can just “ev ap orate” during an attempted resale. Direct reselling of click s (suc h as redirecting a w eb surfer from o ne a dvertise men t to anot her) w ould require a metho d called “a ut o matic redirection” to mo v e the surfer f rom one adv ertisemen t to a replacemen t a dv ertisemen t. Automat ic redirection is not allow ed b y Go ogle’s terms of service. An intere sting issue is that each clic k on a give n searc h term is a unique ev en t with a unique cost. One clic k for “rugged lapto p” may cost $1 and another may cost $0 .50. The diﬀering costs are determined b y the adv ertiser’s bid, a v ailable placemen ts for the k ey phrase, w hat other adv ertise rs are bidding in the market, ho w many w eb s urfers are a v ailable, and G o ogle’s sorting of bids. The sorting of bids b y Go ogle dep ends on the rank of adv ertise r’s bid times an adjustmen t factor managed b y Goo gle. The hop eful assumption is that all of the p otential view ers a nd clic k ers for the same searc h term are e ssen tially exc hangeable in that they all hav e a sim ilar (unkno wn) cost and similar probabilities of later actions, suc h a s buying something from a w eb site. The concept o f exc hangeabilit y is what allows information collected on one set of unique ev en ts t o inform predictions ab out new unique eve nts (drawn from the same exc hangeable population). Whatev er the details are, these large adve rtisemen t marke ts ha ve g iv en G o ogle an income of $12 billion, $3.5 billion in proﬁt and 70% y ear to y ear growth in 2006.[5] This scale of pro ﬁt is due in part to the dominant p osition of Go og le in forming 4 mark ets for on- line adv ertising. The reasons for Go o gle’s mark et domination are v arious and include the sup erior qualit y of the Go ogle ma t ching and bidding service, missteps b y comp etito rs a nd the ne tw ork eﬀe cts found in a go o d mark et – the situatio n whereb y sellers a ttract buy ers and buy ers attra ct sellers. The cost o f switc hing markets (implemen tation, information handling and s taﬃng m ultiple relationships) are also signiﬁcant factors. In our opinion, Go ogle’s proﬁt margins are also help ed b y the limits on information a v ailable to most of the other mark et participan ts. In the next section, w e will discuss some of the informat io n limits or barr iers t o t r a nsparency in the Go ogle mark et. 2.2 Information Limits Go ogle deals are t ypically set up as rev en ue sh aring arrangemen ts in whic h Go ogle agrees to pa y a negotiat ed p ortio n of the rev en ues receiv ed by Go ogle to the AdSense hosting w eb site. As noted abov e, adv ertisemen t clic k-through v alues v ary f rom as little as $0.05 to ov er $40.0 p er clic k. It is ob vious that we b site op erators who r eceiv e a commission to serv e adv ertisemen ts on b ehalf o f the Go ogle AdSense program need detailed information ab out whic h a dv ertisemen ts a re pa ying at what rate. This is necessary b o t h to v erify that Go o gle is sharing the correct a moun t on v aluable adv ertisemen ts and to adjust and optimize the we b site hosting t he adv ertise men ts. Ho w ev er, Go ogle do es not provide AdSense participan ts with a complete breakdo wn of rev en ues paid. There ar e a num b er of p ossible legitimate reasons for this. First, there is a concern tha t allowing w eb sites complete detailed reconciliation data would allow them to ov er-optimize or p erform so-called “key w ord arbitrage” where sites buy precisely the k eyw ords they can proﬁta bly serv e adv ertiseme nts on instead of buying key w ords for whic h the site actually has useful information or services . In addition, the quan tit y of data is v ery large, so there are some tec hnical c hallenges in providing a detailed timely reconciliation. There can also b e reasons fa v ora ble to Go ogle. 2.3 Channel Iden tiﬁers Go ogle’s curren t solution to the conﬂicting informationa l needs deﬁnes the nature of the mark et and is in itself quite in teresting. Go ogle allo ws the AdSense customer a n umber of measuremen ts called “channels .” The c hannels come with iden tiﬁers and the AdSense customer is allow ed to attac h a num b er of iden tiﬁers to ev ery adv ertisemen t clic k ed-out on. Go ogle in turn rep orts not the detailed rev en ue f o r ev ery click - o ut but instead j ust the s um of rev en ue receiv ed on clic ks-out c ontaining eac h c hannel iden tiﬁer. F or example: if a we b site op erator w an ted to know the rev enue from a pa r t icular searc h term (sa y “head cold”) they could attach a single c hannel identiﬁe r to all click- outs asso ciated with “head cold” and to no o t her searc h term. Under this sc heme, Go ogle w ould then b e rep orting the reve nue for t he searc h term as a c hannel summary . This simple sc heme uses up a n en tire c hannel-id for a single search term. This w ould not b e a problem except that an AdSense partner is typically limited (by Go ogle) to 5 a few h undred c hannel iden tiﬁers and is often attempting to trac k tens of tho usands of search terms (and other conditions suc h as traﬃc source and time of day ). It is ob vious to any statistician tha t these limited num b er of c hannels are no t suﬃcien t to eliminate many degrees of uncertaint y in the r ev en ue attribution problem. Go ogle do es allow eac h clic k-out to ha v e multiple channel iden tiﬁers attac hed to it. A t ﬁrst this seems promising – for instance one can easily come up with sc hemes where 30 c hannel ids would b e suﬃcien t to giv e ov er a billion unique searc h terms each a unique p attern of c hannel ide ntiﬁers . How ev er, Go ogle do es not report rev en ue for eac h pattern of c hannel iden tiﬁers; in this case they w ould only report the total for each o f the 30 c hannels. Eac h c hannel tota l w ould b e the sum of all rev en ue given for all clic ks-out that included the giv en c hannel-id. Under this sc heme we w ould hav e a lot of double coun ting in that any clic k-out with m ultiple c hannel iden tiﬁers attache d is necessarily sim ultaneously contributing to multiple totals. An y one fa miliar with statistics or linear algebra will quic kly recognize that 30 c hannels can really only reliably measure a b out 30 facts ab out an ad campaign. Ther e is pr ovably no sup er clever scheme c ap able of de c o ding these c onfounde d me asur em ents into a lar ger numb er of r eliab le outc omes . Let us go bac k to the p oints that we promised to discuss at the b eginning of this pap er: • The diﬃc ulty of estimating the v ariance of individuals from a sm all n umber of aggregated measuremen ts. In terms o f Go ogle AdSense, this means that w e can tell the av erag e (mean) v alue o f a clic k in a given c hannel, but w e cannot tell ho w widely the clic k v alues in the c hannel v ary from this a ve rage v a lue. • The diﬃcult y o f estimating the a ve rages of man y groups from a small n umber of aggregated measuremen t s. This means that if we assign m ultiple s earch terms in to each of our a v ailable c hannels, we cannot separate o ut the v alues of each individual searc h term using only the aggregat e ch annel measure men ts. It is an interes ting exerc ise to t o uc h on the theory of wh y the se facts are true. 3 The Stati stics One thing the last section should ha v e made obvious is that ev en describing the problem is detailed and t edious. It ma y b e better to w ork in analogy to a v oid real- w orld de tails and non-essen tial complications. Let’s replace a dvertise men t clic ks-out with fruit, and c hannels with weighings of bask ets. Supp ose we are dealing with apples a nd our business dep ends on know ing the t ypical w eigh t of eac h fruit. W e ass ume that all apples are exc hangeable: they ma y eac h ha ve a diﬀeren t w eigh t ( a nd v alue) but they all are coming from a single source. W e further assume tha t w e ha ve a limited num b er of times that we are a llow ed to place our apples into a bask et and w eigh them on a scale. 6 3.1 The V a riance is Not Measurable 3.1.1 The Mean The ﬁrst example , the happy one, is when w e hav e a single bask et ﬁlled with many diﬀeren t items of o ne t yp e of fruit. F or instance supp ose w e had a single bask et with 5 apples in it and w e w ere to ld the baske t conten ts ha v e a tota l w eight of 1.3 p ounds. The fact tha t w e w ere giv en only a single measuremen t for the en tire bask et (instead of b eing allo w ed to w eigh eac h apple independen tly) do es not in terfere in any w a y with accurately deducing that the av erag e (or mean) of this t yp e of apple weighs a little more than 1/4 p ound. If w e ha d n apples in the ba sk et, and w e called the total w eigh t of the con ten ts of the bask et T , we could estimate the a v erage or mean w eigh t of individual apples as b eing T /n . If w e us e a w to denote the (unkno wn univ ersal) a v erage w eight of individual apples we would denote our es timate of this a ve rag e as ˆ a w and w e ha ve just said that our estimate is ˆ a w = T /n . Ho w ev er, w e ar e mis sing the opp ortunity to learn at least one imp orta n t thing: ho w m uc h do es the we ight of these apples v ary? This could b e a n important fact needed to run o ur business (apples b elo w a given w eigh t ma y b e uns ellable, or o ther w eigh t considerations ma y apply). W e ma y need to kno w ho w inaccurate is it to use the mean or av erage w eigh t of the a pples in place of individual w eigh ts. If w e w ere allo wed 5 bask et w eighings w e c ould pu t one a pple in eac h ba ske t and directly see how muc h the t ypical v ar ia tion in w eigh t is for the t yp e of apples we ha v e. Let’s call t his Exp eriment-A . Supp ose in this case w e ﬁnd t he 5 apples to w eigh 0 . 25 l b, 0 . 3 l b, 0 . 27 l b, 0 . 23 l b, 0 . 25 l b respectiv ely . This detailed set of measureme nts helps inform us on how this type o f apple v aries in we ight. One of the simp lest metho ds to summarize info r mation ab out v ariation is a statistical notion called “v ariance.” V ariance is deﬁned as the expected squared distance of an random individual fr o m the p opula t io n av erage. V ariance is written as E [( x − a w ) 2 ] where x is a “random v ariable” denoting the w eigh t of a single apple dra wn uniformly and indep endently at rando m (f rom the unkno wn larger p opulation) and the E [] notation denotes “exp ectation.” E [( x − a w ) 2 ] is the v alue that someb o dy who knew the v alue of a w w ould sa y is the a v erage v alue of ( x − a w ) 2 o v er v ery many rep etitions of drawing a single apple and recording its individu al w eigh t as x . F or example if all apples had the exact same we ight the v aria nce would b e zero. F or the bask et abov e, E [( x − ˆ a w ) 2 ] is calculated as: (0 . 25 − 0 . 26) 2 + (0 . 3 − 0 . 26) 2 + (0 . 27 − 0 . 26) 2 + (0 . 23 − 0 . 26) 2 + (0 . 25 − 0 . 26) 2 5 (the 0 . 26 itself the a v erage of the 5 a pples we ights). The in terpretation is that for a similar apple with unknow n w eight x w e w ould exp ect ( x − 0 . 26) 2 ≈ 4 ∗ 0 . 00056 or for x to not b e t o o far o utside the in terv al 0 . 2 1 2 to 0 . 307 (applying t he common rule of th um b “2 standar d deviations” whic h is 4 v ariances). As w e see all of the original 5 apples fell in this interv al. No w the 5 apple w eights w e kno w are not actually all the p ossible apples in the w orld, they are merely the apples in our sample . There are some subtleties ab out using the v ariance found in a sample to estimate the v ariance o f the total p opulation, 7 but for this discussion w e will use the naiv e assumption tha t they a re nearly the same. If w e use the sy mbol v a to denote the ( unknown) true v ariance of individual apple w eigh ts (so v a = E [( x − a w ) 2 ]) w e can use it to express the fact ˆ a w is a ctually an excellen t estimate of a w . Sp eciﬁcally: if w e w ere to rep eat the exp erimen t of taking a bask et of randomly selected apples ( n a pples in the bask et) ov er a nd ov er again, estimating the me an apple w eight ˆ a w eac h time, then E [( ˆ a w − a w ) 2 ] – the exp ected square error betw een our estimate of the av erag e apple w eight and the true a ve rag e apple w eigh t – will go to zero as the sample-size n is increased. In fact, w e can show E [( ˆ a w − a w ) 2 ] = v a /n , whic h means tha t our estimate of the mean gets more precise as n is increased. This fact that large samples are v ery go o d es timates of unkno wn means is basic- but for completeness w e include its deriv at io n in the app endix. 3.1.2 T rying to E st imate the V ariance W e in tro duced the v a r ia nce of individual apples (denoted by v a ) as an unknown quan tit y that aided reasoning. W e kno w that ev en with only one measuremen t of the total we ight of all n apples t ha t ˆ a w is an estimate of the mean whose error go es to zero as the n (the nu mber of apples or the sample size) gets larg e. Ho w ev er, the v ariance of individ ual apples v a is so useful tha t w e would like to ha v e an actual estimate ( ˆ v a ) of it. It w ould b e v ery useful to kno w if v a is near zero (all apples hav e nearly identical w eight) or if v a is large ( a pples v a ry wildly in w eigh t). If w e were allo w ed to w eigh eac h apple as in Experimen t- A (i.e. if we had an unlimited num b er of bask et w eighings or c hannels), w e could estimate the v a riance b y the calculations in the las t section. If w e w ere allo w ed only o ne measurem ent w e w ould really ha ve almo st no informat io n a b out the v a r iance as we hav e only seen one aggregated measuremen t- so we ha v e no idea how individual apple w eigh ts v ary . The next que stion is: can w e create a go o d estimate ˆ v a when w e are allow ed only t w o measuremen ts but the sample size ( n ) is allo w ed to grow? Lets conside r Exp eriment-B : If w e ha ve a total of 2 n apples ( n in each bask et) and T 1 is the total w eight of the ﬁrst bask et and T 2 is the total weigh t of the s econd bask et then some algebra w ould tell us that ˆ v a = ( T 1 − T 2 ) 2 2 n is an un biased e stimate of v a (the v ariance in w eigh t of individual apples ) 1 . It turns out, ho w ev er, that ˆ v a is actually a bad estimate of the v a r ia nce. That is, the exp ected distance of ˆ v a from t he unkno wn true v alue of the v ariance v a (written E [( v a − ˆ v a ) 2 ]) do es not shrink b ey ond a certain b ound as the n umber o f apples in eac h bask et ( n ) is increased. This “v ariance of v ariance estimate” r esult is in stark con trast to the nice b ehavior w e j ust sa w in estimating the a ve rag e a w . With some additiona l a ssumptions and algebra (not sho wn here) w e can sho w that for o ur estimate ˆ v a = ( T 1 − T 2 ) 2 2 n w e ha v e lim n →∞ E [( ˆ v a − v a ) 2 ] = 2 v 2 a . There is a general reason this is happ ening, and w e will dis cuss this in the ne xt section. 1 “Unbiased” simply mea ns that E [ ˆ v a − v a ] = 0 which can also b e wr itten as E [ ˆ v a ] = v a . This means our estima te of v aria nce do esn’t tend to b e mor e o ver than under (or more under than over). 8 3.1.3 Cramer-Rao: Why w e can not estimate the v ariance of individual Apples Of course sho wing one particular calculatio n fa ils is not the same as sho wing that the v ariance o f individual apples can not b e estimated fro m the t w o tota l w eighings T 1 and T 2 . There could be other, b etter, estimates 2 . There is a w ell kno wn statistical law tha t states no unbiase d estimator w orks w ell in this situation. The law is called t he Cramer-Rao inequality .[1 ] The Cramer-R ao inequalit y is a to ol for identifying situations where al l un biased estimators hav e large v ariance. The Cramer-Rao ineq uality is typically a calculation so w e will add a few more (not neces sarily realistic) assumptions to ease calculation. W e assume apple w eigh ts are distributed normally with mean a w and v ariance v a . 3 There is a quantit y dep ending only on t he exp erimen tal set up that reads oﬀ ho w diﬃcult estimation is. By “dep ending o nly on the exp erimen tal set up” w e mean that the quantit y do es not dep end on an y speciﬁc outcomes of T 1 , T 2 and do es not dep end on any sp eciﬁc estimation pro cedure or form ula. This quan tity is called “Fisher Information” and is denoted as J ( v a ). The Cramer-Rao inequalit y[1 ] sa ys for a n y unbiased estimator ˆ v , the v ar ia nce of ˆ v is at least 1 /J ( v a ). W ritten in formulas the conclusion of t he C ramer-R ao inequality is: E [( v a − ˆ v ) 2 ] ≥ 1 / J ( v a ) . Since w e hav e now assumed a mo del for the w eight distribution of apples, w e can deriv e (see a pp endix) the following: J ( v a ) = 2 v 2 a . Applying the Cramer-Ra o inequality lets us immediately s ay: E [( v a − ˆ v ) 2 ] ≥ v 2 a 2 . This means that there is no un biased estimation pro cedure fo r whic h can we exp ect the squared-error to shrink b elow v 2 a 2 ev en as the num b er of items in eac h bask et ( n ) is increased. So not only do es our prop osed v a r ia nce estimate fail to ha v e the (exp ected) 2 As an aside, some of the v a lue in pr op osing a sp e ciﬁc estimate (beca use the theo ry says there is no go o d one) is that it allows one to inv estig ate the failure of the es tima te w itho ut resor ting to the larger theory . F or example in this day of friendly computer languages a nd ubiquitous co mputers one can easily empirically co nﬁrm (by setting up a sim ulation exp eriment a s s uggested b y Met r o po lis and Ulam[3]). One can chec k that our estimate is unbiased (by averaging many applica tions of it) and that it is not g o o d (b y obs e r ving the substantial err or on each individual a pplication even when n is enormous). There is no rule that one should not get an e mpir ical feel (or even an empirical conﬁrmation) of a mathematical s tatement (presentation of math is sub ject to er rors) a nd in this day there are lik ely ma n y more r eaders w ho could quickly conﬁrm or dispr ove the claims of this section by simulation than there ar e readers who would b e inclined to chec k man y lines of tedio us algebra for a subtle erro r. 3 “Normal” is a statistical term for the distribution asso cia ted with the Bell curve. Many q uantit ies in nature have a nea rly no rmal distribution. 9 go o d b ehav ior we sa w when estimating the mean, but in fact no un biased estimating sc heme will w ork. In general w e can show that the quality of t he v ariance estimate is essen tially a f unction of the num b er of measuremen ts w e are allow ed 4 - so an y sc heme using a constan t n um b er of measuremen ts will fail. 3.2 T rying to Und o a Mixture Supp ose w e are willing to giv e up on estimating the v a r ia nce (a dang erous concession). W e are still blinded b y the limited n um b er o f channe ls if we at t empt to estimate more than one individual mean. In our a nalogy let’s introduce a second fruit (ora nges) to the pro blem. Call an assignmen t of fruit to bask ets a “c hannel design.” F or example if w e w ere a llow ed t w o bask et measuremen ts and w anted to kno w the mean w eigh t of apples and the mean w eight of o ranges w e could assign all apples to one bask et and all oranges to the o ther. This “design” w ould giv e us v ery go o d estimates of b oth the mean w eight of apples and the mean w eigh t of or anges. Let’s consider a simple situatio n where due to the limited num b er of channels w e are attempting to measure something tha t was not considered in the original c hannel des ign. This is very like ly b ecause the n um b er of sim ultaneous indep enden t measuremen ts is limited to the n umber o f c hannels and it is v ery likely that one will hav e impor t a n t questions that were not in an y given exp erimen tal design. F or example (going bac k to AdSense), supp ose w e had 26 c hannels and w e used them all to group our s earch phrases b y ﬁr st letter of the Englis h alphab et a nd w e later w an ted to br eak do wn older data by length of ph rase. 5 W e w ould consider ourselv es luc ky if the ﬁrst-letter design w as ev en as go o d as random a ssignmen t of c hannel ids in measuring the eﬀect o f search term length. T o w ork this example w e con tin ue to ignor e most of the details and supp ose w e really are trying to estimate the mean weigh t of apples and the mean w eight o f o r anges at the same time. Due to the kind of bad luc k des crib ed ab o v e w e hav e data from an exp erimen t that was not designed fo r this purp o se. Let’s try the so-called easy case where w e ha v e a random experimen t. F o r Exp eriment-C let’s suppose w e ha ve t w o ba sk ets of fruit and eac h bask et w as ﬁlled with n - it ems of fr uit b y rep eat ing the pro cess of ﬂipping a fa ir coin and placin g an apple if the coin came up heads and an orange if the coin came up tails. This admittedly silly pro cess is sim ulating the situation where w e are forced to use measuremen ts that p oten tially could solve our problem- but w ere not des igned to solv e it. 6 W e can me asure the tota l weigh t of the con ten ts of eac h bask et. So the informa t ion at our disp osal this time is a 1 , o 1 , T 1 (the n um b er of apples in the ﬁrst bask et, the n um b er of oranges in the ﬁrst bask et and the total weigh t of the ﬁrst bask et) and a 2 , o 2 , T 2 (the n um b er of apples in the second bask et, the n um b er of oranges in t he second bask et and t he to t a l w eigh t o f the second 4 And p erha ps surprisingly not a function of the sample size. 5 These exa mples are delib erately trivial. 6 This is one of the na sty diﬀerences b etw een pros pective studies where the expe r iment al lay out is tailored to exp ose the q ua n tities o f interest and retro sp e ctiv e studies where w e hop e to infer new quantities from exp e riments that have relev ant (but not speciﬁca lly orga nized) data. 10 bask et). What w e wan t to estimate are a w and o w the unknown mean we ights of the t yp es of a pples and ty p es of oranges we are dealing with. T o simplify things a bit let’s treat the n umber of apples and orang es in each bask et, a 1 , o 1 , a 2 , o 2 , as kno wn constan ts set at “typical v alues” that w e would exp ect from the coin ﬂipping pro cedure. It tur ns o ut the follow ing v a lues of a 1 , o 1 , a 2 , o 2 are t ypical: a 1 = n/ 2 + √ n o 1 = n/ 2 − √ n a 2 = n/ 2 − √ n o 2 = n/ 2 + √ n. W e call these v alues t ypical b ecause in an y exp erimen t where the distribution of n items in a collection is c hosen by fair coin ﬂips we expect to see a nearly ev en distribution (due to the fairness of the coin) but not to o ev en (due to the randomness). In fact w e really do exp ect any one of these v alues to b e at least √ n/ 2 aw ay from n/ 2 most of the time and close r than 2 √ n most o f the time. So these are typic al v alues, go o d but not t o o g o o d. W e illustrate ho w to pro duce an un biased (though in the en d unfortunately un usable) estimate for a w and o w . T he general theory sa ys the estimate will b e unreliable- but there is some v alue in seeing how an estimate is formed and hav ing a sp eciﬁc es timate to exp erimen t with. The fact that w e k now the coun t of eac h fruit in eac h bask et, and eac h bask et’s w eight, giv es us a sim ultaneous system o f equations: E a 1 ,o 1 ,a 1 ,o 2 [ T 1 ] = a 1 a w + o 1 o w E a 1 ,o 1 ,a 2 ,o 2 [ T 2 ] = a 2 a w + o 2 o w E a 1 ,o 1 ,a 1 ,o 2 [ T 1 ] represen ts the a ve rag e v alue of T 1 o v er imagined r ep eated exp eriments where a 1 apples and o 1 oranges are placed in a bask et and w eighed (similarly f or E a 1 ,o 1 ,a 2 ,o 2 [ T 2 ]). The subscripts are indicating w e are only considering exp eriments where the n umber of apples and oranges are known to b e exactly a 1 , o 1 , a 1 , o 2 . W e do not actually kno w E a 1 ,o 1 ,a 1 ,o 2 [ T 1 ] and E a 1 ,o 1 ,a 2 ,o 2 [ T 2 ] but we can use the speciﬁc ba sk et total w eighs T 1 , T 2 w e sa w in our single exp erimen t as stand-ins. In other w ords, T 1 ma y not equal E a 1 ,o 1 ,a 1 ,o 2 [ T 1 ] but T 1 is a n un biased estimator o f E a 1 ,o 1 ,a 1 ,o 2 [ T 1 ] (this is a v ariation on the o ld “t ypical family with 2.5 c hildren” jok e). So w e rewrite the previous system as e stimates: T 1 ≈ a 1 a w + o 1 o w T 2 ≈ a 2 a w + o 2 o w . W e can the rewrite t his s ystem into a “solv ed form”: 11 a w ≈ o 2 T 1 − o 1 T 2 a 1 o 2 − a 2 o 1 o w ≈ − a 2 T 1 + a 1 T 2 a 1 o 2 − a 2 o 1 . And this gives us the tem pting e stimates ˆ a w and ˆ o w ˆ a w = o 2 T 1 − o 1 T 2 a 1 o 2 − a 2 o 1 ˆ o w = − a 2 T 1 + a 1 T 2 a 1 o 2 − a 2 o 1 . ˆ a w and ˆ o w are indeed unbias ed es timates of a w and o w . The problem is: ev en though these are un biased estimates- t hey are not go o d estimates. With some calculation one can show that as n (the n um b er of pieces of fruit in each bask et) increases that E a 1 ,o 1 ,a 2 ,o 2 [(ˆ a w − a w ) 2 ] and E a 1 ,o 1 ,a 2 ,o 2 [( ˆ o w − o w ) 2 ] do not approa c h zero. Our estimates hav e a certain built- in error b ound that do es not shrink ev en as the sample size is increased. 3.2.1 Cramer-Rao: Wh y w e can’t separate A pples from Oranges What is making estimation diﬃcult has b een the same in all exp erimen ts: most of what w e wan t to measure is b eing obscured. As w e men tioned earlier, in a ty pical case all of a 1 , o 1 , a 2 , o 2 will b e relat ively near a common v alue. An y estimation pro cedure is going to dep end on separations a mo ng these v alues, whic h are unfortunately not that big. This is what makes estimation d iﬃcult. Let us a ssume apple w eigh ts are distributed normally with mean a w and v ariance v a = v and orange w eigh ts are distributed normally with mean o w and v ariance v o = v . Since we ha v e no w a ssumed a mo del for the w eight distribution of apples and oranges w e can derive ( calculating as shown in [1]) the follow ing: J ( a w , o w ) = 1 nv  a 2 1 + a 2 2 a 1 o 1 + a 2 o 2 a 1 o 1 + a 2 o 2 o 2 1 + o 2 2  . What we are really in terested in is the inv erse of J ( a w , o w ), whic h (for or t ypical v alues of a 1 , o 1 , a 2 , o 2 ) is: J − 1 ( a w , o w ) = v 8  1 + 4 /n − 1 + 4 / n − 1 + 4 / n 1 + 4 /n  . The theory says that the diagonal entries of this matrix are essen tially low er b ounds on the sq uared error in the estimates of the apple and orange w eights, resp ectiv ely . The o ﬀ-diagonal terms describ e how an error in the estimate of the mean apple w eigh t aﬀects the estimate of the mean orange w eigh t, and vice-ve rsa. So what w e w ould lik e is f o r all the entries of J − 1 ( a w , o w ) to approach zero as n increases. 12 In our case, ho w ev er, the entries of J − 1 ( a w , o w ) all tend to the constan t v 8 as n g ro ws, meaning that the errors in the estimates are also bounded a wa y from zero and s top impro ving as the sample size increases. The ab o v e discussion assumes that the distribution of a pples and o ranges in each bask et is the same (in t his case, random and uniform). If there is some constructiv e bias in the pro cess forming a 1 , o 1 , a 2 , o 2 , suc h as apples b eing a bit more lik ely in the ﬁrst bask et and oranges a bit more like ly in t he second bask et, then the demonstrated estimate is g o o d (with error decreasing as n grows ) and is actually useful. But the degree of utilit y o f the estimate dep ends on how m uc h useful bias we ha v e- if there is not m uch useful bias then the errors shrink v ery slo wly and w e need a lot more data than one w ould ﬁrst exp ect to g et a goo d measuremen t. Finally , w e w ould lik e to remind the reader that it is imp ossible f or a c hannel design with a limited num b er of channels to sim ultaneously hav e an inde p endent large useful bias on v ery man y measuremen ts. As an ex ample of the application of useful bias s upp ose that our coin has probabilit y p of coming up heads, and that the ﬁrst bask et is ﬁlled b y placing an apple ev ery time the coin is heads, and an orange ev ery time the coin is tails. The second bask et is ﬁlled the opp osite w ay – apple for tails, o range for heads. Again, let’s tr eat the num b er of apples a nd oranges in eac h bask et, a 1 , o 1 , a 2 , o 2 , as known constan ts set at “typic al v alues” that w e w ould expect from the coin ﬂipping p ro cedure. a 1 = np o 1 = n (1 − p ) a 2 = n (1 − p ) o 2 = np (as long as p 6 = 1 2 the √ n terms are dominated by the bias a nd can b e ignored). If p = 1 – the coin alw ay s comes up he ads – then the ﬁrst bask et is only apples , and the second bask et is only oranges, and o bviously , w e can ﬁnd go o d e stimates of a w and o w , by the a rgumen ts in Section 3.1.1. If p = 1 2 , then w e are in the situation that w e already discussed , with appro ximately equal num b ers of apples and oranges in eac h bask et. But suppo se p w ere some other v alue b esides 1 or 1 2 , sa y , p = 1 4 . In that case, the ﬁrst bask et w ould b e primarily oranges, and the second one primarily apples, and w e c an s how that J − 1 ( a w , o w ) = v 2 n  5 − 3 − 3 5  , and all of the en tries of J − 1 ( a w , o w ) do go to zero a s n gets larger. This can b e shown to b e t r ue in gene ral, for an y p 6 = 1 2 . This me ans the Cramer-Rao b ound do es no t prev en t estimation. Another calculation (not show n here) conﬁrms t ha t our prop osed estimate do es indeed hav e shrinking error (as n increases). 13 4 Other S olution Me t ho d s W e did not disc uss solution metho ds t hat inv olve more data, suc h as repeated exp eriments, o r signiﬁcantly deep er kno wledge, suc h a s factor mo dels. What w e discusse d were the limits of the basic mo deling step, whic h itself w ould b e a comp onent of the more sophisticated solutions. Here how ev er, w e will brieﬂy touc h on other pro cedures that could b e us ed to try to improv e the situation discussed ab ov e. R ep e ate d me asur em ents could b e implemen ted by taking data ov er man y days, reassigning the c hannel iden tiﬁers so t ha t eac h searc h term part icipates in diﬀerent com binations of c hannel iden tiﬁers ov er the course of the measuremen ts. Essen tially , this is setting up a m uc h larger system o f sim ultaneous equations, fro m whic h a larger n um b er of v ariables can b e estimated. There are ma t hematical pro cedures for this sort of iterativ e estimation (suc h as the famous K a lman ﬁlter), but the n um b er of quan tities a web site would wish to estimate is so muc h larger than the num b er of measuremen ts av aila ble that the pro cedure will require man y reconciliation r o unds to con ve rge. In addition, this model assumes that the v alues of the v aria bles b eing measured do not change ov er time (or c hange v ery slo wly). This is not an assumption that is necess arily true in the AdW or ds domain, due to seasonalit y and other eﬀects. A factor mo del is a mo del where one has r esearc hed a sm all n um b er of c auses or factors that ex plain the exp ected v alue of searc h phrases in a ve ry simple manner. F or example it would b e nice if the v alue of a se arch phrase w ere the sum of a v alue determined by the ﬁrst letter plus an indep enden t v alue determined b y the second letter. In suc h a case w e w ould only need 2 ∗ 26 = 52 c hannels (to trac k t he factors) and we w ould then b e able to apply our mo del t o man y diﬀeren t searc h phrases. F actor mo dels are a go o d solution, and ar e commonly used in o ther industries, suc h as ﬁnance, but one needs to inv est in deve loping f actors m uc h b etter than the example factors w e just mentioned. 5 Conclus ion The last section brings us to the p oint of this writeup. Ha ving data from a limited n um b er of c hannels is a fundamen tal limit on informatio n in the G o ogle clic k-out mark et. Y ou can not get ar ound it b y mere calculation. Y o u need other infor ma t io n sources or aggregation sc hemes whic h ma y or may not b e a v aila ble. The p oints w e ha ve touc hed on are: • Y o u can not estimate the v a r iance of individu als from a constan t num b er of aggregated measuremen ts. This is bad b ecause this in terferes w ith detailed estimates o f risk. • Y o u can not alw ays undo bad c hannel assignmen ts b y calculation aft er the fact. This is bad b ecause this interferes with detailed assignmen ts and managemen t of v alue. 14 In a mark et info r ma t io n is money . T o the exten t you buy or se ll in ignorance y ou leak money to a ny c ounter-parties that k now the things that y o u do no t . Eve n if there are no suc h informed coun ter-parties there are distinct disadv a n tages in not b eing able to un-bundle mixed measuremen ts. This means it is diﬃcult to un-bundle mixed sales. F o r ex ample we ma y b e making a proﬁt on a com bination purc hase of adv ertisemen t s and w e are not able to quic kly determine whic h advertis emen ts in the com bination are proﬁtable and w hich are unproﬁta ble. 7 The capital mar kets (sto c ks, b onds, index funds, · · · ) hav e ev olv ed and pro g ressed forw ard from initial disorganized arrangemen ts to op en outcry mark ets and then to detailed information env ironments . The demands and exp ectations of these mo dern mark ets include a n um b er of f eatures including: • Complete rec onciliation and publicly a v ailable detailed records of the past. • T ransparent “b o oks” or listings o f all curren t bids and bidders. Not all of these are appropriate fo r a no n-capital mar ket and Go ogle’s on- line adv ertising mark ets are just that: Go ogle’s. It is in teresting that b efore 2007 Y aho o/Overture oﬀered a research in terface that did exp ose the bidding b o ok. It will b e interes ting to see ho w the on-line a dv ertising mark ets ev o lv e and if this feature surviv es in the new er “mor e lik e Goog le” Ov erture mark et. The actual lesson w e learned in w atching o thers w ork with on-line a dv ertising mark ets are the following. It is not necessary to b e able to p erform an y o f the calculations men tioned here to run a successful business. It is imp ortant, how ev er, to ha v e a statistician’s intuition as to what is risky , what can b e estimated and what can not b e estimated. The surprise to the ﬁrst author t hat his initia l in tuition was wrong, ev en t ho ugh he considers himself a mat hematician. It wasn’t un til w e remov ed the non-essen tia l details from the problem and found the appropriate stat istical references that w e w a s ﬁnally able to fully convin ce ourselv es tha t these estimation problems are in fact diﬃcult. 8 References [1] Cover, T. M., and Thomas, J. A. Elements of I nformation The ory . John Wiley & sons, 1991. [2] Google . Go ogle adv ertising programs. http://ww w.google.com/ads/ . [3] M etropolis, N., and Ulam, S. The Monte Carlo metho d. 335 – 341. [4] Rot a, G . Indiscr ete Thoughts . Birkh¨ auser, Boston, 1 997. [5] Y ahoo! Go ogle k ey statistics. http://fin ance.yahoo.com/q/ks?s=GOOG . 7 By “quic kly determine” we mean determine from past da ta we a lr eady hav e. Wha t we hav e shown is we often can not determine wha t we need to k now from pa st data, but m ust re turn to the market with new expe r iment s that cost b oth time and money . 8 This initial optimism of our s is p erhaps a side-eﬀect of a “can do” attitude. 15 6 App endix 6.1 Deriv ation That a Single Mean is Easy to Estimate T o sho w E [( ˆ a w − a w ) 2 ] = v a /n we introduce the sym b ols x i to denote the r andom v ariables represen ting the n apples in our bask et and w ork forw ard. T o calculate w e will need to use some o f the theory of the exp ectation notation E []. Simple facts abo ut the E [] notation are used to reduce complicated expressions in to kno wn quantities. F or example if x is a random v ariable and c is a cons tant than E [ cx ] = cE [ x ]. If y is a random v aria ble that is indep enden t of x then E [ xy ] = E [ x ] E [ y ]. And w e ha v e for an y quantities x , y E [ x + y ] = E [ x ] + E [ y ] (ev en when t hey ar e not indep enden t) . 9 Starting our calculation: E [(ˆ a w − a w ) 2 ] = E   ( n X i =1 x i ) /n − a w ! 2   (1) = E   n X i =1 ( x i − a w ) /n ! 2   (2) = E " n X i =1 ( x i − a w ) n X j = 1 ( x j − a w ) # /n 2 (3) = E " X i,j ( x i − a w )( x j − a w ) # /n 2 (4) = E " n X i =1 ( x i − a w ) 2 # /n 2 (5) = E  n ( x − a w ) 2  /n 2 (6) = E  ( x − a w ) 2  /n (7) = v a /n. (8) Most of the lines o f the deriv ation are just s ubstitutions or uses of deﬁnition (for example the last substitution on line 8 is of E [( x − a w ) 2 ] → v a ). A few of the lines use some cute f acts a b out statistics. F or example line 4 → line 5 is using the fact that E [ x i − a w ] = 0, whic h under our independen t dra wing a ssumption is enough to sho w E [( x i − a w )( x j − a w )] = 0 when i 6 = j ( hence all these terms can b e ignored). The line 5 → line 6 substitution uses the fact that eac h of the n apples w as drawn using an identic al pro cess, so w e expect the same amoun t of error in e ach trial (and there are n trials in tot a l). 9 It is funny in statistics that we spend s o muc h time reminding ourselves that E [ xy ] is no t alwa ys equal to E [ x ] E [ y ] that we actually s o metimes ﬁnd it surprising tha t E [ x + y ] = E [ x ] + E [ y ] is generally true. 16 The conclusion of the deriv atio n is that the exp ected squared error E [(ˆ a w − a w ) 2 ] is a factor of n smaller than v a = E [( x − a w ) 2 ]. This means our estimate ˆ a w is getting b etter and b etter (clos er t o the true a w ) as w e increase the sample size n . 6.2 Fisher Information and the Cramer-Rao Inequalit y 6.2.1 Discussion What is Fisher information? Is it like the other mathem atical quantities that go b y the name of information? There are a lot of o dd quan tities related to infor mation each with its ow n deep theoretical framew ork. F or example there a re Clausius en tropy , Shannon information and Kolmo g oro v-Chaiten complexit y . Eac h of these has useful applications, precise mathematics and deep meaning. They also ha v e somewhat confus ed and incorrect pseudo-philosophical p opularizations. Fisher information is not really famous outside of statistics. T extb o oks mot iv ate it in diﬀeren t w ay s and often in tro duce an auxiliary function called “score” that quic kly mak es the c alculatio ns w ork out. The deﬁnition of“score” uses the fact that ∂ ∂ θ ln( f ( θ ) ) =  ∂ ∂ θ f ( θ )  /f ( θ ) to switc h from lik eliho o ds to relative like liho o ds. The en tries of the F isher info rmation matrix are terms of the form J i,j ( θ ) = Z x f ( x ; θ )  ∂ ∂ θ i ln f ( x ; θ )   ∂ ∂ θ j ln f ( x ; θ )  dx where θ is our v ector of parameters (set a t their unkno wn true v alues that we are trying to estimate) , x ranges ov er all p ossible measuremen ts and f ( x ; θ ) reads o ﬀ the lik eliho o d o f observing the measuremen t x given the parameter θ . Fisher information is actually a simpler concept than the other f o rms of information. The en tries in the Fisher information matrix are merely the exp ected v alues of the eﬀect of each pair of parameters on the r elativ e lik eliho o d of diﬀerent observ ations. In this case, it is s howin g how alt era t ions in the unkno wn parameters w ould c hange the relativ e lik eliho o d of diﬀeren t observ ed outcomes. It is then fairly clev er (but not to o surprising) that its in v erse can then read oﬀ how changes in observ ed outcome inﬂuence estimates of the unk nown parameters. The Cramer-Rao inequalit y is using Fisher info rmation to describ e prop erties of a n in v erse (reco v ering parameters from observ ed data) without needing to kno w the sp eciﬁc in version pro cess (ho w w e performed t he e stimate). 6.2.2 Calculating Cr amer-Rao on t he V ariance of V ariance Estimate When attempting to measure the v ar ia nce of individual apples (Exp erimen t-B) our data w as tw o su ms of random v ariables (each x i or y i represen ting a single apple): 17 T 1 = n 1 X i =1 x i T 2 = n 2 X i =1 y i n 1 , n 2 can b e an y positiv e in tegers. Under o ur assumption that the weigh t of apples is normally distributed with mean-w eigh t a w and v ariance v a w e can write down the odds-densit y for an y pair of measuremen ts T 1 , T 2 as: f ( T 1 , T 2 ; v a ) = 1 2 π v a √ n 1 n 2 e − ( T 1 − n 1 a w ) 2 / (2 n 1 v a ) − ( T 2 − n 2 a w ) 2 / (2 n 2 v a ) . T o apply the Cramer-Ra o inequality w e need the Fische r informa t io n of this distribution whic h is deﬁned as : J ( v a ) = Z T 1 ,T 2 f ( T 1 , T 2 ; v a )  ∂ ∂ v a ln f ( T 1 , T 2 ; v a )  2 dT 1 dT 2 . The ﬁrst step is to us e the f a ct that ∂ ∂ x ln e − f ( x ) 2 = − 2 ∂ ∂ x f ( x ) and write J ( v a ) = Z T 1 ,T 2 f ( T 1 , T 2 ; v a )  ( T 1 − n 1 a w ) 2 / (2 n 1 v 2 a ) + ( T 2 − n 2 a w ) 2 ) / (2 n 2 v 2 a )  2 dT 1 dT 2 = 1 4 v 4 a Z T 1 ,T 2 f ( T 1 , T 2 ; v a )( T 1 − n 1 a w ) 4 /n 2 1 dT 1 dT 2 + 1 4 v 4 a Z T 1 ,T 2 f ( T 1 , T 2 ; v a )2( T 1 − n 1 a w ) 2 ( T 2 − n 2 a w ) 2 / ( n 1 n 2 ) dT 1 dT 2 + 1 4 v 4 a Z T 1 ,T 2 f ( T 1 , T 2 ; v a )( T 2 − n 2 a w ) 4 /n 2 2 dT 1 dT 2 = 1 4 v 4 a Z Φ √ n 1 v a ( x ; n 1 a w )( x − n 1 a w ) 4 dx + 2 4 v 4 a  Z Φ √ n 1 v a ( x ; n 1 a w )( x − n 1 a w ) 2 dx   Z Φ √ n 2 v a ( x ; n 2 a w )( x − n 2 a w ) 2 dx  + 1 4 v 4 a Z Φ √ n 2 v a ( x ; n 2 a w )( x − n 2 a w ) 4 dx where Φ() is the standard s ingle v ariable normal d ensit y: Φ σ ( x ; µ ) = 1 √ 2 π σ e − ( x − µ ) 2 / (2 σ 2 ) . 18 The ﬁrst term is the 4 th momen t of the normal and it is known that: Z Φ σ ( x ; µ )( x − µ ) 4 dx = 3  Z Φ σ ( x ; µ )( x − µ ) 2 dx  2 . It is also a standard fact ab out the normal dens ity that Z Φ σ ( x ; µ )( x − µ ) 2 dx = σ 2 . So w e hav e J ( v a ) = 1 4 v 4 a  3 n 2 1 v 2 a n 2 1 + 2  n 1 v a n 1   n 2 v a n 2  + 3 n 2 2 v 2 a n 2 2  = 2 v 2 a . Finally w e hav e the Fisher Information J ( v a ) = 2 v 2 a . W e can then apply the Cramer- Rao inequalit y whic h sa ys that E [( v a − ˆ v ) 2 ] ≥ 1 / J ( v a ) for a ny un biased estimator (no matter how we c ho ose n 1 and n 2 ) of v a (un biased meaning E [ v a − ˆ v ] = 0). The theory is t elling us that the unkno wn parameter v a has s uch a sloppy contribution to the lik eliho o d of our observ ations that it is in fact diﬃcult to pin do wn the v alue from any one set of observ at ions. In our case w e hav e just sho wn that E [( v a − ˆ v ) 2 ] ≥ v 2 a 2 , whic h means no estimation pro cedure that uses just a single instance of the total T 1 , T 2 can reliably estimate the v a riance v a of individual apple we ights. 6.2.3 Calculating Cr amer-Rao Inequalit y on Multiple Mean Est imates In Exp erimen t - C w e again ha v e t w o baske ts of fruit- but they con tain apples and oranges in the prop ortions giv en b y a 1 , o 1 , a 2 , o 2 . Our assumption that the individual fruit w eigh ts are no r ma lly distributed with means a w , o w and common v ariance v lets us us write the joint probability of the total measuremen ts T 1 , T 2 in terms of the normal-densit y ( Φ()). F or our problem where the v ariables a r e the sums T 1 , T 2 and w e ha ve t w o parameters (the tw o unknown means a w , o w ) and a single p er-fruit v ar ia nce v we will use the tw o dimensional normal densit y: Φ √ nv ( T 1 , T 2 ; a w , o w ) = 1 2 π nv e ( − ( T 1 − a 1 a w − o 1 o w ) 2 − ( T 2 − a 2 a w − o 2 o w ) 2 ) / (2 nv ) . W e concen trate on the v ariables T 1 , T 2 and will abbreviate this densit y (lea ving implicit the imp or t a n t parameters a w , o w , v ) as Φ( T 1 , T 2 ). F rom this we can read oﬀ the diﬃcult y in estimating individual apple w eigh t: J 1 , 1 ( a w , o w ) = Z T 1 ,T 2 Φ( T 1 , T 2 )  ∂ ∂ a w ln Φ( T 1 , T 2 )   ∂ ∂ a w ln Φ( T 1 , T 2 )  dT 1 dT 2 = Z T 1 ,T 2 Φ( T 1 , T 2 ) (2 a 1 ( T 1 − a 1 a w − o 1 o w ) + 2 a 2 ( T 2 − a 2 a w − o 2 o w )) 2 4 n 2 v 2 dT 1 dT 2 = a 2 1 + a 2 2 nv 19 The ﬁrst step is using the fact that ∂ ∂ x ln e − f ( x ) 2 = − 2 ∂ ∂ x f ( x ) The last step is using a num b er fundamen tal facts ab out the normal density : Z x Φ σ ( x ; µ ) dx = 1 Z x Φ σ ( x ; µ )( x − µ ) dx = 0 Z x Φ σ ( x ; µ )( x − µ ) 2 dx = σ 2 . These facts allo w us sa y that the so- called “cross terms” (lik e ( T 1 − a 1 a w − o 1 o w )( T 2 − a 2 a w − o 2 o w )) in tegrate to zero and t he square terms read o ﬀ the v ariance. One of the reasons to a ssume a common distribution (suc h as the normal) is that almost an y complicated calculation inv olving suc h distributions (diﬀeren t ia ting, integrating) can usually b e reduced t o lo oking up a few well kno w facts a b out the so- called “mo ments” of the distribution, as w e ha v e done here. Of, course pic king a distribution that accurately mo dels reality tak e preceden t o ver pick ing one that eases calculation. The other en tr ies of the Fisher Information matrix can b e read oﬀ as easily and w e deriv e: J ( a w , o w ) = 1 nv  a 2 1 + a 2 2 a 1 o 1 + a 2 o 2 a 1 o 1 + a 2 o 2 o 2 1 + o 2 2  . Substituting our “typical” v alues of a 1 , o 1 , a 2 , o 2 from Section 3.2 we hav e J ( a w , o w ) = 1 2 v  n + 4 n − 4 n − 4 n + 4  . A t ﬁrst things lo ok go o d. The J ( a w , o w ) entries are gro wing with n so w e migh t exp ect the en tries of J − 1 ( a w , o w ) to shrink as n increases. Ho w ev er, the en tries the y are all nearly iden tical so the matrix is ill-conditioned and w e see larger t han exp ected en tries in the in v erse. In fa ct in t his case w e hav e: J − 1 ( a w , o w ) = v 8  1 + 4 /n − 1 + 4 /n − 1 + 4 / n 1 + 4 /n  and these en tries are not tending to zero- establishing (by the Cramer-Rao inequalit y) the diﬃcult y of es timatio n. 6.2.4 Cramer-Rao I nequalit y Holds in General By insp ecting our last series of a rgumen ts, w e can actually sa y a bit more. The diﬃcult y in estimation w as not due to our sp eciﬁc assumed v alues of a 1 , o 1 , a 2 , o 2 , but rather to the fact that the coin-ﬂipping pro cess w e describ ed earlier will nearly 20 alw a ys la nd us in ab out a s bad a situation for large n . W e can see that the larger t he diﬀerences | a 1 − o 1 | and | a 2 − o 2 | the b etter things are fo r estimation. The “ strong la w of large n umbers” states that as n increases w e exp ect (with probability 1) to ha v e | a 1 − o 1 | → √ 2 v n and | a 2 − o 2 | → √ 2 v n . This means that it w ould b e v ery rar e (for large n ) to see diﬀerences in a 1 , o 2 , a 2 , o 2 larger than w e sa w in our “t ypical case.” This lets us conclude that if there is no constructiv e bias then for large n estimation is almost alw ays a s diﬃcult as the example we work ed out. No w if there w ere an y constructiv e bias in t he experiment (suc h as apples were a bit more likely in the ﬁrst baske t and or anges w ere a bit more lik ely in t he second bask et) then the en tries of J − 1 () would b e f orced t o zero a nd t he explicit estimate w e gav e ear lier would in fa ct hav e shrinkin g error a s n g rew large. Ho we ve r only the fraction of the data w e can attribute to the bias is really helping us (so if it w as say a 1 / 10th bias only ab out 1 / 10th of the data is useful to us) and w e would need a lot of data t o exp erience low ered error (but at least the error w o uld b e falling). The p oint is that the ev enly distributed p ortio n o f the data is ess entially not useful for inference, and that is wh y it is so imp ortan t to b e inferring things that the ex p eriment w as designed to measure (a nd wh y the limit on c hannel iden tiﬁers is bad since it limits the n um b er of things w e can s imultaneously design for). 21

Comparing Apples and Oranges: Two Examples of the Limits of Statistical Inference, With an Application to Google Advertising Markets

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment