Monte Carlo Null Models for Genomic Data

Statistic al Scienc e 2015, V ol. 30, No. 1, 59–7 1 DOI: 10.1214 /14-STS484 c  Institute of Mathematical Statisti cs , 2015 Monte Ca rlo Null Mo dels fo r Genomic Data Egil F erkingstad, La rs Holden and Geir Kjetil Sandve Abstr act. As increasingly complex h yp othesis-t esting scenarios are considered in man y scient iﬁc ﬁelds , analytic deriv ation of n ull distribu - tions is often out of reac h. T o the rescue comes Mon te Carlo testing, whic h ma y app ear dec eptive ly simple: as long as y ou can sample test statistics u nder th e null hyp othesis, the p -v alue is ju st the prop ortio n of sampled test statistics that exceed the observed test statistic. Sampling test st atistics is often simple on ce y ou ha ve a Mo nte Carlo null mo del for your data, and deﬁning some f orm of rand omization pr o cedure is also, in man y cases, relativ ely straigh tforw ard. Ho wev er, th ere ma y b e sev eral p ossible c hoices of a randomization null mo d el for the d ata and no clear-cut criteria for c ho osing among them. Ob viously , diﬀerent null mo dels ma y lead to v ery diﬀeren t p -v alues, and a v ery lo w p -v alue ma y th us o ccur due to the inadequ acy of the c hosen null mo d el. It is pr efer- able to use assumptions ab out the un derlying rand om data generatio n pro cess to guide selection of a n u ll mo del. In man y cases, w e ma y order the n ull mo d els b y increasing preserv ation of the data c haracteristics, and w e argu e in this pap er that th is ord ering in most cases give s in- creasing p -v alues, that is, lo wer signiﬁcance. W e denote this as the nul l c omplexity principle . The principle giv es a b ett er understandin g of the diﬀeren t n ull mo dels and ma y guide in the c hoice b et ween th e diﬀerent mo dels. Key wor ds and phr ases: Mon te Carlo methods, h yp othesis testing, genomics. Egil F erkingstad is R ese ar ch S cholar, Scienc e Institute, University of Ic eland, Dunhaga 5, 107 R eykjavik , Ic eland, and Norwe gian Computing Center, Gaustadal le en 23B , 0373 Oslo, Norway e-mail: e gil@hi.is . L ars Holden is Managing Dir e ctor, Norwe gian Computing Center, Gaustadal le en 23B, 0373 Oslo, N orway e-mail: L ars.Holden@nr.no . Geir Kjetil Sandve is Asso ciate Pr ofessor, Dep artment of Informatics, University of Oslo, Gaustadal le en 23B, 0373 O s lo, Norway e-mail: geirksa@iﬁ.uio.no . This is an electronic repr int of the or iginal article published by the Ins titute of Mathematical Statistics in Statistic al Scienc e , 2015, V ol. 30 , No. 1, 59–71 . This reprint diﬀers fro m the orig inal in pagination and t yp ogr aphic detail. 1. INTRODUCTION Increasingly , Mon te Carlo metho ds are needed to pro vide answers to imp ortan t s cientiﬁc ques- tions, particularly in the rapidly ad v ancing ﬁeld of genomics. F or b etter or wo rse, these questions are often framed within the formalism of statisti- cal hyp othesis testing. In man y cases, Monte C arlo h yp othesis testing tec hniqu es such as p erm utation testing are the only options. Conceptually , these methods share an a pp ealing clarit y: As long as y ou can sample test statistics un- der the n ull h yp othesis, the p -v alue is ju st the pro- p ortion of sampled test statistics that exceed the observ ed test statistic. O ne of our main aims is to sho w that the apparen t s implicit y of randomization h yp othesis testing can b e ve ry deceptiv e. I n th e fol- lo w in g, we use nul l mo del as a general term for the 1 2 E. FERKI NGST AD, L. H OLDEN A ND G. K. SANDVE distribution of the r esampled data (e.g. , us ing ran- dom p ermutatio ns), and we use nul l distribution to denote the distribution of the test statistic un der the n ull mo del. Even though th er e is a highly de- v elop ed theory of classical hyp othesis testing (e.g., Lehmann and Romano ( 2005 )), n ew practical and metho dological p roblems app ear when w e need to resort to Mon te Carlo testing: • The question of interest ma y b e una vo idably v ague, so that it is not obvio us how to translate it in to a precise mathematical formulat ion. • There ma y b e sev eral p ossible choice s of a ran - domization n ull mo del and no clear-cut criteria for c ho osing among them (exce pt p ossibly co nser- v ativ eness argument s for c ho osing the null mo del giving the largest p -v alues). • A full sp eciﬁcatio n of the null hyp othesis consists of b oth the null mo del and the question of in- terest. This complicates the interpretati on o f a rejection of the n ull hyp othesis—the question of in terest ma y not really hav e b een answ ered if the n ull mo del is inadequate. • There may b e sev eral p ossible c h oices of test statistic and no clear-cut criteria f or choosing one (except p ossibly p ow er considerations). If unresolv ed, th ese problems ma y degrade the re- pro du cibilit y and tran s parency of in vesti gations, as w ell as le ad t o false researc h ﬁnd ings. There has lately b een an in creasing fo cus on h ow to make science more r epro du cible, esp ecially in the ﬁeld of computational b iology (Ioannidis et al. ( 2008 ); Noseda and McLean ( 2008 ); Mesirov ( 2010 ); S andve et al. ( 2013b )). Also, due to th e increased prev a- lence of data-driv en science (Kell and Olive r ( 20 04 )) through increased a v ailabilit y of pub lic data and more accessible and eﬃcien t analytical to ols, there has also b een a heated discu s sion on whether a large prop ortion of p u blished researc h ﬁnd ings are false (Ioannid is ( 2005 ); Goo dman and Greenland ( 2007 )). W e discuss this topic further in the re- mainder of this pap er. Our main application of in- terest is genomics and the Genomic Hyp erBro wser ( Sandve et al. , 2010 , 2013a ) where choosing the cor- rect n ull mo del is a ma jor issu e. W e ha ve discussed n ull mo dels in ecology in a companion rep ort, F er- kingstad, Holden and Sand v e ( 2013 ). Sev eral ex- amples sh ow that the choice of a null mo del can strongly aﬀect the resulting p -v alues. W e state that ordering the n ull mod els according to increasing preserv ation ma y imply an ordering of the statis- tical signiﬁcance. F urther, if the null mo d els are n ot able to capture the essentia l stru ctur al pr op erties of data, this ma y lea d to false ﬁ ndings. W e pro cee d as follo w s : Section 2 discusses gen- eral pr oblems of randomization n ull mo dels. Sec- tion 3 present s null mo d el p reserv ation hierarchies and signiﬁcance ord erings. Sections 4 – 6 illustrate sev eral diﬀerent n ull mo d els within genomics: Sec- tion 4 considers null mo d els for the location of tr an- scription factor bind ing sites, Section 5 sho ws th at genetic p rop erties ha v e a tend ency to cluster along the genome, w hile Section 6 illustrates th at we ma y get false rejections with to o simple null mo dels usin g sim ulated data of p oints and segments in genomic trac ks. Finally , Section 7 p ro vides a general discus - sion and s ome concluding remarks and r ecommen- dations. F or the genomics case studies describ ed in Sec- tions 4 – 6 w e hav e used q -v alues (Storey ( 2002 )) to correct for multiple testing. Assume that w e test m h yp otheses where p (1) ≤ p (2) ≤ · · · ≤ p ( m ) are the or- dered, observ ed p -v alues, R is th e n um b er of rejected n ull hyp otheses, and V is the (unkn o wn) n umb er of falsely rejected n ull hyp otheses. The false disco v- ery rate (FDR) (B enjamini and Ho c hberg ( 1995 )) is then deﬁned as FDR = E( V /R ). F or eac h test, the corresp ond ing q -v alue is deﬁn ed as the minimum FDR at which the test is called signiﬁcant . Let π 0 b e the pr op ortion o f tests th at are truly n ull (Lan- gaas, Lindqvist and F erkin gstad ( 2005 )) and q ( i ) the q -v alue for th e test with p -v alue p i . Th en, we ma y estimate q ( i ) b y ˆ q ( i ) = min i ≤ j ≤ m m ∗ ˆ π 0 ∗ p ( j ) /j, where ˆ π 0 is an estimate of π 0 . Thus, the main in- puts to this m ultiple testing m etho d are the ob- serv ed p -v alues together with an estimate of π 0 . T o estimate π 0 , w e h a v e used the robus t estimator of P ounds and Chen g ( 2006 ), since this is v ery com- putationally eﬃcien t and can b e sho wn to b e con- serv ativ e in man y realistic settings. F or a general discussion of m ultiple-testing issues in Mon te Carlo settings, see also Sandve, F erkingstad and Nyg ˚ ard ( 2011 ). All calculations were p erf ormed using the R programming language (R Dev elopmen t Core T eam ( 2011 )) and the Genomic Hyp erBro wser. A Galaxy P ages (Go ec ks et al. ( 2010 )) do cumen t allo wing for replication of the results is a v ailable at http s:// hyperbro wser. uio.no/suppnu llmodels . MONTE CAR LO NU LL MODELS 3 2. RANDOMIZA TION NULL MODELS Consider a h yp othesis test based on d ata X and a test statist ic T = T ( X ). Wi thout loss of generali t y , w e ma y assume that large v alues of T constitute ev- idence against H 0 . Then, for an observ ed test stat is- tic T = t , the decision to accept or reject H 0 can b e based on the p -v alue p = F 0 ( T ≥ t ), where w e reject H 0 if p < α for some thresh old α and where F 0 is the distribution of T und er the n ull mo del P 0 . If P 0 is false, T h as distrib ution F 1 . In the classic textb o ok setting, the n ull mo del is known and can b e describ ed explicitly , so w e can d ir ectly compu te the p -v alue. Increasingly , b oth data and m o dels are to o complex for this to b e done. In suc h case s w e must r esort to some type of Mon te Carlo randomization test: we g enerate s am- ples T i = t i , i = 1 , . . . , n of the test statistic T under the n ull model and estimate the emp irical p -v alue from the data set X by ˆ p X,e ( t ) = 1 n n X i =1 I ( t i ≥ t ) , (1) where t is the observed test statistic and I ( · ) denotes the ind icator function, equal to one if its argumen t is tru e or zero if false. The idea of randomizat ion testing has b een around at least since the p ioneering w ork of Fisher ( 1935 ), bu t has only b ecome p ractical with the adv ent of electronic computers. F or a r ecen t o verview of Mon te C arlo m etho ds, see Manly ( 2007 ). The ran d omization null mo del is arguably the most crucial comp onen t of th e Mon te Carlo testing setup. Often, the researc h question and ev en the test statistics m a y b e clear, but ho w should one sp ecify the null m o del? Sandve et al. ( 2010 ) int ro d uce the idea of n ull mo del preserv ation hierarc hies and note that “a cru cial asp ect of an inv estigation is the pr e- cise formalization of the n ull mo del, whic h sh ould reﬂect the com bination of sto c hastic and sele ctiv e ev en ts that constitutes the ev olution b ehind th e ob- serv ed genomic feature. [. . . ] Unrealistica lly simple n ull mo dels ma y [. . . ] lead to false p ositiv es.” Here, w e build f urther on these ideas and provide a con- ceptual framew ork to aid the c hoice of null mod el. In the statistics literature, the m ost directly rele- v an t previous pap ers on n ull mo d els are Efron ( 2004 ) and Bic k el et al. ( 2010 ). Efron ( 2004 ) estimate s the n ull mo del f rom data in multiple-te sting prob lems, giving an “empirical null.” This is v ery useful for some m ultiple-testing settings, but n ot d irectly ap- plicable to the problems w e study here. Bic k el et al. ( 2010 ) pr op ose sub sampling metho ds based on a piecewise stationary mo del for genome sequ en ces, a p otent ially useful appr oac h for our case stud y in Section 4 , but w hic h w e f eel would b e b eyond the scop e of this pap er. There is also relev an t work fr om other disciplines. P articularly , n ull mod els ha ve b een a very conteste d issue within ecology , as further discussed in F er- kingstad, Holden and S andve ( 2013 ). F or exam- ple, Gotelli ( 2000 ) p oin ts out th at “the analysis of presence–absence matrices with null mo del r an d om- ization tests has b een a ma jor source of con tro ve rsy in communit y ecolog y for o v er tw o decades.” See also the b o ok b y Gotelli and Gra v es ( 1996 ) and Manly [( 2007 ), Chapter 14], who notes that “one of the in teresting asp ects of this [sp ecies comp etitio n problem] is the d iﬃcult y in deﬁn ing the approp r i- ate mo d el o f randomness” (page 348) . F ortin and Jacquez ( 2000 ) discuss r andomization tests f or sp a- tially autocorr elated data. As discussed elsewhere in this p ap er, genomics is another area where th e prob- lem of choosing the righ t n ull mod el is ve ry ur gen t (Sandve et al. ( 2010 )). Bic kel et al. ( 2010 ) note that “a common question ask ed in many applicatio ns is the follo w in g: Giv en the p osition v ectors of t wo fea- tures in the genome [. . . ] and a measure of relat- edness b et w een features [. . . ] ho w signiﬁcan t is the observ ed v alue of the measure? How do es it compare with that w hic h migh t b e observ ed ‘at random?’ The essen tial c hallenge in the statistical formulatio n of this problem is the appropr iate mo delling of ran- domness of th e genome, sin ce w e obs er ve only one of the m ultitudes of p ossible genomes that ev olution migh t h a v e p ro du ced for our an d other sp ecie s.” See Kallio et al. ( 2011 ) for a general d iscussion of the im- p ortance of n ull mo dels within bioinformatics. R e- lated w ork has also b een done within the ﬁeld o f data mining; see Gionis et al. ( 2007 ), Hanhij¨ arvi, Garriga and Pu olam¨ aki ( 2009 ). Lijﬃjt et al. ( 2014 ) consider th e related problem of estimating the leve l of pr eserv ation n eeded to attain a pr esp eciﬁed sig- niﬁcance lev el α (for example, α = 0 . 05). 3. PRESERV A T ION AND SIGNIFICANCE ORDERINGS By assumption, the data set X is take n as giv en, that is, it is not considered to b e a random sample from some p opulation. In order to test our h yp oth- esis, we need to randomize X from a null mo d el P 0 . In man y cases some sp eciﬁc features of X will 4 E. FERKI NGST AD, L. H OLDEN A ND G. K. SANDVE need to b e preserve d. In a sp eciﬁc problem, it ma y b e v ery diﬃcult to d ecide what features are funda- men tal and whic h are not. I f w e attempt to con- serv e all p ossible f eatures of th e observ ed X , we are left with X itself and n o basis for p erforming the h yp othesis test. If w e conserv e to o little, we gener- ate realizations that violate basic p rop erties of the phenomenon u nder study . Diﬀeren t null mo dels ma y preserve diﬀeren t pr op erties of X , for example, n ull mo del P 0 preserve s p rop erties Q and R and n ull mo del P 1 preserve s prop erties R and S . But qu ite often w e ma y order the n ull mo dels according to increasing preserv ation of the prop erties of X . W e describ e tw o diﬀerent alte rn ative descrip tions of or- dering of preserv ation of the null mo dels: A. Let P 0 denote the state s pace obtained b y a set of resamplings (for example, permutations) that are allo wed u nder a giv en n ull mo del. T hat is, the state space is the set of all p ossible com bin ations of v alues of v ariables in the s to c hastic mo del. W e deﬁne a preserv ation hierarch y if the f ollo wing criteria are s atisﬁed: P (1) 0 ⊂ P (2) 0 ⊂ · · · ⊂ P ( n ) 0 . W e then state that P ( i ) 0 preserve s more than P ( i +1) 0 for i = 1 , 2 , . . . , n − 1 of the prop erties of the orig- inal data set X and hence is m ore restricted. As w e will discuss fu rther b elo w, a more r estricted n ull mo del will in most cases giv e less signiﬁ can t results, that is, p -v alues from P ( i ) 0 will tend to b e larger than p -v alues f r om P ( j ) 0 if P ( i ) 0 ⊂ P ( j ) 0 . Note that w e only consider Monte Carlo n u ll models, that is, n ull mo dels that are generated b y resam- pling from the observed data (as in p ermutation testing), and that the P ( i ) 0 are sets of allo we d r e- samplings und er H 0 —they are not sets of allo wed parameter v alues. B. Let X = ( X 1 , X 2 , . . . , X n ) denote a state in the state space and let the n ull mod el b e deﬁn ed b y a s et of allo wed p ermutations of the X i ’s. Deﬁne X i = 1 for a certain prop erty in base pair i and otherwise X i = 0 . Assu m e further that th e test statistic T is giv en b y T = 1 n X i y i X i (2) for a ﬁxed v ector y = ( y 1 , y 2 , . . . , y n ). W e trivially ha v e E( T ) = 1 n X i y i E( X i ) and V ar( T ) = 1 n 2 X i X j y i y j Co v ( X i , X j ) . W e assume the stationary criteria E( X i ) = λ and V ar ( X i ) = σ 2 are indep enden t of i . Assume Co v ( X i , X j ) is p ositiv e for | i − j | small and decreases w ith incr easing distance | i − j | , sa y , Co v ( X i , X j ) = σ 2 ρ ( | i − j | ) , for some decreasing, p ositiv e correlation function ρ . The co v ariance is smaller in n u ll mo del P (1) than P (2) if the cor- resp ond ing correlation functions satisfy ρ (1) ( d ) ≥ ρ (2) ( d ) f or all d > 0. This imp lies that the more the p erm utation pr eserv es of Co v ( X i , X j ) for | i − j | small, the larger is V ar( T ). Here w e ma y deﬁne a sequence of null mo dels with decreas- ing Co v ( X i , X j ) for all distances | i − j | , implying larger v alues of V ar( T ) . In m ost cases it is rea- sonable to also assume that E( T ) is the same f or all the n ull mo dels. Cases A and B ma y b oth b e satisﬁed at the same time. In Section 5 we argue that it is t yp ical for ge- nomic data of certain typ es to satisfy the crite ria in case B, that is, Co v( X i , X j ) is p ositi ve for | i − j | small and decreases with increasing distance | i − j | . In this case, we make assu mptions directly on the test statistic T wh ic h in dicate larger empirical p - v alues [see deﬁnition ( 1 )] the more we preserve of the original d ata X . By assumption, large v alues of T in d icate e vidence against the h yp othesis H 0 . A larger v alue of V ar( T ) implies under quite general statistica l assu m ptions th at a larger fraction of the realizatio ns ha v e a test statistic T i > T (pro vided the n umb er of r ealizat ions are su ﬃcien tly large), lead- ing to larger p -v alues. Also, in case A, an increasing state space will in most cases lead to an increase in V ar( T ). The relationship b et wee n preserv ation and signiﬁ- cance is the same observ ation as in Hanhij¨ arv i, Gar- riga and Puolam¨ aki ( 2009 ), “ob viously , the m ore restricted the n ull hyp othesis [. . . ] the less signiﬁ- can t the r esu lts of a data min ing algorithm tend to b e.” W e will call this obs er v ation the nul l c omplexity principle . The null complexity p r inciple ma y b e an aid in c h o osing the correct level of p reserv ation in the null mo del, as w ell as in inte rpr etation of the r esults. Since the null complexit y principle do es not alw ays hold, it is necessary to demonstrate it f or the prob- lem un der stu dy . If this p rop erty is pr o ved f or the MONTE CAR LO NU LL MODELS 5 n ull mo dels applied, then this is ve ry u seful infor- mation when choosing a null mo del. F or exa mple, a scien tist w ishing to b e conserv ativ e ma y c h o ose the n ull mo del known a priori to giv e the largest p - v alues. Also, some Mon te Carlo null mod els ma y b e considerably more computationally demanding than others. Then, w e ma y ﬁrst test a null mo d el h a ving lo w computational cost. If w e reject th e hypothesis using this mo d el, we will also reject the hypothesis for less conserv ativ e (a nd more computationall y in- tensiv e) n u ll mo d els. The ord ering of the p -v alues imply th at to o simple null mo d els ma y lead to false p ositiv es, as conjectured in S andve et al. ( 2010 ). Our concepts of n ull mo d els and preserv ation may b e illustrated b y the follo wing simple example. As- sume w e ha ve to ssed a coin N ≫ 100 times and we question whether the observ ed prop ortion of heads in the b eginning of the sequence is signiﬁcantl y larger than 0 . 5. W e w an t to allo w for the p ossib ilit y of c oins tosses b eing correlated. W e use the num- b er of heads in the ﬁrst 100 coin tosses as the test statistic. W e use tw o diﬀerent null mo d els. In null mo del 1 w e assume that the coins are indep enden t of eac h other and ha v e a 50% pr obabilit y for heads, so we can p ermute the observed coin tosses freely to sample from the n ull mo d el. F or n ull mo d el 2 we p er- m ute eac h sequence of 2 observ ations from the ob- serv ed N coins in order to main tain a p ossible corre- lation b et w een consecutiv e coins. T he seco nd mo del is more r estrictive and according to the n u ll com- plexit y prin ciple giv es larger p -v alues. If there is p os- itiv e correlation b et w een consecutiv e coins, this in- creases the v ariabilit y of the test statistics and hence increases the p -v alue. How ev er, if th ere is n egativ e correlation b et we en consecutiv e coins, this decreases the v ariabilit y of the test statistics and hence de- creases the p -v alue. The example also illustrates that the n ull complexit y principle often assumes p ositiv e correlations b etw een te rms in the test statistic. F or test statistics deﬁned on p oin t pro cesses (su ch as the examples in Section 4 ), this t ypically corresp onds to attraction b et ween p oin ts (correlation b et wee n consecutiv e int er-p oin t distances) . Int uitiv ely , it is easier to en vision mec hanisms lea ding to attraction than repu lsion (although these for sur e also exist). Our exp erience is that p ositiv e correlati ons (includ- ing attracti on in p oint pro cesses) are muc h more common than negativ e correlations (includ ing r e- pulsion) i n real data sets, whic h w e also show for a num b er of genomic data sets, represent ing sev eral classes of features, in Section 5 . 3.1 Ho w to Measure Clustering of P oints As we ha v e seen in case B ab ov e, in some cases it is imp ortant to p reserv e clustering of p oints, since this has imp ortant imp lications for the sizes of the resulting p -v alues. F ollo win g the notatio n deﬁned in the previous section, we may use the Ripley’s K - function (Ripley ( 1976 )) as a measure for clustering. This is deﬁned relativ e to a d istance t as K ( t ) = λ − 1 E(n umber of extra p oin ts within distance t of a randomly c h osen p oint) . T o simplify the notation, disr egard edge eﬀects b y assu m ing that there exist X − t − 1 , . . . , X 0 and X n +1 , . . . , X n + t from th e same pro cess as X 1 , . . . , X n . Then K ( t ) = ( nλ ) − 1 n X i =1 i + t X j = i − t j 6 = i P( X j = 1 | X i = 1) for in teger t . W e m ay w rite K ( t ) in terms of the correlation function ρ , as follo ws: K ( t ) = ( nλ ) − 1 n X i =1 i + t X j = i − t j 6 = i ( λ + λ − 1 Co v ( X i , X j )) = 2 t + σ 2 n − 1 λ − 2 n X i =1 i + t X j = i − t j 6 = i ρ ( | i − j | ) = 2 t + 2 σ 2 n − 1 λ − 2 n X i =1 t X j =1 ρ ( j ) = 2 t + 2 σ 2 λ − 2 t X j =1 ρ ( j ) . Using our earlier deﬁnition of clustering [ ρ (1) ( d ) ≥ ρ (2) ( d ) for all d > 0], this means that increased clus- tering implies increased K ( t ) for eac h t . Note that if X i and X j are indep enden t for i 6 = j , then K ( t ) = ( nλ ) − 1 n X i =1 i + t X j = i − t j 6 = i λ = ( nλ ) − 1 n X i =1 (2 tλ ) = 2 t. Therefore, w e ma y deﬁn e a scaled K -function, L ( t ), as follo ws: L ( t ) = K ( t ) / (2 t ) . 6 E. FERKI NGST AD, L. H OLDEN A ND G. K. SANDVE Then, L ( t ) < 1 corresp onds to repulsion b et w een p oints, L ( t ) = 1 to indep end en t p oin ts, while L ( t ) > 1 corresp onds to attraction b et wee n p oin ts. Assume that we hav e ob s erv ed X i = x i , i = 1 , . . . , n and wish to estimate ˆ L ( t ). T o simplify nota- tion, let x i = 0 for i < 1 and i > n . Then, we choose some v alue t = τ and estimate K ( τ ) by ˆ K ( τ ) = n − 1 ˆ λ − 2 n X i =1 i + τ X j = i − τ j 6 = i w − 1 ij x i x j , where ˆ λ = n − 1 n X i =1 x i and w ij = min(max( i, j ) , n ) − m ax(min( i, j ) , 1) max( i, j ) − min( i, j ) are w eigh ts that correct for edge eﬀects. Finally , L ( τ ) is estimated by ˆ L ( τ ) = ˆ K ( τ ) / (2 τ ) . 4. NULL MODELS F OR GENOMIC LOCA TIONS In this section we will sho w ho w to choose a null mo del when w e wan t to test whether the p oints in a p oint track are indep enden t of segments in a seg- men t trac k. Several null mo dels that h a v e preserv a- tion orderin gs according to b oth cases A and B in Section 3 are present ed. T h e resu lts are as exp ected, with more preserv ation giving larger p -v alues. A fully extended human c hromosome w ould b e ab out one meter long, consisting of ab out 3 billion base pairs. T he prop erties v ary along the genome and w e often divide the genome in to bins and p er- form s ep arate te sts for eac h bin. There are ab out 30,000 genes, represented as in terv als of base p airs or segmen ts in the terminology of Sandv e et al. ( 2010 ). T ranscription factors (TF) regulat e th e expression of genes b y bindin g to DNA in the spatial pro xim- it y of the genes th ey regulate, inte racting with the complex of proteins that transcrib es DNA to RNA (the transcriptional mac hinery ). As the DNA may form lo ops, spatial pro ximit y is not n ecessarily the same as pro ximit y a long th e s equence. A TF that binds to DNA ma y ther efore r egulate the expres- sion of a gene that is millions of base pairs a wa y from the bin ding site, and ma y ev en regulate genes on diﬀerent c h r omosomes (Visel, Rub in and Pen- nacc h io ( 2009 ); Ruf et al. ( 20 11 )). In higher organ- isms, s u c h as humans, transcription factor bindin g sites are organized in to mo dular u nits, often referred to as cis-regulato ry mo du les (CRM). These CRM usually comprise a few h un d red base pairs and are c h aracterized by a high lo cal frequency of bindin g for one or sev eral TFs (Berman et al. ( 2002 ); Zhou and W ong ( 2004 )). TFs that interact with the tran- scriptional mac h inery to increase the expr ession of genes at some distance from where the TFs bind to DNA are often r eferred to as enhancers, and the r e- gions o f DNA con taining suc h TF bindin g sites are often referred to a s enhancer regions. The TF are also segments of base pairs, bu t since th ese segmen ts usually are shorter than the genes, these are often represent ed as unmark ed p oin ts in the terminology of Sandv e et al. ( 2010 ). 4.1 Sp ecifying Details of Hyp othesis T ests: T ranscription F acto r Binding Relative to Genes In this section we will discu ss t wo null mo dels that ha v e a preserv ation ordering according to b oth cases A and B of Sectio n 3 . The results are as expected: more p reserv ation giv es larger p -v alues. W e only get rejection of the null hyp othesis when w e ha ve little preserv ation. This ma y b e due to a to o simple n u ll mo del. A v ery basic question relate d to th e p ositioning of transcription f actor binding sites (TFBS) is whether the binding sites of a giv en TF fall preferen tially in- side or ou tsid e genes. As a concrete example, w e con- sider binding sites for the transcription factor MitF (Strub et al. ( 2011 )) in relation to Ensem bl gene re- gions (F licek et al. ( 2012 )). W e aske d this qu estion lo cally along the genome, dividing the genome into bins and p erf orming on e s eparate test p er bin. As bins we used chromosome bands, whic h r epresent a common partition of c h romosomes int o regions of a few megabases. T o ensure a reasonable amount of data for th e tests, w e only considered chromosome bands con taining at least one gene and ﬁve TFBS, resulting in 73 b ins. S eparate tests w ere p erformed for eac h bin. Ho w can a hyp othesis test b e sp eciﬁed for this problem? Clearly , a natur al test statistic is the num- b er T of TFBS falling inside genes. F u r thermore, let n b e the to tal num b er of TFBS in the b in and p the prop ortion of the bin co ve red by ge nes. A nat- ural n ull mo del is th at TFBS are unif orm ly and MONTE CAR LO NU LL MODELS 7 indep en d en tly lo cated w ith in eac h bin. It is then easily seen th at the d istribution of th e test statis- tic is T ∼ Binomial( n, p ). There are other al terna- tiv es. F or instance, one might assum e that the TFBS are P oisson distributed within the bin . This wo uld preserve the underlying probabilit y of observing a TFBS instead of the exact coun t of observed TFBS, th us giving rise to a (sligh tly) diﬀerent null distri- bution. I n our opinion, when realiz ations are based on Mon te Carlo analysis, it is necessary to carefully study the prop erties of the n ull mo del. Mistak es are easily made if one directly w rites do wn the null dis- tribution of the test statistic. P erforming the b inomial test as describ ed ab ov e yields the conclusion th at there is preferentia l lo- cation insid e genes for 9 out of the 73 bins after m ultiple testing correctio n (at a 1 0% false d isco v - ery rate). T his could be taken as an indicati on of lo cal v ariation of an un derlying (mec hanistic) ten- dency of TFBS for the transcription factor MitF to b e lo cated inside gene regions. The TFBS may form clusters, denoted CRM, with t ypical length of a few hundred b ase pairs. This is a m uc h sm aller scale than the gene regions, which t yp- ically are sev eral thousand base p airs. The clustering of TFBS app ears to b e an in trinsic prop erty of the TFBS themselve s, and not a part of the TFBS–ge ne relation that is b eing tested. T h is suggests that at least some asp ects of clustering should b e preserved in the null mo del. T his is an example of case B of Section 3 , as can b e seen by letting X i = 1 for a TFBS in base pair i . Most of the clusters are ei- ther completely inside or completely outside a seg- men t, meaning that Co v ( X i , X j ) is larger for i and j close. If w e main tain this p ositiv e correlat ion in the n ull model, this giv es higher p -v alues. This is tested b y using tw o d iﬀeren t n ull mo dels. The ﬁrst mo del is th e n ull mod el describ ed ab o ve, where we only preserv e the tota l num b er of TFBS. In the sec- ond mo del the empir ical in ter-TFBS distances are preserve d in the n ull mo del b y only p ermuting these distances. This second mo del preserve s more of p osi- tiv e correlation in Co v ( X i , X j ). T hese tw o null mo d- els are in fact also an example of case A in Section 3 , since b oth null mo d els giv e a ﬁ nite state sp ace with equally lik ely s tates and the second n ull mo del is a subset of the ﬁrst one. The p -v alues from the t wo n ull mo dels are illustr ated in Figure 1 . W e see clearly that p reserving the emp irical inter-TFBS distances in the null mo d el giv es larger p -v alues. Some bins sho w v ery diﬀerent results b et wee n null mod els, for Fig. 1. Sc atter pl ot of p -values for the same test under two diﬀer ent nul l mo dels. example, at c h romosome band q25.1, where ind e- p end ent lo cation giv es a p -v alue less than 0.0 005, while p r eserv ation of in ter-TFBS d istance giv es a p -v alue of 0.1. T his is probably du e to strong corre- lation Co v ( X i , X j ) in this bin . When the empirical distribution of inter-TFBS distances is preserved, the null hypothesis is not rejecte d in any bin at 10% FDR, suggesting that the s igniﬁcan t ﬁnd ings u n der the un iformit y assumption ma y sim p ly b e due to inadequacy of the n ull mo del. 4.2 Deciding What Should Be Preserved in the Null Mo del: Randomizing Gene s Instead of T ranscription F acto rs Binding Sites In t his section there are t wo p airs of null m o d- els with p reserv ation ord ering according to case A in S ection 3 . T he p -v alues are ordered as exp ected: more preserv ation giv es larger p -v alues. In the ab o ve d iscussion, w e h a v e implicitly as- sumed that the T FBS distr ibution should b e sto c has- tic in the n ull mo d el, wh ile genes a re preserv ed ex- actly at their genomic locations. This seems reason- able from a biological standp oint , as the lo ca tion of binding sites can generally b e assumed to follo w the lo cation of genes chronolog ically through evol ution (although there ma y b e exceptions, s u c h as co d- ing regions copied int o genomic regions that already ha v e an established regulatory mac hinery). Ho w- ev er, o ne should also co nsid er whic h of t he trac ks ha v e the more complex stru cture. Th is structure should b e preserved in the null mo del, and one w ould 8 E. FERKI NGST AD, L. H OLDEN A ND G. K. SANDVE prefer to randomize th e trac k with the simplest structure. Although the lo cation of genes is clea rly not uniform, it can b e argued th at the TFBS h as an ev en more complex structure. T h e reason is that individual TFBS fall as clusters with sp eciﬁc in tra- cluster str ucture insid e regulatory r egions, with reg- ulatory regions again ha ving a certain structur e in relation to genes. Indeed, as can b e seen from Fig- ure 2 , the p -v alues are somewhat h igher when ran- domizing genes as opp osed to TFBS. In the ﬁgur e w e compare the t w o null mo dels describ ed ab ov e randomizing TFBS-p ositions and t wo n ull mo dels where we randomize the gene lo cations with ran- dom p ositioning and preserving in ter-gene distances. These t wo n u ll mo dels, randomizing the ge ne lo ca- tions, are also examples of case A in Section 3 . As ex- p ected, the second null mo del giv es larger p -v alues. The t wo mod els with gene r an d omization also giv e larger p -v alues than the t w o mod els with TFBS ran - domization, ind icating that the mo dels with gene randomization pr eserv e more of th e complex inter- action b etw een genes and TFBS than the t wo other mo dels. Note also that the diﬀerence b et we en the t w o mo dels rand omizing genes is small er than b e- t w een the t wo mo dels randomizing T FBS. This in - Fig. 2. Empiric al cumul ative distribution of p -values under four diﬀer ent nul l m o dels. The di ﬀer ent nul l mo dels c orr e- sp ond to whether TFBS or genes ar e r andomize d, and whether the empiric al inter-e lement distanc es ar e pr eserve d or not. dicates that preserving int er-distances is more im- p ortant for TFBS than for genes. 5. SIGNIFICANCE ORDERING F OR D A T A THA T DISPLA Y INTERNAL CLUST ERING: TRANSCRIPTION F A CTOR BINDING AND CHROMA TIN ST A TES In this section w e will sho w that clustering is present in a large amoun t of genomic trac ks. Clus- tering leads to the pr eserv ation ord ering shown in case B of Section 3 . Again, the p -v alues are ordered , with more preserv ation giving larger p -v alues. The DNA h as to b e highly compacted in order to ﬁt int o a cell. A t the same time, it has to b e acces- sible, for example, to the binding o f transcription factors in order to allo w eﬃcien t gene r egulation. T o ac hieve controlle d compactness and accessibilit y , DNA is pac ke d in a structured manner at m u ltiple lev els. The ﬁrst su c h organizatio nal la yer consists of the DNA doub le h elix, at the order of 100 base pairs, w ound aroun d small protein complexes called nucle- osomes (Korn b erg and Lorc h ( 1999 )). These nucle- osomes can b e mo diﬁed through the attac hment of other molecules to the pr oteins of the n ucleosomes, whic h are called histones. This is referred to as hi- stone mo d iﬁcation, and serves a regulatory role in itself (Cairns ( 2009 )). Recen tly , it has b ecome p ossi- ble to create genome-wide maps of histone mo diﬁca- tions through the u se of high-throughput sequencing proto cols (W ang et al. ( 2008 )). It has b een suggested that com b inations of such histone mo diﬁcations in a given regi on, referred to as c h romatin states, ca n b e u sed as a mark of th e fu nctional role of the re- gion (Ernst et al. ( 2011 )). One of the pr op osed c h ro- matin states, the “5-enhancer” (shortened to “SE” in part of the follo wing text), is suggested to corre- sp ond to regions that p la y a r ole in gene regulation b y p ro viding accessible binding sites to several tran- scription factors. It is th us in teresting to see whether diﬀeren t TFs ind eed sho ws a higher than exp ected densit y of exp erimenta lly d etermined bind ing ev en ts inside th ese regi ons. T o in v estigate this, w e consid- ered a collecti on of 82 trac ks of exp erimenta lly de- termined TF b inding ev ents in blo o d cells (cell t yp e gm12878 ) generated through the ENC ODE pro j ect. The tr acks are originall y of type Segmen ts, co rre- sp ond ing to ca lled sig nal p eaks of ChIP-seq exp eri- men ts (Kim et al. ( 2005 )). These p ea k segmen ts are around 100 bp s long, reﬂecting exp erimen tal inaccu- racy in the determination of b inding sites that are MONTE CAR LO NU LL MODELS 9 themselv es around 5–25 bp long (Wingender et al. ( 1996 )). The real bindin g sites are often, but not alw a ys , located around the cen ter of these p eak re- gions. In our analyses, w e used the midp oin ts of the p eak regions as bindin g site lo cations. F or eac h TF, w e then tested whether the binding lo cations o c- curred inside regio ns in the “5-e nhan cer” c hromatin state more than exp ected by c hance. An analysis of th e direct relation b et ween T F binding lo cations and c h romatin states might b e strongly confounded b y a common relatio n to gene lo cations. T o reduce this p oten tially confoun ding factor, w e fo cus ed the study of th e relatio n b et ween TF b in ding and enhancer states on only con tiguous regions of size > 100 kb, that are more than 100 kbps a wa y f r om the nearest gene. P arts of these regions are lo cated in cen tromeres, where neither T F bind- ing ev ents nor c hr omatin states can b e m app ed. T o a void an y b ias d ue to this, w e constrained the analy- sis regions to only part of the regions b eing lo cated in the c hromosome arms. There is a total of 580 suc h regions in the human genome (using the En- sem bl gene deﬁnition for computing distance from genes), r anging in size from 10 0 kbp to 2.6 Mbp and co vering a total of 151 Mbps. As can b e seen from F igure 3 , ENCODE trac ks displa y a strong clustering tend en cy across diﬀer- en t scales f or a large n umb er of trac ks of diﬀerent t yp es. The scaled Ripley K v alues are describ ed in Section 3.1 . All the collections sh o w a t yp ical clus- tering tendency w ell b ey ond the neutral v alue of 1. Based on these r esults, w e claim that clustering is t ypical for genomic data of this t yp e. W e observ e v ery few data sets wh ere we ﬁnd r epulsion. Case B in Section 3 sho ws that clustering ma y give increas- ing p -v alues for n ull mo dels: if w e reduce or remo ve the clustering in the sto chastic mo d el, that is, re- duce the p reserv ation, then the p -v alues decrease. Hence, the p -v alue from the n u ll mo dels are ordered according to in creasing pr eserv ation of th e cluster- ing. When testing the clusterin g it is im p ortan t to apply a scale that is a dapted to the length o f the observ ed p r op erty , for example, T Fs. The ordering of the p -v alues dep end s on the scale of clustering relativ e to the length of th e pr op erties (e.g., genes) in the other trac ks used in th e te st. F urthermore, w e tested w hether ChI P-seq p eaks for 47 diﬀeren t TFs transcription factors w ere lo- cated more than exp ecte d in side regions of c hro- matin s tate 5-Strong Enhancer. p -v alues for tw o diﬀeren t n ull mo dels with random location of the CHIP-seq p eaks or preserving the in ter-distances from the original trac ks are sho wn in Figure 4 . A to- tal of 81 trac ks of the T F ChIP-seq p eak region for the ce ll t yp e gm12878 w ere retriev ed from the EN- CODE data collection and analyzed ag ainst Strong Enhancer inside regions of size > 100 kb that we re more than 100 kbps a wa y from th e nearest gene. F or 34 of th ese trac ks, there w ere less than a total of 20 p eaks across all analysis regions, and they w ere remo v ed from the analysis. The p -v alues w ere com- puted based on Mon te Carlo, using 10,000 samples, Fig. 3. Box plot of sc ale d Ripley ’ s K values for sever al c ol le ctions of ENCODE and R o adMap Epigenomics tr acks. The two left b oxes ar e b ase d on 81 TF ChIP-se q tr acks with genome wide data, fol lowe d by two b oxes with the same data but r estricte d to sele cte d r e gions of size > 100 kb that ar e mor e than 100 kbps away fr om the ne ar est gene, fol lowe d by a b ox b ase d on 147 tr acks of DHS f or diﬀer ent c el l typ es and ﬁnal ly a b ox with elements of chr omatin state “ 5-Str ong Enhanc er ” in nine diﬀer ent c el l typ es. The clustering is analyze d f or two diﬀer ent sc ales for the two ﬁrst data typ es. 10 E. FERKI NGST AD, L. H OLDEN A ND G. K. SANDVE Fig. 4. p -values for hyp othesis testing of whether mi dp oints of ChIP-se q p e aks for 47 diﬀer ent TF wer e lo c ate d m or e than exp e cte d inside r e gions of the chr omatin state 5-Str ong Enhanc er. p -values wer e c om pute d for two diﬀer ent nul l mo dels: r andom lo c ation of the midp oints or pr eserving the inter-p oint distanc es. The TFs on the x -axis wer e sorte d ac c or ding to the p -value achieve d when pr eserving inter-p oint distanc es i n the nul l mo del. th us giving a min im um ac hiev able p -v alue of 1E − 4. F or some TFs, this minimum p -v alue was ac hiev ed using either null mo del. F or other TFs, either null mo del resulted in a p -v alue of 1. In all cases w here the t w o null mo d els r esulted in diﬀerent p -v alues, the null mo del that preserves in ter-p oint distances ga ve the highest p -v alue. As w e can see from Figure 4 , v ery lo w p -v alues are reac hed for man y of the tests, conﬁr ming that the 5-Strong En h ancer c h romatin state captur es his- tone mo diﬁcation patterns indicativ e of TF bind ing. Indeed, when considering the union of binding loca- tions across all TFs, the relatio n b et we en T Fs and SE is highly signiﬁcan t ( p < 0 . 000 01) for either n u ll mo del. Our int erpretation of th e results is that it clearly app ears to b e a relation of TF bind ing and the 5-Strong En h ancer c h romatin state, bu t that the data limitatio n du e to only considering regions that meets the strict criteria ab ov e d o es not allo w a con- clusion to b e drawn regarding this relation for al l TFs, when consid ering only the b ehavio r in these regions. T he systematic diﬀerence b et we en p -v alues ac h ieved using the tw o n ull mo dels then reﬂ ects that the n ull mo del preserving inter-point distances more accurately p ortra ys the p ossibilit y of conclud- ing on the TF–SE r elation, wh ile the null mo del dis- regarding the clusterin g of TF p oin ts (inter-point distances) giv es p -v alues that are lo wer than the de- gree of ce rtaint y th at can really b e assig ned to the TF–SE relation in the considered analysis regions. 6. F ALSE REJECTIONS OF NULL MODELS USING SIMULA TED D A T A In this section we p erform h yp othesis te sts b ased on simulated data w ith a clustering r epresen tativ e for genomic data. One test has syn thetic trac ks for p oints and segmen ts and another test uses real TF trac ks and simulated segmen t trac ks. W e g enerate the trac ks indep enden tly from eac h other, so the n ull h yp othesis of indep endence should not b e r ejected for any of the tests. In b oth case s, w e get man y false rejections if w e assume uniform lo cations of p oin ts, but go o d results wh en w e preserve inte r-p oin t d is- tances. The p reviously p resen ted g enomic cases conﬁrm that a null mo d el with a higher lev el of preserv ation t ypically giv es h igher p -v alues on real data . Ho w- ev er, they do not tell us whic h null mod el should b e MONTE CAR LO NU LL MODELS 11 preferred. As the simp le n ull mo dels will typica lly b e easier to implement , will often allo w computation- ally fast anal ytical solutions and will typicall y giv e more signiﬁcance, they m a y b e a tempting c hoice for a p ractitioner. Ho wev er, when their assumptions are not met, there is a severe risk of false p ositiv e ﬁnd - ings, du e to the failure of the null mo del to accoun t for in trinsic c haracteristics of th e data, unrelated to the n ull and alternativ e h yp otheses that are on trial. In order to study t he p otentia l sev erit y of false p ositiv e ﬁnd ings due to u nrealistic n ull models, w e p erformed a simula tion stud y . Tw o trac ks were generated indep endently , but with v arious intrinsic clustering-related prop erties. They w er e then tested for a relation under diﬀeren t null mo dels. The re- sults are sho wn in T able 1 . The syn thetic trac ks w ere generated acco rdin g to the approac h describ ed in Sandve et al. ( 2010 ). Indep enden t p oin ts w ere generated according to a P oisson distr ib ution with λ = 0 . 01. C lustered p oints were generated under an in tra-cluster Poisson distribution with λ = 0 . 1 and in ter-cluster P oisson with λ = 0 . 01, with eac h p oin t ha ving a p r obabilit y 0 . 3 of f orming a new cluster. Segmen ts w ere ge nerated similarly to points, w ith distance b et wee n consecutive segment s follo wing a P oisson distribu tion with λ = 0 . 01. Lengths of seg- men ts w ere distributed uniformly betw een 10 and 100 base p airs. F or eac h combinatio n, 100 separate tests w ere p erformed and the n um b er of false r ejec- tions rep orted afte r m ultiple testing correction at 20% FDR. W e notice that inappropr iate assump- tions could lead to u p to 19% of n ull hyp otheses b eing falsely rejected after correction for multiple testing. Preserving more of the i nd ivid ual prop erties (in ter-elemen t distances) was the safe choice , es- sen tially a voiding false rejections, w hile assu ming uniform p oint lo cations resulted in a high degree of false r ejections, whether the test w as resolv ed an- alyticall y or b y Mon te C arlo s imulation. F or this particular test, using a to o sim p le assumption on segmen t lo cation (assumin g uniform lo cation for segmen ts that were in real it y clustered) presen ted less of a pr oblem. The r eason for this is that the auto correlation b etw een v alues of X i , as discussed in Section 3 , w ould b e relativ ely lo w, and thus not lead to an y strong underestimation of p -v alues. It was sho wn in the p revious t wo sections that using simp le null mo dels led to lo wer p -v alues and more r ejections when testing the relation of TF binding to genes to certain chromatin s tates. Al- though it would b e tempting to consider the h igher signiﬁcance as a sign of b etter p o wer of the testing setup, the assum ption of un iform TF bind ing lo ca- tion is problematic and could lead to p -v alues b eing underestimated. W e ha v e also sho wn on pur ely sim- ulated data that too simp le and un realistic assump- tions can lead to a h igh degree of false rejec tions. Here, we com bin e the r eal data of TF bind in g with sim ulated segmen t data ha ving the same c h aracter- istics as g enes and c hromatin state s, but where the sim ulated d ata is generated indep enden tly from TF binding lo cations and chromatin states. The null h y- p othesis sh ould then n ot b e rejected in an y test after m ultiple testing correction. F or eac h c hromosome band with at least 5 MitF binding sites, we tested wh ether these b inding sites o ccur diﬀerently than exp ected inside simulat ed seg- men ts. Th is resulted in H 0 b eing r ejected in 1 out of 73 b in s at 10% FDR when assuming un if orm MitF lo cations. Ho w ev er, when p erformin g the tests only on 14 bins with a more satisfacto ry amount of data (at least 10 MitF binding sites), the n ull hyp othesis T able 1 Numb er of falsely r eje cte d nul l hyp otheses under di ﬀer ent c ombinations of data gener ation pr o c e dur es and testing assumptions. Two tr acks of p oints and se gments, r esp e ctively, ar e gener ate d indep endently, and then teste d for signiﬁc ant r elation. T he di ﬀer ent c olumns c orr esp ond to whether p oints or se gments ar e gener ate d uniformly (Poisson) or with a tendency for clustering. The diﬀer ent r ows c orr esp ond to whether p oints or se gments ar e assume d to b e r andom in the nul l mo del, as wel l as whether l o c ation is assume d to b e uniformly di stribute d or ac c or ding to a pr eserve d empi ric al distribution of inter-element distanc es Generation Assumption Uniform Clustered p oints Clustered segments Uniform p oint lo cation (analytic) 0/100 17/100 0/100 Uniform p oint lo cation (MC) 0/100 19/100 0/100 Preserving inter-p oin t distances 0/100 0/100 0/100 Uniform segment lo cation (MC) 0/100 0/100 0/100 12 E. FERKI NGST AD, L. H OLDEN A ND G. K. SANDVE is rejected in 4 out of 14 bin s (still at 10% FDR). This high rate of f alse rejectio ns suggests that part of the signiﬁcance observ ed for MitF v ersus genes or c h romatin states un der the assumption of uniform lo cation is lik ely d ue to u nderestimation of p -v alues due to th e inadequacy of this n ull mo del. Con v ersely , preserving the empirical d istribution of in ter-MitF distances leads to no rejections of H 0 at 10% FDR, either when testing in all 73 b ins or in the 14 bins with most MitF binding sites. This suggests that the preserv ation of in ter-p oin t d istances is able to cap- ture the intrinsic stru cture of the MitF trac k in an appropriate manner. In summary , w e ﬁ nd that the choic e of null mo d el strongly inﬂuences the results. Mainly , the diﬀerence is th at a null mo del preserving more of the observ ed data yields higher p -v alues. T ests on simulate d data sho w that an ov erly simple null m o del, preserving to o little of the observe d data, can lead to a large n umb er of false rejections, ev en after correcting for m ultiple testing. 7. DISCUSSION In this pap er we ha ve s tu died the choice of Monte Carlo n ull mo dels. W e hav e deﬁ ned the Mon te Carlo state space as the (ﬁnite) set of all o we d r esamplings of the observed data, and deﬁned a Mon te C arlo null mo del preserv ation h ierarch y . W e h a v e discussed the n ull complexit y p rinciple, namely , that an ordering of preserv ation may imply a corresp onding order- ing of statistical signiﬁcance (i.e., of estimated p - v alues), and illustrated the use of our resu lt on real data sets of general in terest. The choice of null m o del is very app lication d e- p end ent, so it is diﬃcu lt to giv e general guidelines. Ho wev er, t wo general approac hes are as follo ws: (1) to b e conserv ativ e an d choose the largest p -v alue and (2) u se the most r estricted null mo del (which, ho w ev er, should still ha ve suﬃcien t freedom of v ari- abilit y to p ro vide an eﬃcien t test), so that we are “close to the tr u th,” that is, faithful to restriction giv en by the ph enomenon u nder study . Because of the n ull complexit y pr inciple, approac hes (1) and (2) will usually coincide. A fund amen tal feature of the Mon te Carlo ap- proac h to statistical in ference is that conclusions ma y only b e d ra wn regarding the actual observ ed data. In other w ords, there is no prosp ect for gener- alizati ons to any (hyp othetical or real) p opulation. While some ma y see this as a serious dra wbac k of Mon te Carlo metho d s, we feel that th is line of ob- jection to randomization methodology is often quite misguided. Obvio usly , the idea of random sampling from a popu lation is b oth useful and extremel y en- trenc hed in classical statistics. Ho wev er, often is it v ery hard to even conceiv e of th e “p op u lation” in whic h random sampling is supp osed to take place. Genomics and DNA sequences are go o d exa mples of this. In man y cases, the Mon te Carlo metho d is simply a more natural app r oac h: w e d o n ot wish to dra w conclusions fr om a samp le to a p opu lation, it is really t he (single , unique) sample itself th at w e are gen uinely in terested in. In th is pap er we hav e fo cused on examples from genetics since this is our main in terest and motiv ation for the pap er. But sim- ilar problems are encoun tered in other areas such as ecolog y , as do cumented in a separate rep ort; s ee F er- kingstad, Holden and Sandve ( 2013 ). An interesti ng topic for future wo rk w ould b e to study the implications f or the multi ple h y- p othesis testing setting. F or a discussion of some computational and conceptual c hallenges of Mon te Carlo multiple testing, see Sandv e, F erkingstad and Nyg ˚ ard ( 2011 ). The multiple testing p roblem is par- ticularly imp ortan t in genomics, bu t it also ap- p ears in ecolog y; see, for example, Gotelli and Ulric h ( 2010 ). Our main fo cus has b een a v oiding false p ositiv es due to too simple null mo dels. Of course, false nega- tiv es also o ccur, and the eﬀec t of diﬀering null mo d - els on the p ow er of tests should b e fu rther stu died. In order t o a vo id underp o wered tests, a v ery gen- eral adv ice is th e follo wing. Most te st statistics in the pap er a re based on counting, hence, th e v ari- ance of test statistics decreases as 1 / N where N is the num b er of samp les. But observ ations ma y b e correlated, reducing p o we r. W e m ay ha ve very h igh correlation b et wee n a large n umber of observ ations. It is imp ortan t to b e a wa re of this and try to ﬁn d test statistics where the correlati on b etw een obser- v ations is as small as p ossible. Finally , w e ha v e a lso considered a third t yp e of n ull mo del preserv ation, where the data is a s e- quence of categorica l v ariables, for example, . . . A CGT. . . for a DNA sequence. The distrib ution for eac h v ariable d ep ends on the v alue of the previous n v ariables. In this mo del, it is p ossible to h a v e the same probabilit y distribution for s equences of length n as in th e observ ed data. Then, increasing n implies preserve ring more of the pr obabilistic stru cture of the original data. W e omitted this material to mak e MONTE CAR LO NU LL MODELS 13 the pap er shorter and more f o cused. A separate pa- p er on this topic is in preparation. A CKNO WLEDGMENTS W e th ank Knut Liestøl, Marit Holden and Arnoldo F rigessi, as wel l as th e editor, an asso ciate editor and t w o anon ymous referees, for very helpful comments and suggestions. REFERENCES Benjami ni, Y. and Hochberg, Y. (1995). Con t rolling the false disco very rate: A practical and pow erful approach to multiple testing. J. R oy. Statist. So c. Ser. B 57 289–300. MR1325392 Berman, B. P. , Nibu, Y. , Pfeiffer, B . D. , Toman- cak, P. , Celni ker, S. E. , Levine , M. , Rubin, G . M. and Eisen, M. B. (2002). Exploiting transcription factor binding site clustering to identify cis-regulatory mo dules inv olved in pattern formation in the Dr osophila genome. Pr o c. Natl. A c ad. Sci. U SA 99 757–762. Bickel, P. J. , Boley, N. , Bro wn, J. B. , Huang, H. and Zhang, N. R. (2010). Su bsampling method s for genomic inference. Ann. Appl. Stat. 4 1660– 1697. MR2829932 Cairns, B. R. (2009). The logic of c hromatin architecture and remo delling at promoters. Natur e 461 193–198. Efr on, B. (2004). L arge-scale simultaneous hyp othesis test- ing: The choice of a null hyp othesis. J. A mer. Statist. As- so c. 99 96–104. MR2054 289 Ernst, J. , Kheradpour, P . , Mikkelsen, T. S. , Shoresh, N. , W ard, L. D. , Epstein, C. B. , Zh a ng, X. , W ang, L. , Issner, R. , Coyne, M. , Ku, M. , Durham, T. , Kellis, M. and B ernstein, B . E. (201 1). Mapp ing and analysis of chromatin state dynamics in nine human cell types. Natur e 473 43–49 . Ferkingst ad, E. , Holden, L. and Sandve, G . K. (2013). Monte Carlo null mod els in ecology . T ec hni- cal R ep ort SA MBA/20/13 , Norwegia n Computing C en- ter. Av ailable at http://publications. nr.no/1370000051 / NullModelsEcology-F erkingstad.p df . Fisher, R. A. (1935). The Design of Exp eriments . Oliver & Bo yd, London. Flicek, P. , A mode, M. R. , Barrell, D. , B eal, K. , Brent, S. , Car v alho-Sil v a , D. , Clapham, P. , Co a tes, G. , F airley, S. , Fitzgerald, S. , Gil, L. , Gordon, L. , Hendrix, M. , Hourlier, T. , Johnson, N. , K ¨ ah ¨ ari, A. K. , Ke efe, D. , Keena n, S. , Kinsella, R. , Ko moro wska, M. , K oscielny, G. , Kulesha, E. , Lars- son, P. , Longden, I. , McLare n, W . , Muff a to, M. , Overduin, B . , Pigna telli, M. , Pritchard, B. , Ria t, H. S. , Ritchi e, G. R. S . , Ruffier, M. , Schus- ter, M. , Sobral, D. , T ang, Y. A . , T a ylor, K. , Trev anion, S . , V andro vco v a, J. , White, S. , W il- son, M. , W ilder, S. P. , Aken, B. L. , Birne y, E. , Cunningham , F. , Dunham, I. , Durbin, R. , Fern ´ andez- Suarez, X. M. , Harro w, J. , Herrero, J. , Hub- bard, T. J. P. , P arker, A. , Proctor, G. , Spudich, G. , Vo gel, J. , Y a tes, A. , Zadissa, A. and Searle, S. M. J. (2012). Ensembl 2012. Nucleic A cids R es. 40 D84–D90. F or tin, M . J. and Ja cquez , G. M. (200 0). Randomiza- tion tests and sp atially auto-correlated data. Bul letin of the Ec olo gic al So ciety of Americ a 81 201–205 . Gionis, A. , Mannila, H. , Mie lik ¨ ainen, T. and Tsap aras, P. (2007). Assessing data mining results via sw ap randomization. ACM T r ansactions on Know le dge Disc overy f r om Data 1 14. Goecks, J. , N ekrutenk o, A. , T a ylor, J. and Galaxy Team (2010). Galaxy: A comprehensive approach for sup- p orting accessible, reprod ucible, and transparent computa- tional research in the life sciences. Genome Biol. 11 R86. Goodman, S. and Gre enland, S. (2007). Assessing the un- reliabilit y of the med ical litera ture: A respon se to “Why most published research ﬁn dings are false.” W orking P ap er 135, Dept. Biosta tistics, Johns Hopkins Univ ., Baltimore, MD. Ava ilable at http://www.bepress.com/jhubiostat/ paper135 . Gotelli, N. J. (2000). Null mo del analysis of sp ecies co- occurren ce patterns. Ec olo gy 81 2606–2621. Gotelli, N. J. and Gra ves, G. R. (1996). Nul l Mo dels i n Ec olo gy . Smithsonian In stitution, W ashington, D C. Gotelli, N. J. and Ulrich, W. (2010). The empirical Bay es approac h as a t ool t o identi fy non-rand om species asso cia- tions. Oe c olo gia 162 463–477. Hanhij ¨ ar vi , S. , Ga rriga, G. and Puolam ¨ aki, K. (2009). Randomization tec h n iques for gra phs. In Pr o c e e dings of the 9th SIAM International Confer enc e on Data M i ning (SDM ’ 09) 780–791. S I AM, Ph iladelphia, P A. Io annidi s, J. P. A. (2005). Why most p ublished research ﬁndings are false. PL oS Me d. 2 e124. Io annidi s, J. P. A. , Allison, D. B. , Ball, C. A. , Coulibal y, I. , Cui, X. , Culhane , A . C. , F a lchi, M. , Furlanello, C. , Game, L. , Jurman, G. , Mangion, J. , Meht a, T. , Nitzberg, M. , P age, G. P. , Petretto, E. and v an Noor t, V. (2008). Rep eatability of published mi- croarra y gene expression analyses. Nat. Genet. 41 149–155. Kallio, A. , V uokk o, N. , Oj ala, M. , Hai minen, N. and Mannila, H. (20 11). Randomization t ec hniq u es for as- sessing th e signiﬁcance of gene p erio dicity results. BMC Bioinformatics 12 330. Kell, D. B . and Oliver, S. G. (2004). H ere is t he evi- dence, n o w what is the hyp othesis? The complementary roles of inductive and hyp othesis-driven science in the p ost- genomic era. Bio essays 26 99–105. Kim, T. H. , Barrera, L. O. , Zheng, M. , Qu , C. , S i nger, M. A . , Richmond, T. A. , Wu, Y. , Gre en, R. D. and Ren, B. (2005). A high- resolution map of active promoters in the human genome. Natur e 436 876–880. Ko rnberg, R. D. and Lor ch, Y . (1999). Tw enty-ﬁve years of the n ucleosome, fund amental particle of t he eu kary ote chromo some. Cel l 98 285–294. Langaas, M. , Lindqvist, B. H . and Ferkingst ad, E. (2005). Estimating th e prop ortion of true null hyp otheses, with application to D NA microarra y data. J. R oy. Statist. So c. Ser. B 67 555–572. MR2168204 Lehmann, E. L. and Romano, J. P. ( 2005). T esting Statis- tic al Hyp otheses , 3rd ed. Springer, New Y ork. MR2135927 Lijffijt, J. , P ap apetrou, P. and Puolam ¨ aki, K. (2014). A statistical signiﬁcance testing app roac h to mining the most 14 E. FERKI NGST AD, L. H OLDEN A ND G. K. SANDVE informativ e set of patterns. Data Min. Know l. Disc ov. 28 238–263 . Manl y, B . F. J. (2007). R andomization, Bo otstr ap and Monte Carlo Metho ds in Bi olo gy , 3rd ed . Chapman & Hall/CR C, Boca Raton, FL. MR2257066 Mesiro v, J. P. (2010). Computer science. Accessible repro- ducible research. Scienc e 327 415–416 . Noseda, M. and McLean, G. R. (2008). Where did the sci- entiﬁc meth od go? Nat. Biote chnol. 26 28–29. Pounds, S. and Chen g, C. (2006). Robust estimation of the false discov ery rate. Bioinformatics 22 1979–1987 . R Developme nt Core Team (2011). R: A L anguage and Envir onment f or Statist ic al Computing . Vienna, Austria. Ripley, B . D. (1976). The second-order analysis of stationary p oint pro cesses. J. Appl. Pr ob ab. 13 255–266. MR0402918 Ruf, S. , Symmons, O. , Uslu, V . V. , Dolle, D. , Hot, C. , Ettwiller, L. and Sp itz, F. (2011). Large-scal e analysis of the regulatory architecture of t h e mouse genome with a transp oson-associated sensor. Nat. Genet . 43 379–386. Sandve, G. K. , Ferkingst ad, E. and N yg ˚ ard, S. (2011). Sequential Mon te Carlo multiple testing. Bioinf ormatics 27 3235–32 41. Sandve, G. K. , Gu ndersen, S. , R ydbeck, H. , G lad, I. K. , Holden, L. , Holden, M. , Liestøl, K. , Clancy, T. , Ferkingst ad, E. , Johansen, M. , Nygaard, V. , Tøstesen, E. , Frigessi, A. and H ovig, E. (2010). The Genomic Hyp erBrowser: In ferential genomics at the se- quence level. Genome Biol. 11 Article ID R121. Sandve, G . K. , Gundersen, S. , Johansen, M. , Glad, I. K. , Guna thasan, K. , Holden, L. , Holden, M. , Liestøl, K. , N yg ˚ ard, S. , Nygaa rd , V. , P aulsen, J. , R ydbeck, H. , Trengerei d, K. , Clancy, T. , Drabløs, F. , Ferkingst ad, E. , Kalas, M . , Lie n, T. , R ye, M. B. , Frigessi, A. and Hovig, E. (2013a). The Genomic Hyp erBrowser: An analysis web serv er for genome-scale d ata. Nucleic A cids Re s. 41 W133–W141. Sandve, G . K. , Nekrutenk o, A. , T a ylor, J. and Ho vig, E. (2013b). T en simple rules for repro ducible com- putational research. PL oS Comput. Biol. 9 e1003285. Storey, J. D. (2002). A direct approach to false discov ery rates. J. R oy. Stat ist. So c. Ser. B 64 479–498. MR1924302 Strub, T. , Gi u liano, S. , Ye, T. , B onet, C. , Keime , C. , Ko bi, D. , G ras, S. L. , Cormont, M. , Ballotti, R . , Ber tolotto, C. and Da vidson, I. (2011). Essential role of microphthalmia tran scription factor for DNA replica- tion, mitosis and genomic stability in melanoma. Onc o gene 30 2319–2332. Visel, A. , Rub in, E. M. and Pennacchio, L. A . (2009). Ge- nomic views of distant-acting enhancers. Natur e 461 199– 205. W ang, Z. , Zang, C. , Rosenfeld, J. A. , S chones, D. E. , Barski, A. , Cud dap ah, S. , Cui, K. , Roh, T.-Y. , Peng, W. , Zhang, M. Q. and Zhao , K. (2008). Com bi- natorial patterns of histone acetylatio ns and methylations in the human genome. Nat. Genet. 40 897–903 . Wingend e r, E. , Dietze , P. , Kar a s, H. and Kn ¨ uppel, R. (1996). TRANSF AC: A database on transcription fa ctors and their DNA binding sites. Nucleic Ac ids R es. 24 238– 241. Zhou, Q. and Wong, W. H. (2004). CisModule: De no vo disco very of cis-regulatory mo dules by hierarc hical mixture mod eling. Pr o c. Natl. A c ad. Sci. USA 101 1211 4–12119.

Monte Carlo Null Models for Genomic Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment