Understanding better (some) astronomical data using Bayesian methods

Understa nding better (some) astronomic al data using Bayesian methods S. Andreon, 1 ∗ 1 IN AF–Osservatorio Astronomico di Brera, Milano, Italy November 27, 2024 Abstract Current analysis of astronomical data are confronted with the daunting task of modeling the a wkward features of astron omical data, among which heter oscedastic (point-d ependen t) er rors, in trinsic scatter, non-ig norab le data co llection (selection effects), data structure, non-un iform populatio ns (often called Malmquist bias), non-Gaussian da ta, and upper /lower lim its. This chapter shows, b y examples, h ow modeling all these featu res using Bayesian meth ods. In sho rt, one ju st need to for malize, using maths, the logical link between the inv olved q uantities, how the d ata ar ize and wh at we already known o n th e quantities we want to study . The poster ior pro bability distribution summar izes what we known on th e studied qu antities af ter th e data, and we shou ld not be afraid abo ut th eir actu al nu merical co mputation , because it is left to (special) Monte Carlo progr ams such as J A GS. As examples, we show how to predict the mass o f a new object disposing of a ca librating sam ple, how to co nstraint co smological p arameters from supern ovae data and h ow to check if th e ﬁtted d ata are in ten sion with the adop ted ﬁtting m odel. Examples are given with their co ding. These examples ca n be easily used as template f or comp letely different analysis, on totally un related astrono mical objec ts, req uiring to mo del the sam e awkward data features. Keywords : Astro physics; Cosmolog y; Bay esian statistics; Regression; Scaling relations; Prediction; Model testing 1 Introd uction Astronomic al data prese nt a number of quite common awkward feat ures (see [1] for a rev iew): • heter oscedasti c erro rs : error sizes var y from point to point. • non-Gaussian data : the likelih ood is asymmetric a nd t hus errors are, e.g 3 . 4 +2 . 5 − 1 . 2 . Upper/lo wer limits, as 2.3 at 90 % probabil ity , are (perhaps extreme) e xamples of asymmetric likeli hood. • non unifor m populations or data structur e : the number of objects per unit parameter(s) is non- unifor m. T his is the sourc e of the Malmquist- or E ddingt on- like bias that af fect most astronomical quanti ties, as parallax es, star and galaxy counts, mass and luminosit y , galaxy cluster ve locity disper - sions, superno v ae luminosity correction s, etc. • intrinsic scatter : data often scatter m ore than allo wed by the errors. The extra-sca tter can be due to uniden tiﬁed source s of errors, often called systematic errors , or indicat es an intrinsic spread of the popul ation under study , i.e. the fact that astronomica l objects are not identicall y equal. ∗ stefano.and reon@brera.inaf.it 1 • noisy estimates of the error s : as ev ery measuremen t, errors are known with a ﬁ nite degre e of pre- cision . This is ev en more true when one is measuring complex, and somewha t model depende nt, quanti ties like mass. • non-random sampling : in simple terms, the objects in the sample are not a random sampling of those in the Uni verse. In some rare occasions in astron omy , sampling is planne d to be non-ra ndom on purpos e, but most of the times non-ran dom sampling is due to selecti on ef fects: harder -to-o bserv e object s are ve ry often missed in samples. • mixtur es : ve ry often, lar ge samples include the populat ion under interest, bu t also contaminat ing object s. Mixtures also arise when one measure the ﬂux of a source in presence of a backgroun d (i.e. alw ays). • prior : we often known from past data or from theory that some value s of the parameters are more like ly than other . In order terms, we hav e prior knowle dge about the parameter being in ves tigated. If we kno wn anythi ng, not ev en the order of magnitud e, about a parameter , it is difﬁcult e ven to choose which instrume nt, or sample, shoul d be used to measure the parameter . • non-linear : laws o f Nature can be more complicated than y = ax + b . Bayesian m ethods allow to deal with these features (and also other ones), eve n all at the same time, as we illustrat e in Sec 3 an d 4 with tw o re search e xamples, it is just matte r of stating in mathematical terms our wordy s tatements abou t the nature of the measuremen t and of the objects being measu red. The pro babilisti c (Bayesia n) approach returns the whole (posterior) probab ility distrib uti on of the parameters , very often in form of a Monte Carlo sampling of it. In this paper we make an attempt to be clear at the cost of being non-ri gorous. W e defer the reader lookin g for rigo ur to general textboo x, as [2], and, to [3] for our ﬁrst research exampl e. 2 Parameter estimation in Bayesian Inference Before adressing a r esearch ex ample, let’ s consi der an ideal ized applied p roblem to e xplain t he basi cs of the Bayesian approa ch. Suppose o ne is i nterested in es timating the ( log) mass of a galaxy c luster , l g M . In adv ance of collectin g any data, we may hav e certain belief s and expec tations about the v alues of l g M . In fact, these thoughts are often used in deciding which instrumen t w ill be used to gather data and how this instru ment may be conﬁgure d. For exampl e, if we plan to measure the mass of a poor clust er via the virial theorem, we will select a spectroscop ic set up w ith adequate resolution , in order to av oid that ve locity errors are comparable to, or lar ger than, the lik ely low v elocit y dispersion of poor clus ters. C rystalis ing these thoughts in the form of a probability distrib ution for l g M provides the prior p ( lg M ) , on which relies the feasibil ity section of the telesco pe time prop osal, where instrument, conﬁguration and expos ure time are set. For exa mple one may belie ve (e.g. from the cluster being some what poor) that the log of the cluster mass is probably not far from 13 , plus or minus 1; this might be modeled by saying that the (prior ) proba - bility distri but ion of the log mass, here denoted lg M is a Gaussian centred on 13 and with σ , the standard de viation , equal to 0 . 5 , i.e. lg M ∼ N (13 , 0 . 5 2 ) . Once the appropriate instrumen t and its set up hav e been select ed, data can be collected. In our exam- ple, this m eans we record a measuremen t of log m ass, say obsl g M 200 , via, for ex ample, a virial theor em analys is, i.e. measuring distances and velocit ies. The lik elihood describes ho w the noisy observ ation obsl g M 200 arise s giv en a v alue of l g M . For ex- ample, we m ay ﬁnd that the measure ment techn ique allows us to measure masses in an unbiased way but with a standard error of 0.1 and th at the err or struc ture is Gaussia n, ie. obslg M 200 ∼ N ( l g M , 0 . 1 2 ) , whe re 2 the tilde symbol reads “is drawn from” or “is distrib uted as”. If we observe obsl g M 200 = 13 . 3 we usually summarise the abo ve by writing l g M = 13 . 3 ± 0 . 1 . Ho w do we update our beliefs about the unobserv ed log mass l g M in light of th e observ ed measu rement, obsl g M 200 ? Expressi ng this probabilist ically , what is the posterior distrib ution of l g M giv en obsl g M 200 , i.e. p ( l g M | obsl g M 200) ? Bayes Theorem ([4], [5]) tells us that p ( lg M | obsl g M 200) ∝ p ( obsl g M 200 | l g M ) p ( l g M ) i.e. the posterior (the left hand side) is equal to the product of likelihoo d and prior (the right hand side) times a propor tionality constant of no importance in parameter estimation . Simple algebr a sho ws that in our example the poster ior distrib ution of l g M | obsl g M 200 is Gaussian, with mean µ = 13 . 0 / 0 . 5 2 +13 . 3 / 0 . 1 2 1 / 0 . 5 2 +1 / 0 . 1 2 = 13 . 29 and σ 2 = 1 1 / 0 . 5 2 +1 / 0 . 1 2 = 0 . 0096 . µ is just the usual weighted a verag e of two “inpu t” valu es, the prior and the observ ation, with weights giv en by prior and observ ation σ ’ s. From a computatio nal point of vie w , only simple exampl es such as the one described abov e can gen- erally be tackled analytically , realistic analysis should be instead tacked numerica lly by special (Mark ov Chain) Monte Carlo methods. These are includ ed in BUGS-like progr ams ([6]) such as J A G S ([7]), al- lo wing scientists to focus on formalizing in mathematica l terms our wordy statemen ts about the quantities under in v estigatio n without wo rrying a bout th e nu merical impl ementation . In th e ide alized e xample, we just need to w rite in an ascii ﬁle the symbolic express ion of the prior , l g M ∼ N (13 , 0 . 5 2 ) , and of likeliho od, obsl g M 200 ∼ N ( lg M , 0 . 1 2 ) to get the posterio r in form of samplings. F rom the Monte Carlo sampling one may directly deri ve mean values , standard devi ations, and conﬁdence regions . For example , for a 90 % interv al it is sufﬁci ent to peak up the interv al that contain 90 % of the samplings. 3 First example: pred icting ma ss fr om a mass pr oxy Mass estimate s are one of the holy grails of astronomy . Since these are observ ation ally e xpensi v e to measure, or e ven unmeasu rable with exis ting fac ilities, astron omers use mass proxies, far less expens iv e to acquir e: from a(n usually small) sample of objects, the researcher measures masses, y and the mass proxy , x . Then, he reg ress x vs y and infer y for those objects ha ving only x . This is the way most of the times galaxy cluste r masses are estimated, for exa mple using the X-ray luminosity , X -ray temperature, Y X , Y S Z , the cluste r richnes s or the total optical luminosity . Here we use the cluster richness, i.e. the number of member galaxi es, but with minor chan ges this example can be adapted for other cases. [1] shows that predicted y using the B ayesian approach illustrated here are more precise than any other method and that the Bayesian approa ch does not show the large systemati cs of othe r appro aches. This means, in the case of masses, that more precise masses can be deri ved for the same input data, i.e. at the same telescop e time cost. 3.1 Step 1: put in formulae what y ou kn ow 3.1.1 Heter oscedast icity Clusters ha ve widely dif ferent richnesses , and thus w idely dif ferent errors. Some cluste rs hav e better de- termined masses than other . Heterosceda sticity means that errors hav e an index i , because the y dif fer from point to point. 3.1.2 Contamination (mixtur es), non-Gaussian data and upp er limits Galaxies in the cluster direction are both cluster members and galaxies on the line of sight (fore/bac kground ). The con tamination may be es timated by obser ving a referenc e line of sig ht (fore/ba ckground ), perha ps with a C i times lar ger solid angle (to improve the q uality of the determination ). 3 The mathematic al transl ation of our words, when counts are modeled as Poisson, is: obsbk g i ∼ P ( nbk g i ) # Poisson with intensi ty nbk g i (1) obstot i ∼ P ( nbk g i /C i + n 200 i ) # Poisson with intensi ty ( nbk g i /C i + n 200 i ) (2) The var iables n 200 i and nbk g i repres ent the true richness and the true backgrou nd galaxy counts in the studie d solid angles, whereas we add a preﬁx “obs” to indicate the observ ed value s. Upper limits are automatically accounted for . Suppose, for expos ing simplicity , that w e observe d 5 galaxi es, obstot i = 5 , in the cluste r direction and that in the control ﬁeld direction (with C i = 1 for exp osing simplicity) we observ e four background galaxies, obsbk g = 4 . W ith one net galaxy and P oisson ﬂuctuatio ns of a few , n 200 i is poorly determined at best, and the conscient ious researcher would probably report s an upper limits of a few . T o use the infor mation conta ined in the uppe r limit in our re gression analys is w e only need to list in the data ﬁ le the raw measurements ( C i = 1 , obstot i = 5 , obsbk g i = 4 ), as for the other clusters. These data w ill be treated independen tly on whether an astronomer decide s to report a measureme nt or an upper limit, becaus e Nature doesn’ t care about astronomer decisio ns. 3.1.3 Non-linearity and extra -scatter The relatio n between mass, M 200 , and proxy , n 200 , (richnes s) is usually parametri zed as a po wer -law: M 200 i ∝ n 200 β i Allo wing for a Gaussian intrinsic scatter σ scat (cluste rs of galaxie s of a gi ven richness may not all hav e the very same mas s) and taking the log, prev ious equation becomes: lg M 200 i ∼ N ( α + β log ( n 200 i ) , σ 2 scat ) # Gaussi an scatter around ( M 200 i ∝ n 200 β i ) (3) where the intercep t is α and the slope is β . 3.1.4 Noisy err ors Once logged, mass has Gaussian errors. In formulae: obsl g M 200 i ∼ N ( l g M 200 i , σ 2 i ) # Gauss errors on lg mass (4) Ho wev er , errors (as eve rything) is measured w ith a ﬁnite degree of precisio n. W e assume that the measured error , obser r l g M 200 i , is not biased (i.e. it is not systematical ly lar ger or smaller that the true error , σ i ) but some what noisy . If a χ 2 distrib uti on is adopted, it satisﬁes both our request of unbiasnes s and noisin ess. In formulae: obser r l g M 200 2 i ∼ σ 2 i χ 2 ν /ν # Unb iased errors (5) where the parameter ν regulate s the width of the distrib uti on, i.e. how precis e measu red e rrors are. Since we are 95% conﬁdent that quoted errors are correct up to a fact or of 2, ν = 6 # 95 % conﬁdent within a factor 2 (6) 4 3.1.5 Prior knowledge a nd population structur e The data used in this in vestigati on are of qualit y good enough to determine all parameters , b ut one, to a suf ﬁcient deg ree of accurac y that we should not care about priors and we can safely take weak (almost unifor m) priors, zeroed for un-physica l valu es of paramete rs (to a vo id, for example, negati ve richnesses) . The exce ption is giv en by the prior on the errors (i.e. σ i ), for which there is only one measureme nt per datum. The adop ted prior (eq. 11) is support ed by statisti cal considerati ons (see [3] for detai ls). The same prior is also used for the intrinsic scatter term, althou gh an y w eak prior would return the same result, because this term is well determin ed by the data . α ∼ N (0 . 0 , 10 4 ) # Almost unifo rm prior on intercept (7) β ∼ t 1 # Uniform prior on angle (8) n 200 i ∼ U (0 , ∞ ) # Uniform, b ut positi ve, cluster richness (9) nbk g i ∼ U (0 , ∞ ) # Uniform, b ut positi ve, backgroun d rate (10) 1 /σ 2 i ∼ Γ( ǫ, ǫ ) # W eak prior on error (11) 1 /σ 2 scat ∼ Γ( ǫ, ǫ ) # W eak prior on intrins ic scatter (12) Richer clusters are rarer . Therefore, the prior on the clu ster richne ss is, for sure, not u niform, contra ry to our assumpti on (eq. 9). Modeling the populatio n structure is un-neces sary for the data used in [3] and here, b ut is essentia l if noise r richnesses were used. Indee d, [3] sho ws that a pre vious publis hed richne ss-mass calibra tion, which uses richnesse s as lo w as obsn 200 = 3 and neglects the n 200 struct ure, sho ws a slope biased by ﬁve times the quoted uncertainty . Therefo re, the populatio n structure canno t be ov erlook ed in genera l. 3.2 Step 2: remov e T eXing, perf orm stochastic computations and publish At this point, we hav e described , using mathematical symbols, the link between the quantit ies that matter for our proble m, and we only need to compute the posterio r probabilit y distrib ution of the parameters by some sort of sampling (the readers with exq uisite mathematical skills may instead attempt an analyti cal computa tion). Just Another G ibb Sampler (J A GS 1 , ([7]) can return it at the m inor cost of de-T eXing equat ions 1 to 12 (compare them to the J A GS code belo w). P oisson , Normal and Uniform distrib utions become dpois , dnorm , dunif , respe ctiv ely . J A GS, follo wing BUGS ([6]), uses precisions, pr ec = 1 /σ 2 , in place of v ariances σ 2 . Furthermor e, it uses neperian logarithms , instea d of decimal ones. Eq. 5 has been rewritten using the property that the χ 2 is a particular form of the Gamm a distrib ution. E q. 3 is split in two J AGS lines for a better reading. The arrow symbol reads “tak e the value of”. obsvarl gM200 is the square of obser r l g M 200 . For computatio nal adv antage s, log( n 200) is centred at an av erage value of 1.5 and α is centre d at -14.5. Finally , we replaced inﬁnity with a lar ge number . The model abo ve, when inserted in J A GS , rea ds: model { for (i in 1:length(obstot )) { obsbkg[i] ˜ dpois(nbkg[i] ) # eq 1 obstot[i] ˜ dpois(nbkg[i] /C[i]+n200[i ]) # eq 2 n200[i] ˜ dunif(0,3000) # eq 9 nbkg[i] ˜ dunif(0,3000) # eq 10 1 http://calvin.iarc.fr/ ∼ martyn/software/jags/ 5 Figure 1: Richness-mass scaling. The solid line marks the mean ﬁtted regr ession line, while the dashed line sho ws this mean plus or m inus the intrinsic scatter σ scat . The shaded region marks the 68% highest poster ior credible int erv al for the regr ession. Error ba rs on the data points re present ob serve d errors for both v ariables. T he distances bet ween the da ta and th e reg ression line is d ue in p art to the measu rement error an d in part to the intrins ic scatter . From [3], reprodu ced with permissi on. precy[i] ˜ dgamma(1.0E-5, 1.0E-5) # eq 12 obslgM200[i] ˜ dnorm(lgM200[i], precy[i]) # eq 4 obsvarlgM200[ i] ˜ dgamma(0.5 * nu,0.5 * nu * precy[i]) # eq 5 z[i] <- alpha+14.5+beta * (log(n200[i]) /2.30258-1.5) # eq 3 lgM200[i] ˜ dnorm(z[i], prec.intrscat ) # eq 3 } intrscat <- 1/sqrt(prec.i ntrscat) prec.intrscat ˜ dgamma(1.0E-5,1 .0E-5) # eq 11 alpha ˜ dnorm(0.0,1 .0E-4) # eq 7 beta ˜ dt(0,1,1) # eq 8 nu <-6 # eq 6 } J A GS samples the post erior distrib ution of all quantit y of interests, such as intercept , slope and intr insic scatter by Gibb sampling (a sort of Monte Carlo). Form these samplin gs, it is straightforw ard to compute (poste rior) m ean and standar d de viation s (by computing the a verag e and the standa rd devia tion!), to plot poster ior margin als (by ignorin g the v alues o f th e oth er parameters ) and con ﬁdence cont ours, data and mean model, etc. Therefo re, our effort is ov er , we only need to produc e nice plots and summaries of our results. Figure 1 sho ws the data used in this analysi s (see [3] for details), the mean scaling (solid line) and its 68% uncer tainty (shaded yello w regio n) and the mean intrinsic scatter (dashe d lines) around the mean relatio n. The ± 1 intrinsic scatte r band is not expecte d to contain 68% of the data points, because of the presen ce of measur ement errors. Figure 2 sho ws the posterio r mar ginal s for the intercep t, slope and intr insic scatter σ scat . These marg inals are reasonab ly well approx imated by Gaussians . The intrinsic m ass scatter at a giv en richness , σ scat = σ lg M 200 | log n 200 , is small, 0 . 19 ± 0 . 03 . (Unless otherwise stated, results of the statistical computati ons are quoted in the form x ± y w here x is the poster ior m ean and y is the posterio r standa rd de viation .) The found relatio n is: lg M 200 = (0 . 96 ± 0 . 15) (log n 200 − 1 . 5) + 14 . 36 ± 0 . 04 (13) 6 Figure 2: Posterior probability distrib ution for the parameters of the richness-mass s caling. T he black jagged histog ram sho ws the posteri or as computed by MCMC , marg inalised ov er the other parameters. The red curv e is a G auss approximatio n of it. The shaded (yellow) range sho ws the 95% highest posterior credibl e interv al. From [3], reprodu ced w ith permissio n. 3.3 Pr edicting mass es As mentione d, one of the reasons w hy astro nomers reg ress a quantity x against another one, y , is to predict the latter when a direct measurement is m issing (usually becaus e observ ation ally exp ensi ve to acquire). It is clear that the uncerta inty on the predicted y , calle d ˜ y hereaft er , should account for: a) the intrinsic scatter between y and x (a larg er scatter implies a lower quality ˜ y estimate); b) the uncertainty of the x (the larg er it is, the noisier will be the ˜ y ; c) the quality of the calibra tion between y and x (better determined relatio ns should ret urn more preci se estimates of ˜ y ); and d) extrap olation errors, i.e. should penaliz e attempts to infer ˜ y valu es correspond ing to x val ues absent from the calibrator sample (e.g. outsid e the range sampled by it). All these requi rements are satisﬁed usin g the posterior predicti ve distrib utio n, p ( e y ) = Z p ( e y | θ ) p ( θ | y ) dθ (14) where θ are the re gression parameter s (intercept, slope, intrinsic scatter). This apparently ugly ex- pressi on is easy to unde rstand: one should combine (multiply) the uncertaintie s of the calibra ting rela- tion, p ( θ | y ) , to the unce rtainty of predicti ng ne w data if the calibr ating relati on were perfect ly kno wn, p ( e y | θ ) . S ince we are now intereste d in predic ted value s only , we get rid of non-inter esting parameters ( θ ) by marg inalizati on (integ ration). Posterior predicti ve distrib uti ons are so basic to be introduced at page 8 of the > 700 pages “Bayesian Data Analysis” book ([2]) and to be offered as a standa rd output of J A GS. O f course, we need to list the x v alues (richness es), and errors of the clusters for w hich we want to infer ˜ y (predicte d mass) in the data ﬁle, listing for example in the data ﬁle obsbk g = 12 , obtot = 32 , C = 5 , and mass obsl g M 200 = N A (“not a vai lable”), to indicate that this quant ity should be estima ted using the re gression computed from the point s with av ailabl e masses and galaxy counts. J A G S returns p ( e y ) in form of sampling and therefore, as for any other parameter , a point estimate may be obtain ed by taking the av erage and a 68 % (credible) interv al can be deri v ed by taking the interv al includin g 68 % of the sampling s. Returned va lues behav e as expe cted, and indeed ha ve lar ge errors when masse s are estimated for cluster s with richnes ses outside the range where the calibra tion has bee n deriv ed ([8]) 4 Second example: Cosmological parameters from SNIa Superno v ae (SNIa) are very bright objects with very similar luminositie s. The luminosity spread can be made ev en smaller by accountin g for the correla tion w ith colour and stretch parameter (the latter is a m ea- 7 (a) Figure 3: Apparent magnitude vs redshift of the SNIa sample (upper panels) , and their residual from a Ω M = 0 . 3 , Ω Λ = 0 . 7 cosmologi cal model (bottom panels ) befo re (left panels) or after (right panels) correc ting for stretching and color parameter . surement of how slowly S NIa fade), as illustrated in Figure 3 for the sample in [9]. These featur es make SNIa very useful for cosmolo gy: they can be observe d far aw ay and the relatio n between ﬂux (basicall y the rate of photons recei ved) and luminosity (the rate of photons emitted) is modulated by the luminosity distan ce (to the square) , which in turns is function of the cosmologic al parameters. Therefore, measuring SNIa ﬂuxes (and redsh ift) allo ws us to put constraints on cosmolog ical parameter s. The only minor com- plicati on is that S NIa luminosities are function of their colour and stretch parameter , and these paramete rs ha ve an intri nsic scatter too, which in turns has to be det ermined from the data at the same time as the ot her paramete rs. [10] sho ws that the Baye sian ap proach deli ve rs tighter statistic al constraints on the cosmologic al param- eters over 90 % of the times, that it reduces the statistical bias typically by a factor ∼ 2 − 3 and that it has better cov erage properti es than the usual chi-sq uared appro ach. In this second example we can proceed a bit faster in illustrating this non-line ar regre ssion with het- erosce dastic errors, non-unifor m data struc ture and intrinsic scatter . In this example, we also bri eﬂy discuss the p rior sen siti vity , i.e. ho w muc h the results a re af fected b y th e chose n prior , and we also check t he q uality of the model ﬁt. 8 Figure 4: Observ ed value s of the stretch parameters , obsx i , and of the colour parameter , obs c i rank ed by error . Points scatter more than the error bars (see the left side of the ﬁgure). The dashed lines indic ate the size of the intrin sic scatter as determined by our analysis. 4.1 Step 1: put in formulae what y ou kn ow W e obser ve SN Ia magn itudes obs m i ( = − 2 . 5 l og ( f l ux ) + c ) w ith Gaussia n errors σ m,i , i.e. obsm i ∼ N ( m i , σ 2 m,i ) (15) These m i are relat ed to the distance modulus distmod i , via m i = M + distmod i − α x i + β c i with a Gaussian intrinsi c scatter σ scat . More precisely: m i ∼ N ( M + distmod i − α x i + β c i , σ 2 scat ) (16) where M is the (unkno wn) mean absolu te magnitude of S NIa, and α and β allo w to reduc e the SN Ia luminosi ty scatte r by accountin g for the correlation with the stretch and colour parameters . 9 Similarly to [10], the M , α , β and log σ scat priors are tak en uniform in a wide range: log 10 σ scat ∼ U ( − 3 , 0) (17) α ∼ U ( − 2 , 2) (18) β ∼ U ( − 4 , 4) (19) M ∼ U ( − 20 . 3 , − 18 . 3) (20) x i and c i are the true valu e of the stretc h and colour parameters, of which w e observ e (the noisy) obsx i and obsc i with errors σ x,i and σ c,i . In formulae: obsx i ∼ N ( x i , σ 2 x,i ) (21) obsc i ∼ N ( c i , σ 2 c,i ) (22) The key point of the modelin g is that the obsx i and obs c i v alues scatter more than their errors, but not immensely so, see Fig 4. The presence of a non-un iform distrib ution induces a Malmquist-lik e bias if not accoun ted for ( e.g. lar ge obsx i v alues are more lik ely lo w x i v alues s cattered to lar ge v alues t han vic e ve rsa, becaus e of the lar ger abund ance of lo w x i v alues). Therefore, we model, as [10] do, the indi vidual x i and c i as dra wn from independ ent normal distrib utions centered on xm and cm with standard devia tion R x and R c respec tiv ely . In formulae: x i ∼ N ( xm, R 2 x ) (23) c i ∼ N ( cm, R 2 c ) (24) W e take uniform priors for xm and sc , and uniform priors on log R x and on log R c , between the indi- cated bounda ries: xm ∼ U ( − 10 , +10) (25) cm ∼ U ( − 3 , +3) (26) log 10 R x ∼ U ( − 5 , + 2) (27) log 10 R c ∼ U ( − 5 , + 2) (28) That’ s almost all: we need to remember the deﬁnition of distance modulus: distmod i = 25 + 5 log 10 dl i (29) where the lumino sity distance, dl is a complica te exp ression, in vo lving inte grals, of the redshift z i and the cosmologic al parameters Ω Λ , Ω M , w , H 0 (see any recent cosmology textbo ok for the m athematic al exp ression). Redshift , in the consi dered sample, hav e heterosceda stic Gaussian errors σ z ,i : obsz i ∼ N ( z i , σ 2 z ,i ) (30) The redshi ft prior is tak en uniform 10 z i ∼ U (0 , 2) (31) Superno v ae alone do not allo w to determin e all cosmologi cal paramete rs, so w e need ext ernal prior on them, notab ly on H 0 , taken from [11] to be H 0 ∼ N (72 , 8 2 ) (32) At this point, we may decide w hich sets of cosmologica l models we want to in ve stigate using SNIa, for exa mple a ﬂat univ erse with a possible w 6 = 0 with the follo wing priors: Ω M ∼ U (0 , 1) (33) Ω k = 0 (34) w ∼ U ( − 4 , 0) (35) or a curv ed Univ erses with w = − 1 : Ω M ∼ U (0 , 1) (36) Ω k ∼ U ( − 1 , 0) (3 7) w = − 1 (38) or any ot her set. Both conside red cosmologies hav e: Ω k = 1 − Ω m − Ω Λ (39) Finally , one m ay want to use some data. As shortly m ention ed, w e use the compilatio n of 288 SN Ia in [9]. 4.2 Step 2: remov e T eXing, perf orm stochastic computations and publish Most of the distrib ution s abov e are Normal, and the posterior distrib ution can be almost completely analyt- ically computed ([10]). Ho wev er , numerical ev aluati on of the stochas tic part of the model on an (obsolete ) laptop takes about one minute, therefore there is no need for speed up. Instead, the ev aluatio n of the lu- minosity distan ce is CPU intensi v e (it takes ≈ 10 3 more times, unless approx imate analyti c formulae for the luminosi ty distance are used), because an inte gral has to be ev aluate d a number of times equal to the number of superno v ae times the number of target posterior samplings, i.e. about four millions times in our numerica l computa tion. The J A GS implementation of the luminosity distance integral is implemented as a sum ov er a tightly pack ed grid on redshift. As the pre vious example, eq 15 to 32 can be de-T eXed and used in J AGS, addi ng one of the two set of priors , 33-35 or 36-38 , dependin g on which problem one is interest ed in. data { # JAGS like precisions precmag <-1/errmag/errmag precobsc <- 1/errobsc/err obsc precobsx <- 1/errobsx/err obsx 11 precz <- 1/errz/errz # grid for distance modulus integral evaluation for (k in 1:1500){ grid.z[k] <- (k-0.5)/1000. } step.grid.z <-grid.z[2]-g rid.z[1] } model { for (i in 1:length(obsz)) { obsm[i] ˜ dnorm(m[i],prec mag[i]) # eq 15 m[i] ˜ dnorm(Mm+dis tmod[i]- alpha * x[i] + beta * c[i], precM) # eq 16 obsc[i] ˜ dnorm(c[i], precobsc[i]) # eq 22 c[i] ˜ dnorm(cm,pre cC) # eq 24 obsx[i] ˜ dnorm(x[i], precobsx[i]) # eq 21 x[i] ˜ dnorm(xm, precx) # eq 23 # distmod definition & H0 term distmod[i] <- 25 + 5/2.3026 * log(dl[i]) -5/2.3026 * log(H0/300000 ) z[i] ˜ dunif(0,2) # eq 31 obsz[i] ˜ dnorm(z[i],prec z[i]) # eq 30 ######### dl computation (slow and tedious) tmp2[i] <- sum(step(z[i]- grid.z) * (1+w) / (1+grid.z)) * step.grid.z omegade[i] <- omegal * exp(3 * tmp2[i]) xx[i] <- sum(pow((1+grid. z)ˆ3 * omegam + omegade[i] + (1+grid.z)ˆ2 * omegak,-0.5) * * step.grid.z * step(z[i]-gri d.z)) # implementin g if, to avoid diving by 0 added 1e-7 to omegak zz[1,i] <- sin(xx[i] * sqrt(abs(omeg ak))) * (1+z[i])/sqrt (abs(omegak+ 1e-7)) zz[2,i] <- xx[i] * (1+z[i]) zz[3,i] <- (exp(xx[i] * sqrt(abs(omeg ak)))-exp(-x x[i] * sqrt(abs(omegak))))/2 * * (1+z[i])/sqrt (abs(omegak+ 1e-7)) dl[i] <- zz[b,i] } b <- 1 + (omegak==0) + 2 * (omegak > 0) ########## end dl computation # JAGS uses precisions precM <- 1/ intrscatM /intrscatM precC <- 1/ intrscatC /intrscatC precx <- 1/ intrscatx /intrscatx # priors Mm˜ dunif(-20.3, -18.3) # eq 20 alpha ˜ dunif(-2,2. 0) # eq 18 beta ˜ dunif(-4,4.0 ) # eq 19 cm ˜ dunif(-3,3) # eq 26 xm ˜ dunif(-10,10) # eq 25 # uniform prior on logged quantities intrscatM <- pow(10,lgint rscatM) # eq 17 lgintrscatM ˜ dunif(-3,0) # eq 17 intrscatx <- pow(10,lgint rscatx) # eq 27 lgintrscatx ˜ dunif(-5,2) # eq 27 intrscatC <- pow(10,lgint rscatC) # eq 28 lgintrscatC ˜ dunif(-5,2) # eq 28 #cosmo priors H0 ˜ dnorm(72,1/8./ 8.) # eq 32 omegal<-1-ome gam-omegak # eq 39 # cosmo priors 1st set LCDM #omegam˜dunif (0,1) # eq 36 #omegak˜dunif (-1,1) # eq 37 #w <- -1 # eq 38 # cosmo priors 2nd set: wCDM omegam˜dunif( 0,1) # eq 33 12 (a) Figure 5: Prior and posterior probability distrib ution for the three intrin sic scatte r terms in the SNIa prob- lem. The black jagge d histo gram sho ws the posterior as computed by MC MC, margi nalised ove r the other paramete rs. T he red curv e is a Gauss approximati on of it. The shaded (yello w) ra nge s ho ws the 95% hi ghest poster ior credible interv al. The adopt ed priors are indicated by the blue dotted curve. omegak <-0 # eq 34 w ˜ dunif(-4,0) # eq 35 } Figure 5 shows the prior (dashed blue line) and posterior (histo gram) probabili ty distr ibu tion for the three intrinsic scatter terms present in the cosmological parameter estimation: the scatter in absolute lumi- nosity after colour and stretch correctio ns, ( σ scat ), the intrinsic scatter in the distrib uti on of the colour and stretch terms ( R c and R x ). This plot sho ws that the posterio r probabili ty at intrinsic scatters near zero is approx imately zero and thus that the three intrinsic scatter terms are necessary parameters for the modeling of SNIa, an d not useles s complicatio ns. The three po steriors are dominated by the d ata, being the prior q uite ﬂat in the ran ge where the posterio r is appreciab ly not zero (Figure 5). Therefore, any ot her prior choice, as long as smoot h and shallo w over t he sho wn parameter range, would ha v e returned indisti nguishab le results . Not only SNIa hav e luminosi ties that depend on colour and stretch terms, but these in turns hav e their o wn probability distrib ution (taken Gaussian for simplicity) with a w ell deter mined width. Figure 6 depicts the M almquist- like bias one should incur if the spread of the distrib utio n of colour and stretch parameters is ignored: it reports the observ ed v alues (as in Fig 4), obs x i and obs c i as well as the true va lues x i and c i (poste rior means). The effe ct of equation s 23 and 24 is to pull v alues to ward the mean, and m ore so those with lar ge errors, to compensate the systematic shift (Malmquist-lik e bias) towa rd large r observe d value s. Figure 7 sho ws the probabi lity distrib ution of the two the colou r and stretch slopes: α = 0 . 12 ± 0 . 02 and β = 2 . 70 ± 0 . 14 respecti vely . As for the intrinsic scatter terms, the posterior is dominated by the data and therefo re an y other prior , smooth and shallo w , would ha ve return ed indistinguis hable results. Finally , Fig 8 rep orts perha ps the most wanted result: contours of e qual proba bility for th e cosmolo gical paramete rs Ω M and w . For one dimensio nal mar ginals, we found: Ω M = 0 . 40 ± 0 . 10 and w = − 1 . 2 ± 0 . 2 , but w ith non- Gaussian probab ility distri but ions. 4.3 Model checking The work of the c areful research er does not end by ﬁnding the parameter set that best describe t he data, (s)he also check s whether the adopted model is a good des cription of the data, or it is missp eciﬁed, i.e. in tension with the ﬁtted data. In the non-Bayesian paradigm this is often achie ve d by computing a p-v alue, i.e. the probab ility to obtain data more discrep ant than those in hand once parame ters are taken at the best ﬁt v alue. The B ayesia n ve rsion of the concept (e.g. [2]) ackno wledges that parameter s are not perfectly kno wn, and therefo re one should also explore , in additi on to best ﬁt va lue, other v alues of the parameters . Therefore, 13 Figure 6: Effe ct of the population structure for the stretch and the colour parameters . Each tick goes from the observ ed valu e to the posterior mean. The popula tion modelin g attempt to counterb alance the increase d spread (Malmquist- like), esp ecially those with lar ger error (on the right, in the ﬁgures), pulling v alues towar d the mean. (a) Figure 7: Prior and poste rior probab ility distrib ution for the colour and stretch slopes of the SN Ia probl em. The black jagged histogram sho ws the poste rior as computed by MCMC, mar ginalised ov er the other pa- rameters. The red curve is a Gauss approximatio n of it. T he shaded (yello w) range sho ws the 95% highest poster ior credible interv al. The adopt ed (uniform) priors are indicated by the blue dotted curve. 14 Figure 8: Constraints on the cosmolog ical parameters Ω M and w . The two contou rs delimit 68 % and 95 % constr aints. discre pancy b ecomes a vector , instead of a sca lar , of dimensi on j , that measure the distance between the d ata and j models, one per set of parameter s considered . Of course, m ore proba ble models should occur more freque ntly in the list to q uantify that discrep ancy fr om an unlike ly model is less de trimental than discrep ancy from a lik ely model. In practice , if parameters are explored by sampling, it is just matter of computing the discre pancy of the data in hand for each set j of parameters stored in the chain, instead of relying on one single set of parameter (s ay , those that maximiz e th e like lihood). Then, o ne repe ats the computation for fak e data gener ated from the model, and counts ho w many times fak e data are m ore e xtreme of real data. For e xample, if we want to test the modeling of the observed spread of magnitude (i.e. equation 15 and 16), let’ s deﬁne: mcor i = M + distm od i − α x i + β c i (40) W e gener ate fake supern ov ae mag: m.f ak e i ∼ N ( mcor i , σ 2 scat ) (41) and fak e observ ed values of them, obsm.f ak e i ∼ N ( m.f ak e i , σ 2 m,i ) (42) Then, we adopt a m odiﬁed χ 2 to quantify discrepanc y (or its contrary , agree ment). For the real data set we ha ve: χ 2 r eal,j = X i ( obsm i − mcor i,j ) 2 σ 2 m,i + E 2 ( σ scat ) (43) where summation is ov er the data and j refers to the inde x in the sampling chain. Apart for the j inde x, eq. 43 is just the usual χ 2 , the dif ference between observ ed, obsm i , and true mcor i , v alues, weighte d by the expe cted vari ance, computed as quadrature sum of errors, σ m,i , and super - nov ae mag intr insic scatter σ scat . T he χ 2 of the j th fak e data set, χ 2 f ak e ,j is: χ 2 f ak e ,j = X i ( obsm.f ak e i,j − mcor i,j ) 2 σ 2 m,i + E 2 ( σ scat ) (44) 15 Figure 9: S tandard ized residuals (histogram) and their expect ed distri but ion (blue curve ). At th is poin t, we on ly need to co mpute for which fra ction of the simula tions χ 2 f ak e ,j > χ 2 r eal,j and q uote the resu lt. If the modelin g is appropria te, then the computed fractio n (p-val ue) is not extreme (f ar from zero or one). If not, our statistically modeling need to be re vised, because the data are in disagreemen t with the model. W e performed 1500 0 simulations 2 , ea ch one gene rating 288 f ake measu rements of SNIa mea surements. In pract ice, we add ed the follo wing three J AGS li nes: mcor[ i]<-M m+di stmod[i]- alpha * x[i] + beta * c[i] # eq 40 m.fak e[i] ˜ d norm( mcor , p recM) # eq 41 obsm. fake[ i] ˜ dnorm(m .fake [i],precmag[i]) # eq 42 and we can simplify eq 16 in m[i] ˜ dnorm (mco r[i], precM) # eq 16 W e found a p-valu e of 45 % , i.e. that the discre pancy of the data in hand is quite typical (similar to the one of the fak e data). Therefore , real data are quite common and the tested part of the model sho ws no ev idence of misspeciﬁcati on. The careful resear cher should then mov e to the other parts of the model, whose detaile d exp loration is left as exerc ise. In such ex ploration of possible model misﬁts, it is very useful to visually inspe ct se veral data su mmaries to guide the choice of which discrepan cy measure one should adopt (eq. 43 or something else?), and, if the adopte d model turns out to be unsatis factor y , to guide ho w to rev ise the modeling of the the tested part of the mod el. For example , a possib le (and common) data s ummary is the distrib uti on of n ormalized resid uals, that for obsx i reads: stdr esobsx i = obsx i − E ( x i ) q σ 2 x,i + E 2 ( R x ) (45) i.e. observ ed minus expect ed v alue of x i di vided by their expect ed spread (the sum in quadrature of errors and intrinsic spread). A similar summary m ay be buil t for obs c i too. T o ﬁrst order (at least), stan- dardiz ed residuals should be normal distrib uted with standard dev iation one (by constructio n). F ig 9 shows the distrib ution of normalize d resid uals of both obsx i and obs c i , with superpose d a Gaussian centered in 0 with standard de viati on equal to one (in blue). Both distri bu tions show a pos sible hint o f asymmetry . At this point, the careful researcher may want to use a discrepanc y measure sensiti ve to asymmetrie s, as the ske w- ness index, in addition to the χ 2 during model testing . While lea ving the actual computatio n to the reader , we emphasize that if a n e xtreme Bay esian p-v alue if fou nd (on obsx i for e xposin g simplicity ), then on e may 2 Skilled readers may note that we are dealing, by large, with Gaussian distribution s, and may attempt an analytic computation. 16 replac e its m odelin g (eq. 23 in the case of obs x i ) with a distrib ution allo wing a non-zero asymmetry and this can be easily performed in a Bayesia n approach, and easily implemented in J AGS, it is just matter of replac ing the ad opted Gauss ian with an asymmetr ic distri but ion. If instea d the data explor ation gi ves an hi nt of double-b umpe d distrib ution (again on obsx i for expo sing simplicity ), and an extreme Bayesian p-valu e is found w hen a m easure of discrepanc y sensiti ve to double-b umpe d distrib utio ns is adopt ed, then one may adopt a mixture of Gaussian s, replacing (eq. 23 in the case of obsx i ) with obsx i ∼ λ N ( x i , σ 2 x,i ) + (1 − λ ) N ( xx i , σ 2 xx,i ) (46) Even more simple is the (hypothetic al) case of possib le distri but ion (again of obs x i for exposin g sim- plicity ) with fat tails : one may adopt a Cauchy distrib ution. In such case, coding in J A GS is it is just matter of repla cing a dn orm w ith dt in the J AGS i mplementation . And so on. In summary , model check ing consist s in updatin g the model until it produ ces data similar to those in hand. O ne should start by carefully and atten tiv ely inspe ct the data and their summaries. This inspection should suggest a discrepanc y measure to be used to quantify the model m isﬁt and, if one is found, to guide the model updati ng. The proced ure should be iterated until the model fully reproduc es the data. 5 Summary and conclusions These two analysis offer a template for modeling the common awkwa rd features of astronomic al data, namely heter oscedasti c errors, non-Gauss ian likelihoo d (inclu siv e of upper/lo wer limits), non-uni form pop- ulatio ns or data structu re, intrinsic scatter (either due to uniden tiﬁed source of errors, or due to populat ion spread s), noisy estimates of the errors, mixtures and prior kno wledge. In a Bayesian frame work, learning from the data and the prior it is just matter of formalizi ng in math- ematical terms our wordy statements about the quantities under in ve stigation and ho w the data arize. The actual numerical computat ion of the posterior probabi lity distrib utio n of the parameters is left to (special) Monte Carlo programs, of which we don’t need to be afraid more than numeric al methods (Monte Carlo) used to compute the inte gral of a function . The great adv antage of the Bayesia n modelin g is its high ﬂexibil ity: if the data (or theory) call for a more comple x modeling, or call for using distrib utions dif ferent from those initial ly take n, it is just matter of replacing them in the model, because there are no simpliﬁcatio ns forced by the need of reachi ng the ﬁnishing line, as instead is the case in othe r modeling approach es. Furthermore, if one is interested in constr aining another cosmological model, for example one w ith a redsh ift-depen dent dark energ y equation of st ate w = w 0 + w 1 (1 + z ) / (1 + z p ) , o ne shou ld just repl ace w in the J A GS cod e with the abov e equa tion. This ﬂexib le, hierarchica l, modeling is nati ve with Bayesian methods. In both our two e xamples, the Bayesian appr oach performs, unsurprisin gly , better than than non-Bayes ian methods obliged to discard part of the av ail able informatio n in order to reach the ﬁnishing line: Bayesian methods just use all the pro vided information. As a ﬁnal note, we remember that the careful researcher , whether using a Bayesian modeling or not, before publishing his o wn result should check that the numerical computatio n is adequate for his purpose and that the model is appropriat e for the data. Therefore, if you, gentle reader , are using our examples as template , remember to include in your work a sens itiv ity analysi s, by checking that your assumptions (both like lihood and prior) are reasonab le. Some prior sensiti vity analys is has been perfo rmed in both examples to emphasize the importanc e of sho wing how much conclu sions (the posterior) rely on assumptio ns (prior). For what concerns the lik elihood, i.e. misﬁtting, the ke y point consists in updatin g the model fomulat ion until it produc es data similar to those in hand, as we ha ve sho wn in great detail for the second exa mple. More in g eneral, we hope that t he template modeling sho wn in these two examp les may be useful for any analys is confr onted with m odeling the awkward features of astronomic al data, among which h eterosce dastic 17 (point -depende nt) errors, intrins ic scatte r , data structure, non-un iform popula tion (often called Malmquist bias) and non- Gaussian data, inclusi ve of upper/lo wer limits. Ack nowledg ements: The ﬁrst par t of th is chap ter is lar gel y based on pa pers written in coll aboration with Merrilee Hurn. It is a pleasur e to thank Merril ee fo r the fr uitful c ollabora tion and her wise suggest ions along the years, and for comments on an early draft of this chapte r . Errors and inconsi stencies remain my o wn. Refer ences [1] S. Andreon and M.A. Hurn. Statistical aspects of regres sions and scaling relatio ns in astrophysi cs. Statis tical Analysis and Data M ining , submitte d, 2011. [2] A. Gelman , J.B. Car lin, H.S. Stern, and D.B. Rubin . Bayesian dat a analy sis . Champan an d Hall/CRC, 2004. [3] S. Andreon and M.A. Hurn. The scaling relation between richness and mass of galaxy clusters: a Bayesian approa ch. Monthl y N otices of the Royal Astr ono mical Society , 404(4):192 2–1937, 2010 . [4] T . Bayes. Essay T owar ds Solving a Pr oblem in the Doctrine of Chan ces, in Philosop hical T r ansacti ons of the Royal Societ y of London . 1764. [5] P .-S. Laplac e. Th ´ eorie Analyti que des P r oba bilit ´ ees . 1812. [6] D. Lunn, D. Spiege lhalter , A. Thomas, an d N. Best. The BUGS p roject: Ev olution, critique and f uture directi ons. Stat istics in Medicine , 28(25):3 049–306 7, 2009. [7] M. Plummer . J A GS V ersio n 2.2.0 user manual , 2010. [8] S. A ndreon and A. Moretti. Do X-ray dark, or under luminous, galaxy clusters exist? Astr onomy & Astr op hysics, Arxiv pr eprint arXiv:1109 .4031 , in press , 2011. [9] Kess ler R. et al. F irst-Y ear Sloan Digital Sky S urve y-II Superno v a Results : H ubble Diagram and Cosmologi cal Paramet ers. The Astr ono mical Journ al Suppl ement , 185:32–8 4, 2009. [10] M.C. March, R. Trott a, P . Berkes, G.D. Starkman, and P .M. V audre v ange. Impro ved constrai nts on cosmolo gical parameters from SN Ia data. Arxiv pr eprint arXiv:1102 .3237 , 2011. [11] W . L. Freedman, B . F . Madore , B. K. Gibson, L. Ferrarese, D. D . K elson, S. S akai, J. R. Mould, R. C. Ken nicutt, Jr ., H. C. Ford, J. A. Graham, J. P . Huchra, S. M. G. Hughes, G. D. Illingw orth, L. M. Macri, and P . B. S tetson. Final R esults from the Hubble S pace T ele scope Key Project to Measure the Hubble Constan t. T he Astr ophysical Journ al , 553:47–7 2, May 2001. 18

Understanding better (some) astronomical data using Bayesian methods

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment