Densities mixture unfolding for data obtained from detectors with finite resolution and limited acceptance
A procedure based on a Mixture Density Model for correcting experimental data for distortions due to finite resolution and limited detector acceptance is presented. Addressing the case that the solution is known to be non-negative, in the approach pr…
Authors: Nikolai Gagunashvili
Densities mixture unfolding for data obtained from detectors with finite resolution and limited acceptance N.D. Gagunash vili ∗ ,1 University of A kur eyri, B or gir, v/Nor dursl´ od, IS-600 A kur eyri, Ic ela nd Abstract A pr o cedure based on a Mixture Densit y Mo del for correcting exp erimen tal data for distortions due to finite resolution and limited detector a cceptance is presen ted. Addressing the case that the solution is kno wn to b e non- negativ e, in the appro a c h presen ted here, the true distribution is estimated b y a w eighte d sum of pro babilit y densit y functions with p ositive w eigh ts and with the width of the densities acting as a regularisation pa rameter respo n- sible f or the sm o othness of the result. T o obtain b etter smo othing in less p opulated regions, the width parameter is c hosen inv ersely prop ortional to the square ro ot of the estimated densit y . F urt hermore, the non- nega tiv e gar- rote metho d is used to find the most economic represen tation of the solution. Cross-v alidation is emplo ye d to determine the o ptimal v alues o f the resolu- tion and garrote parameters. The prop osed approa c h is directly applicable to m ultidimensional problems. Numerical examples in one and t wo dimensions are presen ted to illustrate the pro cedure. Key wor ds: decon v olution, mixture densities, adaptiv e algorithm, in v erse problem, single sided strongly v arying sp ectra, regularisatio n P ACS: 02.30.Zz, 07.05.Kf, 07 .05.Fb ∗ T el.: +354-4 608505 ; fax: +35 4-4608 998 Email add r ess: ni kolai@u nak.is (N.D. Gagunashvili) 1 Present addres s: Max- Planck-Institut f¨ ur Kernphysik, PO Box 103 980, 69029 Heidelber g, Ger man y Pr eprint submitte d t o Elsevier August 6, 2018 1. In t ro duction The probabilit y densit y function (PDF) P ( x ′ ) of an experimen tally mea- sured c haracteristic x ′ , in general, differs from the true ph ysical PDF p ( x ) b ecause o f the limited a cceptance (probability) A ( x ) to register an ev en t with true c haracteristic x , finite resolution and bias in the resp onse function R ( x ′ | x ), whic h describes the probability to observ e x ′ for a giv en true v alue x . F ormally the r elation b etw een P ( x ′ ) and p ( x ) is giv en b y P ( x ′ ) ∝ Z Ω p ( x ) A ( x ) R ( x ′ | x ) d x . (1) The in tegra tion in ( 1 ) is carried out ov er the domain Ω of the v ariable x . In practical applications the exp erimen tal distribution is usually discretised b y using a histogram represen tation, obtained by inte grating P ( x ′ ) ov er n finite sized bins P j = Z c j c j − 1 P ( x ′ ) dx ′ j = 1 , . . . , n (2) with c j − 1 , c j the limits o f bin j . If a parametric (theoretical) mo del p ( x, a 1 , a 2 , . . . , a l ) for the true PDF is know n, then the unfo lding can b e done b y determining the parameters. F or example, by a least squares fit to the binned data [1–3]. Here the mo del, whic h allows to describ e the t rue distribution b y a finite num b er of parameter v alues, constitutes a priori information whic h is needed to correct f or the distortions by the exp erimen tal setup, In con trast, mo del indep enden t unfolding, as considered e.g. in [4–14], is an ill-p osed problem, and ev ery appro a c h to solv e it requires a priori infor- mation ab o ut the solution. Metho ds differ, directly or indirectly , in the w ay a priori informa t io n is incorp orated in the result. 2. Description of t he unfolding metho d T o solv e the unfolding problem (1), a represen tation of the true distribu- tion has to b e c hosen. This represen tation should b e as flexible as p o ssible and allow in tro ducing a priori information. Classical ke rnel statistics is an example that approx imates the true distribution by putting a 1 / N -w eighed cop y of a k ernel PDF at the lo cation of each of N observ ed data p oints and adding them up (see e.g. [15]). With enough data, this comes arbitrarily close to any PDF. There exist metho ds tha t use a k ernel represen tation of 2 the true distribution to solv e also the inv erse problem [16]. One dra wbac k of this approac h is that one has to store all t he data p o in ts, a nother is that the kno wn k ernel based algorithms exp ect the resp onse function of a set-up in analytical fo rm, i.e. computer mo delling cannot b e used. In this pap er the use of a Mixture Densit y Mo del (MDM) [17, 19] to describe t he true distribution p ( x ) is prop osed, p ( x ) = s X i =1 w i K i ( x ; a 1 i , ..., a li ) , (3) where the K i ( x ; a 1 i , ..., a li ) is the i th Proba bility Densit y F unction in Mixture (PDFM) with parameters a 1 i , ..., a li and the we igh t w i the fraction o f the i th PDFM. The MDM lies b et we en the cases of the pa r a metric represen tation of the true densit y on one hand, i.e. the case when there is o nly one distribution in the sum (3), a nd the k ernel statistics approach where the n umber of terms in the sum (3) is equal to the n um b er of observ ations N . The MDM ha s a limited num b er of parameters fo r represen ting a PDF and computer mo d- elling can b e used to calculate the resp onse of the system. The MDM is also conv enien t for t aking in to accoun t differen t t yp e of a priori information, suc h as kno wledge a b o ut t he t yp e o f distributions, constraints on parameters, smo othness of the distributions and so on. Ideas and a c hiev emen ts of regres- sion a nalysis as well as classical k ernel statistics can b e used in applications of a MDM for estimating the densities. Using Eq. ( 3 ) t o parameterise the solution p ( x ) reduces the unfolding problem from finding a solution in the infinite-dimensional space of a ll func- tions to finding a solution in a finite dimensional space. This w a y a n approx- imation of the true densit y is p erformed which, in contrast to e.g. a discreti- sation b y a histogram, has the a dv antage to in tr o duce negligible quan tisation errors for sufficien tly smo oth distributions. Without loss of generality tw o-parametric PDFMs will b e used through- out the pap er. The fir st parameter, x i , defines the mean v alue (lo cation) of term i and the second one, λ i , represen ts the standard deviation. Differen t smo oth PDFMs commonly emplo y ed by k ernel statistics, suc h as biw eight, triw eight, tricub e, cosine, Cauch y , B- spline a nd other k ernels can b e used. Rather p opular is the Gaussian Mixture Mo del (GMM) [1 8] with PDFMs K i ( x ; x i , λ i ) = 1 λ i √ 2 π exp − ( x − x i ) 2 2 λ 2 i , (4) 3 whic h pro vides a rather flexible mo del in t he approximation of a wide class of statistical distributions. The standard deviation λ i acts as a regularisation parameter, whic h a llo ws to adjust the smo othness of the result. W eigh ts, p ositions x i and standard deviations λ i are determine d b y the unfolding pro cedure describ ed b elow. Substituting p ( x ) as represen ted b y Eq (3 ) in to the basic Eq. (1 ) yields P ( x ′ ) = s X i =1 w i Z Ω K i ( x ; x i , λ i ) A ( x ) R ( x ′ | x ) d x , (5) and taking statistical fluctuations into account, the relation betw een the w eights w i and the histogr a m of the o bserv ed distribution b ecomes a set of linear equations P = Q w + ǫ , (6) where P is the n -comp onent column v ector of the exp erimen tally measured histogram, w = ( w 1 , w 2 , ..., w s ) t is the s -comp onen t v ector of w eigh ts and Q is an n × s matrix with elemen ts Q j i = Z c j c j − 1 K i ( x ; x i , λ i ) A ( x ) R ( x ′ | x ) dx j = 1 , . . . , n ; i = 1 , . . . , s . (7 ) The v ector ǫ is an n - comp o nen t v ector of random deviates with exp ectation v alue E [ ǫ ] = 0 and cov ariance matrix C , the diago na l elemen t s of whic h b eing V ar[ ǫ ] = diag ( σ 2 1 , σ 2 2 , · · · , σ 2 n ), where σ j is the statistical error of the measured distribution for t he j th bin. Each column of the matrix Q is the resp onse of the system to one of the PDFM in the mixture mo del for the true distribution. Numerically the calculation of the column v ectors can b e done b y w eigh t ing ev ents of a Mon te Carlo sample suc h that they follo w the corresp onding PDFM, see Ref. [20], and taking the histogram of the observ ed distribution o btained with the w eighted entries . By a non- negativ e least-squares fit, the w eight v ector w in Eq. (6) for a giv en set of PDFMs is determined suc h that it minimizes X 2 = ( P − Q ˆ w ) t C − 1 ( P − Q ˆ w ) (8) under the constrain ts w i ≥ 0 i = 1 , ..., s . (9) F ollow ing r eference [21], if an unconstrained solutio n satisfies Eq. (9) then ˆ w solv es the constrained problem. Otherwise, the solution to the constrained 4 problem m ust b e a b oundary p oin t o f [0 , + ∞ ) s and t herefore a t least one w i = 0. It follow s tha t after p erforming all p ossible regressions with one or more w i in Eq. (9) se t to zero, t he non-negativ e problem is solv ed by pic king the subset o f w i satisfying Eq. (9) suc h that X 2 as defined in Eq.(8) is smallest. The n umerical algorithm and computer program for solving this minimisation problem has b een dev elop ed in references [21 , 22]. Here, first the subset o f comp onen ts equal to zero is determined iterativ ely , and the v ector of the remaining indices ˆ w is found b y simple linear regression ˆ w = ( Q t C − 1 Q ) − 1 ( Q t C − 1 ) P , (10) where Q is the submatrix of Q t ha t corresp onds to the subset of indices of p ositiv e comp onen ts of the solution. The result of the fit is an estimate of the unfolded distribution ˆ p ( x ), defined by a subset of par a meters x i , λ i , i = 1 , . . . , k whic h are summed with po sitiv e w eigh ts ˆ w i , i = 1 , . . . , k to yield ˆ p ( x ) = k X i =1 ˆ w i K i ( x ; x i , λ i ) . (11) The c hoices of the optimal t yp e of PDFMs and t he v alues of pa r a meters (mean v alues and the standard deviations for the GMM mo del) are driv en b y the accuracy and the complexit y of the mo del. The goal is a simple, and at the same t ime, accurate solution of the problem. A figure of merit for the accuracy is the Prediction Erro r ( P E ) [23], defined as the exp ectation v alue of the a v erage squared normalised residual when using the predictor Q ˆ w to describe an indep enden t exp erimen tally measured histogram P new dra wn from the same parent distribution as the original, P E ( Q ˆ w ) = E [ 1 n ( P new − Q ˆ w ) t C − 1 ( P new − Q ˆ w )] . (12) The exp ectation is tak en o v er P new . In the fo llo wing w e will denote the predictor Q ˆ w as ˆ P and call it the fitting histogram. F ollow ing reference [2 3], V -f old Cross-V alidation allows to estimate P E ( Q ˆ w ). Here the giv en dat a set U is split into V subsets U 1 , ..., U V with equal num b er of ev en ts. The complemen tary sets are denoted by U ( v ) = U − U v . Applying the minimisation pro cedure to U ( v ) and forming the predictors Q ˆ w ( v ) , the Cross-V alidation error ( C V ) is defined b y C V = 1 n V X v =1 ( P v − Q ˆ w ( v ) ) t C − 1 ( P v − Q ˆ w ( v ) ) , (13) 5 where P v is the v ector o f histogra m con ten ts for the subset of the data U v . The Cross-V alidatio n error is the estimate of the Prediction Error C V = d P E ( Q ˆ w ) . (14) In order to hav e sufficien t sampling of the configura tion space, the num b er of folders used in the Cross-V alidation pro cedure should not b e to o small. On the other hand, for statistically meaningful results, it should not b e to o large either. Practice sho ws that taking V in the range b et we en 5 and 10 usually giv es satisfactory results, and t ha t the p erformance is not sensitiv e to the exact c hoice. The prop osed unfolding pro cedure consists of three steps: 2.1. First step The p ositions { x i } of the PDFMs are dra wn randomly from a uniform distribution on the allo we d range of x and with a n umber of PDFMs suc h that t he a verage distance b et w een individual cente rs is significan tly smaller than the width of the PDF Ms. In order t o minimise the loss of information due to binning, the n umber of bins for the measured histogram P should b e as large as p ossible. On the other hand, in order to hav e meaningful error estimates for the least squares fits that determine ˆ w , the n umber of entries in a single bin should no t b e less than 25. Binning with approximately equal num b er of ev en ts in eac h bin is preferable. In this first step the width for a ll PDF Ms in the mixture is take n to b e the same, λ i = λ . Differen t v alues λ are tried and the ˆ λ with the smallest Cross-V alidation error ˆ λ = argmin λ C V ( λ ) (15) is selected. 2.2. Se c ond step Exploiting the information gained so f ar, the pro cedure is rep eated with p ositions { x i } of the PDFMs randomly dra wn according to the estimate of the true densit y ˆ p ( x ) (11) obtained in the first step. In addition, the widths 6 of t he PDFMs { ˆ λ i } are ta k en to be in v ersely pro p ortional to the square ro ot of the result from the first step at the p osition { x i } ˆ λ i = ˆ λ p ˆ p ( x i ) (16) with the v alue of ˆ λ a gain determined b y means of Cross-V a lida tion. This second step is motiv ated by the results o f reference [24], where it is show n that by this w ay the bias of a kerne l estimation of a PDF can b e decreased. The approach ba la nces b etter smo othing in less densely p opulated regions against the p ossibilit y to resolv e finer structures in regions with a higher sampling. It is plausible that for the unfolding case the bias on the shap e of the densit y estimate will decrease also. Finally it has to b e noted t ha t this second step can b e iterated sev eral times, ev en though practical examples sho w that the gain is small. It is recommended to use the same n um b er of PDFMs for the second step as in the first step or more. 2.3. Thir d step Since the n um b er of terms obtained b y the previous t w o steps can still b e large, with not all PDFMs con tr ibuting indep enden t informa t io n, a third step is added to select the most relev an t subset. T o reduce the n umber of the PDFMs, the non-negative garrote method [23] is used. It amounts to taking the set of non-zero w eigh ts { ˆ w j } obtained in the second step and finding co efficien ts { c j } that minimise n X i =1 ( P i − s X j =1 Q ij c j ˆ w j ) 2 /σ 2 i (17) under the constrain ts c j ≥ 0 and s X j =1 c j ≤ r . (18) F or r ≥ P ˆ w i the solution fro m the previous step is not t o uc hed. F or smaller v alues the garrote eliminates some of we igh ts and mo difies others, such that ˜ w j ( r ) = c j ˆ w j are the new v alues of t he w eigh ts for the PDFMs of the estimate the unfo lded distribution. Cross-v alidation is used to choose the optimal garrote parameter r . T o reduce a p oten tia l bia s in tro duced b y the garr ote, the w eights of the PDFMs are a gain determined b y a non-negativ e leas t squares fit on the r emaining terms. 7 2.4. Quality assessm ent and err or pr op agation The qualit y of the fit can b e asses sed with common to o ls used in regression analysis [25]: 1. p -v alue of the fit is defined b y P r ( X ≥ P n i =1 ( P i − ˆ P i ) 2 /σ 2 i ) | χ 2 n − s ), where P r stands f o r probability 2. analysis of the normalised residuals Res i = ( P i − ˆ P i ) /σ i , i = 1 , ..., n (a) a s a function of the estimated v alue ˆ P (b) as a function o f the observ ed v alue x ′ 3. Q-Q plot: (data quan tile) i = (num b er of residuals ≤ Res i ) /n ve rsus (theoretical quan tile) i = P r ( X ≤ ( Res i |N (0 , 1)), i = 1 , ..., n Since the unfo lding pro cedure describ ed ab ov e is not analytically defined, the b o otstrap approac h [26] is the metho d of c hoice to estimate the sta- tistical uncertain ties of the unfolding result. Keeping the normalisation of the observ ed histogram constant, replications are generated according to the m ultinomial distribution N ! N 1 ! N 2 ! . . . N n ! P N 1 1 . . . P N n n with P i = P s j =1 Q ij ˆ w j P n i =1 P s j =1 Q ij ˆ w j , (19) where the set of p o sitiv e w eights obtained in the final step is used. A histogram represen tation ˆ p fo r t he unfolded distribution ˆ p ( x ) with m bins integrating ov er the x -in terv als [ b i − 1 , b i ] , i = 1 , . . . , m is obtained by ˆ p = K ˆ w , (20) where K is an m × k ) matrix with elemen ts K ij = Z b i b i − 1 K j ( x ; x j , λ j ) d x . (21) The unfolding metho d describ ed ab ov e assumes that the matrix Q relat- ing the w eigh t v ector ˆ w to the measuremen ts P is know n exactly . Therefore, when Q is determined b y means of a Monte Carlo sim ula t ion, the Mon te Carlo sample should b e significan tly lar ger than the data sample. In an extension of the metho d whic h is applicable a lso in cases where the Mon te Carlo statistics is of the same or der or less than t he data statistics is obtained b y using a mo dified matrix of errors C whic h includes statistical errors for the elemen ts of matrix Q [27], and the Cross-V alidation statistics substituted b y the go o dness-of-fit statistics for the comparing unw eigh ted P v and w eigh ted Q ˆ w ( v ) histograms giv en in references [28, 2 9]. 8 3. Numerical examples Three t yp es of n umerical examples are discusse d to illustrate the unfold- ing pro cedure. The fir st is the classic example of a double p eak structure prop osed b y V. Blob el [5]. The second is a strongly v arying one-sided distri- bution, and the third one a t w o- dimensional case. 3.1. Double p e ak structur e The metho d describ ed ab ov e is illustrated using the example prop osed in reference [5 ]. The true distribution, defined on the range x ∈ [0 , 2] is described by a sum o f three Breit-Wigner functions p ( x ) ∝ 4 ( x − 0 . 4) 2 + 4 + 0 . 4 ( x − 0 . 8) 2 + 0 . 04 + 0 . 2 ( x − 1 . 5) 2 + 0 . 04 (22) from whic h the experimentally measured distribution is obtained b y P ( x ′ ) ∝ Z 2 0 p ( x ) A ( x ) R ( x ′ | x ) dx, (23) with an acceptance function A ( x ) A ( x ) = 1 − ( x − 1) 2 2 (24) and a resp onse function describing a biased measuremen t with gaussian smearing R ( x ′ | x ) = 1 √ 2 π σ exp − ( x ′ − x + 0 . 05 x 2 ) 2 2 σ 2 with σ = 0 . 1 . (25) The a cceptance and resolution functions are sho wn in Fig. 1. Also sho wn is an example for the measured distribution obtained b y sim ulating a sam- ple of N = 5 00 0 ev ents. A histogram with n um b er of bins n = 87 and 9 appro ximately equal n um b er of ev ents in each bin w as used. 0 1 2 3 4 0 0.5 1 1.5 2 x R(x,0.5) R(x,1) R(x,1.5) A(x) 0 2000 4000 6000 0 1 2 x´ P Figure 1: Acceptance function A ( x ) and resolution function R ( x ′ | x ) for x = 0 . 5 , 1 . 0 and 1 . 5 (left) and histogra m o f the measured distribution P based on a sample of 5 000 event s generated for the true distribution (right). The bin conten ts of the histogr am are no r- malised to the bin width. The true distribution p ( x ) is shown b y the curve (right). The PDFMs we re defined in the form K i ( x ; x i , λ i ) ∝ 1 λ i √ 2 π exp − ( x − x i ) 2 2 λ 2 i + 1 λ i √ 2 π exp − ( x + x i ) 2 2 λ 2 i + 1 λ i √ 2 π exp − ( x − 4 + x i ) 2 2 λ 2 i I { x ∈ [0;2] } , with the indicator function I {··· } = 1 if the condition give n in the curly brac k ets is satisfied 0 otherwise . The functional form is c hosen in accordance with the recommendation for- m ulated in reference [15] for PDFs defined on the r estricted interv a l. An initial set of 400 PDFMs was used with p ositions x i uniformly distributed o ver the in terv al [0 , 2]. F or t he determination o f the matrix Q a sample of 500 00 0 Mon te Carlo ev en ts w a s sim ulated. Here a uniform true distribution w as tak en and the resp onses of the individual PDFMs w ere calculated by w eighting the Monte Carlo ev en ts with w eights pro p o rtional to the v alue of 10 the resp ectiv e PDFM [20]. 1.055 1.06 1.065 1.07 1.075 1.08 1.085 1.09 1.095 1.1 0.14 0.16 0.18 0.2 0.22 0.24 0.26 1st step 2nd step λ CV error 1.035 1.04 1.045 1.05 1.055 1.06 1.065 1.07 1.075 4 4.5 5 5.5 6 6.5 7 r CV error Figure 2 : Cr o ss-V a lidation errors o f the first and seco nd unfolding step a s a function o f λ (top), and Cross- V alidation err or in the third s tep for ˆ λ = 0 . 19 as a function of the garrote parameter r (bottom). The top plot of Fig. 2 sho ws ho w the Cross-V alidation error fo r 5- f old Cross-V alidation in the first and the second step b eha ves as a function of λ . One observ es that the b est v alue in the second step is slightly smaller in step 2, and also that the Cross-V alidatio n error is reduced. This sho ws that adapting the widths of the PDFMs in the mo del a ccording to the estimated densit y not only pro vides b etter smo othing in regions of small statistics, but also improv es the quality of the unfolding result. The b est v alue is found as ˆ λ = 0 . 19. This v alue is used in the thir d step, where the non-negativ e garrote is emplo ye d to reduce the num b er of the PDFMs. The b ottom frame of Fig . 2 11 sho ws the Cross-V alidation erro r a s a function of the garro t e para meter r . Again a significan t improv emen t is f ound for r = 4 . 2. F or the statistics used in this example, the final estimate of the true distribution contains only three terms. The qualit y of the unfolding result is illustrated b y Fig. 3. It sho ws ho w the folded estimate of true distribution ˆ P , with three compo nen ts, approx- imates the measured distribution, together with plot s of the residuals and the quan tile-quan t ile plot. No structure in either of the con trol plots is ob- serv ed. The p -v alue fro m the test comparing the histog ram of the measured distribution P and the fitting histogram ˆ P is p = 0 . 6 . 0 2000 4000 6000 0 1 2 x´ P ˆ a 0 2000 4000 6000 -2 0 2 Res P ˆ b -2 0 2 0 1 2 x´ Res c -2 0 2 -2 0 2 Theoretical quantile Data quantile d Figure 3: Illustr ation of the quality of the unfolding result. (a) folded estimate of the true distribution ˆ P (solid histogr am), with tree comp onent s, co mpared to the measured distribution P ; (b) no r malised r esiduals o f the fit as a function of ˆ P ; (c) normalized residuals as a function o f x ′ ; (d) quantile-quan tile plot for the normalized r esiduals. The estimate of the true distribution obtained by the unfolding pro ce- 12 dure is presen ted in Figs. 4 and 5. The comp onen ts of the unfolding results together with the estimate ˆ p ( x ) in Fig. 4. Also sho wn are the t wo standard deviation bands ± 2 δ ( x ) compared to the true distribution p ( x ). Histogram presen tations of the unfo lded distribution are sho wn in Fig. 5 for n = 12 bins as in reference [5] a nd fo r n = 40 bins as in reference [9]. Standard deviation bands a nd bin-b y-bin uncertain ties for the histograms we re estimated by the b o otstrap metho d. 0 2000 4000 6000 0 0.5 1 1.5 x p(x) ˆ 0 0.5 1 1.5 2 x Figure 4: Comp onents of the unfolded dis tribution and the unfolded distribution ˆ p ( x ) given by the sum of the comp onents with ± 2 δ ( x ) interv al (left) and the tw o standa rd deviation band ov er laid with the true distr ibution p ( x ) (r igh t). 0 2000 4000 6000 0 0.5 1 1.5 x p ˆ 0 0.5 1 1.5 2 x Figure 5: Binned representation of the unfolding res ult ˆ p i for n = 12 (left) and n = 40 (right) bins . The points with er ror bars ar e the estimate obtained b y the unfolding pro cedure, the histo gram shows the true bin conten ts p i . F or the illustration of the prop osed algorithm, the unfolded distributions 13 for steps 1 and 2 o n Fig. 6 are presen ted as w ell as T able 1 with v alues of parameters x i , ˆ λ i , ˆ w i of t he comp onen ts fo r the three steps of the pro cedure. 0 2000 4000 6000 0 0.5 1 1.5 x p(x) ˆ 0 0.5 1 1.5 2 x Figure 6: Comp onents of the unfolded dis tribution and the unfolded distribution ˆ p ( x ) given by the sum of the comp onents for the step 1 (left) and for the step 2 (rig ht). T able 1: V alues of par ameters x i , ˆ λ i , ˆ w i of the PDFMs represented unfolded distributions for three steps of pro cedure i 1 2 3 4 5 6 7 x i 0.275 1 .475 1.992 1.487 0.798 0.812 0.284 ˆ λ i 0.210 0 .210 0.210 0.210 0.210 0.210 0.210 Step 1 ˆ w i 0.087 0 .103 0.028 0.214 0.325 0.211 0.009 x i 0.814 1 .485 0.451 1.475 0.445 1.796 0.807 ˆ λ i 0.187 0 .249 0.308 0.249 0.312 0.356 0.187 Step 2 ˆ w i 0.286 0 .278 0.117 0 .0882 0.06 28 0.004 0.1 41 x i 0.814 1 .485 0.451 – – – – ˆ λ i 0.187 0 .249 0.308 – – – – Step 3 ˆ w i 0.426 0 .369 0.183 – – – – The whole n umerical exp erimen t and unfolding w as r ep eated ten times. The n umber of obtained final comp onen ts v aried b et we en t hr ee and six. The obtained p -v alues sho w no clear deviation from an ev enly distribution b e- t wee n zero and unit y , indicating that the measured distributions are t ypically reasonably w ell describ ed b y t he folded estimates of the true distribution – supp orting the v alidity of the unfolding approach. 14 3.2. Str ongly varying o ne-side d distribution In this example the ab o v e metho d is a pplied to unfold a strong ly v arying one-sided PDF . The true distribution, defined in the range [0 , + ∞ ), is p ( x ) ∝ xe − 5 x . (26) Let us represen t the true v alue x as a function o f tw o v ariables u and v , x = √ u 2 + v 2 , with u = x cos( φ ) a nd v = x sin( φ ), with the angular v ariable φ uniformly distributed in [0 , 2 π ). The reconstructed v alue x ′ = √ u ′ 2 + v ′ 2 is obtained from u ′ and v ′ , defined as indep enden t random v ariables with normal distributions N ( u, (0 . 5 u ) 2 ) and N ( v , (0 . 5 v ) 2 ) resp ectiv ely . Here w e do not presen t an analytical formula fo r t he resolution function R ( x ′ | x ), but notice that it is a generalisation of the Rice distribution. An example for the measured distribution obtained by sim ulat ing a sample of N = 10 000 ev ents is presen ted in Fig. 7. 0 5000 10000 15000 20000 0 1 2 x´ P Figure 7: The histogram of the measur e d distributio n P based o n a sample o f 10 000 even ts generated for the true distribution. The true distribution p ( x ) is shown by the curve. In general the c hoice of PDFMs has should b e adapted to the problem at hand, i.e. symmetric gaussian PDFMs as used in the previous example are not directly suitable for this kind of unfolding problem. The GMM fit 15 mo del can, how ev er, b e used after transforming the problem suc h that t he true distribution b ecomes a ppro ximately ga ussian in shap e. The Bo x- Cox transformation [30] x ⌈ µ ⌋ = ( x µ − 1) /µ for µ 6 = 0 ln x for µ = 0 (27) is appropriate for this case. After tra nsforming the unfolding result back to the original v ariables, the en tire pro cedure is equiv alen t to using a PDFMs of the form K ( x ; x i , λ, µ ) = 1 λ √ 2 π exp − ( x ⌈ µ ⌋ − x ⌈ µ ⌋ i ) 2 2 λ 2 ! x ( µ − 1) . (28) F or the determination o f the ma t r ix Q a sample of 1 000 000 Mon t e Carlo ev en ts w as sim ulated. The true distribution was taken to b e unifo rm a nd the resp onse of the PDF M w as calculated b y w eigh ting the Mon te Carlo ev en ts with w eights prop o rtional to the v alue of resp ectiv e PD F Ms [2 0]. In the first step an initial set of 400 PD F Ms w as used with p o sitions x i uniformly distributed ov er the in terv al [0 , 2]. The transformatio n parameter µ = 0 . 25 w as used, whic h leads to a transformed PDF P ( x ′⌈ µ ⌋ ) with sk ewness close to 0. As sho wn in Fig. 8, the constan t width pa r a meter ˆ λ i = ˆ λ = 0 . 5 prov ides the minimum v alue for the Cross-V alida t io n error C V ( λ ). In the second step ˆ λ i is not constan t but in verse ly prop ortional to the square ro ot o f the unfolded densit y obtained in the first step in order t o hav e b etter smoot hing in regions of lo w statistics. Here one finds a preferred v alue o f ˆ λ = 0 . 39. In this case t he Cross-V alidation error do es not improv e. In the thir d step finally a b est v alue for the garrote parameter r = 2 . 8 is found for ˆ λ = 0 . 39. The minim un, ho w ev er, is not v ery pronounced. These tw o parameters are used for the final calculation of the unfolded distribution. Only three terms are retained for the estimate of the true distribution. 16 1.095 1.1 1.105 1.11 1.115 1.12 1.125 1.13 0.3 0.35 0.4 0.45 0.5 0.55 0.6 1st step 2nd step λ CV error 1.1 1.12 1.14 1.16 1.18 1.2 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 r CV error Figure 8: Cross -V alidation errors for differen t v a lues of λ (top) a nd C r oss-V alidation error s for different v alues of r ( λ = 0 . 39) (bo ttom). Figure 9 illustrates the qualit y of t he fit. No structure in either of the con trol plots is observ ed. The p - v alue from the test for the comparison of the histogram of the measured distribution P and the fitting histogram ˆ P , Fig. 9(a), is p = 0 . 43. The comp onents of the unfolding results are sho wn together with the estimate ˆ p ( x ) in Fig . 10. Also shown are the standard deviation bands ± 2 δ ( x ) compared to the true distribution p ( x ). The binned presen tation of the unfolded distribution sho wn in F ig. 11 w as done with n = 21 bins. 17 0 5000 10000 15000 20000 0 1 2 x´ P ˆ a 0 5000 10000 15000 20000 -2 0 2 Res P ˆ b -2 0 2 0 1 2 x´ Res c -2 0 2 -2 0 2 Theoretical quantile Data quantile d Figure 9: Illustr ation of the q ua lit y of the unfolding result: (a) folded es timate of the true distribution ˆ P (so lid histog ram) compared to the measur ed distribution P ; (b) normalised residuals of the fit as a function of ˆ P ; (c) nor malised residuals as a function o f x ′ ; (d) quantile-quan tile- plot for the normalis ed residuals. 18 0 5000 10000 15000 20000 0 0.5 1 1.5 x p(x) ˆ 0 0.5 1 1.5 2 x Figure 10: Components of the unfolded distribution and the unfolded distribution ˆ p ( x ) given by the sum of the comp onents with ± 2 δ ( x ) interv al (left) and the tw o standa rd deviations band ov er laid with the true distribution p ( x ) (right). 10 2 10 3 10 4 0 0.5 1 1.5 2 x p ˆ Figure 11: Binned representation of the unfolding r esults ˆ p i for m = 21 bins. The v ertical error ba r s deno te the standa rd deviations δ i . The histogram shows the true bin conten ts p i . 19 3.3. Two-dimension al unfolding The metho d presen ted in this pap er is also directly applicable to multidi- mensional cases . Here a tw o-dimensional example is give n. The true distribu- tion defined on the t w o-dimensional domain x, y ∈ [10 , 40] × [10 , 40] and rep- resen ted b y a sum of three biv ariate ga ussian PDFs p g ( x, y ; µ x , µ y , σ x , σ y , ρ ), with µ x , µ y the exp ectation v alues, σ x , σ y the standard deviations and ρ the correlations co efficien t. The actual densit y is giv en b y p ( x, y ) ∝ 4 p g ( x, y ; 20 . 5 , 2 5 . 5 , 4 . 0 , 4 . 0 , 0 . 5) + p g ( x, y ; 30 . 5 , 2 5 . 5 , 3 . 0 , 3 . 0 , 0 . 0) + 5 p g ( x, y ; 35 . 0 , 2 3 . 0 , 20 . 0 , 30 . 0 , 0 . 0) . (29) The exp erimen tally measured distribution is obtained by P ( x ′ , y ′ ) ∝ Z 40 10 Z 40 10 p ( x, y ) R ( x ′ , y ′ | x, y ) dxdy , (30) with a resolution function R ( x ′ , y ′ | x, y ) ∝ p g ( x ′ , y ′ ; x, y , 1 . 0 , 1 . 0 , 0 . 0) + 0 . 5 p g ( x ′ , y ′ ; x, y , 2 . 5 , 2 . 5 , 0 . 0) + 0 . 05 p g ( x ′ , y ′ ; x, y , 5 . 0 , 5 . 0 , 0 . 0) . (31) An example of a measured distribution is obtained b y sim ulating a sam- ple of N = 10 0 0 0 ev ents. In Figure 12 the true densit y and the measured distribution ar e sho wn. While the dominant comp onen t ( fir st gaussian) is still clearly visible in the observ ed distribution, the weak comp onen t (second gaussian) is barely discernible. 20 10 15 20 25 30 35 40 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 45 50 x y 10 15 20 25 30 35 40 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 45 50 x y Figure 12: The true distribution p ( x, y ) (left) ) and histogram of the measured distribution P ba sed on a sample of 10 000 events (right). T o use the existing soft ware, a one-dimensional histogram w as created from the t wo-dimens ional distribution b y copy ing first the top ro w from left to right then the 2nd row fro m righ t to left and so on. Adjacen t bins of the o ne- dimensional histogr am with a low n umber o f eve n ts w ere merged to ha ve at least 2 5 ev ents p er bin. The v ector of the histog r am conten ts P finally used in the unfolding pro cedure has n = 196 comp onen ts. F or the determination of the matrix Q a sample of 1 000 00 0 Mon te Carlo ev ents was sim ulated. PDFMs w ere defined as circular symmetric gaussian probabilit y densit y f unctions with three para meters, the exp ectation v alues x i , y i and the standard deviation λ i . In the first step, a set of 4 00 PDFMs was used with p ositions x i , y i uniformly distributed ov er the domain x, y ∈ [10 , 40] × [10 , 40]. As sho wn in Fig. 13, using 5-f old Cross-V alidatio n, an optimal v alue ˆ λ i = ˆ λ = 3 . 0 is found. F or the second step, with an adaptiv e width ˆ λ i in ve rsely prop ortional to the square ro ot of the unfolded densit y obtained in the first step, the v alue ˆ λ = 0 . 1 6 minimises the Cross-V alidation error C V ( λ ). In the third step t he optimal garrote para meter for ˆ λ = 0 . 16 is f ound to b e r = 2 1 . 2. These tw o parameters are used for t he final calculatio n of the unfolded distribution. 21 1.07 1.08 1.09 1.1 1.11 1.12 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 λ CV error 1st step 1.07 1.08 1.09 1.1 1.11 1.12 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 λ CV error 2nd step 1.065 1.066 1.067 1.068 1.069 1.07 1.071 1.072 18 19 20 21 22 23 r CV error Figure 1 3 : Cro ss-V alida tion as a function of λ in step 1 (top) a nd step 2 (middle), and as a function of the ga rrote parameter r for λ = 0 . 16 from the second step (b ottom). 22 0 10 20 30 40 50 0 50 100 150 bin number P ˆ a 0 10 20 30 40 50 -2 0 2 Res P ˆ b -2 0 2 0 50 100 150 bin number Res c -2 0 2 -2 0 2 Theoretical quantile Data quantile d Figure 14: Illustra tio n of the quality of the unfolding result. (a) folded estimate of the true distribution ˆ P (so lid histog ram) compared to the measur ed distribution P ; (b) normalised residuals of the fit as a function of ˆ P ; (c) normalised residuals as a function of bin n umber; (d) quantile-quan tile plot for the nor malised r esiduals. The qualit y of the unfolding result is illustrated by Fig . 14. It show s the mixture of the folded gaussian PDFMs, whic h a ppro ximates the measured distribution, together with the analysis of the residual and the quan tile- quan tile plot. No structure in either of the con trol plots is observ ed. The p -v alue fo r the comparison of the histogram of the measured distribution P and the fitting histogram ˆ P is p = 0 . 1 9. The unfo lded distribution ˆ p ( x, y ) is presen ted in Fig. 15 together with the difference b et wee n the unfolded and the true distribution ˆ p ( x, y ) − p ( x, y ). One observ es that the true distribution, including its w eak compo nen t, is 23 10 15 20 25 30 35 40 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 45 50 x y 10 15 20 25 30 35 40 10 15 20 25 30 35 40 -4 -3 -2 -1 0 1 2 3 x y Figure 15 : The unfolded distribution ˆ p ( x, y ) (left) and the differe nce b et ween unfolded and tr ue distribution ˆ p ( x, y ) − p ( x, y ) (right). w ell repro duced by the unfolding result, with rather small deviations o f the unfolded distribution from the true one. 4. Summary and conclusions A new metho d for unfolding the true distribution from data obtained from detectors with finite resolution and limited acceptance is presen ted. The metho d ensures smo othness a nd p ositivit y of the result b y represen ting the true distribution as a w eighted sum of smo oth PDFs ( Mixture Densities Mo del). The standard deviation of the PD Fs a cts as a regularisation parame- ter whic h determines the smo othness of the result. The amount of smo othing is adj usted to the lo cal stat istical precision of the da ta b y scaling the width parameter inv ersely prop o rtional to square ro ot of the estimated densit y and the non-negativ e garr o te metho d is used to eliminate insignifican t terms in the solutio n. Cross-V alidation is used to determine o pt ima l v alues of the regularisation and ga rrote parameters. The metho d a voids discretisation of the true densit y en tering the in tegral equation, thereb y a v oiding quan tisation errors for the true distribution. The prop osed pro cedure is directly applica- ble to multidim ensional unfolding problems. Numerical examples co vering 24 the problems of unfolding a simple double-p eak structure, a strongly v ary- ing one-side distribution and a tw o-dimensional density w ere presen ted to illustrate a nd to v alidate t he pro cedure Ac kno wledgemen ts The author w o uld lik e to expres s v ery great appreciation to Mic hael Sc hmelling for constan t inte rest to this w ork and man y discuss ions stim ulating the de- v elopmen t of t he metho d. The author is also grat eful to Markw ard Britsc h for useful discussions as w ell as for careful reading of the manus cript. Sp ecial thanks are extended to the Univ ersit y of Akureyri and the MPI for Nuclear Ph ysics for supp ort in carrying out the r esearc h. References [1] N. D. G agunash vili, Nucl. Instr. Meth. A 635 (2011) 86. [2] V. V. Ammosov, Z. U. Usub ov , V. P . Zhigunov, Nucl. Instr. Meth. A 295 (1990) 22 4 . [3] G. Bohm, G. Zec h, Introduction to Statistics and Data Analysis for Ph ysi- cists, c h. 6, V erlag Deutsc hes Elektronen-Sync hrotron, 2010. [4] V. P . Zhigunov , Nucl. Instr. Meth. 216 (1983) 183. [5] V. Blob el, CERN 85- 02, 1985. [6] V. B. Anik eev, A. A. Spiridonov , V. P . Z higuno v, Nucl. Instr. Meth. A 322 (1992 ) 280. [7] N. Gagunash vili, Nucl. Instr. Meth. A 343 (1993) 606. [8] M. Sc hmelling, Nucl. Instr. Meth. A 3 40 (199 4 ) 400. [9] A. H¨ oc ker, V. Kartve lish vili, Nucl. Instr. Meth. A 3 72 (1996 ) 469. [10] L . Lindemann, G . Zec h, Nucl. Instr. Meth. A 354 (1995) 516. [11] G . D’Agostini, Nucl. Instr. Meth. A 362 (1995) 4 87. 25 [12] V. Blob el, An unfo lding metho d for high-energy ph ysics exp erimen ts, in: Pro ceedings of the Conference on Statistical Problems in P article Ph ysics, Durham, England, 18–22 Marc h 2002, 258–267. [13] N. Gagunashv ili, Unfolding with system iden tification, in: Pro ceedings of the Conferen ce on Statistical Problems in P article Ph ysics, Astro- ph ysics and Cosmology , 12 –15 Septem b er, 2 005, Oxford, Imp erial College Press, Lo ndo n, 2006, pp. 267–210. [14] J. Alb ert et al., Nucl. Instr. Meth. A 58 3 (2007) 494. [15] B. W. Silv erman, Densit y Esitimations fo r Statistics and D a ta Analysis, Chapman a nd Hall, 1986. [16] M. L. Hazelton, B. A. T urla ch, Stat. Comput. 19 (2 0 09) 217 . [17] G . McLac hlan, ed.: Mixture Mo dels. Marcel Dekk er, 1988. [18] D . Reynolds, Gaussian Mixture Mo dels, in: Encyclop edia of Biometrics, pp 659– 6 63, Springer, 2009. [19] C. R. Shalizi, Adv anced Data Ana lysis from an Elemen tary P oint of View, c h.19, http://www.st at.cmu.edu/~ cshalizi/ADAfaEPoV/ , 20 13. [20] I. M. Sob ol, Numerical Mon te Carlo Metho ds, ch. 5, Nauk a, Mosco w, 1973. [21] C. L. Lawson and R. J. Hanson, Solving Least Squares Problems, ch. 23, sec. 3, Pren tice-Hall, Englew o o d Cliffs, 197 4; also a v ailable as Classics in Applied Mathematics, V ol. 15, SIAM, Philadelphia, 199 5. [22] http://www.netlib.org/la wson-hanson/all , subroutine NNLS. [23] L . Breiman, T ec hnometrics, 37 (1995) 373. [24] I. S. Abramson, Annals of Sta tistics, 10 (1982) 1217. [25] G . A. F. Seb er and A. J. Lee, Linear Regression Analysis, John Wiley & Sons, 2003. [26] B. Efron, Annals of Sta tistics, 7 (1979) 1. [27] V. V. F edorov, Biometrica, 61 (1 974) 49 . 26 [28] N. D. Gagunashv ili, Nucl. Instr. Meth. A 614 (2 010) 287 . [29] N. D. Gagunashv ili, Comp. Ph ys. Comm. 183 (2012) 19 3 . [30] R . M. Sakia, The Stat istician, 41 (1992 ) 169. 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment