Solution to Banff 2 Challenge Based on Likelihood Ratio Test

Reading time: 5 minute
...

📝 Original Info

  • Title: Solution to Banff 2 Challenge Based on Likelihood Ratio Test
  • ArXiv ID: 1107.0458
  • Date: 2011-07-05
  • Authors: Wolfgang A Rolke

📝 Abstract

We describe our solution to the Banff 2 challenge problems as well as the outcomes.

💡 Deep Analysis

Figure 1

📄 Full Content

In July of 2010 a conference was held on the statistical issues relevant to significance of discovery claims at the LHC. The conference took place at the Banff International Research Station in Banff, Alberta, Canada. After many discussions it was decided to hold a competition to see which methods would perform best. One of the participants, Thomas Junk, would create a large number of data sets, some with a signal and some without. There were two main parts to the competition:

Problem 1 was essentially designed to see whether the methods could cope with the “look-elsewhere” effect, the issue of searching through a mass spectrum for a possible signal.

Problem 2 was concerned with the problem that sometimes there are no known distributions for either the backgrounds or the signal and they have to be estimated via Monte Carlo.

For a detailed description of the problems as well as the data sets and a discussion of the results see Tom Junk’s CDF web page at http://www-cdf.fnal.gov/ ˜trj/. In this paper we will present a solution based on the likelihood ratio test, and discuss the performance of this method in the challenge.

Our solution for both problems is based on the likelihood ratio test statistic

According to standard theorems in Statistics λ(X) often has a χ 2 distribution with the number of degrees of freedom the difference between the number of free parameters and the number of free parameters under the null hypothesis. This turns out to be true for problem 2 but not for problem 1, in which case the null distribution can be found via simulation.

Here we have:

Now max{log L(α, E|x)} is the log likelihood function evaluated at the maximum likelihood estimator and max{log L(α, E|x) : θ ∈ Θ 0 } = log L(0, 0|x). Note that if α = 0 any choice of E yields the same value of the likelihood function.

In the following figure we have the histogram of 100000 values of λ(x) for a simulation with n = 500 and α = 0 together with the densities of the χ 2 distribution with df’s from 1 to 5. Clearly non of these yields an acceptable fit. Instead we use the simulated data to find the 99% quantile and reject the null hypothesis if λ(x) is larger than that, shown as the vertical line in the graph.

In general the critical value will depend on the sample size, but for those in the challenge (500 -1500) it is always about 11.5.

If it was decided to do discovery using 5σ the critical value can be found using importance sampling. Recently Eilam Gross and Ofer Vitells have developed an analytic upper bound for the tail probabilities of the null distribution, see “Trial factors for the look elsewhere effect in high energy physics”, Eilam Gross, Ofer Vitells, Eur.Phys.J.C70:525-530,2010. Their result agrees with our simulations.

Finding the mle is a non-trivial exercise because there are many local minima. The next figure shows the log-likelihood as a function of E with α fixed at 0.05 for 4 cases.

To find the mle we used a two-step procedure: first a fine grid search over values of E from -0.015 to 1 in steps of 0.005. At each value of E the corresponding value of α that maximizes the log-likelihood is found. In a second step the procedure starts at the best point found above and uses Newton-Raphson to find the overall mle.

Again we want to use:

Now max{log L(α, β|x)} is the log likelihood function evaluated at the maximum likelihood estimator and max{log L(α, β|x) :

The difficulty is of course that we don’t know f 1 , f 2 or g. We have used three different ways to find them:

Here one tries to find a parametric density that gives a reasonable fit to the data. For the data in the challenge this turns out to be very easy. In all three cases a Beta density gives a very good fit:

There are a variety of methods known in Statistics for non-parametric density estimation. The difficulty with the data in the challenge is that it is bounded on a finite interval, a very common feature in HEP data. Moreover the slope of the density of Background 1 at 0 is infinite. I checked a number of methods and eventually ended up using the following: for Background 2, the Signal and the right half of Background 1 i bin the data (250 bins) find the counts and scale them to integrate to unity. Then i use the non-parametric density estimator loess from R with the default span (smoothing parameter). This works well except on the left side of Background 1. There the infinite slope of the density would require a smoothing parameter that goes to 0. Instead i transform the data with log x 1-x . The resulting data has a density without boundary, which i estimate using the routine density from R, again with the default bandwidth. This is then back-transfomed to the 0-1 scale. This works well for the left side but not the right one, and so i “splice” the two densities together in the middle. The resulting densities are shown here:

It is possible to combine the two approaches above: fit some of the data parametrically and others non-parametrically, f

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut