A non-parametric peak finder algorithm and its application in searches for new physics
We have developed an algorithm for non-parametric fitting and extraction of statistically significant peaks in the presence of statistical and systematic uncertainties. Applications of this algorithm for analysis of high-energy collision data are discussed. In particular, we illustrate how to use this algorithm in general searches for new physics in invariant-mass spectra using pp Monte Carlo simulations.
💡 Research Summary
The paper introduces a non‑parametric peak‑finding algorithm, named NPFinder, designed for high‑energy physics applications where one must search for statistically significant excesses (bumps) in invariant‑mass spectra or other binned distributions. Traditional bump‑hunting methods rely on smoothing techniques (moving average, LOWESS, splines) or on fitting analytic background functions with many free parameters. Those approaches struggle when the background spans many orders of magnitude, when systematic uncertainties are asymmetric, or when one wishes to automate the search across many channels without manual tuning.
NPFinder proceeds bin‑by‑bin through a histogram. For each bin i it computes a first‑order derivative α_i using the neighboring bin i+1, but crucially incorporates the appropriate error bar: if y_{i+1}>y_i the lower error of y_{i+1} is used, otherwise the upper error is taken. This “conservative side” choice ensures that the derivative is not artificially inflated by statistical fluctuations. The derivative values are then averaged over a sliding window of N bins, yielding (\bar α_N). A peak is declared when two consecutive points exceed the local average by more than a single user‑defined positive parameter Δ:
δ α_{N+1}=α_{N+1}−\bar α_N > Δ
δ α_{N+2}=α_{N+2}−\bar α_N > Δ.
Δ is the only free parameter of the method; decreasing Δ increases sensitivity but also the rate of false positives. Once the start of a candidate peak is identified, the algorithm continues forward until the next two points fall below the local average, marking the peak maximum. The algorithm assumes approximate symmetry of the peak, adding an equal number of bins to the right of the centre as to the left. This assumption is reasonable for many resonance‑type signals but can lead to a modest under‑estimation of significance for very asymmetric structures.
After a peak region is defined, NPFinder estimates the underlying background by a simple linear interpolation between the first and last bins of the region. The linear parameters m and b are calculated using the bin contents plus their uncertainties, again on the conservative side. The significance σ of the peak is then obtained by summing the residuals r_i = y_i – (mx_i + b) over all bins belonging to the peak and forming
σ = Σ r_i / √(Σ r_i²).
A peak with σ > 5–7 is considered statistically noteworthy. The method deliberately avoids any χ² minimisation or multi‑parameter fitting, which makes it fast and robust against over‑fitting.
The authors validate the algorithm with Monte‑Carlo data generated by PYTHIA for pp collisions at √s = 7 TeV, using an integrated luminosity of 200 pb⁻¹. Jets are reconstructed with the anti‑k_T algorithm (R = 0.6) and a p_T > 100 GeV cut, and the dijet invariant‑mass spectrum is built. With Δ = 1, the pure QCD background yields no peaks above the 5σ threshold, demonstrating a low false‑positive rate. To test detection capability, three artificial Gaussian signals are superimposed on the background at 1000 GeV (σ=20 GeV, 2 × 10⁵ events), 1500 GeV (σ=50 GeV, 3 × 10⁵ events) and 2800 GeV (σ=40 GeV, 1.2 × 10⁴ events). NPFinder successfully recovers all three, providing accurate estimates of position, width, and σ (e.g., σ≈104 for the largest peak, σ≈8 for the smallest).
For comparison, the ROOT TSpectrum package (based on a smoothing algorithm originally developed for γ‑ray spectroscopy) is applied to the same data. TSpectrum can also locate the injected peaks, but it requires manual tuning of two free parameters (effective σ of the searched peaks and expected amplitude) and an additional analytic fit to obtain a significance estimate. This makes a fully automated search cumbersome. Moreover, TSpectrum’s smoothing is less suited for the steeply falling QCD spectra typical of dijet masses.
The paper discusses limitations: the assumption of peak symmetry may not hold for some new‑physics signatures; the linear background approximation can be inaccurate in regions where the true background curvature is large; and the significance metric, being a simple residual‑to‑noise ratio, is generally more conservative than a full likelihood‑ratio test. The authors note an empirical correlation between Δ and the width of detectable peaks – broader peaks demand smaller Δ (as low as 0.2). Nevertheless, the method’s simplicity (single tunable parameter), ability to incorporate asymmetric systematic uncertainties, and speed make it attractive for large‑scale, blind searches across many final states.
Implementation details are provided: the algorithm is written in Python, with optional graphical output via ROOT (C++) or SCaVis (Java). The source code is publicly available for download. The authors conclude that NPFinder offers a practical, non‑parametric tool for the early‑stage identification of statistically significant structures in high‑energy physics data, complementing more sophisticated model‑dependent analyses that follow.
Comments & Academic Discussion
Loading comments...
Leave a Comment