Simulation Assisted Likelihood-free Anomaly Detection

Simulation Assisted Likelihood-free Anomaly Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Given the lack of evidence for new particle discoveries at the Large Hadron Collider (LHC), it is critical to broaden the search program. A variety of model-independent searches have been proposed, adding sensitivity to unexpected signals. There are generally two types of such searches: those that rely heavily on simulations and those that are entirely based on (unlabeled) data. This paper introduces a hybrid method that makes the best of both approaches. For potential signals that are resonant in one known feature, this new method first learns a parameterized reweighting function to morph a given simulation to match the data in sidebands. This function is then interpolated into the signal region and then the reweighted background-only simulation can be used for supervised learning as well as for background estimation. The background estimation from the reweighted simulation allows for non-trivial correlations between features used for classification and the resonant feature. A dijet search with jet substructure is used to illustrate the new method. Future applications of Simulation Assisted Likelihood-free Anomaly Detection (SALAD) include a variety of final states and potential combinations with other model-independent approaches.


💡 Research Summary

The paper “Simulation Assisted Likelihood-free Anomaly Detection” by Andreassen, Nachman, and Shih addresses a critical challenge in high-energy physics: broadening the search for new particles at the Large Hadron Collider (LHC) in the absence of clear discovery evidence. It introduces a novel hybrid framework, SALAD, designed to combine the strengths of two prevalent model-independent search paradigms: those heavily reliant on simulations and those based entirely on unlabeled data.

The core problem with simulation-dependent searches is their sensitivity to inaccuracies in the background model, which can obscure signals or ruin specificity. Pure data-driven methods, while robust to simulation mismodeling, often struggle with background estimation and may require imposing constraints like decorrelation between features. SALAD elegantly bridges this gap for resonant anomaly searches, where a signal is localized in one known feature (e.g., an invariant mass m).

The SALAD method operates in three key steps. First, in a signal-free sideband region (m outside the signal region), a parameterized classifier f(x, m) is trained to distinguish real data from background-only simulation. Using a loss like binary cross-entropy, this network asymptotically learns a likelihood ratio, which serves as a reweighting function w(x|m). This function morphs the simulation to match the data distribution in the sidebands. Crucially, because the network is conditioned on m, the learned reweighting is a function of the resonant feature.

Second, this trained reweighting function is interpolated into the signal region (SR). Even for m values not seen during training, the neural network can estimate w(x|m), generating a reweighted version of the background simulation within the SR.

Third, this reweighted simulation is used for two purposes concurrently: 1) Enhanced Signal Sensitivity: A second classifier g(x) is trained to distinguish the SR data from the reweighted SR simulation. Since the reweighted simulation now closely matches the background component of the data, g(x) approximates the optimal classifier for separating signal from background. 2) Background Estimation: The same reweighted simulation directly provides a data-driven background estimate for any selection based on g(x) (via a weighted sum, Eq. 2.4). A major advantage is that this allows g(x) to have non-trivial correlations with the resonant feature m, unlike methods requiring sideband fits.

The authors demonstrate SALAD using a dijet resonance search scenario from the LHC Olympics 2020 dataset. They treat Pythia QCD events as “data” and Herwig++ QCD events as imperfect “simulation,” with a benchmark signal from a W' boson decay. Features include jet masses and N-subjettiness ratios. The results show excellent closure: the DCTR reweighting successfully aligns the Herwig simulation with the Pythia “data” in the sideband (Figs. 4 & 5). Importantly, this alignment holds when the reweighting is interpolated into the signal region (Figs. 6 & 7), validating the method’s robustness. The classifier g(x) trained on reweighted simulation approaches the performance of an ideal supervised classifier trained with perfect knowledge (Fig. 3).

In conclusion, SALAD provides a powerful, likelihood-free framework that minimizes dependence on the accuracy of the background simulation while fully leveraging its rich, high-dimensional information. It enhances signal sensitivity and provides a correlated background estimate in one coherent process. The method is generalizable to various final states and can potentially be combined with other model-independent techniques, offering a significant advance for the LHC’s broad-based search program.


Comments & Academic Discussion

Loading comments...

Leave a Comment