HEP ML Lab: An end-to-end framework for applying machine learning into phenomenology studies

HEP ML Lab: An end-to-end framework for applying machine learning into phenomenology studies
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent years have seen the development and growth of machine learning in high energy physics. There will be more effort to continue exploring its full potential. To make it easier for researchers to apply existing algorithms and neural networks and to advance the reproducibility of the analysis, we develop the HEP ML Lab (hml), a Python-based, end-to-end framework for phenomenology studies. It covers the complete workflow from event generation to performance evaluation, and provides a consistent style of use for different approaches. We propose an observable naming convention to streamline the data extraction and conversion processes. In the Keras style, we provide the traditional cut-and-count and boosted decision trees together with neural networks. We take the $W^+$ tagging as an example and evaluate all built-in approaches with the metrics of significance and background rejection. With its modular design, HEP ML Lab is easy to extend and customize, and can be used as a tool for both beginners and experienced researchers.


💡 Research Summary

The paper presents HEP ML Lab (hml), a Python‑based, end‑to‑end framework designed to streamline the application of machine‑learning techniques to phenomenological studies in high‑energy physics. The authors identify four typical stages in a typical HEP‑ML workflow—event generation, dataset construction, model training, and performance evaluation—and note that each stage traditionally relies on a different software package (MadGraph5_aMC for hard‑process generation, Pythia8 for parton showering, Delphes for fast detector simulation, ROOT/UPROOT for data handling, and deep‑learning libraries such as PyTorch or TensorFlow for model development). Switching between these tools often requires manual file conversions, intricate configuration files, and ad‑hoc scripting, which hampers reproducibility and raises the barrier for newcomers.

HEP ML Lab addresses these issues by providing a unified Python API that wraps the most common HEP tools and integrates them with Keras‑style machine‑learning modules. The “generator” module supplies a Madgraph5 class that encapsulates the core functionalities of MadGraph5_aMC: model import, particle definition, process generation, diagram visualization, and event launch. Users configure runs through keyword arguments (e.g., settings={"nevents":1000,"run_tag":"mass=200"}, seed=42, multi_run=2) and can preview the exact shell commands with a dry=True flag before execution. After a run finishes, a summary() method prints a table containing collider information, cross‑section, statistical error, number of events, and random seed, and the Madgraph5Run object enables direct access to the generated ROOT files via uproot. This design guarantees that all meta‑information is captured automatically, facilitating exact replication of results.

A second major contribution is the “observable naming convention”. Physical objects (jets, electrons, muons, etc.) are referenced by a string of the form <physics object>.<observable>. The framework distinguishes four object types: single objects (e.g., jet0), collective objects (e.g., jet:10 for the first ten jets), nested objects (e.g., jet0.constituents:100), and multiple objects combined with commas (e.g., jet0,jet1). The parser parse_physics_object translates these strings into branch names and slice objects, which are then fed to observable classes. Observables are defined in a separate module and include kinematic quantities (mass, pT), substructure variables (N‑subjettiness ratios, angular distances), and size‑related measures. Aliases and case‑insensitivity reduce the cognitive load on users. Internally, the framework relies on the awkward library to handle jagged arrays, returning empty lists for non‑existent objects so that cuts automatically ignore missing entries without raising errors.

On the machine‑learning side, HEP ML Lab ships with two lightweight neural‑network architectures implemented in Keras: a simple multi‑layer perceptron (MLP) and a modest convolutional neural network (CNN), each containing fewer than ten thousand trainable parameters. These models serve as baselines for classification tasks and can be replaced or extended without modifying the surrounding pipeline. In addition, the authors integrate two traditional approaches—cut‑and‑count and Gradient Boosted Decision Trees (GBDT)—wrapped to share the same fit, predict, and evaluate interface as the neural networks. After training, the framework computes physics‑oriented performance metrics: signal significance (often expressed as Z‑score) and background rejection rate at a fixed signal efficiency (e.g., 50%). This dual evaluation (statistical and machine‑learning) enables a direct comparison between classic cut‑based analyses and modern data‑driven methods.

The authors demonstrate the utility of the framework through a case study on W⁺ boson tagging. Simulated proton‑proton collisions at 13 TeV are generated with MadGraph5_aMC → Pythia8 → Delphes. From the resulting events, a set of observables (jet mass, τ₂₁, angular distances, etc.) is extracted using the naming convention, and three analysis strategies are applied: (1) a traditional cut‑and‑count selection, (2) a GBDT classifier, and (3) the provided CNN trained on jet images. ROC curves and significance scans show that both the GBDT and CNN outperform the cut‑based method, with the CNN achieving slightly higher background rejection at the same signal efficiency, illustrating the advantage of image‑based deep learning for jet substructure. Importantly, the entire workflow—from event generation to final significance plot—was implemented in a compact Python script, highlighting the framework’s ability to reduce boilerplate code and accelerate prototyping.

Beyond the concrete example, the paper emphasizes the modularity and extensibility of HEP ML Lab. New physics objects, custom observables, or advanced neural‑network architectures can be added as plug‑ins, and the open‑source nature of the project encourages community contributions. The built‑in seed handling, logging, and metadata storage ensure that analyses are fully reproducible, addressing a long‑standing concern in the HEP community.

In summary, HEP ML Lab offers a comprehensive, user‑friendly, and reproducible environment that bridges traditional HEP simulation tools with modern machine‑learning libraries. By standardizing data extraction through a clear naming scheme and providing ready‑to‑use ML models alongside classic techniques, it lowers the entry barrier for physicists wishing to explore ML‑driven phenomenology, while also supplying experienced researchers with a flexible platform for rapid development and systematic performance studies.


Comments & Academic Discussion

Loading comments...

Leave a Comment