Consistent Feature Construction with Constrained Genetic Programming for Experimental Physics
A good feature representation is a determinant factor to achieve high performance for many machine learning algorithms in terms of classification. This is especially true for techniques that do not build complex internal representations of data (e.g. decision trees, in contrast to deep neural networks). To transform the feature space, feature construction techniques build new high-level features from the original ones. Among these techniques, Genetic Programming is a good candidate to provide interpretable features required for data analysis in high energy physics. Classically, original features or higher-level features based on physics first principles are used as inputs for training. However, physicists would benefit from an automatic and interpretable feature construction for the classification of particle collision events. Our main contribution consists in combining different aspects of Genetic Programming and applying them to feature construction for experimental physics. In particular, to be applicable to physics, dimensional consistency is enforced using grammars. Results of experiments on three physics datasets show that the constructed features can bring a significant gain to the classification accuracy. To the best of our knowledge, it is the first time a method is proposed for interpretable feature construction with units of measurement, and that experts in high-energy physics validate the overall approach as well as the interpretability of the built features.
💡 Research Summary
The paper addresses the need for interpretable, high‑performance feature representations in high‑energy physics (HEP) classification tasks, where physicists must understand and trust the models applied to real collision data. While deep neural networks can automatically learn complex representations, they are opaque, and decision‑tree‑based methods, which are more explainable, suffer when the supplied features are poorly chosen. Traditionally, physicists manually engineer high‑level variables (e.g., invariant masses, angular separations) based on conservation laws and domain intuition. Automating this process while preserving physical meaning is the central goal of the work.
To achieve this, the authors propose a constrained Genetic Programming (GP) framework that enforces dimensional consistency through a Context‑Free Grammar (CFG). The grammar defines a small set of physical types—Energy (E), Angle (A), Float (F), Energy‑squared (E2), etc.—and specifies which operators (+, –, *, /, sqrt, sin, cos, tan, square) may combine which types, ensuring that any generated expression respects unit compatibility (e.g., you can add two energies but not an energy and a length). This approach is equivalent to Strongly‑Typed GP, but the CFG makes the constraints explicit and easily extensible.
Beyond syntactic typing, the authors introduce a probabilistic guidance mechanism: a manually crafted transition matrix that encodes the typical ordering of operators in HEP formulae. For instance, a square‑root is often followed by a sum of squares, so the matrix assigns a high probability to selecting “+” after “sqrt”. An initial distribution (P_init) determines the first operator. These probabilities remain fixed during evolution, preventing premature convergence to a narrow subspace while still biasing the search toward physically plausible structures.
The evolutionary loop follows standard GP: random initialization (respecting the grammar), fitness evaluation, selection, crossover, and mutation. Crossover and mutation are constrained so that offspring always remain derivable from the grammar. Fitness is measured in a wrapper fashion: each candidate feature set is added to the original variables and fed to a downstream classifier (e.g., Random Forest, XGBoost). The classifier’s validation accuracy serves as the fitness score. The authors also experiment with filter‑based fitness (information gain) for comparison.
Experiments are conducted on three HEP datasets, notably the Higgs boson dataset (signal vs. background), a top‑quark pair (tt̄) dataset, and an additional collision‑event set. For each dataset, three baselines are compared: (1) using only the raw features, (2) augmenting raw features with manually engineered high‑level variables (the usual practice), and (3) augmenting raw features with automatically generated GP features. Results show that the GP‑augmented models consistently outperform the raw‑only baseline, achieving accuracy improvements of roughly 3–7 percentage points. Moreover, the automatically discovered features are interpretable: examples include expressions like √(E₁² + E₂²) (a transverse energy‑like quantity) or (E₁ – E₂)·cos(θ), which align with known physics constructs. Domain experts evaluated a subset of the generated features and confirmed that they respect unit consistency and often correspond to meaningful physical concepts, sometimes revealing novel combinations not previously considered.
Ablation studies demonstrate that the physics‑inspired transition matrix yields better performance than a uniform random operator selection, confirming the benefit of embedding domain knowledge into the search bias.
The paper acknowledges limitations: the grammar and transition probabilities rely on expert input, so transferring the method to a different subfield would require re‑design; the operator set is deliberately small, which may restrict discovery of highly non‑linear relationships; and the approach currently optimizes a single objective (classification accuracy) without explicitly balancing interpretability versus performance. Future work is suggested in automatic grammar expansion, learning the transition matrix via Bayesian optimization, and multi‑objective GP that simultaneously maximizes accuracy and minimizes expression complexity.
In conclusion, this work presents the first method that automatically constructs dimensionally consistent, human‑readable features for HEP classification, validated both quantitatively (accuracy gains) and qualitatively (expert interpretability). The integration of CFG‑based type safety with probabilistic operator guidance offers a promising template for other scientific domains where physical laws constrain feasible feature forms.
Comments & Academic Discussion
Loading comments...
Leave a Comment