Economical Jet Taggers -- Equivariant, Slim, and Quantized
Modern machine learning is transforming jet tagging at the LHC, but the leading transformer architectures are large, not particularly fast, and training-intensive. We present a slim version of the L-GATr tagger, reduce the number of parameters of jet-tagging transformers, and quantize them. We compare different quantization methods for standard and Lorentz-equivariant transformers and estimate their gains in resource efficiency. We find an order-of-magnitude reduction in energy cost for an moderate performance decrease, down to 1000-parameter taggers. This might be a step towards trigger-level jet tagging with small and quantized versions of the leading equivariant transformer architectures.
💡 Research Summary
The paper addresses the growing computational demands of modern jet‑tagging models at the Large Hadron Collider (LHC) by proposing two complementary strategies: a slim, Lorentz‑equivariant transformer architecture (L‑GATr‑slim) and aggressive low‑precision quantization.
Slim Transformer Design
The original L‑GATr transformer processes five types of multivector components (scalar, pseudoscalar, vector, axial‑vector, and rank‑2 antisymmetric tensor) to preserve Lorentz symmetry. The authors observe that most LHC tasks require only scalars and four‑vectors, so they drop the other components. The resulting L‑GATr‑slim uses a scalar‑vector linear layer where the same learnable scalar multiplies all vector components, guaranteeing equivariance. Non‑linearity is implemented by extending the Gated Linear Unit (GLU): the inner product of two vectors is passed through GELU and multiplied by the vector output. RMSNorm is adapted to use the absolute value of the Minkowski inner product for vector normalization, avoiding sign issues. Attention combines scalar‑scalar and vector‑vector inner products, weighted by learned scalars, before applying Softmax. This design reduces the number of parameters by roughly 30 % while keeping the same depth and overall architecture (stacked attention‑MLP blocks).
Quantization Techniques
Two quantization schemes are evaluated: (1) the standard Straight‑Through Estimator (STE), which treats the quantized value as the forward pass output but uses the unquantized gradient during back‑propagation, and (2) a novel Piecewise‑Affine Regularized Quantization (PARQ) based on proximal‑gradient optimization. PARQ adds a regularization term that forces weights into predefined quantization intervals, smoothing the otherwise abrupt transitions between quantization levels. Both methods are applied to linear layers, with separate bit‑widths explored (8‑bit, 4‑bit, 2‑bit).
Experimental Results
The authors benchmark three representative tasks: (i) top‑quark tagging on the standard benchmark, (ii) multi‑class jet classification on the large JetClass dataset (10 M training jets, ten classes), and (iii) supervised regression of Z + n‑gluon scattering amplitudes (n = 1…4).
Top‑tagging: L‑GATr‑slim matches the full L‑GATr and LLoCa transformers with an AUC of ≈0.942 and background rejection rates within statistical uncertainties, despite having only 2 M parameters (versus 5 M+ for the full models).
JetClass: The slim model achieves 86.6 % accuracy and an average AUC of 0.9885, comparable to state‑of‑the‑art architectures. FLOPs per inference drop from ~1.3 GFLOPs (full transformer) to ~0.8 GFLOPs, and memory usage falls from ~4 GB to ~2 GB.
Amplitude regression: For the hardest case (Z + 4 g), L‑GATr‑slim reaches a mean‑squared error of 1.8 × 10⁻⁵, on par with the full transformer. Quantized versions retain similar performance: 8‑bit quantization adds <0.2 % degradation, 4‑bit adds <1 %, and even 2‑bit quantization stays within 2‑3 % loss.
Quantization yields substantial efficiency gains: 8‑bit models reduce FLOPs by ~30 % and memory by ~35 %; 4‑bit models cut both by ~50 %; 2‑bit models achieve >60 % reductions. Power measurements on an NVIDIA H100 GPU show an order‑of‑magnitude lower energy consumption for the 2‑bit quantized slim model compared to the full‑precision baseline.
Implications for Trigger‑Level Deployment
The combination of a 1 000‑parameter ultra‑mini model and 2‑bit quantization results in a network that can be executed within the strict latency and power budgets of LHC trigger hardware (e.g., FPGA or ASIC). The authors argue that such a model could be directly embedded in the Level‑1 trigger system of the High‑Luminosity LHC, enabling real‑time jet‑flavor classification that was previously impossible due to resource constraints.
Conclusions
By stripping unnecessary multivector components and applying sophisticated low‑precision quantization, the authors demonstrate that Lorentz‑equivariant transformers can be made dramatically more resource‑efficient without sacrificing physics performance. This work opens a clear path toward deploying advanced machine‑learning jet taggers at the trigger level, potentially improving the physics reach of future LHC runs.
Comments & Academic Discussion
Loading comments...
Leave a Comment