Comparison of Image Processing Models in Quark Gluon Jet Classification
We present a comprehensive comparison of convolutional and transformer-based models for distinguishing quark and gluon jets using simulated jet images from Pythia 8. By encoding jet substructure into a three-channel representation of particle kinematics, we evaluate the performance of convolutional neural networks (CNNs), Vision Transformers (ViTs), and Swin Transformers (Swin-Tiny) under both supervised and self-supervised learning setups. Our results show that fine-tuning only the final two transformer blocks of the Swin-Tiny model achieves the best trade-off between efficiency and accuracy, reaching 81.4% accuracy and an AUC (area under the ROC curve) of 88.9%. Self-supervised pretraining with Momentum Contrast (MoCo) further enhances feature robustness and reduces the number of trainable parameters. These findings highlight the potential of hierarchical attention-based models for jet substructure studies and for domain transfer to real collision data.
💡 Research Summary
The paper presents a systematic study of modern image‑based deep learning techniques applied to the classification of quark and gluon jets, a long‑standing problem in high‑energy physics. Using simulated proton‑proton collisions at √s = 14 TeV generated with Pythia 8, the authors select light‑flavor quark and gluon jets in the transverse momentum range 500–550 GeV and rapidity |y| < 1. Each jet is converted into a 72 × 72 pixel image in the (η, ϕ) plane with three physically motivated channels: (R) charged‑particle transverse momentum, (G) neutral‑particle transverse momentum, and (B) charged‑particle multiplicity. This representation preserves both momentum and multiplicity information while providing a standard 2‑D image format suitable for computer‑vision models.
Three families of models are evaluated: (i) conventional convolutional neural networks (CNNs), (ii) Vision Transformers (ViTs), and (iii) Swin Transformers, specifically the lightweight Swin‑Tiny variant. CNNs exploit local convolutions and are strong at capturing short‑range spatial correlations, but their receptive field grows only gradually with depth, potentially missing long‑range jet‑wide patterns. ViTs split the image into fixed‑size patches, embed them, and apply global self‑attention, enabling direct interaction between any two patches and thus capturing long‑range dependencies such as correlations between the jet core and its periphery. However, global attention incurs high computational cost and lacks the strong inductive bias for locality that benefits jet images.
Swin‑Tiny addresses these issues by performing self‑attention within non‑overlapping windows, then shifting windows in successive layers to allow cross‑window communication, and finally merging patches to build a hierarchical multi‑scale representation. This design combines the locality bias of CNNs with the global modeling capacity of Transformers, making it especially suitable for jet images that contain both fine‑grained radiation patterns and broader shape information.
Training is conducted on an NVIDIA RTX 4060 GPU with a fixed random seed (42) for reproducibility, using a batch size of 128 for 80 epochs. The authors explore both fully supervised training and a self‑supervised pre‑training stage based on Momentum Contrast (MoCo). MoCo constructs query–key pairs from different augmentations of the same jet image and contrasts them against a memory bank of negative keys, encouraging the encoder to learn discriminative representations even when quark and gluon jets appear visually similar. Importantly, augmentations respect the (η, ϕ) geometry by avoiding rotations or reflections that would destroy physical meaning.
The key empirical findings are: (1) Swin‑Tiny fine‑tuned on only its last two transformer blocks achieves the best trade‑off between efficiency and performance, reaching 81.4 % classification accuracy and an area‑under‑the‑ROC curve (AUC) of 0.889. This configuration reduces the number of trainable parameters dramatically compared with full fine‑tuning of all layers. (2) MoCo pre‑training further improves robustness: a Swin‑Tiny model initialized with MoCo weights matches or exceeds the performance of a fully supervised, fully‑fine‑tuned Swin‑Tiny while using fewer trainable parameters and less training time. (3) Compared with baseline CNNs, both ViT and Swin‑Tiny deliver higher AUCs, but Swin‑Tiny’s hierarchical attention yields superior parameter efficiency, especially when only a subset of layers is updated.
The authors also discuss practical considerations: the custom three‑channel jet image preprocessing (centering, grid discretization, channel‑wise normalization) preserves physics information and enables direct transfer of computer‑vision architectures. The study demonstrates that transformer‑based models, traditionally trained on massive natural‑image datasets, can be successfully adapted to the high‑energy physics domain through domain‑specific pre‑training and careful fine‑tuning strategies.
In conclusion, the work shows that hierarchical attention models such as Swin‑Transformer, when combined with self‑supervised MoCo pre‑training, provide a powerful and computationally efficient tool for quark–gluon jet discrimination. The methodology is readily extensible to real LHC data and to other HEP data modalities (e.g., calorimeter towers, tracking hit maps), opening avenues for broader applications of modern deep learning in particle physics.
Comments & Academic Discussion
Loading comments...
Leave a Comment