PhyGHT: Physics-Guided HyperGraph Transformer for Signal Purification at the HL-LHC

Phy GHT: Physics-Guide d HyperGraph Transformer for Signal Purification at the HL-LHC Mohammed Rakib Department of Computer Science Oklahoma State University , USA mohammed.rakib@okstate.edu Luke V aughan Department of Physics Oklahoma State University , USA luke.vaughan@okstate.edu Shivang Patel Department of Physics Oklahoma State University , USA shivang.patel@okstate.edu Flera Rizatdinova Department of Physics Oklahoma State University , USA era.rizatdinova@okstate.edu Alexander Khanov Department of Physics Oklahoma State University , USA alexander .khanov@okstate.edu Atriya Sen Department of Computer Science Oklahoma State University , USA atriya.sen@okstate.edu Figure 1: Signal Collision (Blue) covered in Pileup Collisions (Red) in the HL-LHC Environment Abstract The High-Luminosity Large Hadron Collider (HL-LHC) at CERN will produce unprecedented datasets capable of revealing funda- mental properties of the universe. However , realizing its discov- ery potential faces a signicant challenge: extracting small signal fractions from overwhelming backgrounds dominated by approxi- mately 200 simultaneous pileup collisions. This extreme noise se- verely distorts the physical observables required for accurate recon- struction. T o address this, we introduce the Physics-Guided Hyp er- graph Transformer (P hyGH T), a hybrid architecture that combines distance-aware local graph attention with global self-attention to mirror the physical topology of particle showers formed in proton- proton collisions. Crucially , we integrate a Pileup Suppression Gate (PSG), an interpretable, physics-constrained mechanism that ex- plicitly learns to lter soft noise prior to hypergraph aggregation. T o validate our approach, we release a novel simulated dataset of top-quark pair production to model extr eme pileup conditions. PhyGH T outperforms state-of-the-art baselines from the A TLAS and CMS experiments in predicting the signal’s energy and mass correction factors. By accurately reconstructing the top quark’s invariant mass, we demonstrate how machine learning innovation and interdisciplinary collaboration can directly advance scientic discovery at the frontiers of experimental physics and enhance the HL-LHC’s discovery potential. The dataset and code are available at https://github.com/r AIson-Lab/PhyGH T . Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honor ed. Abstracting with cr edit is permitted. T o copy otherwise, or republish, to post on servers or to redistribute to lists, r equires prior specic permission and /or a fee. Request permissions from permissions@acm.org. KDD ’26, Jeju, Korea © 2026 Copyright held by the owner/author( s). Publication rights licensed to ACM. ACM ISBN 978-x-xxxx-xxxx-x/Y YY Y/MM https://doi.org/XXXXXXX.XXXXXXX CCS Concepts • Computing methodologies → Neural networks ; Multi-task learning ; Supervise d learning ; Rare-event simulation ; Simula- tion evaluation ; • Applied computing → Physics . Ke ywords PileUp Mitigation, Graph Neural Networks, Physics-Informed Ma- chine Learning, High-Energy Physics, AI4Science A CM Reference Format: Mohammed Rakib, Luke V aughan, Shivang Patel, Flera Rizatdinova, Alexan- der Khanov, and Atriya Sen. 2026. PhyGH T: Physics-Guided HyperGraph Transformer for Signal Purication at the HL-LHC. In Proceedings of (32nd A CM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’26). ACM, New Y ork, N Y , USA, 12 pages. https://doi.org/XXXXXXX.XXXXXXX 1 Introduction Starting in 2030, the Large Hadron Collider (LHC) at CERN will begin upgrades to the High-Luminosity phase (HL-LHC)[ 3 ]. High- energy physics experiments, such as A TLAS[ 17 ] and CMS[ 18 ], will record vast amounts of data by colliding bunches of protons to better understand the fundamental structure of nature. Since it is not feasible to collect data from all collisions, only "interesting" signal interactions that satisfy online preselection criteria will be recorded. Interactions recorded by the detector data acquisition system, called events , will include particles due to both the signal interaction and additional spurious interactions within the same bunch crossing, referred to as pileup . The average number of interactions p er bunch crossing, dened as ⟨ 𝜇 ⟩ , will shift from 60 at the LHC to 200 at the HL-LHC environment, which will signicantly increase the pileup background, as shown in Fig. 2. Due to the nature of physics processes at the LHC, particles created during the proton collisions often form collimated streams called jets , which are obser ved as compact clusters of energy in the detector . In the presence of pileup, the energy and momentum of KDD ’26, August 09–13, 2026, Jeju, K orea Rakib et al. (a) (b) Figure 2: A Pho enix Event Display[ 19 ] depicting an event in the A TLAS inner tracking system in the HL-LHC conditions at ⟨ 𝜇 ⟩ = 60 (left) and ⟨ 𝜇 ⟩ = 200 (right) for signal (blue) and background (red) particles with 𝑝 𝑇 > 1 . 0 Ge V . signal jets be come signicantly distorte d compared to their true values, which hinders physics analysis. The task of pileup mitiga- tion is to suppress the pileup background and restor e the physical quantities from the signal collision. While the HL-LHC will maximize the potential for discoveries in particle physics, the unprecedented levels of pileup will pose a major challenge for data analysis. Finding the signal of a rare physics process in this environment is akin to nding a needle in a haystack datasets [ 23 , 31 ], where machine learning (ML) models must b e robust enough to denoise complex event data. In recent years, there is a rising interest in ML algorithms for mitigating pileup eects[ 13 , 24 – 27 ]. Current non-ML production algorithms deployed in LHC experiments, typically optimized for lower pileup conditions, hav e adopted two distinct appr oaches: either mitigating pileup at the jet level [ 1 ] or at the particle lev el [ 8 ]. These single- modality strategies often face a trade-o: jet-level substructure can obscure internal substructure, while particle-level ltering may lack the global context required to estimate event-wide noise density . In our approach, we exploit the full information from both parti- cles and jets in each event by leveraging a combination of graph architectures [ 2 , 22 , 26 , 29 ] and transformer encoders [ 25 , 30 , 32 ] to directly provide energy and mass correction factors for each jet, enabling a more precise pileup mitigation. In this w ork, we propose the Physics-Guided Hypergraph T rans- former (PhyGH T), a hierarchical architecture that fuses Distance- A ware Graph Attention (DA -GA T) for local sub-structure encoding with a Global Transformer for event-lev el energy ow analysis. This hybrid appr oach resolves the trade-o between local geometric pre- cision and global context awareness required for high-pileup envi- ronments. Inspired by P UPPI [ 8 ], we introduce the Pileup Suppres- sion Gate (PSG), a learnable and dierentiable mechanism designed to enhance interpretability . PSG explicitly predicts a per-particle signal probability , enabling the model to perform soft-masking of pileup prior to aggregation with jets W e formulate the jet puri- cation task as a hypergraph aggregation problem, treating jets as hyperedges that connect variable-sized sets of tracks. This approach employs a bipartite attention mechanism to dynamically weight constituent tracks, overcoming the information loss associated with xed-size pooling [ 21 ] and enabling precise reconstruction of phys- ical observables such as top quark mass. T o rigorously evaluate these capabilities, we release a simulated dataset of top quarks un- der extreme pileup conditions ( ⟨ 𝜇 ⟩ = 200 ) that closely mimic the real-world detector data. 2 Dataset and Simulation W e present a novel, open-source dataset that lls a critical ne ed at the intersection of machine learning and high-energy physics. While functionally similar to proprietary datasets used by the A T - LAS collaboration [ 14 ], our dataset oers three key advantages. First, it provides truth labels for pileup characterization. This en- ables precise separation of signal and background noise under HL- LHC conditions. Second, it is fully public. This provides open access to realistic particle physics data that is typically restricted to collab- oration members. Third, it supports reproducibility and scalability . This allows the broader computer science community to compre- hensively evaluate state-of-the-art machine learning ar chitectures on complex particle physics analysis tasks. By simulating the ex- treme pileup conditions ( ⟨ 𝜇 ⟩ = 200 ) expected at the HL-LHC, we provide a benchmark that allows researchers to build and test sys- tems ready for the next generation of particle physics experiments. The dataset is available at Zenodo. 2.1 Data Representation The signal process is chosen to b e top quark pair production de- caying semi-leptonically , ( 𝑝 𝑝 → 𝑡 ¯ 𝑡 , 𝑡 → 𝑞 ¯ 𝑞 ′ 𝑏 , ¯ 𝑡 → ℓ 𝜈 ¯ 𝑏 ), and is generated using MadGraph5_aMC@NLO [ 4 ]. Pythia 8 [ 9 ] with A T - LAS A14 central tune[ 7 , 16 ] is used for parton showering which results in stable, nal state particles which can be observed by the detectors. T o mimic both standard LHC conditions and the high pileup ones of HL-LHC, soft Quantum Chromodynamics (QCD) pileup interactions were o verlaid by sampling from a Poisson dis- tribution with mean ⟨ 𝜇 ⟩ = 60 and ⟨ 𝜇 ⟩ = 200 , respe ctively . The primary and pileup vertices were spatially smeared using Gaussian distributions with widths 𝜎 𝑥 𝑦 = 0 . 3 mm and 𝜎 𝑧 = 50 mm. Stable, nal state particles are clustered using FastJet [ 11 ] with the anti- 𝑘 𝑡 algorithm [ 10 ] using a cone size parameter of 𝑅 = 0 . 4 and the minimum transverse momentum thr eshold 𝑝 𝑚𝑖𝑛 𝑇 > 25 𝐺 𝑒𝑉 . Lastly to model detector acceptance, we removed neutral particles and charged particles with 𝑝 𝑇 < 400 Me V . Particles in the detector are describe d using a standard 3D coor- dinate system dened by their transverse momentum ( 𝑝 𝑇 ), pseudo- rapidity ( 𝜂 ), and azimuthal angle ( 𝜙 ) 1 . Each charged particle track is represented by a feature vector x 𝑡 𝑟 𝑎𝑐 𝑘 = [ 𝑝 𝑇 , 𝜂, 𝜙 , 𝑞, 𝑑 0 , 𝑧 0 ] . Here, 1 𝜙 is azimuthal angle , 𝜂 = − ln h tan  𝜃 2  i where 𝜃 is polar angle , and 𝑝 T = | − → 𝑝 | sin 𝜃 in the spherical coordinate system where the 𝑧 axis is directed along the beam. PhyGH T for Signal Purification at the HL-LHC KDD ’26, August 09–13, 2026, Jeju, K orea 𝑞 is the charge, and 𝑑 0 and 𝑧 0 denote the transverse and longitudi- nal impact parameters, respectively . These impact parameters are calculated by extrapolating the track to the beam line [ 15 ]. Each jet is described by a vector x 𝑗 𝑒 𝑡 = [ 𝑝 𝑇 , 𝜂 , 𝜙 , 𝑚 ] where 𝑚 is the mass. These features of the jet represent the aggregated kinematics of the charged and neutral constituents of each clustered set of particles. 2.2 Truth Label Denition Each jet and track are describ ed as a Lorentz 4-vector which is dened by energy and momentum: ( 𝐸, ® 𝑝 ) . The mass of an object can be calculated using the relativistic energy-momentum rela- tion as 𝑚 =  𝐸 2 − | ® 𝑝 | 2 . Each jet is constructed by summing the 4-vectors over a set of tracks. Each track is assigned a binar y label 𝑦 𝑙 𝑎𝑏𝑒 𝑙 ∈ { 0 , 1 } . W e assign a value of 1 to tracks originating from the signal vertex and 0 to those originating from pileup vertices, which are used as an auxillary task during training. For each jet, we can calculate truth level energy correction factor 𝑦 𝐸,𝑘 and mass cor- rection factor 𝑦 𝑀 ,𝑘 for the 𝑘 -th jet as the ratio of the contributions from HS tracks to the total contributions from all tracks: 𝑦 𝐸,𝑘 = 𝐸 𝐻 𝑆 ,𝑘 𝐸 𝑟 𝑎 𝑤,𝑘 , 𝑦 𝑀 ,𝑘 = 𝑀 𝐻 𝑆 ,𝑘 𝑀 𝑟 𝑎 𝑤,𝑘 (1) where 𝐸 𝑟 𝑎 𝑤,𝑘 and 𝑀 𝑟 𝑎 𝑤,𝑘 represent the total jet energy and mass clustered from the hard scatter signal and the pileup background. 3 Methodology Figure 3 illustrates our PhyGH T architecture, a hierarchical graph neural network designed to mitigate pileup-induced distortion in physical jet observables such as mass and energy . Phy GHT takes as input raw event-level information, specically jets and tracks. It then employs a four-stage sequential feature renement strategy to purify jets from pileup. First, the Local Geometric Block uses a Distance- A ware Graph Attention Network (DA -GA T) to enco de spa- tial correlations, eectively capturing the local topology of signal particles against stochastic pileup. Then the Global Context Block employs a T ransformer encoder to model event-wide constraints like pileup density and momentum conservation. Next, the Pileup Suppression Gate (PSG) applies a learnable, dier entiable mask to down-weight pileup tracks, serving as an analogue to traditional algorithms such as P UPPI. Finally , the Hypergraph Attention Block dynamically aggregates puried tracks and fuses them with raw jet features via bipartite message passing, enabling precise regression of energy and mass correction factors. 3.1 Problem Formulation W e formulate pileup mitigation as a regression task on a hetero- geneous graph, aiming to predict correction factors that recover hard-scatter observables from pileup-contaminated events. Let a collision event be r epresented as a graph G = ( V , E ) , where V = V 𝑡 𝑟 𝑎𝑐 𝑘 ∪ V 𝑗 𝑒 𝑡 consists of 𝑁 𝑡 𝑟 𝑎𝑐 𝑘 track nodes and 𝑁 𝑗 𝑒 𝑡 jet nodes. W e denote track indices by 𝑖 , 𝑗 ∈ { 1 , . . . , 𝑁 𝑡 𝑟 𝑎𝑐 𝑘 } and jet indices by 𝑘 ∈ { 1 , . . . , 𝑁 𝑗 𝑒 𝑡 } . 3.1.1 Node Features. Each track node 𝑡 𝑖 ∈ V 𝑡 𝑟 𝑎𝑐 𝑘 is initialized with a feature vector x 𝑡 𝑟 𝑎𝑐 𝑘 𝑖 ∈ R 6 containing the kinematic and verte xing variables dened in Section 2. Similarly , each jet node 𝑗 𝑘 ∈ V 𝑗 𝑒 𝑡 acts as a hypernode initialized with the aggregate feature vector x 𝑗 𝑒 𝑡 𝑘 ∈ R 4 . These features represent the raw input state . 3.1.2 Graph Connectivity . The e dge set E encodes both local ge o- metric and hierarchical relationships through tw o distinct subsets. First, the Local Edge Set E 𝑙 𝑜𝑐 𝑎𝑙 connects track nodes to their spatial neighbors; an e dge ( 𝑡 𝑖 , 𝑡 𝑗 ) ∈ E 𝑙 𝑜𝑐 𝑎𝑙 exists if track 𝑡 𝑗 is among the 𝐾 -nearest neighbors of track 𝑡 𝑖 in the ( 𝜂 , 𝜙 ) metric space, dened by the Euclidean distance Δ 𝑅 𝑖 𝑗 =  ( Δ 𝜂 𝑖 𝑗 ) 2 + ( Δ 𝜙 𝑖 𝑗 ) 2 . Second, the Hypergraph Edge Set E ℎ 𝑦𝑝 𝑒 𝑟 connects tracks to jets base d on the clustering history , where a directed edge ( 𝑡 𝑖 , 𝑗 𝑘 ) exists if track 𝑡 𝑖 is a physical constituent of jet 𝑗 𝑘 . 3.1.3 Learning Objective. The goal of PhyGH T is to learn a map- ping 𝑓 𝜃 ( G ) → ˆ Y that regresses the hard-scatter contributions. For each jet 𝑗 𝑘 , the model predicts an estimate d energy correction factor ˆ 𝑦 𝐸,𝑘 ∈ [ 0 , 1 ] and mass correction factor ˆ 𝑦 𝑀 ,𝑘 ∈ [ 0 , 1 ] correspond- ing to the ground truth ratios in Eq. 1. Targeting these bounde d coecients rather than absolute values stabilizes the regression against the large dynamic range of jet kinematics. The nal physical quantities are reconstructed via: 𝐸 𝑐𝑜 𝑟 𝑟 ,𝑘 = ˆ 𝑦 𝐸,𝑘 · 𝐸 𝑟 𝑎 𝑤,𝑘 , 𝑀 𝑐𝑜 𝑟 𝑟 ,𝑘 = ˆ 𝑦 𝑀 ,𝑘 · 𝑀 𝑟 𝑎 𝑤,𝑘 (2) where 𝐸 𝑐𝑜 𝑟 𝑟 ,𝑘 and 𝑀 𝑐𝑜 𝑟 𝑟 ,𝑘 represent the corrected energy and mass. 3.2 Local Geometric Enco ding The primary objective of the Local Geometric block is to encode the local context of particle showers represented by jets. In high- energy physics, signal particles from a hard scatter typically exhibit collinearity , clustering tightly in the ( 𝜂 , 𝜙 ) space, whereas particles from 200 superimposed pileup collision create a random distribu- tion of tracks. T o capture this, we employ a Distance-A ware Graph Attention Network (D A -GA T) that biases the aggregation of neigh- bor features based on their spatial proximity in the detector . 3.2.1 Feature Emb edding. First, the raw input features x 𝑡 𝑟 𝑎𝑐 𝑘 𝑖 of each track 𝑡 𝑖 are projected into a high-dimensional latent space to enable the learning of non-linear kinematic correlations. W e apply a linear transformation followed by layer normalization and a GELU activation to get the initial track embedding h ( 0 ) 𝑖 ∈ R 𝐷 : h ( 0 ) 𝑖 = GELU  LayerNorm  W 𝑇 x 𝑡 𝑟 𝑎𝑐 𝑘 𝑖 + b 𝑇   (3) where W 𝑇 ∈ R 𝐷 × 6 , b 𝑇 ∈ R 𝐷 and 𝐷 is the hidden dimension. 3.2.2 Distance-A ware Graph Aention (D A -GA T). Standard Graph Attention Networks (GA T s) [ 28 ] compute attention coecients solely based on node features, potentially assigning high weights to distant pileup tracks that share similar kinematic properties with the signal. T o mitigate this, we inject a structural bias into the attention mechanism. For every track 𝑡 𝑖 , we consider its local neighborhood N 𝐾 ( 𝑖 ) dened by 𝑘 -NN in the E 𝑙 𝑜𝑐 𝑎𝑙 edge set. W e compute the normalized attention coecients 𝛼 𝑖 𝑗 directly via: 𝛼 𝑖 𝑗 = S  LeakyReLU  a 𝑇 h Wh ( 0 ) 𝑖   Wh ( 0 ) 𝑗   𝑤 𝑑 · ( Δ 𝑅 𝑖 𝑗 ) 2 i   (4) Here, ∥ denotes concatenation, W ∈ R 𝐷 × 𝐷 is a shared weight matrix, a ∈ R 2 𝐷 + 1 is the attention vector , Δ 𝑅 𝑖 𝑗 is the Euclidean distance in the ( 𝜂 , 𝜙 ) plane, and S denotes the softmax function applied across the neighborho od N 𝐾 ( 𝑖 ) . Crucially , 𝑤 𝑑 is a learnable KDD ’26, August 09–13, 2026, Jeju, K orea Rakib et al. Figure 3: O verview of P hyGH T . It takes a heterogeneous graph of Tracks and Jets as input. It processes tracks through the Local Geometric (D A -GA T) block, followed by the Global Context block. The fused representations are ltered by the Pileup Suppression Gate (PSG) before b eing aggregated into the Jet representation via Hypergraph Attention for nal regression. scalar parameter . This allows the network to learn a spatial decay function analogous to a Gaussian kernel that penalizes information ow from physically distant tracks, eectively enfor cing a soft cone size for information aggregation. 3.2.3 Aggregation. The local representation for track 𝑡 𝑖 is updated by aggregating the neighbor features weighted by 𝛼 𝑖 𝑗 : h ( 𝑎𝑔𝑔 ) 𝑖 =  𝑗 ∈ N 𝐾 ( 𝑖 ) 𝛼 𝑖 𝑗 Wh ( 0 ) 𝑗 (5) T o preserve gradient ow and stabilize training, we employ a resid- ual connection and layer normalization: h ( 𝑙 𝑜𝑐 𝑎𝑙 ) 𝑖 = LayerNorm  h ( 0 ) 𝑖 + Dropout  GELU  h ( 𝑎𝑔𝑔 ) 𝑖    (6) The resulting vector h ( 𝑙 𝑜𝑐 𝑎𝑙 ) 𝑖 encodes the track’s kinematic state contextually enriched by its immediate geometric surroundings. 3.3 Global Contextualization While DA -GA T captures the ne-grained lo cal context of jets, it is inherently blind to long-range dependencies such as global mo- mentum conservation and event-wide pileup density uctuations. T o address this, we process the locally enco ded features h ( 𝑙 𝑜𝑐 𝑎𝑙 ) 𝑖 through a Global Contextualization block based on the Transformer encoder architecture [5]. 3.3.1 Global Self- Aention. W e treat the ev ent as a fully connected graph where every track attends to every other track. For each track 𝑡 𝑖 , we compute Query ( q 𝑖 ), Key ( k 𝑖 ), and V alue ( v 𝑖 ) vectors via linear projections below , where W 𝑄 , W 𝐾 , W 𝑉 ∈ R 𝐷 × 𝐷 : q 𝑖 = W 𝑄 h ( 𝑙 𝑜𝑐 𝑎𝑙 ) 𝑖 , k 𝑖 = W 𝐾 h ( 𝑙 𝑜𝑐 𝑎𝑙 ) 𝑖 , v 𝑖 = W 𝑉 h ( 𝑙 𝑜𝑐 𝑎𝑙 ) 𝑖 (7) The pairwise attention coecients 𝐴 𝑖 𝑗 , representing the rele- vance of track 𝑡 𝑗 to track 𝑡 𝑖 , are computed via the scaled dot-product: 𝐴 𝑖 𝑗 = S q 𝑇 𝑖 k 𝑗 √ 𝐷 ! (8) where S denotes the softmax function applied across all tracks 𝑗 ∈ V 𝑡 𝑟 𝑎𝑐 𝑘 . The global context vector h ( 𝑔𝑙𝑜 𝑏𝑎𝑙 ) 𝑖 ∈ R 𝐷 is obtained by aggregating the values weighted by the attention coecients, followed by a Feed-Forward Network (FFN), residual connection, and layer normalization: h ( 𝑔𝑙𝑜 𝑏𝑎𝑙 ) 𝑖 = LayerNorm h ( 𝑙 𝑜𝑐 𝑎𝑙 ) 𝑖 + FFN 𝑁 𝑡 𝑟 𝑎𝑐 𝑘  𝑗 = 1 𝐴 𝑖 𝑗 v 𝑗 ! ! (9) 3.3.2 Glo cal Fusion. Finally , to preserve the strong geometric gra- dients learned by the DA -GA T , we fuse the global and local r epre- sentations through summation: z 𝑖 = h ( 𝑙 𝑜𝑐 𝑎𝑙 ) 𝑖 + h ( 𝑔𝑙𝑜 𝑏𝑎𝑙 ) 𝑖 (10) The resulting vector z 𝑖 ∈ R 𝐷 serves as the input to the gating mechanism, encoding both the dense collinear structure of the jet and the global event context. 3.4 Pileup Suppression Gate (PSG) W e introduce the Pileup Suppression Gate (PSG), a learnable mecha- nism inspired by the phenomenological P UPPI algorithm [ 8 ]. While standard attention mechanisms implicitly down-weight noisy fea- tures, they do not explicitly remov e them. The PSG acts as a dier- entiable soft-mask lter , explicitly predicting the probability that a track originates from the hard-scatter verte x. PhyGH T for Signal Purification at the HL-LHC KDD ’26, August 09–13, 2026, Jeju, K orea 3.4.1 Signal Probability Estimation. The fused feature vector z 𝑖 , is passe d through a Multi-Layer Perceptron (MLP) to compute a scalar signal probability score ˆ 𝑠 𝑖 ∈ [ 0 , 1 ] : ˆ 𝑠 𝑖 = 𝜎  w 𝑇 𝑔𝑎𝑡 𝑒 · Dropout  ReLU  W 𝑔𝑎𝑡 𝑒 z 𝑖 + b 𝑔𝑎𝑡 𝑒    (11) where W 𝑔𝑎𝑡 𝑒 ∈ R 𝑑 𝑔𝑎𝑡 𝑒 × 𝐷 , w 𝑔𝑎𝑡 𝑒 ∈ R 𝑑 𝑔𝑎𝑡 𝑒 , b 𝑔𝑎𝑡 𝑒 ∈ R 𝐷 , and 𝜎 is the sigmoid function. This score ˆ 𝑠 𝑖 serves as a track-level condence metric, providing interpretability: values near 1 indicate signal-like kinematics, while values near 0 indicate pileup. 3.4.2 Dierentiable Filtering. W e apply this score to the feature vector via element-wise multiplication, eectively suppressing the inuence of identied pileup tracks before aggregation: ˜ z 𝑖 = ˆ 𝑠 𝑖 · z 𝑖 (12) This operation retains the feature direction for signal tracks while shrinking pileup vectors toward zero, eectively mitigating pileup . 3.5 Hypergraph Attention Aggregation W e purify jets by aggregating the ltered constituent tracks via a hypergraph attention mechanism. This allows the model to dy- namically weight tracks based on their relevance to the spe cic jet cluster , rather than relying on a xed global po oling. 3.5.1 Feature Projection. W e project the raw features of jets and the ltered features of tracks into a shared latent space R 𝐷 . For each jet node 𝑗 𝑘 , we compute a quer y embedding h 𝐽 𝑘 below from its initial features x 𝑗 𝑒 𝑡 𝑘 ∈ R 4 , where W 𝐽 ∈ R 𝐷 × 4 and b 𝐽 ∈ R 𝐷 : h 𝐽 𝑘 = GELU  LayerNorm  W 𝐽 x 𝑗 𝑒 𝑡 𝑘 + b 𝐽   (13) Simultaneously , for each track 𝑡 𝑖 , we project its ltered feature vector ˜ z 𝑖 ∈ R 𝐷 to obtain a key embedding h 𝑇 𝑖 below , where W 𝑇 ′ ∈ R 𝐷 × 𝐷 and b 𝑇 ′ ∈ R 𝐷 : h 𝑇 𝑖 = GELU ( W 𝑇 ′ ˜ z 𝑖 + b 𝑇 ′ ) (14) 3.5.2 Bipartite Aention Mechanism. W e model aggregation as message passing on a bipartite graph where edges ow from con- stituent tracks to their parent jets. For a given jet 𝑗 𝑘 and its con- stituent set N ( 𝑗 𝑘 ) = { 𝑡 𝑖 | ( 𝑡 𝑖 , 𝑗 𝑘 ) ∈ E ℎ 𝑦𝑝 𝑒 𝑟 } , we compute the atten- tion coecients 𝛽 𝑘 𝑖 using a dynamic graph attention mechanism: 𝛽 𝑘 𝑖 = S  LeakyReLU  a 𝑇 h h 𝐽 𝑘 ∥ h 𝑇 𝑖 i   (15) where a ∈ R 2 𝐷 is the attention vector , and S is the softmax function that normalizes scores across the constituent set 𝑡 𝑖 ∈ N ( 𝑗 𝑘 ) . The aggregated jet representation is then computed as the weighted sum of the constituent embeddings: h 𝑎𝑔 𝑔 𝑘 =  𝑡 𝑖 ∈ N ( 𝑗 𝑘 ) 𝛽 𝑘𝑖 · h 𝑇 𝑖 (16) 3.5.3 Final Jet Embedding. T o stabilize the regression, we fuse the cleaned information h 𝑎𝑔 𝑔 𝑘 with the original raw jet embedding h 𝐽 𝑘 , which serves as a stable prior for the total energy scale . The nal jet representation h 𝑓 𝑖 𝑛𝑎𝑙 𝑘 is obtained via concatenation and non-linear projection below , where W 𝑓 𝑢𝑠 𝑒 ∈ R 𝐷 × 2 𝐷 : h 𝑓 𝑖 𝑛𝑎𝑙 𝑘 = Dropout  ReLU  W 𝑓 𝑢𝑠 𝑒 h h 𝐽 𝑘 ∥ h 𝑎𝑔 𝑔 𝑘 i   (17) 3.6 T ask-Spe cic Prediction Heads The nal jet emb edding h 𝑓 𝑖 𝑛𝑎𝑙 𝑘 encodes b oth the stable global energy scale and the cleaned lo cal jet context. T o decouple the tasks of energy and mass r econstruction, we pass this shar ed representation through two separate Multi-Layer Perceptrons (MLP), denoted as Φ 𝐸 and Φ 𝑀 . Each head projects the latent vector to a scalar correc- tion factor , constrained to the range [ 0 , 1 ] via sigmoid ( 𝜎 ): ˆ 𝑦 𝐸,𝑘 = 𝜎  Φ 𝐸 ( h 𝑓 𝑖 𝑛𝑎𝑙 𝑘 )  , ˆ 𝑦 𝑀 ,𝑘 = 𝜎  Φ 𝑀 ( h 𝑓 𝑖 𝑛𝑎𝑙 𝑘 )  (18) ˆ 𝑦 𝐸,𝑘 and ˆ 𝑦 𝑀 ,𝑘 represent the estimate d fraction of the raw jet’s energy and mass, attributable to the hard scatter interaction, respectively . 3.7 Joint Learning Objective T o enforce both accurate jet purication and accurate track classi- cation, we train P hyGH T with a multi-task obje ctive . The total loss L 𝑡 𝑜 𝑡 𝑎𝑙 is a weighted sum of the primary regression loss and an auxiliary classication loss: L 𝑡 𝑜 𝑡 𝑎𝑙 = L 𝑟 𝑒 𝑔 + 𝜆 L 𝑎𝑢𝑥 (19) where 𝜆 controls the inuence of the physics-guided super vision. 3.7.1 Regression Loss. The primary objective is to minimize the error in the predicted correction factors for all jets in the event. W e employ the Mean Squared Err or (MSE) between the predicted fractions ( ˆ 𝑦 𝐸,𝑘 , ˆ 𝑦 𝑀 ,𝑘 ) and the ground truth ratios ( 𝑦 𝐸,𝑘 , 𝑦 𝑀 ,𝑘 ) : L 𝑟 𝑒 𝑔 = 1 𝑁 𝑗 𝑒 𝑡 𝑁 𝑗 𝑒𝑡  𝑘 = 1  ( ˆ 𝑦 𝐸,𝑘 − 𝑦 𝐸,𝑘 ) 2 + ( ˆ 𝑦 𝑀 ,𝑘 − 𝑦 𝑀 ,𝑘 ) 2  (20) 3.7.2 A uxiliary Classification Loss. T o ensure PSG learns to cor- rectly identify signal particles, we use Binary Cross-Entropy loss on the track-level scores, ˆ 𝑠 𝑖 . This forces the latent soft-mask mechanism to align with the true physical vertex association 𝑦 𝑙 𝑎𝑏𝑒 𝑙 𝑖 ∈ { 0 , 1 } : L 𝑎𝑢𝑥 = − 1 𝑁 𝑡 𝑟 𝑎𝑐 𝑘 𝑁 𝑡 𝑟 𝑎𝑐 𝑘  𝑖 = 1  𝑦 𝑙 𝑎𝑏𝑒 𝑙 𝑖 log ( ˆ 𝑠 𝑖 ) + ( 1 − 𝑦 𝑙 𝑎𝑏𝑒 𝑙 𝑖 ) log ( 1 − ˆ 𝑠 𝑖 )  (21) By jointly optimizing L 𝑎𝑢𝑥 and L 𝑟 𝑒 𝑔 , the model learns to lter pileup e xplicitly while rening the aggregated jet properties, result- ing in physical corrections to jet energy and mass. 4 Experiments and Results W e evaluate the proposed PhyGH T framework on the basis of the following research questions: Q1: How does P hyGH T compare to state-of-the-art baselines in recovering signal observables under ex- treme pileup conditions? Q2: How does the computation eciency of PhyGH T compare to baseline mo dels for oine reconstruction? Q3: What is the contribution of each ar chitectural component to the overall model performance and robustness? Q4: Does the inter- pretable PSG gate eectively distinguish signal from background compared to e xisting physics algorithms? Q5: Can Phy GHT restore precision for downstream physics tasks, such as the invariant mass resolution of the top quark? 4.1 Experimental Setup 4.1.1 Dataset. Using 𝑡 ¯ 𝑡 collision dataset detailed in Section 2, we evaluate performance under two distinct pileup scenarios: standard KDD ’26, August 09–13, 2026, Jeju, K orea Rakib et al. T able 1: Performance Comparison on the T est Set. W e report the Coecient of Determination ( 𝑅 2 ) for the energy ( ˆ 𝑦 𝐸 ) and mass ( ˆ 𝑦 𝑀 ) correction factors across two pileup scenarios: ⟨ 𝜇 ⟩ ∈ { 60 , 200 } . ⟨ 𝝁 ⟩ T arget Baselines Ours Transformer GNN HGNN GA T HGA T P UPPI ParticleNet P UMINet PhyGH T 60 Energy 0.812 0.841 0.821 0.839 0.792 0.769 0.869 0.934 0.943 Mass 0.634 0.694 0.662 0.683 0.621 0.549 0.748 0.838 0.869 200 Energy 0.778 0.837 0.798 0.792 0.756 0.348 0.853 0.926 0.932 Mass 0.567 0.643 0.587 0.591 0.595 0.114 0.693 0.805 0.836 (a) (b) (c) (d) (e) (f ) (g) (h) Figure 4: 1D distribution and 2D correlation plots for Energy ( ˆ 𝑦 𝐸 ) and Mass ( ˆ 𝑦 𝑀 ) at ⟨ 𝜇 ⟩ = 60 (top row ) and ⟨ 𝜇 ⟩ = 200 (bottom row). LHC ( ⟨ 𝜇 ⟩ = 60 ) and extreme HL-LHC ( ⟨ 𝜇 ⟩ = 200 ). For each sce- nario, the dataset consisting of 10k events is split into an 80/10/10 proportions for training , validation, and testing, respectively . 4.1.2 Baselines. W e benchmark Phy GH T against the standard physics algorithm P UPPI [ 8 ]. T o assess geometric deep learning performance, we compare against GNN [ 21 ], GA T [ 28 ], HGNN [ 20 ], and HGA T [ 6 ]. W e also evaluate the sequential Transformer [ 5 ] to isolate the impact of pure attention mechanisms. Finally , we benchmark against spe cialized models for high-energy physics, specically ParticleNet [ 29 ] and P UMINet [ 32 ]. Please refer to Ap- pendix A for detailed baseline descriptions. 4.1.3 Implementation Details. All experiments were conducted on a single N VIDIA A10 GP U. T o ensure a fair comparison, all models were trained for 200 epochs using the A damW optimizer with a learning rate of 3 × 10 − 4 . W e utilized a batch size of 16 for ⟨ 𝜇 ⟩ = 60 , and 4 for ⟨ 𝜇 ⟩ = 200 . For P hyGH T , we set the nearest neighbor count 𝑘 = 8 and the auxiliary loss weight 𝜆 𝑎𝑢𝑥 = 0 . 1 , based on the hyperparameter analysis in Section 4.3.2. Additional hyperparameter details are provided in Appendix B. 4.2 Results W e evaluate the model using the Co ecient of Determination ( 𝑅 2 ). Results with additional metrics ( MAE, MSE, RMSE ) are available in Appendix C. In all reported results, bold indicates the best perfor- mance, while underlined values denote the second-best results. 4.2.1 Reconstruction Accuracy . As shown in T able 1, PhyGH T con- sistently outperforms all baselines across both standard ( ⟨ 𝜇 ⟩ = 60 ) and extreme ( ⟨ 𝜇 ⟩ = 200 ) pileup scenarios. PhyGH T exhibits ex- ceptional precision in predicting the mass correction factor . This gain is driven by our distance-aware graph attention of the local block, which explicitly weights particle interactions based on spa- tial proximity and preserves the angular correlations critical for accurate signal purication. The regression plots in Figure 4 fur- ther demonstrate the superiority of PhyGH T in pileup mitigation. The 1D distributions (Figs. 4a, 4c, 4e, 4g) conrm that Phy GHT closely tracks the ground truth spe ctrum across the entire jet fre- quency spectrum, while the 2D correlation densities (Figs. 4b, 4d, 4f, 4h) show a tight diagonal alignment, demonstrating robust per-jet reconstruction even in the high-energy tails. 4.2.2 Resolution A nalysis. T o quantify the precision of our recon- struction, we analyze the resolution distributions dened by the relative error ( 𝑦 𝑝𝑟 𝑒𝑑 − 𝑦 𝑡 𝑟 𝑢𝑒 ) / 𝑦 𝑡 𝑟 𝑢𝑒 . Figure 5 compares P hyGH T PhyGH T for Signal Purification at the HL-LHC KDD ’26, August 09–13, 2026, Jeju, K orea (a) (b) (c) ( d) Figure 5: Comparison of Energy and Mass Resolutions of PhyGH T with baselines at ⟨ 𝜇 ⟩ = 60 and ⟨ 𝜇 ⟩ = 200 . against the top-performing baselines, P UMINet and ParticleNet. PhyGH T (red cur ve) exhibits the sharpest p eak centered at zero for both energy and mass, indicating minimal bias and the lowest variance among all methods. This high resolution conrms the ef- fectiveness of our model not only in recovering hard scatter energy and mass of each jet, but also in eectively ltering pileup tracks. 4.2.3 Computational Eiciency . T able 2 benchmarks model p erfor- mance against computational cost. Despite a marginal increase in parameter count, PhyGH T achieves the lowest latency , delivering a 1.9x and 8.7x sp eedup over P UMINet and ParticleNet, respectively , at ⟨ 𝜇 ⟩ = 200 . This eciency stems from fundamental structural dierences. Baselines suer from comp ounding overheads: Par- ticleNet dynamically recomputes neighbor graphs at ev ery layer , while P UMINet repeats expensive quadratic attention three times in sequence. In contrast, PhyGH T computes the local k-NN graph only once and restricts dense global op erations to a single block. By handling local and jet-level aggregation with ecient sparse graph layers, our model av oids r ecalculating the global ev ent conte xt mul- tiple times. This ensures that latency scales favorably with e vent complexity , enabling high throughput for oine pileup mitigation. T able 2: Comparison of p erformance, mo del size, and infer- ence latency across a single event for 1000 runs. ⟨ 𝝁 ⟩ Model Energy Mass Params Latency ( 𝑅 2 ) ↑ ( 𝑅 2 ) ↑ (M) ↓ (ms) ↓ 60 ParticleNet 0.869 0.748 0.17 130.1 P UMINet 0.934 0.838 0.85 12.4 PhyGH T 0.943 0.869 0.95 10.0 200 ParticleNet 0.853 0.693 0.17 353.1 P UMINet 0.926 0.805 0.85 78.2 PhyGH T 0.932 0.836 0.95 40.4 T able 3: Impact of removing key components from PhyGH T . Metric 𝑅 2 PhyGH T w/o Local w/o Global w/o PSG w/o Hypergraph Energy 0.943 0.903 0.848 0.915 0.891 Mass 0.869 0.813 0.715 0.837 0.824 T able 4: Impact of Local DA -GA T neighborhood size ( 𝑘 ) on regression performance. For 𝑘 = 0 , the local block is removed. Metric 𝑅 2 Neighbors ( 𝒌 ) 0 2 4 8 16 20 32 Energy 0.903 0.932 0.938 0.943 0.934 0.928 0.921 Mass 0.813 0.851 0.856 0.869 0.858 0.853 0.849 T able 5: Impact of the auxiliar y classication loss weight ( 𝜆 𝑎𝑢𝑥 ) on overall regression performance. Metric 𝑅 2 A uxiliary W eight ( 𝝀 𝒂 𝒖𝒙 ) 0.0 0.1 0.25 0.5 0.75 1.0 Energy 0.915 0.943 0.940 0.938 0.935 0.931 Mass 0.837 0.869 0.865 0.862 0.859 0.855 4.3 Ablation Study 4.3.1 Impact of A rchitectural Components. T able 3 highlights the contribution of each module to the model’s predictive power by re- moving key components from P hyGH T . The Global Context block is the most critical component, as its r emoval isolates the model fr om the event-wide context required to estimate background pileup density . Among the structural components, the Hypergraph Ag- gregation is the most inuential for energy accuracy , enabling the network to selectively gather signal-dominant tracks while ignor- ing background uctuations. In contrast, the Lo cal Geometric blo ck is indispensable for mass recovery , as preserving local angular cor- relations is essential for accurately reconstructing the jet’s invariant mass. Finally , PSG provides a crucial layer of renement by explic- itly ltering noisy constituents before they r each the aggregation stage. Implementation details are in Appendix F. 4.3.2 Sensitivity to Hyperparameters. W e examine the model’s sen- sitivity to key hyperparameters in T ables 4 and 5. The local neigh- borhood size exhibits a clear optimum at 𝑘 = 8 , representing the ideal trade-o where the model captures necessary structural cor- relations without integrating distant pileup noise. For the gating mechanism, a low auxiliar y weight of 𝜆 𝑎𝑢𝑥 = 0 . 1 proves most ef- fective as it scales the auxiliar y classication loss to match the magnitude of the primary regression objective. This pr events the gating task from dominating the optimization process. KDD ’26, August 09–13, 2026, Jeju, K orea Rakib et al. (a) (b) Figure 6: Mass Reconstruction after Phy GHT correction 4.4 Physics Study 4.4.1 T op ark Mass Reconstruction. Figure 6a and Figure 6b illustrates the reconstruction of the top quark invariant mass, a standard b enchmark for practical physics analysis. Heavily dis- torted by pileup, the uncorrecte d response (red) shows a mass resonance that is shifted with substantial broadening. PhyGH T suc- cessfully mitigates pileup contamination (blue) and nearly matches the ground truth mass resonance (green). Since the top quark mass resonance using PhyGH T’s predictions aligns in good agreement with ground truth, we demonstrate that the model can b e used in real world physics analysis. Additionally , we have reduced the pileup mitigation task to two simple correction factors which over- all simplies the pileup mitigation workow compared to existing methods. The detaile d methodology for sele cting candidate jets and reconstructing the invariant mass is provided in Appendix D. 4.4.2 Pileup Suppression Eicacy . Figure 7 evaluates track-level classication performance under high pileup conditions ( ⟨ 𝜇 ⟩ = 200 ). PhyGH T achieves a near-perfect ROC curve, demonstrating eec- tive discrimination between signal and pileup. T o contextualize this, we benchmark against P UPPI [ 8 ], a standard statistical technique that reweights particles based on their likelihoo d of originating from a hard scatter . W e also compare against SoftKiller [ 12 ], a geo- metric approach that applies a median-base d 𝑝 𝑇 cut within grid patches to remove background. As shown in the plot, PhyGH T signicantly outperforms both baselines: it maintains high signal eciency where P UPPI struggles to separate classes, and it avoids the signal loss inherent to SoftKiller’s hard xed cuts. P UPPI’s inabil- ity to cleanly separate signal from noise directly leads to the poor energy and mass reconstruction seen in T able 1. This conrms that our physics-guided architecture oers signicant improvements in pile-up mitigation accuracy . 5 Discussion The experimental results demonstrate that explicitly modeling the hierarchical top ology of particle collision improves pileup miti- gation. The Global Context block is crucial because it enables the model to analyze the entire event to estimate background density . The Distance-A ware GA T signicantly improves jet purication by preserving essential angular correlations within the jet’s local context. PSG eectively lters noise, ensuring the model processes only physically relevant information. This is complemented by the Hypergraph Aggregation mechanism, which allows the model to Figure 7: ROC Cur ve for PSG Track Classication at ⟨ 𝜇 ⟩ = 200 dynamically attend to signal-dominant tracks while ignoring pileup uctuations. This highlights a fundamental dierence in motiva- tion from the original hypergraph neural networks, which were de- signed to model multi-mo dal conne ctivity (e.g., in so cial media) [ 20 ]; instead, we le verage this topology to enforce the physical hierarchy of particle interactions. Beyond high-energy physics, this frame- work oers a generalizable solution for any domain that r equires separating dense, local signal clusters fr om global environmental noise. This architecture may be applicable to any heterogeneous graph problem where signal and background are topologically dis- tinct, such as denoising 3D point clouds in autonomous driving or detecting anomalous communities in large-scale social networks. 6 Conclusion In this work, we proposed PhyGH T , a physics-guided hyp ergraph transformer designed to recover hard-scatter obser vables at the HL-LHC. By leveraging a hierarchical architecture that synergizes global context with local geometry , the model surpasses existing state-of-the-art baselines in pileup suppression. Beyond predictive performance, the model achieves signicantly lo wer inference la- tency , demonstrating its suitability for practical physics analysis, such as jet reconstruction. A dditionally , we introduced a nov el sim- ulated dataset of top quark pair production in high-pileup environ- ments to rigorously benchmark these capabilities. W e open-source our data to bridge a crucial gap between machine learning and high-energy physics, fostering interdisciplinary collaboration in order to deliver AI solutions to the frontier of science . 7 Limitations and Ethical Considerations This study establishes PhyGH T’s performance using simulations that closely mimic the HL-LHC environment. Our next step is to integrate the model into the A TLAS software framework to process real collision data with full detector eects. W e also plan to extend this architecture to other complex physics signatures, such as di-Higgs production and 𝑊 / 𝑍 boson decays. Regarding ethical considerations, this work relies solely on simulated particle physics data and involves no human subjects or conceivable societal risks. Acknowledgments This work is supp orted by the U.S. Department of Energy (DoE) grant DE-SC0024669. PhyGH T for Signal Purification at the HL-LHC KDD ’26, August 09–13, 2026, Jeju, K orea References [1] 2014. T agging and suppression of pileup jets with the A TLAS detector . Technical Report. CERN, Geneva. https://cds.cern.ch/record/1700870 [2] 2022. Graph Neural Network Jet Flavour T agging with the A TLAS Detector . T echnical Report. CERN, Geneva. https://cds.cern.ch/ record/2811135 All gures including auxiliary gures are available at https://atlas.web .cern.ch/Atlas/GROUPS/PHYSICS/P UBNOTES/A TL-PHYS- P UB-2022-027. [3] O Aberle, C Adorisio , A Adraktas, M Ady , J Albertone, L Alberty , M Alcaide Leon, A Alekou, D Alesini, B Almeida Ferreira, et al . 2020. High-luminosity large hadron collider (HL-LHC): T echnical design report. (2020). [4] Johan Alwall, Rikkert Frederix, Stefano Frixione, V alentin Hirschi, Fabio Maltoni, Olivier Mattelaer , Hua-Sheng Shao, Tim Stelzer , Paolo T orrielli, and Marco Zaro. 2014. The automated computation of tree-level and next-to-leading order dier- ential cross sections, and their matching to parton shower simulations. Journal of High Energy Physics 2014 (2014). https://api.semanticscholar .org/CorpusID: 256012920 [5] V aswani Ashish. 2017. Attention is all you nee d. Advances in neural information processing systems 30 (2017), I. [6] Song Bai, Feihu Zhang, and Philip HS T orr. 2021. Hypergraph convolution and hypergraph attention. Pattern Recognition 110 (2021), 107637. [7] Richard D Ball, V alerio Bertone, Stefano Carrazza, Christopher S Deans, Luigi Del Debbio, Stefano Forte, Alberto Guanti, Nathan P Hartland, José I Latorr e, Juan Rojo, et al . 2013. Parton distributions with LHC data. Nuclear Physics B 867, 2 (2013), 244–289. [8] Daniele Bertolini, Philip Harris, Matthew Low , and Nhan T ran. 2014. Pileup per particle identication. Journal of High Energy P hysics (Online) 2014 (10 2014). doi:10.1007/JHEP10(2014)059 [9] Christian Bierlich, Smita Chakraborty , Nishita Desai, Leif Gellersen, Ilkka Hele- nius, P hilip Ilten, Leif Lönnblad, Stephen Mrenna, Stefan Prestel, Christian T obias Preuss, et al . 2022. A comprehensive guide to the physics and usage of PYTHIA 8.3. SciPost Physics Codebases (2022), 008. [10] Matteo Cacciari, Gavin P Salam, and Gregory Soyez. 2008. The anti-kt jet clus- tering algorithm. Journal of High Energy Physics 2008, 04 (2008), 063. [11] Matteo Cacciari, Gavin P Salam, and Gregory Soyez. 2012. FastJet user manual: (for version 3.0. 2). The European Physical Journal C 72, 3 (2012), 1896. [12] Matteo Cacciari, Gavin P Salam, and Gregory Soyez. 2015. SoftKiller , a particle- level pileup removal method. The European Physical Journal C 75, 2 (2015), 59. [13] Benjamin T . Carlson, Stephen T . Roche, Michael Hemmett, and T ae Min Hong. 2025. Ring-based ML calibration with in situ pileup correction for real-time jet triggers. arXiv:2507.16686 [hep-ph] https://arxiv .org/abs/2507.16686 [14] A TLAS Collaboration. 2026. Expected Transformer-Network-based jet avour tag- ging p erformance with the ATLAS Inner Tracker Dete ctor at the High-Luminosity LHC. in preparation. [15] A TLAS collaboration et al . 2021. Expected tracking and related performance with the updated A TLAS Inner Tracker lay out at the High-Luminosity LHC . T echni- cal Report. LHC/A TLAS Experiment. [16] The A TLAS Collaboration. 2014. A TLAS Pythia 8 tunes to 7 T e V data . Technical Report A TL-PHYS-P UB-2014-021. CERN. https://cds.cern.ch/record/1966419 [17] The ATLAS Collaboration and G Aad et al. 2008. The ATLAS Experiment at the CERN Large Hadron Collider. Journal of Instrumentation 3, 08 (aug 2008), S08003. doi:10.1088/1748- 0221/3/08/S08003 [18] The CMS Collab oration and S Chatrchyan et al. 2008. The CMS experiment at the CERN LHC. Journal of Instrumentation 3, 08 (aug 2008), S08004. doi:10.1088/1748- 0221/3/08/S08004 [19] Fawad Ali et al. 2024. HSF/phoenix: v3.0.3. doi:10.5281/zenodo.14203030 [20] Yifan Feng, Haoxuan Y ou, Zizhao Zhang, Rongrong Ji, and Y ue Gao. 2019. Hy- pergraph neural networks. In Proceedings of the AAAI conference on articial intelligence, V ol. 33. 3558–3565. [21] Will Hamilton, Zhitao Ying, and Jure Leskov ec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017). [22] Nilotpal Kakati, Etienne Dreyer , Anna Ivina, Francesco Armando Di Bello, Lukas Heinrich, Marumi K ado, and Eilam Gr oss. 2025. HGPow: Extending Hypergraph Particle Flow to Collider Event Reconstruction. arXiv:2410.23236 [hep-ex] https: //arxiv .org/abs/2410.23236 [23] Gregor Kasieczka, Benjamin Nachman, David Shih, Oz Amram, Anders An- dreassen, Kees Benkendorfer , Blaz Bortolato, Gustaaf Brooijmans, F lorencia Canelli, Jack H Collins, et al . 2021. The LHC Olympics 2020 a community challenge for anomaly detection in high energy physics. Reports on progress in physics 84, 12 (2021), 124201. [24] Patrick T . Komiske, Eric M. Metodiev , Benjamin Nachman, and Matthew D. Schwartz. 2017. Pileup Mitigation with Machine Learning (P UMML). Journal of High Energy Physics 2017, 12 (Dec. 2017). doi:10.1007/jhep12(2017)051 [25] B Maier , S M Narayanan, G de Castro, M Goncharov , Ch Paus, and M Schott. 2022. Pile-up mitigation using attention. Machine Learning: Science and T e chnology 3, 2 (June 2022), 025012. doi:10.1088/2632- 2153/ac7198 [26] Jesus Arjona Martinez, Olmo Cerri, Maurizio Pierini, Maria Spiropulu, and Jean- Roch Vlimant. 2019. Pileup mitigation at the Large Hadron Collider with Graph Neural Networks. arXiv:1810.07988 [hep-ph] https://ar xiv .org/abs/1810.07988 [27] V . Mikuni and F. Canelli. 2020. ABCNet: an attention-base d method for particle tagging. The European P hysical Journal Plus 135, 6 (June 2020). doi:10.1140/ epjp/s13360- 020- 00497- 3 [28] V eličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and B Y oshua. 2018. Graph attention networks. In International conference on learning representations, V ol. 8. [29] Huilin Qu and Loukas Gouskos. 2020. Jet tagging via particle clouds. Physical Review D 101, 5 (2020), 056019. [30] Huilin Qu, Congqiao Li, and Sitian Qian. 2024. Particle Transformer for Jet T agging. arXiv:2202.03772 [hep-ph] https://ar xiv .org/abs/2202.03772 [31] David Rousseau, Sabrina Amrouche, Paolo Calaura, Victor Estrade, Steven Farrell, Cecile Germain, Vladimir Gligorov , T obias Golling, Heather Gray , Isabelle Guyon, et al. 2018. The TrackML Particle T racking Challenge. (2018). [32] Luke V aughan, Mohamme d Rakib , Shivang Patel, Flera Rizatdinova, Alexander Khanov , and Arunkumar Bagavathi. 2025. PileUp Mitigation at the HL-LHC Using Attention for Event- Wide Context. In Pacic- Asia Conference on Knowledge Discovery and Data Mining. Springer , 342–353. KDD ’26, August 09–13, 2026, Jeju, K orea Rakib et al. Appendix A Baseline Descriptions A.1 Graph Neural Network (GNN) [21] T o evaluate the ecacy of standard message-passing frameworks, we implement a baseline based on the GraphSA GE architecture. W e model the collision event as a heterogeneous graph containing two distinct node types: tracks and jets. The graph connectivity is de- ned by three specic edge sets: (1) local geometric edges conne cting tracks to their spatial neighbors ( Δ 𝑅 < 0 . 4 ) to encode local particle density; (2) hierarchical edges linking tracks to their constituent jets based on clustering histor y , enabling bidirectional information ow; and (3) global context edges connecting jets to other nearby jets ( Δ 𝑅 < 0 . 8 ). W e employ SAGEConv layers to p erform mean aggre- gation across these relations, pr ogressively updating the jet node embeddings to regress the nal energy and mass correction factors. A.2 Graph Attention Network (GA T) [28] T o assess the impact of dynamic feature weighting versus static aggregation, w e implement a baseline based on the GA T v2 architec- ture. This model employs the same heterogeneous graph topology as the GNN baseline, using the same node features and edge sets. In contrast to the static mean pooling of GraphSA GE , we employ GATv2Conv layers to compute learnable attention coecients for every edge. This allows the network to assign adaptive weights to lo- cal neighbors, constituent tracks, and surr ounding jets, dynamically prioritizing informative connections during the message-passing phase before regressing the nal correction factors. A.3 Hypergraph Neural Network (HGNN) [20] T o determine if simply grouping tracks into jets is sucient without auxiliary spatial conne ctions, we implement a baseline based on the HGNN architecture. In this model, we treat jets as hyperedges that connect variable-sized sets of track nodes . Unlike the GNN and GA T baselines, we discard all pairwise geometric edges, relying entirely on the bipartite structure dened by the jet clustering history . The network employs HypergraphConv layers to propagate information using xed weights determined simply by the number of conne cted tracks, rather than learnable attention. After this exchange, the updated track features are averaged and concatenated with the original raw jet features to form the nal embe dding used for regression. A.4 Hypergraph Attention Network (HGA T) [ 6 ] T o evaluate the benet of dynamic feature weighting within the bipartite structure, we implement a baseline base d on the HGA T architecture. Similar to the HGNN, this model represents the event strictly as tracks conne cted to jet hyp eredges, ignoring spatial neigh- bor connections. However , instead of the xed averaging used in HGNN, we employ attention-enabled HypergraphConv layers. By explicitly utilizing the jet features as context, the network calculates a learnable attention score for every track-jet pair . This allows the model to dynamically down-weight pileup tracks and prioritize signal constituents during the aggregation. Finally , these rened track representations are concatenated with the original raw jet features to form the nal embedding used for regression. A.5 ParticleNet [29] T o benchmark against a standard model in high-energy physics, we implement ParticleNet . This architecture was originally introduced for jet tagging tasks—identifying the type of particle that initiated a jet by treating the jet as an unordered cloud of particles. It uses Dynamic Graph CNNs , where the model continually nds new neighbors for each track based on learned features rather than xed physical positions. T o adapt this model for our pileup mitigation task, we modied the nal output stage. Instead of combining all tracks into a single classication score for the whole event, we group the rened track features back into their specic parent jets. W e then average these features to predict the energy and mass correction factors for each individual jet. A.6 Transformers [5] T o evaluate the capability of pure self-attention mechanisms to learn e vent topology without explicit structural bias, we implement a baseline based on the standard T ransformer architecture. In this approach, we atten the hierar chical event structur e into a single sequence. W e project both tracks and jets into a shared latent space, distinguishing them via learnable type embeddings. These tokens are concatenated to form a unied input sequence containing all jets and tracks in the event. W e employ a standard Transformer Encoder to process this sequence, enabling every token to attend to every other token via global self-attention. Finally , we slice out the updated emb eddings corresponding to the jet tokens to regress the energy and mass correction factors. A.7 P UMINet [32] T o compare against a state-of-the-art model specically designed for pileup mitigation, we implement P UMINet . Unlike the standard Transformer , which attens the event, P UMINet preserves the struc- ture where tracks belong to specic jets. The model processes the event through stacked blocks, each performing a specic se quence of updates. First, tracks use local self-attention to exchange informa- tion only with other tracks inside the same jet, eectively learning the jet’s internal shape. These rened track features are then av- eraged and combined with the jet’s features. Ne xt, the tracks use global self-attention to look at every other track in the ev ent, cap- turing the ov erall pileup density . Finally , the jets use cross-attention to look at this global track conte xt, allowing them to adjust their energy and mass predictions based on the surrounding event noise. A.8 P UPPI [8] The P UPPI algorithm is implemented by initializing track 4-vectors and applying cuts on particles with transverse momentum ( 𝑝 𝑇 < 1 Ge V), 𝜂 > 4 . 0 , and neutral particles. Tracks are paired within specic Δ 𝑅 ranges (0.02 to 0.3) to calculate local shape parameters 𝛼 𝑖 based on the momentum-weighted distances of neighboring tracks. Using truth labels to identify pileup contributions, we construct a 𝜒 2 metric from the median and RMS of pileup 𝛼 values, which is then converted to PUPPI weights via the cumulative 𝜒 2 distribution function. These weights are applied to reweight jet constituents, allowing calculation of predicted energy and mass fractions that can be validated against true pileup lab els through 𝑅 2 scores and ROC curves. Performance degrades signicantly at ⟨ 𝜇 ⟩ = 200 compared PhyGH T for Signal Purification at the HL-LHC KDD ’26, August 09–13, 2026, Jeju, K orea T able 6: Detailed regression metrics for Energy ( ˆ 𝑦 𝐸 ) and Mass ( ˆ 𝑦 𝑀 ) correction factors across standard ( ⟨ 𝜇 ⟩ = 60 ) and high-pileup ( ⟨ 𝜇 ⟩ = 200 ) scenarios. Best results are b olded, second-best are underlined. ⟨ 𝝁 ⟩ Model Energy ( ˆ 𝑦 𝐸 ) Mass ( ˆ 𝑦 𝑀 ) MSE ↓ RMSE ↓ MAE ↓ 𝑅 2 ↑ MSE ↓ RMSE ↓ MAE ↓ 𝑅 2 ↑ 60 ParticleNet 0.0044 0.0664 0.0429 0.8692 0.0030 0.0549 0.0366 0.7483 P UMINet 0.0022 0.0472 0.0308 0.9341 0.0019 0.0440 0.0289 0.8389 Phy GH T (Ours) 0.0019 0.0436 0.0281 0.9436 0.0016 0.0396 0.0257 0.8692 200 ParticleNet 0.0009 0.0305 0.0174 0.8536 0.0005 0.0212 0.0125 0.6935 P UMINet 0.0005 0.0216 0.0124 0.9265 0.0003 0.0169 0.0097 0.8052 Phy GH T (Ours) 0.0004 0.0207 0.0122 0.9322 0.0002 0.0155 0.0093 0.8364 to the original P UPPI paper at ⟨ 𝜇 ⟩ = 80 , likely because our hard scatter events lack generator-le vel ltering and appear more pileup- like. See validation plots in Appendix E B Hyperparameter Settings T o ensur e a fair comparison, we utilized a consistent set of hyperpa- rameters across all baselines and the proposed P hyGH T model. All experiments were conducted on a single NVIDIA A10 GP U using the AdamW optimizer with a xed learning rate of 3 × 10 − 4 and a random seed of 42 for reproducibility . W e trained all models for 200 ep ochs, using a batch size of 16 for the standard pileup sce- nario ( ⟨ 𝜇 ⟩ = 60 ) and 4 for the high-pileup scenario ( ⟨ 𝜇 ⟩ = 200 ) to accommodate memory constraints. For all architectures, we set the hidden dimension to 𝑑 = 128 , the number of attention heads to 4, the network depth to 𝐿 = 3 layers, and the dropout rate to 0.1. For the PhyGH T mo del specically , we set the nearest-neighbor count for local graph construction to 𝑘 = 8 and the auxiliary loss weight for the gating mechanism to 𝜆 𝑎𝑢𝑥 = 0 . 1 . C Extended Performance Metrics T able 6 presents a comprehensive evaluation with MSE, RMSE, MAE, and 𝑅 2 for the proposed P hyGH T architecture compared to the two most competitive baselines: ParticleNet and P UMINet. Results are reported for both the standard ( ⟨ 𝜇 ⟩ = 60 ) and high- luminosity ( ⟨ 𝜇 ⟩ = 200 ) pileup scenarios. W e observe that PhyGH T outperforms all baselines across all metrics in both pileup scenarios. D T op Quark Mass Reconstruction Process T o reconstruct the top quark mass, we rst apply the predicted corrections to each jet by rescaling the mass and energy according to the predicted fractions. W e then dene a corrected jet vector with a new momentum magnitude | ® 𝑝 | = √ 𝐸 2 − 𝑚 2 and transverse momentum 𝑝 𝑇 = | ® 𝑝 | sech ( 𝜂 ) . This results in a corrected Lorentz 4- vector where the direction remains unchanged, but the kinematics are scaled to reect only the hard-scatter contributions. Using truth labels, we trace the parton shower histor y via a depth- rst search algorithm to identify nal-state particles originating from the 𝑏 -quark and 𝑊 -boson of the top quark decay . W e then select three candidate jets: the two containing the highest fraction of tracks fr om the 𝑊 -boson are designated as 𝑊 1 and 𝑊 2 , while the jet with the most particles from the 𝑏 -quark is designated as 𝐵 . The top quark 4-vector is reconstructed by summing these candidates, ® 𝑇 = ® 𝑊 1 + ® 𝑊 2 + ® 𝐵 , and the nal invariant mass is calculated using the energy-momentum relation 𝑚 =  𝐸 2 − | ® 𝑝 | 2 . E P UPPI V alidation T o implement the P UPPI algorithm on our dataset, we rst initialize Lorentz 4-vectors of each track using [ 𝑝 𝑇 , 𝜂 , 𝜙 , 𝑚 ] where 𝑚 ≈ 0 and has charge 𝑞 . Tracks with 𝑝 𝑇 < 1 𝐺 𝑒𝑉 , 𝜂 > 4 . 0 , and 𝑞 = 0 are cut from the dataset. Using awkward librar y in python, we nd all possible pairs of tracks, [ 𝑇 𝑖 , 𝑇 𝑗 ] , and cut all pairs of tracks with Δ 𝑅 ( 𝑇 𝑖 , 𝑇 𝑗 ) > 0 . 3 and Δ 𝑅 ( 𝑇 𝑖 , 𝑇 𝑗 ) < 0 . 02 . For each 𝑇 𝑖 in all passing pairs, an 𝜖 parameter is calculated where 𝜖 𝑖 𝑗 = 𝑝 𝑇 ( 𝑇 𝑗 ) Δ 𝑅 ( 𝑇 𝑖 , 𝑇 𝑗 ) . Then for all 𝑇 𝑖 , we calculate the local shape parameter 𝛼 𝑖 , using the following equation on pairs that pass the Δ 𝑅 cut. 𝛼 𝑖 = 𝑙 𝑜 𝑔  𝑖 𝜖 𝑖 𝑗 ! (22) Then we select the 𝛼 𝑖 originating from pileup using truth lab els, and calculate the median, 𝛼 𝑚𝑒𝑑 𝑖 𝑎𝑛 𝑃 𝑈 , and RMS, 𝜎 2 𝑃 𝑈 originating from pileup. W e then construct a 𝜒 2 metric using the following equation where H is the Heav yside function: 𝜒 2 = H ( 𝛼 𝑖 − 𝛼 𝑚𝑒𝑑 𝑖 𝑎𝑛 𝑃 𝑈 ) ( 𝛼 𝑖 − 𝛼 𝑚𝑒𝑑 𝑖 𝑎𝑛 𝑃 𝑈 ) 2 𝜎 2 𝑃 𝑈 (23) A P UPPI W eight is then constructed using using 𝐹 𝜒 2 , the cumula- tive distribution function of the 𝜒 2 distribution with a single degree of freedom: 𝑤 𝑖 = 𝐹 𝜒 2 ,𝑁 𝐷 𝐹 = 1 ( 𝜒 2 𝑖 ) . (24) After each track is assigned a P UPPI weight, we r eweight each constituent of each jet accordingly . When can sum over the weighted 4-vectors of the set of tracks to calculate the predicted energy and mass fraction of each jet according to P UPPI weights. Since some particles were cut from the dataset with 𝑝 𝑇 < 1 𝐺 𝑒𝑉 , we also recal- culate the energy and mass fractions using true pileup lab els for the remaining constituents. From these recalculated values, we can derived an 𝑅 2 score and ROC cur ve . Note: at ⟨ 𝜇 ⟩ = 200 the same KDD ’26, August 09–13, 2026, Jeju, K orea Rakib et al. Figure 8: V alidation Plots for implementation of the P UPPI algorithm cuts were used for P UPPI weights, but the performance sharply dropped as shown in the Figure 8. Since we do not apply a generator level lter to hard scatter events, our hard scatter appears more pileup-like than the original P UPPI paper at ⟨ 𝜇 ⟩ = 80 . F Ablation Implementation Details T o understand how much each part of our model contributes to the nal performance, we trained four dierent versions of PhyGH T . In each version, we removed or replaced exactly one component while keeping everything else the same. The specic changes are described below . F .1 w/o Global Context Here, we removed the Global Transformer block entirely . In the original model, this blo ck allows tracks to attend to every other track in the event to estimate the overall pileup noise. By r emoving it, the model is forced to rely solely on local information for each track, without knowing what is happening in the rest of the event. F .2 w/o Hypergraph Aggregation Here, we r eplaced our specialized Jet Attention mechanism with a simple average. In the full mo del, the Hypergraph allows the jet to assign dierent importance weights to its tracks (e.g., focusing on high-energy signal tracks). In this ablation, the model simply takes the average featur e of all tracks in the jet, treating them all as equally important. F .3 w/o Local Geometric In this version, we skipp ed the Distance- A ware GA T layer at the very b eginning of the network. Normally , this layer helps tracks understand their immediate neighbors within the jet’s dense core. By removing it, the raw input features are passed directly to the global block, preventing the model from explicitly learning lo cal spatial relationships between nearby tracks. F .4 w/o PSG (Pileup Suppression Gate) W e remo ved the soft gating network, which serves as a lter before the nal aggregation. In the full model, this component calculates a signal probability for each track and suppresses those that appear to be noise. For this ablation, we set this probability to 1.0 for ev ery track, eectively turning o the lter and forcing the model to use all tracks—signal and pileup alike to calculate the jet properties. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

PhyGHT: Physics-Guided HyperGraph Transformer for Signal Purification at the HL-LHC

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment