MadMiner: Machine learning-based inference for particle physics

UCI-TR-2019-16, SLA C-PUB-17461 MadMiner: Mac hine learning–based inference for particle ph ysics Johann Brehmer, 1 , ∗ F elix Kling, 2, 3 , † Irina Espejo, 1 , ‡ and Kyle Cranmer 1 , § 1 Center for Data Scienc e and Center for Cosmolo gy and Particle Physics, New Y ork University, New Y ork, NY 10003, USA 2 Dep artment of Physics and Astr onomy, University of California, Irvine, CA 92697, USA 3 SLAC National A c c eler ator L ab or atory, 2575 Sand Hil l R o ad, Menlo Park, CA 94025, USA Precision measuremen ts at the LHC often require analyzing high-dimensional ev en t data for subtle kinematic signatures, which is c hallenging for established analysis metho ds. Recen tly , a pow erful family of multiv ariate inference tec hniques that leverage b oth matrix element information and mac hine learning has been developed. This approach neither requires the reduction of high-dimensional data to summary statistics nor any simpliﬁcations to the under- lying physics or detector resp onse. In this pap er we introduce MadMiner , a Python mo dule that streamlines the steps inv olved in this procedure. W rapping around MadGraph5_aMC and Pythia 8 , it supp orts almost any physics pro cess and mo del. T o aid phenomenological studies, the to ol also wraps around Delphes 3 , though it is extendable to a full Geant4 - based detector simulation. W e demonstrate the use of MadMiner in an example analysis of dimension-six op erators in ttH pro duction, ﬁnding that the new techniques substantially increase the sensitivit y to new ph ysics. CONTENTS I. Introduction 2 I I. Inference techniques 3 A. LHC measuremen ts as a likelihoo d-free inference problem 3 B. Learning the lik eliho od function 5 C. Learning lo cally optimal observ ables 7 D. The Fisher information 8 E. Practical analysis asp ects 9 F. Recommendations for getting started 12 I II. Using MadMiner 13 A. Analysis sp eciﬁcation and even t generation 13 B. Detector eﬀects and observ ables 15 C. Sample un w eigh ting and augmentation 15 D. Machine learning 16 E. Inference 16 F. Fisher information 16 IV. Physics example 17 A. Illustration of analysis tec hniques 17 B. V alidation at parton lev el 21 C. Realistic ph ysics analysis 23 ∗ johann.brehmer@n yu.edu † felixk@slac.stanford.edu ‡ iem244@n yu.edu § kyle.cranmer@n yu.edu 2 V. Conclusions 28 A c kno wledgments 29 A. F requently asked questions 29 References 31 I. INTR ODUCTION Precision measuremen ts at the Large Hadron Collider (LHC) experiments search for direct and indirect signals of physics beyond the Standard Mo del. Statistically , this requires constraining a t ypically high-dimensional parameter space, for instance the Wilson co eﬃcien ts in an eﬀectiv e ﬁeld theory (EFT) or the couplings and masses in a sup ersymmetric model. The data going in to these analyses consists of a large num b er of observ ables, man y of whic h can carry information on the parameters of interest. The relation b et ween model parameters and observ ables is typically b est describ ed by a suite of computer simulation to ols for the hard in teraction, parton sho w er, hadronization, and detector resp onse. These to ols take as input assumed parameters of the physics mo del, for instance a particular v alue for the Wilson co eﬃcien ts of an EFT, and use Monte-Carlo methods to sample h yp othetical observ ations. Unfort unately , they do not directly let us solve the in v erse problem: giv en a set of observ ed ev en ts, it is not p ossible to explicitly calculate the lik eliho od of such a measuremen t as a function of the theory parameters. This in tractabilit y of the likelihoo d function is a ma jor c hallenge for particle physics measurements. P article ph ysicists hav e dev elop ed a range of techniques for this problem of likeliho o d-fr e e infer enc e . These can b e roughly grouped into three categories [ 1 ]: 1. T raditionally , analyses are restricted to a small num b er of hand-pic k ed observ ables. The lik eliho od function for these lo w-dimensional summary statistics can then b e estimated with explicit parametric functions, histograms, k ernel densit y estimation techniques, or Gaussian Pro cesses [ 2 – 4 ]. Relatedly , Approximate Bay esian Computation [ 5 – 8 ] is a family of Bay esian tec hniques that allow sampling from an approximate version of the p osterior in the space of the summary statistics. Coming up with the new est and greatest kinematic observ ables is a p opular pastime among phenomenologists. But limiting the analysis to a few summary statistics discards the information in all other directions in phase space. Even w ell-motiv ated v ariables often do not come close to the p o wer of an analysis of the fully diﬀeren tial cross section [ 9 , 10 ]. 2. Another approac h aims to estimate the likelihoo d function of high-dimensional observ ables b y approximating the eﬀect of show er, hadronization, and detector resp onse with simple transfer functions (or neglecting them altogether). In this approximation, the lik eliho od b ecomes tractable. This category includes the Matrix Elemen t Metho d [ 11 – 26 ], Optimal Observ ables [ 27 – 29 ], and Sho wer and Even t Deconstruction [ 30 – 33 ]. These metho ds make maximal use of the kno wledge about the ph ysics underlying the simulations. While these metho ds do not require picking summary statistics, the appro ximation of the detector resp onse can lead to sub optimal results, the treatmen t of additional jet radiation is a c hallenge, and the ev aluation of each even t requires the calculation of a numerically exp ensiv e integral. 3. Ov er the last y ears metho ds based on mac hine learning ha ve b ecome increasingly p opular. The industry standard in particle physics is to train a classiﬁer (often a b oosted decision tree or 3 neural net work) to classify even ts as coming from diﬀeren t sources (e. g. signal vs. bac kground). Its output is used to deﬁne acceptance regions, accepted even ts are then usually analyzed with a traditional histogram-based measurement strategy . While this strategy is great at suppressing background even ts, it do es not necessarily lead to the most precise parameter measuremen ts when kinematic distributions change ov er the parameter space [ 9 ]. Only recen tly has there b een an increased interest in using mac hine learning to estimate the lik eliho od, lik eliho o d ratio, or (in a Ba y esian setting) the p osterior [ 34 – 58 ]. These approac hes ha v e in common that they only require access to samples generated for diﬀerent mo del parameter v alues. They can handle high-dimensional observ ables and do not require a c hoice of summary statistics. They also w ork nativ ely with the output of the sim ulator, so they do not require any simpliﬁcations to the underlying ph ysics or detector resp onse. The estimate of the lik eliho o d provided by these algorithms typically b ecomes exact in the limit of inﬁnite training samples (assuming suﬃcient capacity and eﬃcien t training), but often a large n um b er of sim ulations is required b efore a go od p erformance is reached. A new machine-learning-based approach that directly leverages matrix element information has b een in tro duced in Refs. [ 59 – 61 ] and since b een further dev elop ed in Refs. [ 1 , 62 ]. Lik e the other m ultiv ariate approaches, these techniques supp ort high-dimensional observ ables without the restriction to summary statistics. Similar to the Matrix Element Metho d and Optimal Observ ables, these techniques leverage our ph ysics insight in the form of the matrix elements eﬃcien tly . But unlik e those metho ds, they support state-of-the-art simulations of the parton show er and detector resp onse. In addition, after an upfront simulation and training phase, they pro vide a function that estimates the likelihoo d and can b e ev aluated in microseconds. These new techniques require extracting matrix-elemen t information from the Mon te-Carlo sim u- lation, k eeping trac k of and manipulating these weigh ts in sp eciﬁc w a ys, and then training mac hine learning mo dels on this data. Without prop er softw are supp ort, these steps are cumbersome and error-prone, providing a technological hurdle to a wider adaptation of these methods. Reference [ 59 ] describ es this approach with the analogy of “mining gold” from Mon te-Carlo sim ulations: while the additional information from the sim ulations is v ery v aluable, it can require some eﬀort to extract and process. But the gold do es not hav e to b e hard to mine! In this paper w e in tro duce MadMiner , a Python module that automates all steps necessary for these mo dern multiv ariate inference tec hniques. It supp orts all elemen ts of a typical analysis, including the simulation of ev en ts with MadGraph5_aMC [ 63 ], Pythia 8 [ 64 ], detector sim ulation, reducible and irreducible bac kgrounds, and systematic uncertain ties. F or phenomenological studies, the to ol supp orts the simulation of the detector resp onse with Delphes 3 [ 65 ], though it is extendable to a full detector simulation based on Geant4 [ 66 ]. W e review the supported analysis tec hniques in Sec. II and describ e their implemen tation in MadMiner in Sec. I I I . In Sec. IV , the new tool is demonstrated in an example analysis of Higgs pro duction in asso ciation with a top pair at the high luminosity run of the LHC. W e give our conclusions in Sec. V . In the app endix we answer frequently asked questions. I I. INFERENCE TECHNIQUES A. LHC measurements as a lik eliho o d-free inference problem The ultimate goals of most measuremen ts are b est-ﬁt p oin ts and exclusion regions in a (high- dimensional) parameter space. In particle physics exp erimen ts, b est-ﬁt p oin ts are t ypically deﬁned as maxim um likelihoo d estimators, while exclusion regions are based on hypothesis tests that use 4 the (proﬁle) likelihoo d ratio as test statistic [ 67 ]. 1 Both are based on the same central ob ject, the lik eliho od function p full ( { x }| θ ) . It quan tiﬁes the probabilit y of observing a set of even ts, where each ev en t is c haracterized by a v ector x of observ ables suc h as reconstructed energies, momen ta, and angles of all ﬁnal-state particles, as a function of a vector of model parameters θ , e. g. the Wilson co eﬃcien ts of an eﬀectiv e ﬁeld theory . In particle physics measurements, the likelihoo d function usually has the form p full ( { x }| θ ) = P ois( n | Lσ ( θ )) Y i p ( x i | θ ) . (1) Here n is the observ ed n um b er of ev en ts, L is the in tegrated luminosit y , σ ( θ ) is the cross section as a function of the mo del parameters, P ois ( n | λ ) = λ n e − λ /n ! is the probability mass function of the P oisson distribution, and p ( x | θ ) = 1 σ ( x ) d d σ ( x | θ ) d x d (2) is the likelihoo d function for a single ev ent: the probabilit y density of the d -dimensional v ector of observ ables x as a function of the model parameters θ . Up to the normalization, this kinematic lik eliho od function is iden tical to the fully diﬀeren tial cross section d d σ ( x | θ ) / d x d . The P oisson or rate term in Eq. ( 1 ) is comparably simple, even though it is based on the cross section after eﬃciency and acceptance eﬀects, which can b e complicated to calculate in realistic problems. But the remaining terms, whic h quan tify the kinematic information, typically cannot b e explicitly computed at all. This is b ecause the most accurate model of the kinematic distributions is usually given by a complicated c hain of Monte-Carlo simulators. The kinematic likelihoo d they implicitly deﬁne can b e written symbolically as p ( x | θ ) = Z d z d Z d z s Z d z p p ( x | z d ) p ( z d | z s ) p ( z s | z p ) p ( z p | θ ) | {z } p ( x,z | θ ) , (3) where z p are the four-momen ta, charges, and helicities of the parton-level four-momenta, z s is the en tire history of the parton sho wer, and z d describ e the interactions of the particles with the detector. A state-of-the-art simulation can easily inv olve billions of such latent v ariables. Explicitly calculating the integral ov er this huge space is clearly imp ossible: giv en a set of ev ents { x } and a parameter p oin t θ , we hence cannot compute the lik elihoo d function (it is intr actable ). This is a ma jor c hallenge for analyzing LHC data. The same structural problem app ears in many other ﬁelds that use computer sim ulations to mo del complicated processes, including cosmology , systems biology , and epidemiology , giving rise to the dev elopmen t of diﬀerent likelihoo d-free inference techniques. In particle ph ysics, common analysis tec hniques address the intractabilit y of the lik eliho od function in diﬀerent wa ys. The traditional approac h restricts the observ ables x to one or tw o summary statistics v ( x ) , for instance the in v arian t mass of the deca y products of a searc hed resonance or the transverse momentum of the hardest particle in an EFT analysis. Then the density p ( v | θ ) can b e calculated with histograms and used in lieu of the full likelihoo d p ( x | θ ) . On the other hand, the Matrix Element Metho d and Optimal Observ able approaches simplify the integral in Eq. ( 3 ) b y replacing the show er and detector resp onse with simple smearing or transfer functions; in this appro ximation it also b ecomes tractable. F or a discussion and comparison of these diﬀeren t metho ds see Ref. [ 1 ]. 1 The issue of likelihoo d-free inference, the inference techniques discussed here, and MadMiner just as well apply in a Ba yesian setting, see for instance Ref. [ 56 ]. 5 B. Learning the lik eliho o d function A ﬁrst class of inference tec hniques in MadMiner tac kles the in tractabilit y of the lik eliho od function head-on: a neural netw ork is trained to estimate the kinematic lik elihoo d p ( x | θ ) or, equiv alently , the lik eliho o d ratio r ( x | θ 0 , θ 1 ) = p ( x | θ 0 ) p ( x | θ 1 ) (4) using data a v ailable from the simulator. T o b e more sp eciﬁc, MadMiner diﬀeren tiates b et w een three diﬀeren t functions that the neural net w ork can learn: Lik eliho o d estimators: In this case, a neural netw ork tak es as input ev en t data x as well as a mo del parameter p oin t θ and returns the estimated likelihoo d ˆ p ( x | θ ) , NN : ( x, θ ) 7→ ˆ p ( x | θ ) . (5) Lik eliho o d ratio estimators: Alternativ ely , the netw ork can mo del the lik eliho od ratio including its dependence on the data x and on the parameter p oin t θ in the n umerator of the ratio, NN : ( x, θ ) 7→ ˆ r ( x | θ ) ≈ p ( x | θ ) p ref ( x ) . (6) There are diﬀeren t options for the denominator distribution p ref ( x ) . In MadMiner w e set it to the distribution from a reference parameter p oin t, p ref ( x ) = p ( x | θ ref ) . Alternatively , it could b e given b y a marginal mo del p ref ( x ) = R d θ 0 p ( x | θ 0 ) p ( θ 0 ) , or even b e an entirely unphysical reference distribution. Doubly parameterized lik eliho o d ratio estimators: The last option is to mo del the lik eliho od ratio as a function of not only the ev en t data x and the numerator parameter p oin t θ 0 , but also including its dep endence on the denominator mo del, NN : ( x, θ 0 , θ 1 ) 7→ ˆ r ( x | θ 0 , θ 1 ) ≈ p ( x | θ 0 ) p ( x | θ 1 ) . (7) Note that in all three cases, the netw ork is p ar ameterize d in terms of the theory parameters θ [ 37 , 68 ]: rather than training separate netw orks for diﬀerent points on a grid of parameter p oin ts, one neural netw ork models the lik eliho od function for the whole parameter space. The net work learns to interpolate in parameter space and can “b orro w” statistical pow er from close parameter p oin ts, leading to a signiﬁcantly b etter sample eﬃciency than a p oin t-by-point approach [ 61 ]. But ho w do w e train a neural net w ork to learn an y of these three functions? More sp eciﬁcally , whic h loss function can w e minimize so that a neural net w ork will conv erge to the right function? There are a num b er of diﬀeren t answers, whic h can b e group ed in to t wo categories. First, some metho ds ha v e b een developed that just use samples of ev ents { x } generated from diﬀerent para- meter p oin ts θ . This includes neur al density estimation ( NDE ) tec hniques, for instance mask ed autoregressiv e ﬂo ws [ 47 ], in which the netw ork learns the lik elihoo d function. Another approach is the Carl metho d [ 37 ], whic h trains the netw ork to estimate the likelihoo d ratio. While b oth NDE and Carl are implemented in MadMiner , its ma jor feature is the supp ort for a new, p oten tially more p o werful paradigm to lik eliho o d or likelihoo d ratio estimation [ 59 – 61 ]. The k ey idea is that additional information can b e extracted from the Mon te-Carlo simulations, and that 6 Metho d Run sim ulation at Loss fn. uses Asympt. exact Generativ e Ref. r ( x, z ) t ( x, z ) Lik eliho od estimators NDE θ ∼ π ( θ ) X X [ 47 ] Scand al θ ∼ π ( θ ) X X X [ 59 ] Lik eliho od ratio estimators Carl θ ∼ π ( θ ) , θ ref X [ 37 ] R olr θ ∼ π ( θ ) , θ ref X X [ 61 ] Alice θ ∼ π ( θ ) , θ ref X X [ 62 ] Cascal θ ∼ π ( θ ) , θ ref X X [ 61 ] Rascal θ ∼ π ( θ ) , θ ref X X X [ 61 ] Alices θ ∼ π ( θ ) , θ ref X X X [ 62 ] Doubly parameterized lik eliho o d ratio estimators Carl θ 0 ∼ π ( θ ) , θ 1 ∼ π ( θ ) X [ 37 ] R olr θ 0 ∼ π ( θ ) , θ 1 ∼ π ( θ ) X X [ 61 ] Alice θ 0 ∼ π ( θ ) , θ 1 ∼ π ( θ ) X X [ 62 ] Cascal θ 0 ∼ π ( θ ) , θ 1 ∼ π ( θ ) X X [ 61 ] Rascal θ 0 ∼ π ( θ ) , θ 1 ∼ π ( θ ) X X X [ 61 ] Alices θ 0 ∼ π ( θ ) , θ 1 ∼ π ( θ ) X X X [ 62 ] Score estimators Sall y θ ref X in lo cal approx. [ 61 ] Sallino θ ref X in lo cal approx. [ 61 ] T able I. Inference tec hniques implemen ted in MadMiner . W e separate them into four groups, dep ending on whic h quan tity is estimated by the neural netw ork; see the text for more details. W e giv e the parameter v alues for which the Monte-Carlo samples hav e to b e generated and list whether the augmented data (joint lik eliho od ratio r ( x, z ) and joint score t ( x, z ) ) are used. “Asymptotically exact” describ es metho ds that should give theoretically optimal results in the limit of suﬃcien t net w ork capacity , p erfect optimization, and enough training data. Metho ds that also allow for the fast generation of even t data from the neural netw ork are marked as “generativ e”. Finally , for each metho d we pro vide the reference that pro vides the clearest explanation (and sp ells out the acronym). this additional information can b e used to train more precise estimators of lik eliho od or lik eliho od ratio with less training data. More speciﬁcally , for each simulated even t it is possible to calculate the join t lik eliho o d ratio r ( x, z | θ 0 , θ 1 ) ≡ p ( x, z | θ 0 ) p ( x, z | θ 1 ) = p ( x | z d ) p ( z d | z s ) p ( z s | z p ) p ( z p | θ 0 ) p ( x | z d ) p ( z d | z s ) p ( z s | z p ) p ( z p | θ 1 ) = d σ ( z p | θ 0 ) d σ ( z p | θ 1 ) σ ( θ 1 ) σ ( θ 0 ) (8) and the joint score t ( x, z | θ ) ≡ ∇ θ log p ( x, z | θ ) = p ( x | z d ) p ( z d | z s ) p ( z s | z p ) ∇ θ p ( z p | θ ) p ( x | z d ) p ( z d | z s ) p ( z s | z p ) p ( z p | θ ) = ∇ θ d σ ( z p | θ ) d σ ( z p | θ ) − ∇ θ σ ( θ ) σ ( θ ) . (9) Here σ ( θ ) is the total cross section as a function of the mo del parameters θ , and d σ ( z p | θ ) are the parton-lev el ev ent weigh ts. A t a hadron collider suc h as the LHC these can b e written as [ 15 ] d σ ( z p | θ ) = (2 π ) 4 f 1 (x 1 , Q 2 ) f 2 (x 2 , Q 2 ) 2x 1 x 2 s |M| 2 ( z p | θ ) dΦ( z p ) . (10) They dep end on the momentum fractions x i carried b y the initial-state partons, the squared cen ter- of-mass energy s , the momentum transfer Q , the corresp onding v alues of the parton density functions 7 f i (x i , Q 2 ) , and the Loren tz-in v ariant phase-space elemen t dΦ( z p ) . Finally , z p is the entire phase- space p oin t of a sim ulated even t (including the parton four-momen ta, helicities, and charges), and |M| 2 ( z p | θ ) is the squared matrix element. Both the joint likelihoo d ratio and the joint score th us dep end on the parton-lev el momenta z p and are directly related to the squared matrix elemen t describing the underlying pro cess. The main insigh t of Refs. [ 59 – 61 ] is that the join t likelihoo d ratio and join t score can b e used to deﬁne loss functions that, when minimized with resp ect to a test function that only dep ends on x and θ , conv erges to the lik eliho od function p ( x | θ ) or the lik eliho od ratio. 2 There are sev eral v ariations of this idea. The main diﬀerence betw een them is the exact form of the loss function used. W e lab el them with a set of acronyms: Scandal is an improv ed version of NDE tec hniques that uses the joint score to train likelihoo d estimators more eﬃcien tly; Cascal is a similarly improv ed v ersion of the Carl metho d; Rolr and Alice use the joint likelihoo d ratio to eﬃcien tly train a lik eliho od ratio estimator; and ﬁnally the Rascal and Alices techniques use both the join t lik eliho o d ratio and the joint score, maximizing the use of information from the sim ulator. In Tbl. I we provide an ov erview and giv e references to detailed explanations of all methods. Once a neural netw ork has been trained with one of these methods, it can calculate an estimated v alue of the likelihoo d or likelihoo d ratio for any even t and an y parameter p oin t. Established statistical tools can then b e used to calculate b est-ﬁt p oin ts and exclusion limits in the parameter space. F or the calculation of frequen tist conﬁdence regions, there are generally t w o strategies. The ﬁrst is sim ulating a large num b er of to y exp erimen ts to calculate the p -v alue for eac h parameter p oin t that is tested. This approach can b e computationally exp ensiv e, but guaran tees statistically correct results — ev en if the neural net w ork has not learned the likelihoo d function accurately , this approac h will not lead to to o tigh t limits. The second strategy uses the asymptotic prop erties of the lik elihoo d ratio function [ 69 – 71 ] to directly translate v alues of the likelihoo d ratio in to p -v alues. While this metho d is extremely eﬃcient, it relies on correctly trained neural netw orks. C. Learning lo cally optimal observ ables MadMiner also implemen ts a second class of metho ds: rather than trying to reconstruct the full lik eliho od function, a neural netw ork is trained to provide the most p o w erful observ ables for a giv en measuremen t problem. The cen tral quan tity of this approach is the sc or e t ( x ) = ∇ θ log p ( x | θ )     θ ref (11) ev aluated at a ﬁxed reference parameter point θ ref , for instance the SM. This vector has one comp onen t per parameter. F or a giv en even t x , its comp onen ts are just n umbers (unlik e the lik eliho od and the likelihoo d ratio, whic h are also functions of the parameters θ ). In other words, the score is a vector of observ ables. The relev ance of these observ ables is most ob vious in a lo cal appro ximation of the lik eliho o d function [ 7 , 60 , 72 ]: in the neighborho od of the parameter p oin t θ ref , the score components are the suﬃcien t statistics. That means that for the purp ose of measuring θ , knowing t ( x ) is just as p o w erful as kno wing the full likelihoo d p ( x | θ ) (whic h, since it dep ends on θ , is a muc h more complicated 2 Note that this approach is similar in spirit to the Matrix Element Metho d, which also uses parton-level lik eliho o ds and aims to estimate r ( x | θ 0 , θ 1 ) by calculating approximate versions of the integral in Eq. ( 3 ) . But unlike the Matrix Element Metho d, our machine-learning-based approac h supp orts realistic show er and detector sim ulations and can b e ev aluated v ery eﬃcien tly . 8 ob ject). In this sense, the score deﬁnes the most p o werful observ ables for the measuremen t of θ . 3 This motiv ates a fourth function for a neural netw ork to estimate: Score estimator: A neural netw ork takes as input ev en t data x and returns the estimated score at a reference parameter p oin t, NN : x 7→ ˆ t ( x ) ≈ ∇ θ log p ( x | θ )     θ ref (12) Ho w do es a neural netw ork learn to estimate the score? Again, extracting additional information from the simulator prov es useful. The Sall y and Sallino metho ds introduced in Refs. [ 59 – 61 ] deﬁne a loss function that inv olves the joint score t ( x, z ) . Minimizing this loss function will train a neural net work to conv erge to the true score t ( x ) [ 59 ]. After training, such a score estimator can b e used like any other set of observ ables. In particular, w e can ﬁll multiv ariate histograms of the score and use them for inference. This approach, named Sall y , requires only a minimal mo diﬁcation of established analysis w orkﬂo ws. A similar metho d called Sallino constructs one-dimensional histograms of particular pro jections of the estimated score, see Ref. [ 61 ] for details. As long as parameter p oin ts close to the reference p oin t, for instance the SM, are analyzed, and assuming that the neural net work was trained eﬃcien tly and with enough training data, the Sall y or Sallino metho ds will lead to statistically optimal limits. F urther aw ay from the reference p oin t, the score comp onen ts migh t no longer be optimal, and this approach might lose some p o wer compared to the techniques discussed in the previous section. The size of the parameter region in whic h the score comp onen ts are the suﬃcient statistics dep ends on the size of higher deriv ativ es of the (log) lik eliho od with resp ect to the parameters and is not known a priori; w e will illustrate this with an example in Sec. IV C . D. The Fisher information The ﬁnal results of actual measurements are b est-ﬁt p oin ts and exclusion limits. Ho w ev er, for quic kly ev aluating the sensitivit y of a measuremen t, comparing diﬀerent c hannels, or optimizing an analysis, a diﬀerent ob ject is often more practical: the Fisher information matrix. It is closely connected to the score discussed in the previous section and summarizes the sensitivit y of an analysis in a compact, interpretable, and p o werful wa y [ 9 , 10 ]. It is deﬁned as the exp ectation v alue I ij ( θ ) = E  ∂ log p full ( { x }| θ ) θ i ∂ log p full ( { x }| θ ) θ j     θ  (13) with the full likelihoo d function p full ( { x }| θ ) from Eq. ( 1 ). T o see wh y this matrix is useful, consider an expansion of the exp ected log likelih o od ratio b et ween θ + ∆ θ and θ around the minim um: − 2 E  log p full ( { x }| θ + ∆ θ ) p full ( { x }| θ )     θ  = − E  ∂ 2 log p full ( { x }| θ ) ∂ θ i ∂ θ j     θ  ∆ θ i ∆ θ j + O (∆ θ 3 ) = E  ∂ log p full ( { x }| θ ) ∂ θ i ∂ log p full ( { x }| θ ) ∂ θ j     θ  ∆ θ i ∆ θ j + O (∆ θ 3 ) = I ij ( θ ) ∆ θ i ∆ θ j + O (∆ θ 3 ) = d ( θ , θ + ∆ θ ) 2 + O (∆ θ 3 ) . (14) 3 In fact, the score vector is a generalization of the concept of Optimal Observ ables [ 27 – 29 ] from the parton level to the full statistical mo del including show er and detector sim ulation. 9 In the last step we hav e introduced the lo c al Fisher distanc e d ( θ + ∆ θ , θ ) = q I ij ( θ ) ∆ θ i ∆ θ j (15) whic h is a con v enien t approximation of the log likelihoo d ratio as long as ∆ θ is small. 4 Moreo v er, according to the Cramér-Rao b ound [ 76 , 77 ] the in verse of the Fisher information is the minimal co v ariance of any estimator ˆ θ . The larger the Fisher information, the more precisely a parameter can be measured. This approac h shines when it comes to ease of use and interpretabilit y . The Fisher information matrix is in v arian t under reparameterizations of the observ ables x , transforms cov ariantly under reparameterizations of the parameters θ , and is additive ov er phase-space regions. This prop ert y means w e can deﬁne the distribution of the diﬀerential information o v er phase space, whic h quan- tiﬁes where in phase space the p o w er of an analysis comes from [ 9 ]. The formalism also easily accommo dates nuisance parameters, and proﬁling o ver them is a simple matrix operation [ 9 , 78 ]. In particle physics pro cesses describ ed by Eq. ( 1 ), the Fisher information turns out to b e [ 9 ] I ij ( θ ) = L ∂ i σ ( θ ) ∂ j σ ( θ ) σ ( θ ) + L σ ( θ ) Z d x p ( x | θ ) t i ( x | θ ) t j ( x | θ ) ≈ L ∂ i σ ( θ ) ∂ j σ ( θ ) σ ( θ ) + L σ ( θ ) n X x ∼ p ( x | θ ) t i ( x | θ ) t j ( x | θ ) , (16) where L is the integrated luminosity , σ the cross section, ∂ i denotes deriv atives with resp ect to θ i , n is the n um b er of even ts x generated for the parameter p oin t θ , and t i is the i -th comp onen t of the score v ector introduced in Eq. ( 11 ) . The ﬁrst term describ es the information in the ov erall rate, while the second term quan tiﬁes the p o wer in the kinematic distributions. A neural score estimator ˆ t ( x ) as in Eq. ( 12 ) together with a set of even ts thus lets us calculate the (a priori intractable) Fisher information. E. Practical analysis asp ects Let us no w link these abstract inference techniques to sp eciﬁc aspects of typical analyses in high-energy ph ysics and summarize some features and limitations of MadMiner . High-energy pro cess: MadMiner supp orts almost all pro cesses of perturbative high-energy physics that can b e run in MadGraph5_aMC [ 63 ]. This includes any high-energy ph ysics mo del sp eciﬁed in the UFO format [ 79 ]. The inference tec hniques only require that the mo del is parameterized b y a ﬁnite n umber of mo del parameters θ and that it is possible to calculate the parton-level ev en t weigh ts of Eq. ( 10 ) for arbitrary v alues of the mo del parameters θ , i. e. to “reweigh t” the ev ents to diﬀerent parameter p oin ts [ 80 ]. The approac h is not fundamentally restricted to leading order, though one has to b e careful that negativ e even t weigh ts, which can app ear in certain subtraction schemes, do not lead to n umerical instabilities. It is often beneﬁcial to deﬁne the parameters θ suc h that they span a similar order of magnitude. In practice, this may require some rescaling. F or instance, if an analysis aims to measure t w o Wilson co eﬃcien ts f 0 and f 1 and the range of in terest of f 1 is 1000 times larger than that of f 0 , consider deﬁning the parameters internally as θ = ( f 0 , f 1 / 1000) . 4 The Fisher information deﬁnes a metric on the parameter space, giving rise to the ﬁeld of information geometry [ 9 , 73 , 74 ]. In that formalism we can also deﬁne “global” distances measured along geo desics, which are equiv alent to the exp ected log likelihoo d ratio ev en beyond the local appro ximation of small ∆ θ [ 75 ]. 10 Morphing: In an imp ortan t class of mo dels the squared matrix elements (or parton-level even t w eigh ts) can be factorized in to a sum ov er n c comp onen ts, each consisting of an analytical function of the theory parameters times a function of phase-space: |M| 2 ( z p | θ ) = X c w c ( θ ) f c ( z p ) . (17) This is often the case in eﬀective ﬁeld theories, or when indirect eﬀects of new physics are parameterized through form factors. F or instance, consider the simple case in which we are trying to measure a single BSM parameter θ and the pro cess is described by a SM con tribution, an interference term, and a squared BSM amplitude: |M| 2 ( z p | θ ) = 1 |{z} w 0 ( θ ) |M S M | 2 ( z p ) | {z } f 0 ( z p ) + θ |{z} w 1 ( θ ) 2 Re M † S M ( z p ) M B S M ( z p ) | {z } f 1 ( z p ) + θ 2 |{z} w 2 ( θ ) |M B S M | 2 ( z p ) | {z } f 2 ( z p ) . (18) More generally , the dep endency on the mo del parameters is often a combination of diﬀerent p olynomials. Note that the contributions f c ( z p ) are not necessarily distributions: they can b e negative, or in tegrate to zero, for instance for interference terms. Nevertheless, the sum of all comp onen ts is alwa ys a ph ysical distribution, i. e. it is non-negative everywhere and in tegrates to the total cross section. When a pro cess factorizes according to Eq. ( 17 ) , a “morphing techniq ue” [ 61 , 81 ] allows us to calculate ev en t weigh ts an ywhere in parameter space precisely and very fast. First, the squared matrix element is ev aluated at n c diﬀeren t p oin ts in the parameter space. The structure of Eq. ( 17 ) together with some linear algebra is then used to exactly interpolate to an y other parameter p oin t. This pro cess is described in detail in Ref. [ 61 ]. MadMiner implemen ts this morphing technique and lev erages it extensively . The user only has to sp ecify the maximal p o wers with whic h eac h mo del parameter contributes to the squared matrix elemen t. MadMiner then automates the necessary linear algebra internally . A practical question is at which n c b enc hmark p oin ts the matrix elements should b e ev aluated originally . This set of parameter p oin ts is called the morphing b asis . While the physical predictions for a given parameter p oin t are indep enden t of this basis, the morphing pro cedure in v olv es matrix inv ersions and cancellations b et ween p oten tially large terms that dep end on the c hoice of basis. Some morphing basis choices can thus lead to ﬂoating-p oin t precision issues, while others are numerically more stable. MadMiner can automatically pick or complete a morphing basis that av oids or minimizes numerical precision issues. This optimization consists of randomly dra wing a num b er of basis conﬁgurations o ver a user-sp eciﬁed parameter region, calculating morphing weigh ts for each basis, and c ho osing the basis that minimizes the sum of squared morphing weigh ts. Note that MadMiner is not restricted to problems that factorize according to Eq. ( 17 ) . Much of the core functionalit y is av ailable for almost an y model of new physics. But some features are curren tly only implemen ted in the morphing case, and for others the morphing setup can reduce the computational cost substantially . P arton sho w er: P arton sho w er and hadronization can b e simulated with Pythia 8 [ 64 ], including matc hing and merging of diﬀeren t ﬁnal-state jet multiplicities. This part of the even t ev olution 11 should not directly depend on the new physics parameters of in terest. 5 Other show er sim ulators can be interfaced with little eﬀort. Detector simulation: Out of the b o x, MadMiner includes a fast phenomenological detector simu- lation with Delphes 3 [ 65 ], as well as an alternative approximate detector simulation through smearing functions based on the parton-lev el ﬁnal state. MadMiner is designed mo dularly so that it can b e in terfaced to more realistic detector sim ulations used b y the exp erimen tal collab orations such as Geant4 [ 66 ]. Suc h an extension will mostly require careful b o ok-k eeping of ev ent weigh ts and observ ables. Observ ables: The observ ed data for eac h ev en t needs to be parameterized in a ﬁxed-length v ector of observ ables x . These can include b oth basic c haracteristics like energies, transverse momen ta, and angular directions of reconstructed particles, but also higher-level features such as inv ariant masses or angular correlations b et ween particles. F or Delphes -lev el analyses, MadMiner allo ws the deﬁnition of these observ ables as arbitrary functions of the ob jects in the Delphes output ﬁle, while for parton-level analyses arbitrary functions of the smeared parton-lev el four-momenta are supp orted. It is p ossible to extend MadMiner with in terfaces to an y external co de that calculates observ ables from generated ev ents. Bac kgrounds: Diﬀeren t signal and bac kground pro cesses, with no limitations on the parton-lev el ﬁnal states, can b e combined in the same analysis. Background pro cesses are allow ed to dep end on the mo del parameters θ . In the case of a reducible backgrounds that are not aﬀected b y θ , the joint log likelihoo d ratio and joint score of all background even ts are zero, up to an x -indep enden t constant that is related to the dep endence of the ov erall signal cross section on θ . W e will discuss and illustrate this case in Sec. IV A . While fully data-driven backgrounds are not supp orted, a data-driven normalization of MC ev en t samples is p ossible. Systematic uncertainties: All imp erfections in the description of the physics pro cess with the sim ulation chain are mo deled with n uisance parameters ν . F or most of the analysis chain, they pla y the same role as the physics parameters of interest θ : their true v alue is unkno wn and they aﬀect the lik elihoo d of sim ulation outcomes. F or the inference techniques presented in the previous section, ev ery occurence of θ then has to b e replaced with ( θ , ν ) : the neural net w orks estimate the likelihoo d p ( x | θ , ν ) , the lik eliho o d ratio r ( x | θ 0 , ν 0 ; θ 1 , ν 1 ) , or the score t ( x | θ , ν ) , where the latter now has more comp onen ts corresp onding to b oth the gradient with resp ect to θ and the gradien t with resp ect to ν . At the ﬁnal limit setting stage, one then pic ks a constrain t term (or, in a Bay esian setting, a prior) for the nuisance parameters and proﬁles (or marginalizes) ov er them, following established statistical pro cedures [ 9 , 71 , 78 , 82 ]. MadMiner curren tly supports nuisance parameters that mo del systematic uncertain ties from scale and PDF c hoices. The eﬀect of the nuisance parameters on an even t weigh t is paramet- erized as d σ ( z p | θ , ν ) = d σ ( z p | θ , 0) × exp " X i  a ( z p ) ν i + b ( z p ) ν 2 i  # , (19) similar to HistFactory [ 3 ] and PyHF [ 83 ]. ν i = 0 corresponds to the nominal v alue of the i -th nuisance parameter. F or eac h v aried scale, ν i = ± 1 corresp ond to the scale v ariations 5 F undamentally , the presen ted inference techniques also supp ort new ph ysics eﬀects that aﬀect e. g. the probabilities of sho wer splittings, but this is currently not supported in MadMiner . 12 (t ypically half and t wice the nominal scale c hoice). PDF uncertain ties are describ ed by one n uisance parameter p er eigen v ector in a Hessian PDF set, and ν i = 1 corresp onds to the ev en t w eigh t along a unit step of an eigen v ector. The factors a ( z p ) and b ( z p ) are automatically calculated for each ev en t based on a rew eighting pro cedure [ 84 ]. The exp onen tial form of Eq. ( 19 ) ensures non-negative even t weigh ts. Neural netw ork architectures and training: The heart of MadMiner ’s analysis tec hniques are neural netw orks that tak e even t data x (and, dep ending on the metho d, a parameter p oin t ( θ , ν ) ) as input and return the lik eliho o d, likelihoo d ratio, or score. The optimal architecture of these netw orks dep ends on the problem. MadMiner curren tly supp orts fully connected feed-forw ard neural net works with a v ariable num b er of la yers and hidden units and diﬀerent activ ation functions, implemen ted in PyTorch [ 85 ]. The loss functions are mostly ﬁxed b y the inference metho ds given in Tbl. I . The Scand al , Rascal , Alices , and Cascal tec hniques ha v e a free h yp erparameter α that weigh ts the joint score term in the loss function relative to another term. These loss functions are minimized by sto c hastic gradien t descen t with or without momentum [ 86 ], the Adam optimizer [ 87 ], or the AMSGrad optimizer [ 88 ]; other options include batc hing, learning rate decay , and early stopping. Uncertain ty estimation: An individual neural estimator merely provides a point estimate for the likelihoo d, likelihoo d ratio, or score. By training an ensem ble of diﬀerent estimators with diﬀeren t random seeds, w e can use the ensemble v ariance as a diagnostic to ol to chec k whether the global minim um of the loss functional has b een found [ 89 ]. T aking this idea one step further, we can train eac h net w ork on resampled data. With this nonparametric b ootstrap metho d, the ensemble v ariance represen ts the uncertaint y in the neural net w ork output from ﬁnite training sample size. While this approach ma y serve as a useful indicator of the epistemic uncertaint y of the net w ork predictions (i. e. the uncertain t y on the parameters of the neural netw ork), there is no guarantee that it co v ers all relev an t sources of bias and v ariance. F. Recommendations for getting started The large n umber of diﬀerent inference metho ds, analysis asp ects, and h yp erparameters outlined ab o ve and describ ed in detail in a total of six publications [ 37 , 47 , 59 – 62 ] might seem a little o v erwhelming. That is why we here provide a few suggestions for new users of MadMiner , largely based on the comprehensive comparison in Refs. [ 61 , 62 ]. Rather than b eing a one-size-ﬁts-all solution, this should b e seen as a starting p oin t for the exploration of the space of p ossible analysis metho ds. The main question is whether one of the metho ds should b e used that reconstruct the entire lik eliho od or lik eliho od ratio function, as describ ed in Sec. I I B , or whether the analysis merely aims to ﬁnd (lo cally) optimal observ ables, as describ ed in Sec. II C . The former approach is p oten tially more p o w erful: given enough data, expressive enough netw orks, and a training of the neural netw ork that reac hes the global minim um of the loss function, it will lead to the b est p ossible limits. But it is also more ambitious, ma y require more training data and h yp erparameter exp erimen ts, and represen ts a bigger change to a t ypical data analysis pip eline. The latter strategy , on the other hand, is simpler, scales b etter to high-dimensional parameter spaces, and requires less training samples. Since it essen tially deﬁnes a new set of observ ables, it requires only minimal modiﬁcations to existing analysis pipelines. The Fisher information form ulation makes it very easy to summarize the sensitivit y of a measuremen t. The catc h is that 13 this approac h is only optimal as long as the dominant signatures en ter at linear order in the mo del parameters, and otherwise loses statistical p o wer and may lead to w orse limits. If this last condition is satisﬁed — if the dominant new ph ysics eﬀects are exp ected at linear order in the parameters — w e consider the Sall y strategy an ideal starting p oin t. A t ypical example for this is a precision measurement of eﬀective op erators. On the other hand, if nonlinear con tributions from the mo del parameters dominate, w e instead suggest using the Alices technique. Its h yp erparameter α should initially be c hosen suc h that the tw o terms in the loss function con tribute approximately equally to the training, but it is w orth scanning this parameter o ver a few orders of magnitude. I II. USING MADMINER W e will now describ e the implementation of these techniques in the new Python pac k age MadMiner . MadMiner is op en source and its co de is av ailable at Ref. [ 90 ]. That rep ository also contains in teractiv e tutorials with step-by-step commen ts. A detailed do cumen tation of the API is av ailable online at Ref. [ 91 ]. W e also provide a Docker con tainer with a working environmen t of all required to ols at Ref. [ 92 ], and reusable workﬂo ws based on Reana [ 93 ] at Ref. [ 94 ]. T o get started, the minimal requiremen ts are w orking installations of MadGraph5_aMC and MadMiner . The latter can b e installed with a simple pip install madminer . Show er and detector sim ulations in addition require installations of Pythia 8 , the automatic MadGraph-Pythia interface, and Delphes 3 . T o mo del PDF uncertainties, LHAPDF has to b e installed, including its Python inter- face. All these additional dep endencies can easily b e installed from the MadGraph5_aMC command line in terface. Detailed instructions for the installation can be found at Ref. [ 91 ]. In the following we will go through the typical steps of a MadMiner analysis that uses the inference tec hniques discussed in the last section. Figure 1 visualizes the workﬂo w of suc h an analysis, and w e will generally follow this ﬁgure. A. Analysis sp eciﬁcation and ev ent generation The ﬁrst phase of a MadMiner analysis consists of specifying the problem and generating even ts. First, the necessary ﬁles (“cards”) that deﬁne the analyzed process and theoretical mo del should b e collected. This includes the UF O mo del ﬁles as well as the run card, the parameter card, the Pythia card, and the Delphes card, all in the standard format used b y MadGraph5_aMC . The measuremen t problem is speciﬁed with an instance of the MadMiner class. The parameter space is deﬁned by repeatedly calling its add_parameter() function. Eac h mo del parameter is sp eciﬁed by its LHA blo c k and LHA ID in the UFO mo del. Next, the user chooses b enchmarks : parameter p oin ts at whic h the even t are ev aluated. Bench- marks can b e sp eciﬁed manually with add_benchmark() . Additionally , a morphing technique based on Eq. 17 can b e activ ated b y calling set_morphing() . If less b enc hmarks ha v e b een man ually sp e- ciﬁed than required for a morphing basis, more b enc hmarks will b e c hosen automatically , minimizing the expected size of morphing w eigh ts | w c ( x ) | . Systematic uncertain ties (from PDF and scale v ariations) can b e sp eciﬁed with a call to set_ systematics() . Once the parameter space, b enc hmarks, morphing, and systematic uncertainties are set up, save() sa v es these settings in the MadMiner ﬁle , whic h is based on the HDF5 standard [ 95 ]. Finally , even ts can b e generated b y calling run() or run_multiple() (the diﬀerence is that the former starts one ev en t generation run, while the latter generates multiple sets with diﬀeren t run cards or sampled from diﬀeren t b enc hmarks). MadMiner will set up MadGraph ’s rew eigh ting feature to ev aluate the ev en t weigh ts for all ev ents at all b enc hmarks, whic h are stored in the LHE even t 14 5 A . F ISHER INFO FisherInformation 5 B . I NFERENCE AsymptoticLimits 1. E VENT GENERA TION 3. S AMPLING 4. ML 2. O BSERV ABLES MadMiner DelphesReader SampleAugmenter LikelihoodEstimator RatioEstimator ScoreEstimator Ensemble MadGraph Delphes Pythia MadMiner  le (.h5) MadGraph cards ! (.dat) Parton-level even ts ! (.lhe) Hadron-level even ts ! (.hepmc) Detector-level events ! (.root) T raining data ! (.npy) T rained ML model ! (.json, .pt, .npy) Paramet ers Fisher inf ormation Observed events, parameter grid Best  t, p-values Paramet er space, benchmarks, morphing, ! nuisance parameters Observables, ! cuts Sampling setup Physics pr ocess, theory model, ! simulation setup Input / Output MadMiner classes External simulators Files Figure 1. Example workﬂo w, with classes in red, external simulations in blue, and ﬁles in green. 15 ﬁles together with the parton-level information. Pythia 8 will automatically b e called to sho w er and hadronize the partons, the results are stored in a standard HepMC ev en t ﬁle [ 96 ]. B. Detector eﬀects and observ ables In the second phase, all relev ant information has to be extracted from the ev en t samples, including observ ables as well as even t weigh ts for the diﬀeren t benchmarks. There are currently t w o implementations for this step: the LHEReader class realizes a simple parton-level analysis, in which the eﬀects of show er and detector are approximated with transfer functions, while the DelpesReader class implements an detector-lev el analysis in which the show er is mo deled with Pythia 8 and the detector with Delphes 3 . The API of b oth classes is very similar, here we fo cus on the DelpesReader option. After creating a DelphesReader instance and p oin ting it to the MadMiner ﬁle, the user has to list the HepMC ev ent samples that should b e analyzed b y calling the function add_sample() . The detector simulation with Delphes can either b e run externally or through MadMiner b y calling run_delphes() . In a next step, the user deﬁnes a set of observ ables that will b e calculated for eac h ev en t. These can b e pro vided either as Python functions with add_observable_from_function() or as parse-able strings with add_observable() . In b oth cases, reconstructed ob jects are accessible as MadMiner Particle ob jects, which inherits all functions of scikit-hep ’s LorentzVector class [ 97 ]. This mak es observ able deﬁnitions v ery easy: for instance, the transv erse momentum of the hardest lepton can simply b e deﬁned as add_observable("lepton_pt", "l[0].pt") , while the azim uthal angle b et ween the tw o hardest jets can be deﬁned as add_observable("delta_phi", "j[0].deltaphi( j[1])") . Cuts can b e added similarly with add_cuts() . Once all samples are added, Delphes has been called, and all observ ables and cuts are deﬁned, a call to analyse_delphes_samples() parses the observ ables for the simu lated ev en ts, applies the cuts, and extracts the relev ant even t weigh ts. With save() this data is stored in the MadMiner ﬁle. C. Sample unw eighting and augmentation If m ultiple diﬀerent samples w ere created, for instance for diﬀerent pro cesses or phase-space regions, they should now be com bined in to a single MadMiner ﬁle and sh uﬄed b y calling the combine_and_shuffle() function. In the third step of the analysis workﬂo w, the even t information in the MadMiner ﬁle is pro cessed in to training data for the diﬀerent algorithms describ ed in the previous section. This consists of t w o aspects: ﬁrst, the even t data needs to b e reweigh ted to the parameter p oin ts θ (and / or θ 0 , θ 1 , θ ref ) that mak e up the training data and then unw eighte d. Second, the joint likelihoo d ratio r ( x, z ) and the joint score t ( x, z ) need to be calculated for eac h un w eighted even t. This is implemented in the SampleAugmenter class. It provides a set of six high-level functions that generate and augment the data for the diﬀeren t t yp es of inference tec hniques. F or instance, sample_train_local() generates training samples for score estimators (the Sall y and Sallino tec hniques), while sample_train_ratio() prepares training data for lik eliho o d ratio estimators. The output of all these functions are a set of plain NumPy [ 98 ] arra ys. The rows of these tw o- dimensional arra ys are the even ts; the columns corresp ond to the observ ables that characterize the ev en t data (in the order in whic h the observ ables were deﬁned in the DelphesReader or LHEReader classes), the parameter p oin ts according which they are sampled, and the comp onent s of the joint score, respectively . 16 D. Mac hine learning It is ﬁnally time to train neural netw orks to estimate the lik eliho od, likelihoo d ratio, or score, as discussed in Sec. II . This is implemented in the classes LikelihoodEstimator , Parameterized RatioEstimator , DoubleParameterizedRatioEstimator , and ScoreEstimator . This training is indep enden t of the external Monte-Carlo simulations and ev en the MadMiner ﬁle, which makes it easy to run it on an external system with GPU supp ort. During initialization of any of these classes, the netw ork arc hitecture is c hosen. Currently , MadMiner supp orts fully connected feed-forward net works with v ariable n um b er of lay ers, hidden units, and activ ation functions, implemented in PyTorch [ 85 ]. A call to train() starts the training; k eyw ords sp ecify whic h loss function to use, the lo cation of the training data generated in the previous step, the optimizer, the learning rate schedule, the batc h size, and whether early stopping is used. After training, save() sa v es the neural net w ork to ﬁles. The estimators are ev aluated for arbitrary parameter p oin ts and observ ables with evaluate_log_likelihood() , evaluate_log_likelihood_ ratio() , or evaluate_score() . F or many users, the estimates returned b y these functions will be the ﬁnal output of MadMiner , and the statistical analysis will b e p erformed externally . W e also provide the Ensemble class, a conv enient wrapp er that allows to train an ensem ble of m ultiple neural netw orks. The diﬀerent instances can ha ve identical or diﬀerent arc hitectures and the training can b e p erformed on the same or resampled training data. Suc h an ensemble is useful for consistency chec ks and uncertaint y estimation as discussed in Sec. I I E . E. Inference MadMiner pro vides a bareb ones framework for the statistical analysis: the AsymptoticLimits class. After initalizing it with the MadMiner ﬁle, the tw o high-lev el functions expected_limits() and observed_limits() calculate expected and observ ed p -v alues o ver a grid in parameter space. expected_limits() tak es as input the parameter p oin t that is assumed to be true and in ternally generates a so-called “Asimo v” data set [ 71 ], a large sim ulated set of ev en ts. observed_limits() on the other hand is directly based on a list of even ts, whic h the user can tak e from simulations or actual measured data. Both methods can estimate the kinematic likelihoo d either through histograms of kinematic v ariables, through histograms of the estimated score from a trained ScoreEstimator instance, or through a trained likelihoo d (ratio) estimator. p -v alues are calculated with a lik eliho o d ratio test, using the asymptotic distribution of the lik eliho od ratio as described in Wilks’ theorem [ 69 – 71 ]. The AsymptoticLimits curren tly do es not support systematic uncertainties. W e are planning to in terface MadMiner with existing soft w are pac k ages that implement proﬁle lik eliho od ratio tests. F. Fisher information As discussed in Sec. I I C , a conv enient and p o w erful summary of the sensitivity of a measurement is the Fisher information matrix. Its calculation is implemen ted in the FisherInformation class. Most imp ortan tly , full_information() calculates the Fisher information based on a ScoreEstimator instance as given in Eq. ( 16 ) . Sev eral other functions allow to calculate the Fisher information in the cross section only (i. e. the ﬁrst term of Eq. ( 16 ) ), the Fisher information in the histogram of one or t w o kinematic v ariables, and ﬁnally the truth-level Fisher information, which treats all prop erties of the parton-lev el particles as observ able. Finally , the function histogram_of_information() allo ws 17 the user to calculate the distribution of the Fisher information o v er phase space, as in tro duced in Ref. [ 9 ]. In the presence of systematic uncertain ties and in a frequen tist setup, nuisance parameters can either b e neglected (“pro jected out”) or conserv atively tak en into accoun t (“proﬁled out”). These op erations are implemen ted in the functions project_information() and profile_information() . IV. PHYSICS EXAMPLE W e demonstrate the use of MadMiner in the measurement of dimension-six op erators in tth pro duction at the high-luminosit y run of the LHC. W e choose to analyze fully leptonic top decays and a Higgs decay into tw o photons, pp → t ¯ t h → ( b` + ) ( ¯ b` − ) ( γ γ ) E miss T (20) with ` = e, µ . While this particular signature is not exp ected to b e the most sensitive channel, for example when compared to either semi-leptonic tth pro duction or Higgs pro duction in gluon fusion, it pro vides a high-dimensional ﬁnal state with a non-trivial missing energy signature, illustrating the features and challenges that MadMiner can address. W e consider three diﬀerent scenarios: Illustration: W e ﬁrst illustrate the mechanism b ehind the inference tec hniques in MadMiner in a one-dimensional version of the problem, restricting the analysis to one parameter and one observ able. V alidation: MadMiner is then v alidated in a parton-lev el to y scenario. By not letting the W b osons deca y and ignoring the eﬀect of show er and detector on observ ables, we can calculate the true lik eliho od function and compare the output of the neural net works to a ground truth. Ph ysics analysis: Finally , we perform a realistic phenomenological analysis, including the eﬀects of parton show er and detector and considering a three-dimensional parameter space and high-dimensional ev ent data. All three analyses are p erformed with MadMiner v0.4 follo wing the workﬂo w outlined in the previous section. Even ts are generated with MadGraph5_aMC@NLO at leading order for √ s = 14 T eV using the PDF4LHC15_nlo parton distribution function [ 99 ]. W e normalize the rates to the NLO predictions [ 100 ] with a phase-space-indep enden t k -factor. W e consider the Standard Mo del Lag- rangian supplemen ted with dimension-six op erators in the SILH basis [ 101 ], as implemen ted in the HEL FeynRules model [ 102 ]. Otherwise, the simulation setup is diﬀeren t for each of the three scenarios. W e summarize the main settings in Tbl. I I and discuss them in eac h of the following sections. A. Illustration of analysis techniques Our ﬁrst analysis aims to illustrate ho w MadMiner calculates the lik eliho o d function in a simpliﬁed one-dimensional version of the problem. F or this we restrict ourselv es to a single dimension-six op erator, L = L S M + c G O G (21) 18 Illustration V alidation Ph ysics analysis Op erators O G O u , O G , O uG O u , O G , O uG Initial states pp g g pp Final state ( b` + ) ( ¯ b` − ) ( γ γ ) E miss T ( bW + ) ( ¯ bW − ) ( γ γ ) ( b` + ) ( ¯ b` − ) ( γ γ ) E miss T Bac kground X – X Sho w er simulation Pythia – Pythia Detector sim ulation Delphes – Delphes Observ ables 1: p T ,γ γ 80 48 Systematic uncertain ties – – PDF, scale T able I I. The three scenarios in which we analyze the tth pro cess. with O G = g 2 s m 2 W ( H † H ) G a µν G µν a . (22) This op erator induces an additional contribution to the eﬀectiv e Higgs-gluon coupling, g g gh → g g gh (1 + 192 π 2 /g 2 × c G ) , and therefore aﬀects the kinematic distributions [ 103 ]. W e deﬁne the theory parameter as θ = 100 c G , whic h is dimensionless and of order unity ov er the parameter range of interest. θ = 0 then corresp onds to the SM, any deviation from zero to a new ph ysics eﬀect. The squared matrix elemen t consists of an SM contribution, an in terference term linear in θ , and a squared dimension-six amplitude prop ortional to θ 2 , and w e can use a morphing tec hnique to in terp olate even t w eigh ts and cross sections from three b enc hmarks (or morphing basis p oin ts) to an y point in parameter space. In this illustration setup we also restrict the analysis to a single observ able x = p T ,γ γ , the transv erse momentum of the di-photon system. All other observ ables are treated as if they were unobserv able. T ogether with physically unobserv able degrees of freedom (such as neutrino energies) as w ell as random v ariables in the simulation of the sho wer, hadronization, and detector, they form the set of latent v ariables z . This setup is similar to a histogram-based analysis of c G using only the p T ,γ γ histogram. W e generate ev en ts with MadGraph5_aMC@NLO as describ ed abov e. They are then sho wered and hadronized through Pythia 8 . The detector resp onse is simulated with Delphes 3 using the HL-LHC card suggested by the HL/HE-LHC working group [ 104 ]. 1. Signal only In the sampling and data augmentation step (the third b o x in Fig. 1 ), MadMiner creates training samples where eac h simulated ev ent is characterized by v alues of the observ able x = p T ,γ γ and the (unobserv able) latent v ariables z . Additionally , for eac h even t MadMiner calculates the joint lik eliho od ratio r ( x, z | θ ) betw een the parameter p oin t θ and a reference p oin t θ ref , whic h we take to be the SM. It also calculates the join t score t ( x, z | θ ) ev aluated at the parameter p oin t θ . This is illustrated in Fig. 2 . The blue dots and orange triangles in the left panel show the joint log lik eliho od ratio log r ( x, z | θ ) with their dep endence on the observ able x = p T ,γ γ . The blue dots sho w tth ev en ts sampled according to the SM (with θ ref = 0 ), while the orange triangles are sampled from a BSM hypothesis with θ = 1 (or c G = 0 . 01 ). W e can see that there are more high- p T ev en ts for the BSM mo del than for the SM, and hence the join t lik eliho o d ratio is higher. The large vertical scatter in the join t lik elihoo d ration is caused b y the presence of the laten t v ariables z , whic h aﬀect the join t lik eliho od ratio, but are unobserv able. In the right panel of the same ﬁgure, the arrows 19 0 100 200 300 400 500 O b s e r v a b l e x = p T , [ G e V ] 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Log likelihood ratio r ( x , z | ) s a m p l e d a c c o r d i n g t o r e f r ( x , z | ) s a m p l e d a c c o r d i n g t o E s t i m a t e d r ( x | ) ( A L I C E S ) 1.5 1.0 0.5 0.0 0.5 1.0 1.5 = 1 0 0 c G 1 0 1 2 3 4 5 Log likelihood ratio r ( x , z | ) ( p o s i t i o n ) + t ( x , z | ) ( s l o p e ) E s t i m a t e d r ( x | ) ( A L I C E S ) Figure 2. Illustration of the analysis techniques in a one-dimensional problem. Left : Join t log likelihoo d ratio as a function of the observ able p T ,γ γ for tth signal even ts sampled according to the SM (blue dots) and a BSM theory with θ = 100 c G = 1 (orange triangles). The solid line shows the estimated log likelihoo d ratio from an Alices mo del trained only on p T γ γ as input observ able. Righ t : Join t log likelihoo d ratio (arrow p osition) and joint score (arrow slop e) as a function of the mo del parameter θ = 100 c G , for tth signal even ts in the range p T ,γ γ = (300 ± 2 . 5) Ge V. The solid line sho ws the estimated log likelihoo d ratio from an Alices mo del trained only on p T ,γ γ as input observ able and ev aluated at p T ,γ γ = 300 Ge V. sho w the joint log likelihoo d ratio log r ( x, z | θ ) (arrow p osition) and the joint score t ( x, z | θ ) (arrow slop e) with their dep endence on the theory parameter θ . Here the observ able is constrained to the range p T ,γ γ = (300 ± 2 . 5) GeV to suppress the observ able dep endence. Estimating the likelihoo d ratio with the metho ds describ ed in Sec. I I B (and in more detail in Ref. [ 61 ]) essentially means ﬁtting a function ˆ r ( x | θ ) to the join t lik eliho od ratio r ( x, z | θ ) by n umerically minimizing a suitable loss functions. In this pro cess, the unobserv able latent v ariables z are eﬀectively integrated out. This is the gist of the mac hine learning step of the MadMiner w orkﬂo w (b o x four in Fig. 1 ). The result of this step, the estimated log lik eliho o d ratio ˆ r ( x | θ ) based on the Alices metho d, is sho wn in the solid black lines in Fig. 2 : the left panel illustrates the x dep endence for ﬁxed θ = 1 , the righ t panel the θ dep endence for ﬁxed x . While it is p ossible to estimate the lik eliho od ratio only using the joint likelihoo d ratio as input, the gradient information that is the join t score provides additional guidance, which often allo ws for the ﬁt to conv erge with less data. 2. A dding b ackgr ounds So far we ha ve only considered the tth signal pro cess. Ho w do es this picture change when w e include bac kgrounds? W e answer this question in the left panel of Fig. 3 , where in addition to the signal w e now include the dominan t bac kground, contin uum t ¯ t γ γ pro duction with leptonically deca ying tops. As b efore the circles sho w the joint log lik eliho o d ratio log r ( x, z | θ ) and the line denotes the estimated log lik eliho o d ratio function log ˆ r ( x | θ ) . Since signal and background p opulate diﬀeren t phase-space regions, the in terference b et w een them is negligible and we could consistently simulate them separately from each other. This means that every simulated even t is lab eled either as a signal or a bac kground ev en t, which pla ys the role of a discrete v ariable in the set of latent v ariables z . 20 0 100 200 300 400 500 O b s e r v a b l e x = p T , [ G e V ] 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Log likelihood ratio r ( x , z | ) : S i g n a l r ( x , z | ) : B a c k g r o u n d r ( x | ) : A L I C E S r ( x | ) : 5 - b i n h i s t o g r a m r ( x | ) : 1 0 0 - b i n h i s t o g r a m 0 100 200 300 400 500 O b s e r v a b l e x = p T , [ G e V ] 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Score t 2 ( x , z ) : S i g n a l t 2 ( x , z ) : B a c k g r o u n d t 2 ( x ) : S A L L Y t 2 ( x ) : 5 - b i n h i s t o g r a m t 2 ( x ) : 1 0 0 - b i n h i s t o g r a m Figure 3. Illustration of the analysis techniques in a one-dimensional problem. Left : Join t log likelihoo d ratio as a function of the observ able p T γ γ for tth signal even ts (green dots) and ttγ γ bac kground even ts (red triangles) sampled according to the SM and a BSM theory with θ = 100 c G = 1 . The background even ts cluster at a constant v alue of − 0 . 78 , as explained in the text. The lines sho w the estimated log likelihoo d ratio based on the Alices method trained only on p T ,γ γ (blac k solid) and a p T ,γ γ histogram with 5 (orange solid) and 100 (blue dashed) bins, resp ectiv ely . Righ t : Join t score ev aluated at the SM for for tth signal (green dots) and ttγ γ bac kground even ts (red triangles). The bac kground ev ents cluster at a constant v alue of − 0 . 58 , as explained in the text. The lines show the estimated score obtained using a Sall y metho d trained only on p T γ γ (blac k solid) and a p T ,γ γ histogram with 5 (orange solid) and 100 (blue dashed) bins, resp ectiv ely . The bac kground ev ent weigh ts are unaﬀected by the EFT op erator O G , so the join t likelihoo d ratio for these even ts is indep enden t of x and z : r ( x, z | θ )     background = p ( x, z | θ ) p ( x, z | θ ref ) = d σ ( z p | θ ) d σ ( z p | θ ref ) σ ( θ ref ) σ ( θ ) = σ ( θ ref ) σ ( θ ) , (23) whic h in our case turns out to be log r ( x, z | θ )     background = − 0 . 78 =: log r ∗ . (24) This is clearly visible in the left panel of Fig. 3 , where the t ¯ t γ γ ev en ts sho w up as a horizontal line at this v alue. While the presence of backgrounds do es not aﬀect the fundamen tal v alidity of the inference tec hnique, it increases the v ariance of the joint likelihoo d ratio around the true likelihoo d ratio so that more training ev ents are required before the neural netw ork con v erges on the true lik eliho od ratio function. In this simple example with one-dimensional observ ations x , we can v alidate the Alices predictions with histograms. The histogram appro ximation for the likelihoo d ratio is ˆ r ( x | θ ) = [ σ bin ( θ ) /σ ( θ )] / [ σ bin ( θ ref ) /σ ( θ ref )] , where σ bin ( θ ) is the cross section in the bin corresp onding to x . In the left panel of Fig. 3 , the log likelihoo d ratio based on a histogram with 5 (100) equally sized bins is sho wn as solid orange (dashed blue) line. It generally agrees excellen tly with the Alices prediction. The tw o histogram lines show the trade-oﬀ in the n umber of bins: while to o few bins lead to large binning eﬀects, a large num b er of bins can lead to large ﬂuctuations due to limited Mon te-Carlo statistics. In con trast, the MadMiner tec hniques based on neural net works learn the 21 correct con tin uum limit equiv alent to an inﬁnite n um b er of histogram bins, without suﬀering from large ﬂuctuations. In Sec. I I C we described an alternative approac h in which MadMiner calculates the score, a vector of summary statistics that are statistically optimal close to a reference parameter p oin t such as the SM. W e illustrate this Sall y tec hnique in the righ t panel of Fig. 3 . The green circles show the join t score t ( x, z ) at the SM reference p oin t, corresponding to the c hange of the log likelihoo d when inﬁnitesimally increasing the sole theory parameter θ = 100 c G . In analogy to the log likelihoo d ratio, the red p oin ts clustering at a horizontal line t ( x, z )     background = − 0 . 58 =: t ∗ (25) corresp ond to the t ¯ t γ γ bac kground ev en ts. Estimating the score function conceptually corresp onds to ﬁtting a function ˆ t ( x ) to the joint score data t ( x, z ) b y numerically minimizing an appropriate loss function. The resulting score estimator ˆ t ( x ) is shown as a solid black line. Again, in this one-dimensional case w e can compare the result to the score estimated through a histogram, whic h is shown in a solid red (dashed green) line for a histogram with 5 (100) bins. W e ﬁnd excellen t agreemen t betw een the Sall y prediction and the histogram results. B. V alidation at parton lev el Next, w e v alidate MadMiner in a setup in which we can calculate a ground truth for the output of the algorithms. This is not trivial b ecause the ground truth — the true likelihoo d, lik eliho o d ratio, or score — is intractable in realistic situations. In the last section w e sho w ed how we can use histograms to chec k the algorithms, but only when limiting the analysis to one or tw o observ ables. W e now turn to another appro ximation in which we can access the true lik elihoo d ratio and score, ev en though b oth observ ables and model parameters are high-dimensional: F ollowing Ref. [ 60 , 61 ], w e consider a truth-level scenario in which all laten t v ariables are also observ able, x = z . In this case the likelihoo d ratio r ( x ) is equal to the joint likelihoo d ratio r ( x, z ) and the score t ( x ) is equal to the joint score t ( x, z ) . W e can thus compare the predictions of a neural net work trained to estimate either of these quantities to a ground truth. F or this v alidation we choose the parton-lev el pro cess g g → t ¯ th → ( bW + ) ( ¯ bW − ) ( γ γ ) . (26) W e do not let the W b osons decay and assume that the four-momenta and ﬂa v ors of all initial-state and ﬁnal-state particles can b e measured, i. e. we do not sim ulate the eﬀect of parton show er and detector resp onse. These truth-lev el approximations are not necessary for the inference tec hniques in MadMiner , but they allow us to calculate a ground truth for the likelihoo d ratio and score, which is not p ossible for any realistic treatmen t of neutrinos or mo deling of parton show er and detector resp onse. F ollo wing Ref. [ 103 ], w e consider three dimension-six op erators aﬀecting the top and Higgs couplings in tth pro duction: L = L S M + c u O u + c G O G + c uG O uG , (27) where the op erators are deﬁned as O u = − 1 v 2 ( H † H )( H † ¯ Q L ) u R , O G = g 2 s m 2 W ( H † H ) G a µν G µν a , O uG = − 4 g s m 2 W y u ( H † ¯ Q L ) γ µν T a u R G a µν . (28) 22 1 0 1 2 3 T r u e s c o r e t 2 ( x ) 1 0 1 2 3 E s t i m a t e d s c o r e t 2 ( x ) ( S A L L Y ) 1 0 1 2 3 T r u e l o g l i k e l i h o o d r a t i o l o g r ( x ) 1 0 1 2 3 E s t i m a t e d l o g l i k e l i h o o d r a t i o l o g r ( x ) ( A L I C E S ) Figure 4. V alidation of the analysis techniques in a parton-level analysis, treating the momenta and ﬂa vours of all initial-state and ﬁnal-state partons are observ able. Left: V alidation of score estimation with the Sall y metho d. Estimated vs. true score comp onen t t 2 ( x ) ev aluated at the SM. Righ t: V alidation of lik eliho od ratio estimation with the Alices tec hnique. Estimated vs. true log lik eliho od ratio log r ( x | θ ) . The n umerator parameter p oin ts θ are drawn from a multiv ariate Gaussian with mean (0 , 0 , 0) and cov ariance matrix diag(0 . 2 2 , 0 . 2 2 , 0 . 2 2 ) as an example for a relev ant region of parameter space. The O u op erator eﬀectively rescales the top Y uk aw a coupling as y t → y t × (1 + 3 / 2 × c u ) , essen tially rescaling the ov erall rate of the tth pro cess. As discussed in the previous section, the O G op erator induces an additional con tribution to the eﬀectiv e Higgs-gluon coupling, g g gh → g g gh (1 + 192 π 2 /g 2 × c G ) , and thus changes the kinematic distributions. Finally , the O uG op erator corresp onds to a top-quark chromo-dipole moment, whic h modiﬁes the g tt v ertex. It also induces new eﬀectiv e g g tt , g tth , and g g tth couplings, promising new kinematic features. As theory parameters we deﬁne the vector θ = ( θ 1 , θ 2 , θ 3 ) T = ( c u , 100 c G , 100 c uG ) T , (29) T w o of the Wilson co eﬃcien ts are rescaled by a factor 100 to make sure that t ypical v alues of the three parameters are of the same size. Like in most EFT analyses, the squared matrix elemen t factorizes as described in Eq. 17 , and w e can use a morphing tech nique to in terp olate ev en t weigh ts and cross sections from nine benchmarks (or morphing basis p oin ts) to an y p oin t in parameter space. Based on a sample of 1 . 25 · 10 6 ev en ts, we train a likelihoo d ratio estimator with the Alices tec hnique and a score estimator with the Sall y metho d. W e sho w the results in Fig. 4 . The left panel shows the correlation betw een the true and estimated score based on the Sall y technique, fo cusing on the score comp onen t t 2 ( x ) that corresp onds to the theory parameter θ 2 = 100 c G (with similar results for the other comp onen ts). In the right panel w e compare the estimated likelihoo d ratio based on the Alices metho d to the ground truth, with parameter p oin ts drawn from a region of parameter space that could be of in terest in a typical analysis. In b oth cases w e ﬁnd that the predictions of the neural net w ork are v ery close to the true v alues, conﬁrming that the MadMiner algorithms w ork correctly in this truth-level scenario. 23 C. Realistic physics analysis Finally w e analyse the new physics reach of the tth pro cess in a realistic setup with high- dimensional ev en t data and theory parameters. W e consider the three dimension-six op erators given in Eqs. ( 27 ) and ( 28 ) and deﬁne the theory parameter space as in Eq. ( 29 ). In addition to the tth signal w e again include the dominan t bac kground, contin uum t ¯ t γ γ pro duction with leptonically decaying tops. W e tak e into account that this pro cess is sensitiv e to the theory parameter c uG through the mo diﬁed g tt and g g tt v ertex while b eing independent of c u and c G . W e neglect subleading bac kgrounds, in particular those with fak e photons or fake leptons. The even t generation follows the discussion in Sec. IV A ; we simulate the parton sho w er with Pythia 8 and the detector resp onse with Delphes 3 using the HL-LHC detector setup. W e no w also tak e in to account PDF and scale uncertainties, using the 30 eigenv ectors of the PDF4LHC15_nlo_30 PDF set and indep enden tly v arying renormalization and factorization scales b y a factor of 2. The even t data is describ ed by 48 observ ables, which includes the four-momenta of all reconstruc- ted ﬁnal-state ob jects (photons, leptons, and jets), the missing energy , as well as derived quantities suc h as the reconstructed transv erse momen tum of the di-photon system p T ,γ γ . W e require the ev ents to pass a di-photon mass cut 115 GeV < m γ γ < 135 GeV and to pass one of four triggers, whic h w ere adopted from the Delphes default trigger card: the mono-photon trigger ( p T ,γ > 80 GeV ), the di-photon trigger ( p T ,γ 1 > 40 GeV and p T ,γ 2 > 20 GeV ), the mono-lepton tigger ( p T ,` > 29 GeV ), or the di-lepton trigger ( p T ,` 1 > 17 GeV and p T ,` 2 > 17 GeV ). F or an anticipated in tegrated luminosity of L = 3 ab − 1 at the HL-LHC, we exp ect 24.5 tth SM signal and 33.6 ttγ γ bac kground ev ents to pass these acceptance and selection cuts. W e sim ulate 1 . 5 · 10 6 signal and 10 6 bac kground even ts (after all cuts) and extract training samples with 10 7 un w eigh ted even ts. W e then train neural net w orks to estimate the score or likelihoo d ratio b y minimizing the Sall y and Alices loss functions, the latter with a hyperparameter α = 0 . 1 . W e use fully connected neural netw orks with three hidden lay ers of 100 units and tanh activ ation functions, minimize the loss functions with the Adam optimizer using 50 ep ochs, a batch size of 128, a learning rate that deca ys exp onen tially from 10 − 3 to 10 − 5 , and early stopping to av oid o v ertraining. These h yperparameters are the result of a coarse h yp erparameter scan, though w e did not p erform a exhaustiv e optimization. In the ﬁnal step, w e calculate exp ected exclusion limits and Fisher information matrices. W e compare the results of the new metho ds to a baseline histogram analysis of the transv erse momen tum of the di-photon system p T ,γ γ , and to an analysis of the total cross section alone. 1. Fisher information F ollowing our recommendations from Sec. I I F , w e start our physics analysis by using the Sall y tec hnique, training a neural netw ork to estimate the score at the SM. W e then use it to calculate the SM Fisher information I ij as described in Sec. I I D , ﬁnding I ij =   140 . 5 68 . 1 170 . 6 68 . 1 47 . 1 105 . 7 179 . 5 105 . 7 283 . 3   . (30) This simple matrix summarizes the sensitivit y of the measuremen t on all three op erators. In particular, it allo ws us to calculate the squared Fisher distance d 2 ( θ , θ ref ) = I ij ( θ ref )( θ − θ ref ) i ( θ − θ ref ) j . As long as θ is suﬃcien tly close to θ ref , d 2 appro ximates to ( − 2) times the exp ected log likelihoo d ratio b et ween θ and θ ref . That, in turn, can b e directly translated into an exp ected p -v alue with whic h θ can b e excluded if θ ref is true, using the asymptotic prop erties of the likelihoo d ratio [ 69 – 71 ]. 24 0.30 0.15 0.00 0.15 0.30 c u 0.4 0.2 0.0 0.2 0.4 1 0 0 c G Rate Distribution Combination 1.2 0.6 0.0 0.6 1.2 1 0 0 c G 0.4 0.2 0.0 0.2 0.4 1 0 0 c u G Including systematics Without systematics Rate p T , Full Figure 5. Realistic physics analysis. Left: Exp ected 68% CL limits in the c G – c u plane based on the Fisher information in the rate (gra y), the kinematic information (dashed blue), and their combination (solid blue). The kinematic information is calculated based on the Sall y technique. The shaded error bands show the ensem ble v ariance of a set of 10 independently trained neural netw orks. W e set c uG to zero. Righ t: Exp ected 68% CL limits in the c uG – c G plane based on the Fisher Information for the rate (grey), a p T ,γ γ histogram (green), and the full m ultiv ariate information based on Sall y (blue). The dashed (solid) line sho ws the reach without (with) systematic uncertainties. The shaded error bands show the ensemble v ariance of a set of 10 indep enden tly trained neural netw orks. c u is set to zero. In the follo wing we use the Fisher Information to calculate expected limits on a combination of t w o theory parameters, while ﬁxing the remaining theory parameter to its SM v alue. In this case, the 68% conﬁdence level contours corresp ond to a lo cal Fisher distance d = 1 . 509 (95% CL corresp onds to d = 2 . 447 , 99% CL to d = 3 . 034 ). W e show the resulting exp ected 68% CL con tours in the c G – c u plane as solid blue line in the left panel of Fig. 5 . The Fisher information formalism makes it easy to disect these results a little. First, Eq. ( 16 ) sho ws that w e can separate the full Fisher information into a rate term and kinematic information. W e show this separation in the left panel of Fig. 5 by separately plotting the exp ected limits corresp onding to the rate information (gra y), the kinematic information (dashed blue) and their com bination (solid blue). W e ﬁnd that kinematic information is crucial for this channel. Since a rate measuremen t only provides a single n um ber, at the lev el of the Fisher information it can only constrain one direction in theory space and is blind in the remaining direction. This degeneracy is broken once additional information from the kinematic distributions is included. Indeed, the kinematic information can constrain the rate-sensitive direction in theory space almost as well the rate itself. Another asp ect that can b e con venien tly discussed in the Fisher information framew ork are systematic uncertainties. MadMiner can take PDF and scale uncertain ties int o account by para- meterizing them with nuisance parameters and then proﬁling ov er them, whic h at the lev el of the Fisher information is a simple matrix op eration [ 9 , 78 ]. In the righ t panel of Fig. 5 we analyze the impact of these uncertainties. The dashed lines sho w the exp ected limits neglecting systematic uncertain ties, while the solid lines sho w results that take systematics into accoun t by proﬁling ov er n uisance parameters. W e also again distinguish b et w een the Fisher information in the rate (gray), the Fisher information in a p T ,γ γ histogram (green), and the full information based on a neural score estimator (blue). W e can see that the presence of systematic uncertainties, which are dominated 25 0 100 200 300 400 500 p T , [ G e V ] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 d / d p T , [ a b / b i n ] Signal Background 0.00 0.04 0.08 0.12 0.16 Normalized Fisher information I 1 1 f o r 1 = c u I 2 2 f o r 2 = 1 0 0 c G I 3 3 f o r 3 = 1 0 0 c u G 115 120 125 130 135 m [ G e V ] 0 100 200 300 400 500 p T , [ G e V ] -1 t * 0 1 S c o r e c o m p o n e n t t 2 ( x ) Figure 6. Realistic ph ysics analysis. Left: Diﬀeren tial cross section (shaded grey) and distribution of the Fisher information comp onen ts (lines) ov er p T ,γ γ . Righ t: Score comp onen t t 2 ( x ) corresp onding to the Wilson co eﬃcien t c G as a function of di-photon mass m γ γ and di-photon transverse momentum p T ,γ γ . Note that ev en ts in background-dominated regions cluster at the v alue t ∗ = − 0 . 58 , as discussed in Sec. IV A . b y the scale uncertain ty , mainly reduces the sensitivit y in the rate-sensitiv e direction. The eﬀect of systematic uncertain ties is more pronounced for the information in the total rate and in the p T ,γ γ histogram. The full, m ultiv ariate information is reduced mostly in the rate-sensitive direction in parameter space, while the information in the orthogonal direction (to which the rate analysis is blind) is aﬀected only slightly . The results in b oth panels of Fig. 5 do not just include central predictions for each Fisher inform- ation or contour, but also shaded error bands. These bands visualize the v ariance of an ensem ble of 10 score estimator instances, eac h trained on resampled training samples with indep enden t random seeds. The bands sho w 2 σ v ariations, where σ is the ensemble standard deviation for a prediction. The small width of these bands signals a passed sanity chec k; a larger width would b e an indicator for n umerical issues during training or insuﬃcien t training data. In the discussion so far we hav e fo cused on the total Fisher information in tegrated ov er phase space, which is related to the exp ected exclusion limits. There is another useful asp ect of the Fisher information: w e can analyse the kinematic distribution of the Fisher information ov er kinematic v ariables to iden tify the imp ortan t phase-space regions for a measurement [ 9 ]. This knowledge can then b e used to design and optimize the ev en t selection. 6 As an example w e consider the distribution of information ov er the di-photon transv erse momentum p T ,γ γ , whic h is shown in the left panel of Fig. 6 . The shaded grey areas show the diﬀeren tial cross section for the ttγ γ bac kground and the SM tth signal. The three colored lines sho w the normalized distribution of the diagonal elements of the Fisher Information. W e ﬁnd that the information on O u , the op erator that just rescales the o v erall tth rate, peaks at 100 GeV , marking the optimal compromise b et w een go o d signal-to-background ratio and large rate. F or O uG and in particular O G , the information is shifted further to wards the high-energy tail of the distribution, where the kinematic eﬀects from these op erators are large. In the righ t panel of Fig. 6 w e illustrate the relation b et ween the score and kinematic v ariables and sho w how the score itself can b e used to identify the most sensitiv e region of phase space. W e sho w the score comp onen t t 2 ( x ) , corresp onding to the Wilson co eﬃcien t c G , as a function of the di-photon mass m γ γ and di-photon transv erse momen tum p T ,γ γ . While the m γ γ distribution for the 6 Similarly , imp ortan t phase-space regions can also b e identiﬁed using the log lik elihoo d ratio directly [ 105 – 107 ]. 26 0.4 0.2 0.0 0.2 0.4 c u 0.50 0.25 0.00 0.25 0.50 1 0 0 c G Fisher matrix: Confidence limits: 1% 10% SALLY d i m 6 2 / ALICES SALLY 0.6 0.3 0.0 0.3 0.6 1 0 0 c G 0.30 0.15 0.00 0.15 0.30 0.45 1 0 0 c u G A L I C E S p T , S A L L Y p T , H i s t o g r a m p T , Kinematics only Including rate Rate Figure 7. Realistic ph ysics analysis. Left: Comparison of the exp ected limits in the c G – c u plane at 68% CL. W e sho w the limits based on the full likelihoo d ratio estimated with the Sall y (solid blue) and Alices (solid red) metho ds as well as approximate limits based on the Fisher information calculated with Sall y (dashed blue). The dotted black line indicates where the contribution of the dimension-six squared terms contribute 1% and 10% to the total cross section. Righ t: Comparison of the exp ected 68% CL exclusion limits in the c uG – c G plane using the rate (gray), a p T ,γ γ histogram with 20 bins (green), the Sall y metho d trained with only p T ,γ γ as input (blue), and an Alices lik eliho od ratio estimator trained with only p T ,γ γ as input (red). The dashed limits only use kinematic distributions, while the solid curves include the rate measuremen t. signal pro cess do es not dep end on the Wilson co eﬃcien ts, this v ariable is imp ortan t in telling apart signal and background contributions. As discussed in Sec. IV A , bac kground even ts are generated with a constan t joint score t 2 ( x, z ) = t ∗ = − 0 . 58 . This is why in kinematic regions dominated b y the background, for instance aw ay from the Higgs mass p eak, the estimated score approac hes a constan t v alue ˆ t 2 ( x ) ≈ t ∗ = − 0 . 58 . Clusters of p ositiv e (negativ e) v alues of the score comp onen t corresp ond to phase-space regions that are enhanced (suppressed) when increasing c G . The largest scores are observ ed for ev ents around the Higgs peak with high p T ,γ γ & 100 GeV , showing the increased sensitivity of this high-energy region to the Wilson co eﬃcien t c G . Note that while the score comp onen t is clearly related to the t w o v ariables shown here, it is not a simple function of m γ γ and p T ,γ γ ; the neural net w ork instead learned a non-trivial function of the high-dimensional observ able space. 2. Exclusion limits So far we ha ve calculated limits in a lo cal approximation, in which non-linear eﬀects of the theory parameters on the lik eliho od function are neglected and in which the Fisher information fully characterizes the exp ected log likelihoo d ratio as given in Eq. ( 14 ) . Let us no w go b ey ond this approximation and calculate exclusion limits based on the full lik elihoo d function, including an y non-linear eﬀects. In an analysis of eﬀective dimension-six op erators, the approach in the previous section corresp onds to an analysis of in terference eﬀects b et ween the SM contribution and dimension-six eﬀects, while in this section w e also take into account the squared dimension-six amplitudes. W e can thus draw conclusions ab out the relev ance of the dimension-six squared terms b y comparing the limits obtained using the Fisher information with those obtained using the full 27 0.4 0.2 0.0 0.2 0.4 c u 0.6 0.3 0.0 0.3 0.6 1 0 0 c G Rate p T , SALLY ALICES 0.6 0.3 0.0 0.3 0.6 1 0 0 c G 0.3 0.2 0.1 0.0 0.1 0.2 0.3 1 0 0 c u G Rate p T , SALLY ALICES Figure 8. Realistic physics analysis. Expected exclusion limits on the Wilson co eﬃcien ts c G vs. c u with c uG set to zero (left), and on c uG vs. c G with c u set to zero (right). W e show b est-ﬁt p oints and 68% CL limits based on the rate only (gray), a p T ,γ γ histogram with 20 bins (green), the Sall y tec hnique (blue), and the Alices metho d (red). lik eliho od ratio function. In the left panel of Fig. 7 w e sho w the exp ected 68% CL con tours for the parameter plane spanned b y c G and c u . The solid red line shows the limits obtained using the Alices metho d, which directly estimates the lik eliho od ratio function. The Sall y metho d (solid blue line) estimates the SM score v ector; the comp onen ts corresp onding to c G and c u are used as observ ables and the likelihoo d is calculated with tw o-dimensional histograms. Finally , the limits based on the lo cal Fisher distance are sho wn as dashed blue line. W e can see the limits obtained using the three metho ds do not fully agree, indicating the relev ance of dimension-six squared terms. Indeed, in the region of parameter space probed at 68% CL, these terms contribute b et ween 1% and 10% to the total rate, as shown b y the dotted black lines, and substantially more in the relev an t high-energy region of phase space. These multiv ariate results are compared to limits based on the analysis of just a single summary statistic in the righ t panel of Fig. 7 . W e analyse the p T ,γ γ distribution with three metho ds: a histogram with 20 bins (pastel green), an Alices lik elihoo d ratio estimator trained only on p T ,γ γ as observ able input (forest green), and a Sall y estimator of the score trained only on p T ,γ γ as input (turquoise). W e also show limits based only on the total cross section (grey) and, for each of the three metho ds, only on kinematic information (dashed lines). W e ﬁnd that the shap e information in the p T ,γ γ distribution (dashed green) is complemen tary to the rate information, and hence remo v es the blind directions of the pure rate measuremen t. In addition, the results from the three metho ds agree v ery well, providing a non-trivial cross-chec k of the three diﬀerent approaches. Finally we collect the exp ected limits on the Wilson co eﬃcien ts based on the diﬀerent methods in Fig. 8 . The left panel shows the c G – c u plane, the right panel the c uG – c u plane, while the parameter not sho wn is set to zero in b oth cases. In grey we sho w limits based on a cut-and-count analysis of the total rate. This approach only constrains one direction in theory space and is blind in the remaining directions. In particular, this rate-only analysis cannot distinguish betw een multiple disjoin t b est-ﬁt regions, for instance betw een c u = 0 and c u = − 4 / 3 , whic h corresp onds to a sign-ﬂipp ed top Y uk aw a coupling and predicts the same total cross section. This degeneracy is brok en once kinematic information is included. Ev en the simplest case, the histogram-based analysis 28 of a single v ariable suc h as the di-photon transverse momentum p T ,γ γ (green line), can substan tially impro v e the sensitivity of the analysis. The blue and red line sho w the exp ected limits from the new, mac hine-learning-based methods implemen ted in MadMiner . In blue w e sho w the sensitivity of the Sall y tec hnique, which uses the estimated score as a v ector of lo cally optimal observ ables. W e ﬁnd clearly stronger limits: the score components are indeed more p o werful observ ables than p T ,γ γ . Finally , the red line shows the limits from the Alices metho d, in which a neural netw ork learns the full likelihoo d ratio function throughout the entire theory parameter space. In con trast to Sall y , it also guaran tees optimal sensitivity further aw ay from the SM reference p oin t, provided that the netw ork was trained successfully — and indeed, the Alices technique leads to the strongest exp ected limits on the Wilson coeﬃcients. V. CONCLUSIONS In this pap er we in troduced MadMiner , a Python pack age that implemen ts a range of mo dern m ultiv ariate inference techniques for particle physics pro cesses. These inference metho ds require running Mon te-Carlo sim ulations and extracting additional information related to the matrix elemen ts, using this information to train neural net w orks to precisely estimate the lik elihoo d function, and constraining ph ysics parameters based on this likelihoo d function with established statistical methods. MadMiner implements all steps in this analysis c hain. These inference techniques are designed for high-dimensional even t data without requiring a c hoice of lo w-dimensional summary statistics. Unlik e for instance the matrix elemen t method, they mo del the eﬀect of a realistic show er and detector simulation, without requiring an y approximations on the underlying ph ysics. After an upfront training phase, even ts can b e ev aluated extremely fast, whic h can substantially reduce the computational cost compared to other metho ds. Finally , the eﬃcien t use of matrix element information reduces the n umber of simulated samples required for a succesful training of the neural netw orks compared to other, physics-agnostic, machine learning metho ds. MadMiner curren tly provides interfaces to the simulators MadGraph5_aMC , Pythia 8 , and the fast detector simulation Delphes 3 , which form a state-of-the-art to olbox for phenomenological analyses. It supp orts almost an y LHC pro cess, arbitrary theory mo dels, reducible and irreducible bac kgrounds, and systematic uncertainties based on PDF and scale v ariations. In the future, w e are planning to extend MadMiner to supp ort detector simulations based on Geant4 as well as new t yp es of systematic uncertainties. After discussing the implemen ted inference techniques and their implementation, w e provided a step-b y-step guide through an analysis workﬂo w with MadMiner . W e then demonstrated the to ol in an example analysis of three eﬀective op erators in tth pro duction at the high-luminosit y run of the LHC. The mec hanism b ehind the inference techniques was illustrated in a one-dimensional case, and the metho ds v alidated in a simpliﬁed parton-level setup where the true likelihoo d is tractable. W e demonstrated how MadMiner lets us isolate the important phase-space regions and deﬁne optimal observ ables. Finally , w e sho w ed that compared to analyses of the total rate and standard histograms, the machine-learning-based techniques lead to stronger exp ected limits on the eﬀectiv e op erators. These results demonstrate that the techniques implemented in MadMiner ha v e the potential to clearly improv e the sensitivit y of the LHC legacy measuremen ts. 29 A CKNO WLEDGMENTS W e w ould like to thank Zubair Bhatti, Luk as Heinrich, Alexander Held, and Samuel Homiller for their imp ortan t contributions to the developmen t of MadMiner . W e are grateful to Joakim Olsson for his help with the tth data generation. W e also thank P ablo de Castro, Sally Dawson, Gilles Loupp e, Olivier Mattelaer, Duccio Pappadopulo, Michael Peskin, Tilman Plehn, Josh Rudermann, and Leonora V esterback a for fruitful discussions. Last but not least, w e are grateful to the authors and main tainers of man y op en-source softw are pack ages, including Delphes 3 [ 65 ], Docker [ 108 ], Jupyter noteb ooks [ 109 ], MadGraph5_aMC [ 63 ], Matplotlib [ 110 ], NumPy [ 98 ], pylhe [ 111 ], Pythia 8 [ 112 ], Python [ 113 ], PyTorch [ 85 ], REANA [ 93 ], scikit-hep [ 114 ], scikit-learn [ 115 ], uproot [ 116 ], and yadage [ 117 ]. This work was supp orted by the U. S. National Science F oundation (NSF) under the a wards A CI-1450310, OA C-1836650, and OA C-1841471. It was also supp orted through the NYU IT High P erformance Computing resources, services, and staﬀ exp ertise. JB and KC are grateful for the supp ort of the Mo ore-Sloan data science environmen t at NYU. KC is also supported through the NSF grant PHY-1505463, while FK is supp orted by NSF gran t PHY-1620638 and U. S. Department of Energy grant DE-AC02-76SF00515. App endix A: F requen tly asked questions Here w e collect questions that are ask ed often, hoping to av oid misconceptions: • Do es the whole event history not change when I change p ar ameters? No. In probabilistic pro cesses such as those at the LHC, an y giv en even t history is t ypically compatible with diﬀerent v alues of the theory parameters, but might b e more or less likely . With “even t history” w e mean the en tire evolution of a sim ulated particle collision, ranging from the initial-state and ﬁnal-state elementary particles through the parton show er and detector in teractions to observ ables. The joint likelihoo d ratio and joint score quan tify how m uc h more or less likely one particular suc h evolution of a sim ulated even t becomes when the theory parameters are v aried. • If the network is tr aine d on p arton-level matrix element information, how do es it le arn ab out the eﬀe ct of shower and dete ctor? It is true that the “lab els” that the net w orks are trained on, the joint likelihoo d ratio and join t score, are based on parton-lev el information. But the inputs in to the neural netw ork are observ ables based on a full sim ulation c hain, after parton show er, detector eﬀects, and the reconstruction of observ ables. It w as shown in Ref. [ 59 – 61 ] that the join t likelihoo d ratio and join t score are unbiased, but noisy , estimators of the true lik elihoo d ratio and true score (including show er and detector eﬀects). A netw ork trained in the right wa y will therefore learn the eﬀect of show er and detector. W e illustrate this mec hanism in Sec. IV A in a one-dimensional problem. • Can this appr o ach b e use d for signal-b ackgr ound classiﬁc ation? Y es. In the simplest case, where the signal and bac kground hypothesis do not dep end on an y additional parameters, the Carl , R olr , or Alice tec hniques can b e used to learn the probabilit y of an individual ev ent b eing signal or background. If there are parameters of in terest suc h as a signal strength or the mass of a resonance, the score b ecomes useful and tec hniques suc h as Sall y , Rascal , Cascal , and Alices can b e more p o w erful. The tec hniques that use the joint likelihoo d ratio or score require less training data when the signal and background pro cesses p opulate the same phase-space regions. If this is not the 30 case, these metho ds still apply , but will not oﬀer an adv antage ov er the traditional training of binary classiﬁers. • What if the simulations do not describ e the physics ac cur ately? No sim ulator is p erfect, but man y of the tec hniques used for incorp orating systematic un- certain ties from mismo deling in the case of multiv ariate classiﬁers can also b e used in this setting. F or instance, often the eﬀect of mismo deling can b e corrected with simple scale factors and the residual uncertaint y incorp orated with nuisance parameters. MadMiner can handle suc h systematic uncertain ties as discussed ab o ve. If only particular phase-space regions are problematic, for instance those with low-energy jets, we recommend to exclude these parameter regions with suitable selection cuts. If the kinematic distributions are trusted, but the o verall normalization is less well known, a data-driv en normalization can b e used. Of course, there is no silver bullet, and if the simulation co de is not trust worth y at all in a particular pro cess and the uncertain t y cannot be quantiﬁed with nuisance parameters, these metho ds (and many more traditional analysis metho ds) will not provide accurate results. • Is the neur al network a black b ox? Neural netw orks are often criticized for their lac k of explainability . It is true that the internal structure of the netw ork is not directly in terpretable, but in MadMiner the in terpretation of what the netw ork is trying to learn is clearly connected to the matrix elemen t. In practical terms, one of the c hallenges is to verify whether a netw ork has b een successfully trained. F or that purpose, many cross-chec ks and diagnostic tools are av ailable to mak e sure that this is the case: – c hec king the loss function on a separate v alidation sample; – training of multiple netw ork instances with indep enden t random seeds, as discussed ab o ve; – c hec king the expectation v alues of the score and likelihoo d ratio against their kno wn true v alues, see Ref. [ 61 ]; – v arying of the reference hypothesis in the lik elihoo d ratio, see Ref. [ 61 ]; – training classiﬁers b et ween data reweigh ted with the estimated likelihoo d ratio and original data from a new parameter p oin t, see Ref. [ 61 ]; – v alidating the inference techniques in low-d imensional problems with histograms, see Sec. IV A ; – v alidating the inference techniques on a parton-lev el scenario with tractable lik eliho od function, see Sec. IV B ; and – c hec king the asymptotic distribution of the lik eliho o d ratio against Wilks’ theorem [ 69 – 71 ]. Finally , when limits are set based on the Neyman construction with toy exp erimen ts (rather than using the asymptotic properties of the lik eliho o d ratio), there is a co v erage guarantee: the exclusion contours constructed in this wa y will not exclude the true p oin t more often than the conﬁdence lev el. No matter how wrong the lik eliho od, likel iho od ratio, or score function estimated by the neural netw ork is, the ﬁnal limits might lose statistical p o w er, but will never b e to o optimistic. • A r e you trying to r eplac e PhD students with a machine? As a preemptiv e safet y measure against scien tists being made redundant b y automated inference algorithms, we ha v e implemented a num b er of bugs in MadMiner . It will tak e skilled ph ysicists to ﬁnd them, ensuring safe jobs for a while. More seriously , just as MadGraph automated the pro cess of generating even ts for an arbitrary hard scattering pro cess, MadMiner aims to con tribute to the automation of several steps in the inference chain. Both developmen ts enhance the pro ductivit y of ph ysicists. 31 [1] J. Brehmer, K. Cranmer, I. Esp ejo, F. Kling, G. Loupp e, and J. Pa vez: ‘Eﬀective LHC measurements with matrix elemen ts and mac hine learning’, 2019. . [2] K. S. Cranmer: ‘Kernel estimation in high-energy physics’. Comput. Ph ys. Comm un. 136, p. 198, 2001. arXiv:hep-ex/0011057 . [3] K. Cranmer, G. Lewis, L. Moneta, A. Shibata, and W. V erkerk e (ROOT): ‘HistF actory: A to ol for creating statistical mo dels for use with Ro oFit and Ro oStats’ , 2012. [4] M. F rate, K. Cranmer, S. Kalia, A. V anden b erg-Rodes, and D. Whiteson: ‘Mo deling Smo oth Back- grounds and Generic Lo calized Signals with Gaussian Pro cesses’ , 2017. . [5] D. B. Rubin: ‘Bay esianly justiﬁable and relev ant frequency calculations for the applied statistician’. Ann. Statist. 12 (4), p. 1151, 1984. [6] M. A. Beaumont, W. Zhang, and D. J. Balding: ‘Appro ximate bay esian computation in p opulation genetics’. Genetics 162 (4), p. 2025, 2002. [7] J. Alsing, B. W andelt, and S. F eeney: ‘Massive optimal data compression and density estimation for scalable, lik eliho od-free inference in cosmology’ , 2018. . [8] T. Charno c k, G. Lav aux, and B. D. W andelt: ‘Automatic ph ysical inference with information maxim- izing neural net w orks’. Phys. Rev. D97 (8), p. 083004, 2018. . [9] J. Brehmer, K. Cranmer, F. Kling, and T. Plehn: ‘Better Higgs b oson measurements through informa- tion geometry’. Ph ys. Rev. D95 (7), p. 073002, 2017. . [10] J. Brehmer, F. Kling, T. Plehn, and T. M. P . T ait: ‘Better Higgs-CP T ests Through Information Geometry’. Ph ys. Rev. D97 (9), p. 095017, 2018. . [11] K. K ondo: ‘Dynamical Likelihoo d Metho d for Reconstruction of Even ts With Missing Momentum. 1: Metho d and T oy Mo dels’. J. Phys. So c. Jap. 57, p. 4126, 1988. [12] V. M. Abazo v et al. (D0): ‘A precision measurement of the mass of the top quark’. Nature 429, p. 638, 2004. arXiv:hep-ex/0406031 . [13] P . Artoisenet and O. Mattelaer: ‘MadW eigh t: Automatic even t reweigh ting with matrix elemen ts’. P oS CHAR GED2008, p. 025, 2008. [14] Y. Gao, A. V. Gritsan, Z. Guo, K. Melnik o v, M. Sch ulze, and N. V. T ran: ‘Spin determination of single-pro duced resonances at hadron colliders’. Phys. Rev. D81, p. 075022, 2010. . [15] J. Alw all, A. F reitas, and O. Mattelaer: ‘The Matrix Elemen t Method and QCD Radiation’. Ph ys. Rev. D83, p. 074010, 2011. . [16] S. Bolognesi, Y. Gao, A. V. Gritsan, et al.: ‘On the spin and parity of a single-pro duced resonance at the LHC’. Ph ys. Rev. D86, p. 095031, 2012. . [17] P . A very et al.: ‘Precision studies of the Higgs b oson deca y channel H → Z Z → 4 l with MEKD’. Phys. Rev. D87 (5), p. 055006, 2013. . [18] J. R. Andersen, C. Englert, and M. Spannowsky: ‘Extracting precise Higgs couplings b y using the matrix elemen t metho d’. Phys. Rev. D87 (1), p. 015019, 2013. . [19] J. M. Campb ell, R. K. Ellis, W. T. Giele, and C. Williams: ‘Finding the Higgs b oson in decays to Z γ using the matrix element metho d at Next-to-Leading Order’. Phys. Rev. D87 (7), p. 073005, 2013. arXiv:1301.7086 . [20] P . Artoisenet, P . de Aquino, F. Maltoni, and O. Mattelaer: ‘Unrav elling tth via the Matrix Element Metho d’. Phys. Rev. Lett. 111 (9), p. 091802, 2013. . [21] J. S. Gainer, J. Lykken, K. T. Matc hev, S. Mrenna, and M. Park: ‘The Matrix Element Metho d: Past, Presen t, and F uture’. In ‘Proceedings, 2013 Communit y Summer Study on the F uture of U.S. Particle Ph ysics: Snowmass on the Mississippi (CSS2013): Minneap olis, MN, USA, July 29-August 6, 2013’, , 2013. . [22] D. Schouten, A. DeAbreu, and B. Stelzer: ‘Accelerated Matrix Element Metho d with Parallel Com- puting’. Comput. Ph ys. Comm un. 192, p. 54, 2015. . [23] T. Martini and P . Uw er: ‘Extending the Matrix Element Metho d b ey ond the Born approximation: Cal- culating ev en t w eigh ts at next-to-leading order accuracy’. JHEP 09, p. 083, 2015. . [24] A. V. Gritsan, R. Röntsc h, M. Sc hulze, and M. Xiao: ‘Constraining anomalous Higgs b oson couplings to the heavy ﬂav or fermions using matrix element techniques’. Phys. Rev. D94 (5), p. 055023, 2016. arXiv:1606.03107 . [25] T. Martini and P . Uw er: ‘The Matrix Elemen t Metho d at next-to-leading order QCD for hadronic colli- 32 sions: Single top-quark pro duction at the LHC as an example app lication’ , 2017. . [26] M. Kraus, T. Martini, and P . Uwer: ‘Predicting even t w eights at next-to-leading order QCD for jet ev en ts deﬁned by 2 → 1 jet algorithms’ , 2019. . [27] D. At woo d and A. Soni: ‘Analysis for magnetic moment and electric dip ole momen t form-factors of the top quark via e + e − → t ¯ t ’. Ph ys. Rev. D45, p. 2405, 1992. [28] M. Davier, L. Duﬂot, F. Le Dib erder, and A. Rouge: ‘The Optimal metho d for the measurement of tau p olarization’. Phys. Lett. B306, p. 411, 1993. [29] M. Diehl and O. Nach tmann: ‘Optimal observ ables for the measurement of three gauge boson couplings in e + e − → W + W − ’. Z. Ph ys. C62, p. 397, 1994. [30] D. E. Sop er and M. Spannowsky: ‘Finding physics signals with show er deconstruction’. Phys. Rev. D84, p. 074002, 2011. . [31] D. E. Sop er and M. Spannowsky: ‘ Finding top quarks with show er deconstruction’. Ph ys. Rev. D87, p. 054012, 2013. . [32] D. E. Sop er and M. Spannowsky: ‘Finding physics signals with even t deconstruction’. Ph ys. Rev. D89 (9), p. 094005, 2014. . [33] C. Englert, O. Mattelaer, and M. Spannowsky: ‘Measuring the Higgs-b ottom coupling in weak b oson fusion’. Ph ys. Lett. B756, p. 103, 2016. . [34] Y. F an, D. J. Nott, and S. A. Sisson: ‘Approximate Bay esian Computation via Regression Density Estimation’. ArXiv e-prin ts , 2012. . [35] L. Dinh, D. Krueger, and Y. Bengio: ‘NICE: Non-linear Indep enden t Com ponents Estimation’. ArXiv e-prin ts , 2014. . [36] M. Germain, K. Gregor, I. Murray, and H. Larochelle: ‘MADE: Masked Auto encoder for Distribution Estimation’. ArXiv e-prin ts , 2015. . [37] K. Cranmer, J. Pa v ez, and G. Loupp e: ‘Appro ximating Likelihoo d Ratios with Calibrated Discrimin- ativ e C lassiﬁers’ , 2015. . [38] K. Cranmer and G. Loupp e: ‘Unifying generative mo dels and exact lik elihoo d-free inference with conditional bijections’. J. Brief Ideas , 2016. [39] G. Louppe, K. Cranmer, and J. P a v ez: ‘carl: a likelihoo d-free inference to olbox’. J. Op en Source Softw. , 2016. [40] L. Dinh, J. Sohl-Dickstein, and S. Bengio: ‘Density estimation using Real NVP’. ArXiv e-prin ts , 2016. arXiv:1605.08803 . [41] G. P apamak arios and I. Murra y: ‘F ast  -free Inference of Simulation Mo dels with Ba y esian Conditional Densit y Estimation’. arXiv e-prin ts arXiv:1605.06376, 2016. . [42] R. Dutta, J. Corander, S. Kaski, and M. U. Gutmann: ‘Lik eliho od-free inference by ratio estimation’. ArXiv e-prin ts , 2016. . [43] B. Uria, M.-A. Côté, K. Gregor, I. Murra y, and H. Larochelle: ‘Neural Autoregressive Distribution Estimation’. ArXiv e-prin ts , 2016. . [44] M. U. Gutmann, R. Dutta, S. Kaski, and J. Corander: ‘Likelihoo d-free inference via classiﬁcation’. Statistics and Computing p. 1–15, 2017. [45] D. T ran, R. Ranganath, and D. M. Blei: ‘Hierarchical Implicit Mo dels and Likelihoo d-F ree V ariational Inference’. ArXiv e-prin ts , 2017. . [46] G. Loupp e and K. Cranmer: ‘Adv ersarial V ariational Optimization of Non-Diﬀerentiable Simulators’. ArXiv e-prin ts , 2017. . [47] G. Papamak arios, T. Pa vlakou, and I. Murray: ‘Masked Autoregressive Flow for Density Estimation’. ArXiv e-prin ts , 2017. . [48] J.-M. Lueckmann, P . J. Goncalves, G. Bassetto, K. Öcal, M. Nonnenmacher, and J. H. Mack e: ‘Flexible statistical inference for mechanistic mo dels of neural dynamics’. arXiv e-prints arXiv:1711.01861, 2017. arXiv:1711.01861 . [49] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville: ‘Neural Autoregressiv e Flo ws’. ArXiv e-prints , 2018. . [50] G. Papamak arios, D. C. Sterratt, and I. Murray: ‘Sequen tial Neural Likelihoo d: F ast Likelihoo d-free Inference with Autoregressiv e Flo ws’. ArXiv e-prin ts , 2018. . [51] J.-M. Lueckmann, G. Bassetto, T. Karaletsos, and J. H. Mac k e: ‘Lik eliho od-free inference with emulator net w orks’. arXiv e-prints arXiv:1805.09294, 2018. . [52] T. Q. Chen, Y. Rubanov a, J. Bettencourt, and D. K. Duvenaud: ‘Neural ordinary diﬀerential equations’. 33 CoRR abs/1806.07366, 2018. . [53] D. P . Kingma and P . Dhariw al: ‘Glo w: Generativ e Flow with In v ertible 1x1 Conv olutions’. arXiv e-prin ts arXiv:1807.03039, 2018. . [54] W. Grathw ohl, R. T. Q. Chen, J. Bettencourt, I. Sutsk ev er, and D. Duvenaud: ‘FFJORD: F ree- form Con tin uous Dynamics for Scalable Reversible Generativ e Mo dels’. ArXiv e-prin ts , 2018. arXiv:1810.01367 . [55] T. Dinev and M. U. Gutmann: ‘Dynamic Likelihoo d-free Inference via Ratio Estimation (DIRE)’. arXiv e-prin ts arXiv:1810.09899, 2018. . [56] J. Hermans, V. Begy , and G. Louppe: ‘Lik eliho od-free MCMC with Approximate Likelihoo d Ratios’ , 2019. . [57] J. Alsing, T. Charno c k, S. F eeney , and B. W andelt: ‘F ast likelihoo d-free cosmology with neural densit y estimators and activ e learning’ , 2019. . [58] D. S. Greenberg, M. Nonnenmacher, and J. H. Mack e: ‘Automatic Posterior T ransformation for Lik eliho od-F ree Inference’. arXiv e-prints arXiv:1905.07488, 2019. . [59] J. Brehmer, G. Loupp e, J. Pa vez, and K. Cranmer: ‘Mining gold from implicit models to impro v e lik eliho od-free inference’ , 2018. . [60] J. Brehmer, K. Cranmer, G. Loupp e, and J. Pa vez: ‘Constraining Eﬀective Field Theories with Machine Learning’. Ph ys. Rev. Lett. 121 (11), p. 111801, 2018. . [61] J. Brehmer, K. Cranmer, G. Loupp e, and J. Pa vez: ‘A Guide to Constraining Eﬀective Field Theories with Mac hine Learning’. Ph ys. Rev. D98 (5), p. 052004, 2018. . [62] M. Stoy e, J. Brehmer, G. Loupp e, J. P av ez, and K. Cranmer: ‘Lik elihoo d-free inference with an impro v ed cross-entrop y estimator’ , 2018. . [63] J. Alwall, R. F rederix, S. F rixione, et al.: ‘The automated computation of tree-lev el and next-to-leading order diﬀerential cross sections, and their matching to parton show er simulations’. JHEP 07, p. 079, 2014. . [64] T. Sjostrand, S. Mrenna, and P . Z. Sk ands: ‘A Brief Introduction to PYTHIA 8.1’. Comput. Ph ys. Comm un. 178, p. 852, 2008. . [65] J. de F a v ereau, C. Delaere, P . Demin, et al. (DELPHES 3): ‘DELPHES 3, A mo dular framework for fast sim ulation of a generic collider exp erimen t’. JHEP 02, p. 057, 2014. . [66] S. Agostinelli et al. (GEANT4): ‘GEANT4: A Simulation to olkit’. Nucl. Instrum. Meth. A506, p. 250, 2003. [67] K. Cranmer: ‘Practical Statistics for the LHC’. In ‘Pro ceedings, 2011 Europ ean Sc hool of High- Energy Physics (ESHEP 2011): Cheile Gradistei, Romania, September 7-20, 2011’, p. 267-308, 2015. [,247(2015)], . [68] P . Baldi, K. Cranmer, T. F aucett, P . Sado wski, and D. Whiteson: ‘Parameterized neural netw orks for high-energy ph ysics’. Eur. Ph ys. J. C76 (5), p. 235, 2016. . [69] S. S. Wilks: ‘The Large-Sample Distribution of the Lik elihoo d Ratio for T esting Comp osite Hyp otheses’. Annals Math. Statist. 9 (1), p. 60, 1938. [70] A. W ald: ‘T ests of statistical h yp otheses concerning sev eral parameters when the num b er of observ ations is large’. T ransactions of the American Mathematical So ciet y 54 (3), p. 426, 1943. [71] G. Cow an, K. Cranmer, E. Gross, and O. Vitells: ‘Asymptotic formulae for likelihoo d-based tests of new physics’. Eur. Ph ys. J. C71, p. 1554, 2011. [Erratum: Eur. Ph ys. J. C73, p. 2501, 2013]. arXiv:1007.1727 . [72] J. Alsing and B. W andelt: ‘Generalized massive optimal data compression’. Mon. Not. Roy . Astron. So c. 476 (1), p. L60, 2018. . [73] B. Efron: ‘Deﬁning the curv ature of a statistical problem (with applications to second order eﬃciency)’. Ann. Statist. 3 (6), p. 1189, 1975. [74] S.-I. Amari: ‘Diﬀeren tial geometry of curved exponential families-curv atures and information loss’. Ann. Statist. 10 (2), p. 357, 1982. [75] J. Brehmer: ‘New Ideas for Eﬀective Higgs Measurements’. Ph.D. thesis, U. Heidelb erg (main), 2017. URL http://www.thphys.uni- heidelberg.de/~plehn/includes/theses/brehmer_d.pdf . [76] C. Radhakrishna Rao: ‘Information and the accuracy attainable in the estimation of statistical parameters’. Bull. Calcutta Math. So c. 37, p. 81, 1945. [77] H. Cramér: ‘Mathematical Metho ds of Statistics’. Princeton Universit y Press, 1946, ISBN 0691080046. [78] T. D. P . Edwards and C. W eniger: ‘A F resh Approach to F orecasting in Astroparticle Physics and 34 Dark Matter Searc hes’. JCAP 1802 (02), p. 021, 2018. . [79] C. Degrande, C. Duhr, B. F uks, D. Grellscheid, O. Mattelaer, and T. Reiter: ‘UFO - The Universal F eynRules Output’. Comput. Phys. Commun. 183, p. 1201, 2012. . [80] O. Mattelaer: ‘On the maximal use of Monte Carlo samples: re-weigh ting even ts at NLO accuracy’. Eur. Ph ys. J. C76 (12), p. 674, 2016. . [81] G. Aad et al. (A TLAS): ‘A morphing tec hnique for signal mo delling in a multidimensional space of coupling parameters’, 2015. Physics note A TL-PHYS-PUB-2015-047. URL http://cds.cern.ch/ record/2066980 . [82] J. Alsing and B. W andelt: ‘Nuisance hardened data compression for fast likelihoo d-free inference’ , 2019. . [83] Luk as, M. F eick ert, G. Stark, R. T urra, and J. F orde: ‘diana-hep/pyhf v0.0.15’, 2018. URL https: //doi.org/10.5281/zenodo.1464139 . [84] R. F rederix, S. F rixione, V. Hirschi, F. Maltoni, R. Pittau, and P . T orrielli: ‘F our-lepton pro duction at hadron colliders: aMC@NLO predictions with theoretical uncertain ties’. JHEP 02, p. 099, 2012. arXiv:1110.4738 . [85] A. P aszk e, S. Gross, S. Chintala, et al.: ‘Automatic diﬀerentiation in pytorc h’. In ‘NIPS-W’, , 2017. [86] N. Qian: ‘On the momentum term in gradient descent learning algorithms’. Neural Netw. 12 (1), p. 145, 1999. [87] D. P . Kingma and J. Ba: ‘Adam: A Metho d for Sto c hastic Optimization’. arXiv e-prints arXiv:1412.6980, 2014. . [88] S. J. Reddi, S. Kale, and S. Kumar: ‘On the conv ergence of adam and b ey ond’. In ‘International Conference on Learning Represen tations’, , 2018. [89] B. Lakshminaray anan, A. Pritzel, and C. Blundell: ‘Simple and Scalable Predictiv e Uncertaint y Estimation using Deep Ensem bles’. arXiv e-prin ts arXiv:1612.01474, 2016. . [90] J. Brehmer, F. Kling, I. Espejo, and K. Cranmer: ‘MadMiner co de rep ository’. DOI:10.5281/zenodo.1489147 , URL https://github.com/diana- hep/madminer . [91] J. Brehmer, F. Kling, I. Esp ejo, and K. Cranmer: ‘MadMiner technical do cumen tation’. URL https: //madminer.readthedocs.io/en/latest/ . [92] I. Esp ejo, J. Brehmer, and K. Cranmer: ‘MadMiner Do c ker rep ositories’. URL https://hub.docker. com/u/madminertool . [93] T. Šimk o, L. Heinric h, H. Hirv onsalo, D. K ousidis, and D. Ro dríguez: ‘REANA: A System for Reusable Researc h Data Analyses’. T echnical Rep ort CERN-IT-2018-003, CERN, Genev a, 2018. URL https://cds.cern.ch/record/2652340 . [94] I. Esp ejo, J. Brehmer, F. Kling, and K. Cranmer: ‘MadMiner Reana deploymen t’. URL https: //github.com/irinaespejo/workflow- madminer . [95] The HDF Group: ‘Hierarc hical data format version 5’, 2000-2010. URL http://www.hdfgroup.org/ HDF5 . [96] M. Dobbs and J. B. Hansen: ‘The HepMC C++ Monte Carlo even t record for High Energy Physics’. Comput. Ph ys. Comm un. 134, p. 41, 2001. [97] E. Ro drigues, M. Marinangeli, B. Pollac k, et al.: ‘scikit-hep/scikit-hep: scikit-hep-0.5.1’, 2019. URL https://doi.org/10.5281/zenodo.3234683 . [98] T. Oliphant: ‘NumPy: A guide to NumPy’. USA: T relgol Publishing, 2006–. URL http://www.numpy. org/ . [99] J. Butterworth et al.: ‘PDF4LHC recommendations for LHC Run I I’. J. Phys. G43, p. 023001, 2016. arXiv:1510.03865 . [100] D. de Florian et al. (LHC Higgs Cross Section W orking Group): ‘Handb ook of LHC Higgs Cross Sections: 4. Deciphering the Nature of the Higgs Sector’ , 2016. . [101] G. F. Giudice, C. Gro jean, A. P omarol, and R. Rattazzi: ‘The Strongly-Interacting Light Higgs’. JHEP 06, p. 045, 2007. arXiv:hep-ph/0703164 . [102] A. Alloul, B. F uks, and V. Sanz: ‘Phenomenology of the Higgs Eﬀectiv e Lagrangian via FEYNRULES’. JHEP 04, p. 110, 2014. . [103] F. Maltoni, E. V ryonidou, and C. Zhang: ‘Higgs production in asso ciation with a top-antitop pair in the Standard Mo del Eﬀective Field Theory at NLO in QCD’. JHEP 10, p. 123, 2016. . [104] M. Cepeda et al. (Physics of the HL-LHC W orking Group): ‘Higgs Physics at the HL-LHC and HE-LHC’ , 2019. . 35 [105] T. Plehn, P . Schic htel, and D. Wiegand: ‘Where b o osted signiﬁcances come from’. Phys. Rev. D89 (5), p. 054002, 2014. . [106] F. Kling, T. Plehn, and P . Schic h tel: ‘Maximizing the signiﬁcance in Higgs b oson pair analyses’. Phys. Rev. D95 (3), p. 035026, 2017. . [107] D. Gonçalv es, T. Han, F. Kling, T. Plehn, and M. T akeuc hi: ‘Higgs b oson pair pro duction at future had- ron colliders: F rom kinematics to dynamics’. Ph ys. Rev. D97 (11), p. 113004, 2018. . [108] D. Merkel: ‘Do c ker: Ligh tw eight linux containers for consistent developmen t and deploymen t’. Linux J. 2014 (239), 2014. URL http://dl.acm.org/citation.cfm?id=2600239.2600241 . [109] T. Kluyver, B. Ragan-Kelley , F. Pérez, et al.: ‘Jupyter notebo oks - a publishing format for repro ducible computational w orkﬂo ws’. In ‘ELPUB’, , 2016. [110] J. D. Hunter: ‘Matplotlib: A 2d graphics environmen t’. Computing in Science & Engineering 9 (3), p. 90, 2007. [111] Luk as: ‘luk asheinrich/p ylhe v0.0.4’, 2018. URL https://doi.org/10.5281/zenodo.1217032 . [112] T. Sjstrand, S. Ask, J. R. Christiansen, et al.: ‘An Introduction to PYTHIA 8.2’. Comput. Phys. Comm un. 191, p. 159, 2015. . [113] G. V an Rossum and F. L. Drak e Jr: ‘Python tutorial’. Centrum v oor Wiskunde en Informatica Amsterdam, The Netherlands, 1995. [114] E. Ro drigues: ‘The Scikit-HEP Pro ject’. In ‘23rd International Conference on Computing in High Energy and Nuclear Ph ysics (CHEP 2018) Soﬁa, Bulgaria, July 9-13, 2018’, , 2019. . [115] F. P edregosa, G. V aro quaux, A. Gramfort, et al.: ‘Scikit-learn: Machine learning in Python’. Journal of Mac hine Learning Researc h 12, p. 2825, 2011. [116] J. Piv arski, P . Das, D. Smirnov, et al.: ‘scikit-hep/upro ot: 3.7.2’, 2019. URL https://doi.org/10. 5281/zenodo.3256257 . [117] L. Heinrich and K. Cranmer: ‘diana-hep/y adage v0.12.13’, 2017. URL https://doi.org/10.5281/ zenodo.1001816 .

MadMiner: Machine learning-based inference for particle physics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment