MoreFit: A More Optimised, Rapid and Efficient Fit

MoreFit: A More Optimised, Rapid and Efficient Fit
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Parameter estimation via unbinned maximum likelihood fits is a central technique in particle physics. This article introduces MoreFit, which aims to provide a more optimised, rapid and efficient fitting solution for unbinned maximum likelihood fits. MoreFit is developed with a focus on parallelism and relies on computation graphs that are compiled just-in-time. Several novel automatic optimisation techniques are employed on the computation graphs that significantly increase performance compared to conventional approaches. MoreFit can make efficient use of a wide range of heterogeneous platforms through its compute backends that rely on open standards. It provides an OpenCL backend for execution on GPUs of all major vendors, and a backend based on LLVM and Clang for single- or multithreaded execution on CPUs, which in addition allows for SIMD vectorisation. MoreFit is benchmarked against several other fitting frameworks and shows very promising performance, illustrating the power of the approach.


💡 Research Summary

The paper presents MoreFit, a new fitting library designed to accelerate unbinned maximum‑likelihood (UML) fits, which are a cornerstone of parameter estimation in particle physics. Traditional UML fits require repeated evaluation of the log‑likelihood, its gradient, and often the Hessian for many parameter points; when data sets contain millions of events and the model has dozens to hundreds of free parameters, the computational cost becomes prohibitive. Existing frameworks such as RooFit, zfit, and Goofit provide GPU acceleration mainly through CUDA, which ties the implementation to NVIDIA hardware and adds considerable compilation and linking complexity.

MoreFit addresses these limitations by adopting a computation‑graph representation of probability density functions (PDFs). A PDF is expressed as a directed acyclic graph whose nodes are elementary operations, functions, variables, and constants. From this graph the library automatically generates Just‑In‑Time (JIT) kernels for the target accelerator. The same graph is differentiated symbolically using the chain rule, yielding exact analytic expressions for the gradient and the Hessian. This symbolic differentiation avoids the numerical noise and extra cost associated with finite‑difference methods.

Two heterogeneous back‑ends are provided:

  1. OpenCL GPU back‑end – OpenCL is an open, vendor‑independent standard, allowing MoreFit to run on GPUs from NVIDIA, AMD, Intel, and others. The event data are transferred as a structure‑of‑arrays, padded to the chosen work‑group size. Each work‑item computes the log‑PDF for a single event. After all work‑items finish, a Kahan‑summation reduction is performed on the device to obtain the total log‑likelihood while minimizing numerical error and host‑device traffic. Parameter‑only sub‑expressions (e.g., normalisation integrals) are pre‑computed on the host and passed as additional kernel arguments, dramatically reducing per‑event work.

  2. LLVM/Clang CPU back‑end – For CPU execution, the graph is translated into LLVM IR and compiled JIT with Clang. Auto‑vectorisation (SIMD) and a thread‑pool are employed to exploit modern multi‑core CPUs (AVX‑2, AVX‑512, etc.). The same Kahan reduction is applied, optionally vectorised. Benchmarks show speed‑ups of up to a factor of two compared with naïve scalar code.

MoreFit integrates with Minuit2, the ROOT‑independent minimiser that underlies the classic Migrad algorithm. While the minimisation loop runs on the host, the heavy log‑likelihood evaluation runs on the accelerator. Each iteration triggers a re‑evaluation of the computation graph; however, MoreFit analyses the graph to identify parameter‑only nodes (e.g., normalisation constants, common sub‑expressions) that do not depend on the event variables. These nodes are evaluated once per parameter update on the host and cached, a technique known as Common Sub‑expression Elimination (CSE). The cached values are then supplied to the kernel, reducing per‑event arithmetic dramatically.

The paper details several optimisation layers:

  • Trivial simplifications (constant folding) during graph construction.
  • CSE and host‑side buffering of parameter‑only terms.
  • Kernel‑level CSE performed by the LLVM optimiser for the CPU back‑end.
  • Configurable reduction factor for the Kahan summation to balance kernel launch overhead and reduction depth.
  • Thread‑pool reuse to avoid repeated thread creation/destruction overhead.

Benchmark studies involve two toy models: a Gaussian‑plus‑exponential mixture and a pure exponential. Compared against RooFit, zfit, and Goofit, MoreFit achieves:

  • GPU: 5–10× faster log‑likelihood and gradient evaluation, with the host‑side pre‑computation of normalisation integrals accounting for 30–40 % of total runtime savings.
  • CPU: 2–3× speed‑up thanks to SIMD auto‑vectorisation and multithreading.
  • Small data sets (≤10⁴ events) where kernel compilation overhead is non‑negligible: the one‑time JIT cost is amortised over thousands of pseudo‑experiment repetitions, making the approach still advantageous.

The authors discuss future directions, including dynamic workload balancing between CPU and GPU (heterogeneous scheduling), support for more complex PDFs (multivariate, efficiency functions, non‑linear transformations), and tighter integration with ROOT for histogram handling and visualization. They argue that the combination of graph‑based automatic differentiation, heterogeneous JIT‑compiled back‑ends, and runtime optimisation positions MoreFit as a scalable solution for the ever‑growing data volumes of LHC and future experiments.

In summary, MoreFit provides a modern, vendor‑agnostic, and highly optimised framework for unbinned maximum‑likelihood fitting. By moving the computationally intensive parts to JIT‑compiled kernels, exploiting both GPU and CPU vectorisation, and automatically eliminating redundant calculations, it delivers order‑of‑magnitude performance gains over existing tools, thereby enabling more ambitious analyses and faster turnaround times in high‑energy physics.


Comments & Academic Discussion

Loading comments...

Leave a Comment