DISCOVER: A Physics-Informed, GPU-Accelerated Symbolic Regression Framework

DISCOVER: A Physics-Informed, GPU-Accelerated Symbolic Regression Framework
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Symbolic Regression (SR) enables the discovery of interpretable mathematical relationships from experimental and simulation data. These relationships are often coined descriptors which are defined as a fundamental materials property that is directly correlated to a desired or undesired functional property of the material. Although established approaches such as Sure Independence Screening and Sparsifying Operator (SISSO) have successfully identified low-dimensional descriptors within large feature spaces many existing SR tools integrate poorly with modern Python workflows, offer limited control over the symbolic search space, or struggle with the computational demands of large-scale studies. This paper introduces DISCOVER (Data-Informed Symbolic Combination of Operators for Variable Equation Regression), an open-source symbolic regression package developed to address these challenges through a modular, physics-motivated design. DISCOVER allows users to guide the symbolic search using domain knowledge, constrain the feature space explicitly, and take advantage of optional GPU acceleration to improve computational efficiency in data-intensive workflows, enabling reproducible and scalable SR workflows. The software is intended for applications in computational physics, computational chemistry, and materials science, where interpretability, physical consistency, and execution time are especially important, and it complements general-purpose SR frameworks by emphasizing the discovery of physically meaningful models.


💡 Research Summary

The paper introduces DISCOVER (Data‑Informed Symbolic Combination of Operators for Variable Equation Regression), an open‑source Python‑native symbolic regression framework designed to meet the growing demand for interpretable, physics‑consistent models in computational physics, chemistry, and materials science. While existing tools such as SISSO excel at sparse descriptor discovery, they lack fine‑grained user control over the search space, seamless integration with modern scientific Python ecosystems, and built‑in hardware acceleration. DISCOVER addresses these gaps through four pillars: (1) a declarative configuration system that lets users impose physical constraints (dimensional consistency, allowed operators, maximum expression complexity, custom variable‑combination rules) without modifying source code; (2) a modular architecture that plugs into standard libraries (NumPy, SciPy, pandas) and supports multiple sparsity‑driven search strategies; (3) optional GPU acceleration via CUDA on NVIDIA GPUs and Metal Performance Shaders on Apple Silicon, which dramatically speeds up feature generation and model evaluation; and (4) a suite of search algorithms—including Orthogonal Matching Pursuit (OMP), Mixed‑Integer Quadratic Programming (MIQP), and Simulated Annealing—each approximating the same L0‑regularized least‑squares objective (min ‖y − Φβ‖², ‖β‖₀ ≤ D).

The workflow begins with user‑provided raw features and an operator library. Candidate symbolic expressions are generated recursively, and at each step the configured physics‑informed constraints are applied. Dimensional consistency is enforced through the pint library, which tracks physical units throughout the expression tree and discards any candidate that violates unit rules, thereby pruning the search space early and ensuring physically plausible models.

Search strategies are interchangeable: OMP offers rapid greedy selection for an initial sparse model; MIQP formulates the problem as a mixed‑integer program to approach a global optimum; Simulated Annealing explores non‑linear operator combinations and can escape local minima. Users can switch among them based on dataset size, desired accuracy, and computational budget.

GPU acceleration is employed primarily for the computationally intensive steps of constructing the feature matrix Φ and evaluating candidate models against the target vector y. CUDA kernels parallelize matrix‑vector products, logarithmic and exponential operations, while Metal kernels provide comparable performance on Apple hardware. An automatic device‑selection layer routes small workloads to the CPU to avoid overhead, and large workloads to the GPU, achieving 3–7× speed‑ups in benchmark experiments without sacrificing numerical stability.

DISCOVER’s applicability is demonstrated on several materials‑science problems: (i) low‑dimensional descriptors for crystal‑structure stability, (ii) ion‑mobility descriptors for energy‑storage materials, (iii) magnetic‑structure discrimination, and (iv) battery‑cathode candidate screening. In each case, the inclusion of domain‑specific constraints led to dimensionally valid, parsimonious formulas that matched or exceeded the performance of SISSO‑derived descriptors while offering greater interpretability and reproducibility.

The authors acknowledge limitations: the quality of discovered models heavily depends on the initial feature set; overly restrictive constraints may eliminate viable solutions; and for extremely large feature spaces (hundreds of thousands of candidates) DISCOVER’s memory and runtime requirements can exceed those of highly specialized SISSO implementations. Ongoing development aims to expand the operator library, introduce automatic unit inference, and support distributed GPU clusters to further improve scalability.

A brief statement on AI assistance notes that large language models were used to refactor code, generate documentation, and standardize function names, illustrating a modern development workflow. Funding acknowledgments cite the German Research Foundation (DFG), the NFDI FAIRmat consortium, and related excellence clusters.

In summary, DISCOVER provides a flexible, physics‑aware, and GPU‑accelerated platform for symbolic regression that bridges the gap between fully automated descriptor discovery and user‑guided model building, enabling reproducible, efficient, and physically meaningful scientific discovery.


Comments & Academic Discussion

Loading comments...

Leave a Comment