Towards Scaling Laws for Symbolic Regression
Symbolic regression (SR) aims to discover the underlying mathematical expressions that explain observed data. This holds promise for both gaining scientific insight and for producing inherently interpretable and generalizable models for tabular data. In this work we focus on the basics of SR. Deep learning-based SR has recently become competitive with genetic programming approaches, but the role of scale has remained largely unexplored. Inspired by scaling laws in language modeling, we present the first systematic investigation of scaling in SR, using a scalable end-to-end transformer pipeline and carefully generated training data. Across five different model sizes and spanning three orders of magnitude in compute, we find that both validation loss and solved rate follow clear power-law trends with compute. We further identify compute-optimal hyperparameter scaling: optimal batch size and learning rate grow with model size, and a token-to-parameter ratio of $\approx$15 is optimal in our regime, with a slight upward trend as compute increases. These results demonstrate that SR performance is largely predictable from compute and offer important insights for training the next generation of SR models.
💡 Research Summary
This paper presents the first systematic investigation of scaling laws for symbolic regression (SR) using transformer models. Motivated by the transformative impact of scaling laws in large language models, the authors ask whether similar predictable relationships between compute and performance exist for SR. To answer this, they construct a highly controlled end‑to‑end pipeline that consists of (1) a synthetic data generation process with strict control over the distribution of expressions, and (2) a transformer architecture that incorporates recent advances from tabular foundation models.
Data generation proceeds in two steps. First, a base set of expressions E is built recursively by starting from variables and applying a predefined collection of binary and unary operators. Duplicate and mathematically equivalent expressions are filtered out, yielding a clean, unbiased corpus. Second, for each base expression the authors insert random integer constants and sample a 64‑point dataset from a Gaussian‑mixture distribution, producing up to 3,600 expression‑dataset pairs per base expression. In total, 100,000 distinct base expressions are generated, and separate validation and test splits contain 1,000 unseen expressions each, with fresh constants and data points.
Tokenization follows the common practice of encoding numeric values in base‑10 floating‑point notation. Mantissa and exponent are embedded separately, projected to the model dimension, and summed to obtain a cell‑wise embedding. This design mirrors recent tabular models that treat rows and columns symmetrically. The encoder is a standard transformer but each layer applies both row‑wise and column‑wise attention, allowing the model to capture interactions across variables and across data points simultaneously. The decoder follows the classic sequence‑to‑sequence transformer and cross‑attends to the updated cell embeddings. Training uses a cross‑entropy loss on token sequences, the AdamCPR optimizer, a 5 % linear warm‑up, and cosine annealing thereafter.
The experimental study spans five model sizes ranging from 6.5 M to 93 M parameters. Compute budgets cover three orders of magnitude (≈1 × 10¹⁷ to 1 × 10¹⁹ FLOPs). For each size the authors sweep batch size and learning rate, initially using a token‑to‑parameter ratio of 20 (the value found optimal for language models). After identifying the configuration with the lowest validation loss, they retrain with token‑to‑parameter ratios from 5 to 80 to locate the compute‑optimal trade‑off. Performance is measured by two metrics on the test set: (i) Acc solved – the fraction of predictions that exactly match the ground‑truth symbolic expression, and (ii) Acc R²>0.99 – the fraction of predictions whose R² score exceeds 0.99. Each metric is averaged over three random seeds.
Key findings are:
-
Power‑law scaling of loss and solved rate. Both Acc solved and validation loss follow clear power‑law relationships with total FLOPs. Acc solved rises from ~0.03 at the smallest compute budget to ~0.60 at the largest, and extrapolation of the fitted law predicts an 0.8 solved rate at ≈3.8 × 10²¹ FLOPs. The R²‑based metric improves even faster, highlighting that achieving exact symbolic matches is substantially harder than attaining high predictive fidelity. No saturation is observed within the explored range.
-
Compute‑optimal hyperparameters grow with model size. The optimal batch size and learning rate increase monotonically as the model scales. This trend contrasts with the decreasing learning‑rate scaling reported for large language models, suggesting that SR training dynamics differ due to the nature of the input (tabular numeric data) and the output (structured symbolic strings). The authors note that variance remains high and more seeds would be needed for precise quantification.
-
Token‑to‑parameter ratio ≈ 15 is optimal. Across all models, a ratio of about 15 tokens per parameter yields the lowest validation loss for a given compute budget. The ratio shows a slight upward drift with larger compute, implying that data size should scale marginally faster than model size for SR. This finding aligns with the intuition that symbolic regression benefits from richer datasets to disambiguate expression structures.
The paper also discusses several limitations: the synthetic benchmark restricts expressions to at most two variables and integer constants, which may not reflect real‑world scientific problems that involve many variables and floating‑point parameters; only single‑seed training runs were performed, potentially inflating variance; the compute range is limited, so extrapolations beyond 10¹⁹ FLOPs remain speculative; and no direct comparison with state‑of‑the‑art genetic programming or other deep SR methods is provided, as the focus is on scaling behavior rather than absolute performance.
Despite these constraints, the work convincingly demonstrates that symbolic regression with transformers obeys predictable scaling laws. The identified power‑law trends, the upward scaling of batch size and learning rate, and the ≈15 token‑to‑parameter sweet spot together furnish practical heuristics for designing future, larger SR models. The authors suggest extending the analysis to more complex expressions (more variables, floating‑point constants), broader compute regimes, and incorporating multi‑seed robustness studies. Ultimately, the paper argues that systematic scaling, rather than ad‑hoc architectural tweaks, may be the most efficient path toward SR models that surpass existing genetic programming baselines and become useful tools for scientific discovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment