LLM Priors for ERM over Programs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study program-learning methods that are efficient in both samples and computation. Classical learning theory suggests that when the target admits a short program description (for example, a short piece of ``Python code’’), it can be learned from relatively few examples by performing ERM over the program class. However, this approach relies on enumerating candidate programs, which is typically exponential in the description length. In contrast, gradient-based training avoids explicit search, but for some families of short programs it can require exponentially many samples to succeed. We propose \textsc{LLM-PV}, a propose-and-verify recipe that enables ERM-style selection over a discrete program class without exhaustive enumeration. A pretrained LLM induces a proposal distribution over candidate programs; each proposal is executed, scored on a held-out validation set, and the best program is selected. The method uses no gradient updates and does not use validation feedback to adapt the sampling distribution. Across algorithmic tasks including parity variants, pattern matching, and primality testing, \textsc{LLM-PV} often recovers the exact underlying rule from a small labeled set and generalizes far beyond the training sequence lengths. In the same regimes, SGD-trained transformers and standard adaptation baselines (fine-tuning and in-context learning), as well as classical ML baselines, can fit the training data yet fail to generalize reliably. Together, these results suggest that pretrained LLM priors can serve as effective search biases for ERM, narrowing the gap between statistical and computational efficiency. The code is available at [\href{https://github.com/DLFundamentals/LLM_PV}{code}].

💡 Research Summary

The paper tackles the classic tension between sample efficiency and computational tractability in learning short programs. Classical learning theory tells us that if a target function can be expressed by a program of length L over a finite token alphabet Σ, then empirical risk minimization (ERM) over the finite hypothesis class of all such programs requires only O(L·log|Σ|) labeled examples to guarantee low generalization error. However, implementing ERM by exhaustive length‑first enumeration is computationally infeasible because the number of candidates grows as |Σ|^L, which is exponential in L. On the other hand, modern deep learning sidesteps explicit search by training high‑capacity models (e.g., transformers) with stochastic gradient descent (SGD). While SGD is computationally cheap per step, the authors show—using the statistical query (SQ) framework—that for many algorithmic families (parities, cryptographic‑like functions) the SQ dimension is large, forcing SGD to require an exponential number of samples to achieve non‑trivial error. This creates a regime where both naïve enumeration and gradient‑based training break down.

To bridge this gap, the authors introduce LLM‑PV (LLM‑Propose‑Verify), a simple yet powerful “propose‑and‑verify” pipeline. A pretrained large language model (LLM) is used solely as a proposal engine: conditioned on the training set S = {(x_i, y_i)} it samples candidate programs (or program edits) from a data‑dependent distribution. Each candidate is compiled and executed on a held‑out validation set; its validation error is measured, and the program with the lowest validation error is selected. Crucially, the LLM never receives gradient updates, and validation feedback does not alter the proposal distribution. Thus the LLM acts purely as a search prior, while the selection criterion remains the classic ERM objective (minimum validation error).

The authors evaluate LLM‑PV on a suite of algorithmic tasks that are known to be SQ‑hard: various parity variants, pattern‑matching, palindrome detection, Dyck‑2 language recognition, and several primality‑testing variants. Across these tasks, LLM‑PV recovers the exact underlying rule from as few as 200 labeled examples, often producing compact, length‑invariant programs that generalize far beyond the training input lengths (e.g., handling numbers orders of magnitude larger than seen during training). In contrast, baseline methods—including fine‑tuned transformers, in‑context learning, SVMs, and XGBoost—typically achieve perfect training accuracy but fail to generalize when input size grows, confirming the theoretical SQ‑hardness predictions.

Beyond performance, LLM‑PV offers interpretability and auditability. The final hypothesis is executable, human‑readable code, and the entire propose‑verify trace (proposed programs, execution outcomes, validation diagnostics) is logged, enabling systematic debugging and mechanistic analysis of successes and failures. The paper also provides a formal analysis: it restates the ERM sample bound for short programs, proves the exponential runtime of length‑first enumeration, and, via Proposition 1, quantifies the exponential sample requirement for finite‑precision mini‑batch SGD on planted k‑parity families using SQ dimension arguments.

In summary, the work demonstrates that pretrained LLMs can serve as effective, data‑driven search priors for discrete program spaces, allowing ERM‑style learning to retain its statistical optimality while dramatically reducing computational cost. The approach closes a substantial gap between statistical and computational efficiency in program learning, and the released codebase establishes a solid benchmark for future research at the intersection of program synthesis, LLM‑guided search, and theoretical learning guarantees.

LLM Priors for ERM over Programs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment