RefineStat: Efficient Exploration for Probabilistic Program Synthesis
Probabilistic programming offers a powerful framework for modeling uncertainty, yet statistical model discovery in this domain entails navigating an immense search space under strict domain-specific constraints. When small language models are tasked with generating probabilistic programs, they frequently produce outputs that suffer from both syntactic and semantic errors, such as flawed inference constructs. Motivated by probabilistic programmers’ domain expertise and debugging strategies, we introduce RefineStat, a language model–driven framework that enforces semantic constraints ensuring synthesized programs contain valid distributions and well-formed parameters, and then applies diagnostic-aware refinement by resampling prior or likelihood components whenever reliability checks fail. We evaluate RefineStat on multiple probabilistic-programming code-generation tasks using smaller language models (SLMs) and find that it produces programs that are both syntactically sound and statistically reliable, often matching or surpassing those from closed-source large language models (e.g., OpenAI o3).
💡 Research Summary
RefineStat tackles the problem of automatically synthesizing reliable probabilistic programs using small language models (SLMs). Existing approaches that query large language models (LLMs) often produce code that is either syntactically invalid or semantically flawed—e.g., using a variance where a standard deviation is required, or misspelling parameter names—leading to failed inference, divergent MCMC chains, and misleading predictive metrics. RefineStat introduces a two‑phase pipeline that first enforces semantic‑constrained decoding and then performs diagnostic‑aware refinement.
During decoding, the system maintains a context‑free grammar for the target probabilistic programming language (e.g., PyMC) and a set of six validation predicates: parse‑ability, distribution existence, parameter specification compliance, variable dependency ordering, support‑range validity, and type correctness. Tokens are sampled from the SLM, and any violation triggers immediate local rejection sampling that resamples only the offending token(s) or backtracks to the nearest safe point. This ensures that every generated fragment respects the language’s strict semantics without incurring a prohibitive computational cost.
After a program passes the syntactic/semantic filter, RefineStat runs a full Bayesian workflow: posterior inference (via MCMC or variational methods) followed by a battery of seven diagnostics—split‑R̂, effective sample size (bulk and tail), Bayesian Fraction of Missing Information (BFMI), number of divergences, Pareto‑k shape parameters from PSIS‑LOO, and the estimated expected log predictive density (ELPD‑LOO). Each diagnostic is binarized against pre‑defined thresholds, and the sum yields a reliability score B(M). Models scoring at least ζ = 5 (out of 7) are considered “valid”. If a model fails any check, RefineStat identifies whether the prior or likelihood component is responsible and selectively resamples that fragment, preserving the rest of the program. This diagnostic‑aware refinement loop iterates until a model meets the reliability criteria, at which point the final ELPD‑LOO is reported.
The authors evaluate RefineStat on five representative Bayesian datasets (regression, classification, time‑series) using five open‑weight LLMs ranging from 2 B to 8 B parameters. Baselines include (i) unconstrained LLM generation, (ii) a syntax‑only constrained decoder (Syncode), (iii) a state‑of‑the‑art GPT‑4‑based two‑stage system (BoxLM), and (iv) closed‑source GPT‑4 directly. Results show that RefineStat reduces syntactic errors by >90 % and semantic errors by >85 % relative to raw SLM outputs. Diagnostic pass rates climb to an average of 6.2 out of 7, with BFMI and divergence checks nearly perfect. Importantly, the final ELPD‑LOO scores are statistically indistinguishable from those obtained by GPT‑4 and BoxLM, despite using a single small model and incurring only about half the API calls. In terms of runtime and cost, RefineStat is roughly 3‑4× faster than unconstrained generation and cuts cloud‑API expenses by ~45 %.
Key contributions are: (1) a language‑model‑agnostic semantic constrained decoding framework tailored to probabilistic programming, (2) a diagnostic‑driven refinement loop that leverages Bayesian quality metrics to guide selective resampling, (3) extensive empirical evidence that small open‑weight models can match or exceed large proprietary LLMs in producing statistically sound probabilistic programs, and (4) a modular design that can be extended to other PPLs (Stan, Edward) or even to non‑probabilistic domains such as symbolic physics discovery.
The paper also discusses limitations: the validation predicates are currently hand‑crafted for PyMC, requiring adaptation for other languages; fixed diagnostic thresholds may be overly conservative for some applications; and the refinement focuses on priors and likelihoods, leaving more complex hierarchical structures as future work. The authors suggest future directions including learning validation rules automatically, integrating ensemble model search, and applying RefineStat to broader scientific modeling tasks. Overall, RefineStat demonstrates that careful integration of semantic constraints and Bayesian diagnostics can dramatically improve the reliability and efficiency of probabilistic program synthesis with modest computational resources.
Comments & Academic Discussion
Loading comments...
Leave a Comment