Large Language Models as Formalizers on Constraint Satisfaction Problems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

An emerging line of recent work advocates for using large language models (LLMs) as formalizers instead of as end-to-end solvers for various types of problems. Instead of generating the solution, the LLM generates a formal program that derives a solution via an external solver. We thoroughly investigate the formalization capability of LLMs on real-life constraint satisfaction problems. On 4 domains, we systematically evaluate 6 LLMs, including 4 large reasoning models with inference-time scaling, paired with 5 pipelines, including 2 types of formalism. We show that in zero-shot settings, LLM-as-formalizer performs on par with the mainstream LLM-as-solver while offering verifiability, interpretability, and robustness. We also observe excessive reasoning tokens and hard-coded solutions scaling with problem complexity, which demonstrates that even the state-of-the-art LLMs have limited ability to generate solutions or formal programs. We present our detailed analysis and actionable remedies to drive future research that improves LLM-as-formalizer.

💡 Research Summary

This paper conducts a systematic reality check of using large language models (LLMs) as “formalizers” rather than as end‑to‑end solvers on real‑world constraint satisfaction problems (CSPs). The authors focus on four representative CSP domains—calendar scheduling, trip planning, meeting planning, and the Zebra logic‑grid puzzle—each of which requires assigning values to variables under a set of constraints. For each domain they randomly sample 100 instances (out of a larger pool of 1,000) and manually annotate every constraint, so that a solution is considered correct only if it satisfies all formal constraints, not merely matches a reference answer.

Six state‑of‑the‑art LLMs are evaluated: four large‑reasoning models (LRMs) that support inference‑time scaling—DeepSeek‑R1, Qwen‑3‑32B, o3‑mini‑high, and GPT‑5—and two non‑reasoning models (DeepSeek‑V3, Qwen2.5‑32B). Each model is tested in two roles. In the LLM‑as‑solver setting, the model receives the natural‑language problem description and directly outputs the answer, optionally generating a chain‑of‑thought but without any explicit prompting to “think step‑by‑step”. In the LLM‑as‑formalizer setting, the model receives the same description and is asked to generate executable code that encodes the problem; two code styles are considered: (1) free‑form Python that includes both the declarative constraints and a search algorithm, and (2) a concise SMT (Z3) wrapper that only declares constraints and calls a pre‑built solver. After code generation, the program is executed; if it crashes or the solver reports “no plan”, the model is allowed up to five revision attempts (the “revision‑by‑error” loop).

The central findings are:

Overall performance – Across 24 model‑dataset combinations, the formalizer approach underperforms the solver approach in 15 cases. The gap is especially pronounced on the more complex Trip Planning domain, where every model yields higher accuracy as a solver. In the relatively easier domains (Calendar Scheduling, Meeting Planning, Zebra Logic) Python formalization sometimes matches or slightly exceeds the solver, but never consistently outperforms it.
Effect of problem complexity – The authors stratify instances by the number of constraints (a proxy for difficulty). Both solvers and formalizers degrade sharply as constraints increase. Even the best‑performing Python formalizer drops to less than half the accuracy of its performance on the easiest 20 % of instances in 10 out of 12 model‑dataset pairs. This contradicts the common expectation that formalization should be more robust because the output size grows linearly with input length, while a solver must explore an exponentially growing search space.
Error taxonomy – Generated programs are classified into four outcomes: (a) Error (syntax or runtime failures), (b) No plan (solver reports infeasibility), (c) Wrong plan (program runs but violates at least one annotated constraint), and (d) Correct plan. Syntax/runtime errors are rare and largely eliminated by the revision loop. “No plan” errors are also largely fixed by revisions. The dominant failure mode is “Wrong plan”, which the authors break down further into missing constraints, incorrectly encoded constraints, and faulty algorithmic reasoning. Manual inspection of 80 examples (5 per domain, per code style, per model) confirms that these semantic errors stem from the model’s limited ability to faithfully translate natural‑language constraints into precise logical expressions.
Impact of model type and code style – LRMs tend to be more token‑efficient than non‑reasoning models, but they do not exhibit a clear advantage in formalization quality. Python code generally outperforms SMT code (15/24 cases) despite SMT being a more natural fit for CSPs, suggesting that the richer pre‑training exposure to Python helps the model generate more reliable programs.
Observations on “solver‑like” tokens – The authors note an abundance of tokens that resemble internal solver reasoning (e.g., explicit search loops) even when the model is only asked to produce declarative constraints. They hypothesize that pre‑training on large code corpora may cause LLMs to over‑inject procedural logic, which can lead to hard‑coded or brittle solutions.

Based on these analyses, the paper proposes several research directions: (i) richer prompt engineering and data augmentation focused on constraint translation; (ii) stronger static analysis or automated verification steps before execution to catch semantic mismatches; (iii) tighter integration of LLMs with CSP solvers that can feed back detailed unsatisfied‑constraint information for iterative refinement; and (iv) dedicated pre‑training or fine‑tuning regimes that emphasize logical reasoning rather than procedural code generation.

In summary, while LLM‑as‑formalizer retains attractive properties—verifiability, interpretability, and the promise of decoupling problem description from search—the empirical evidence in this work shows that current open‑source large models still lag behind direct solver approaches on realistic CSPs, especially as problem complexity grows. The paper’s thorough error analysis and scaling experiments provide a valuable benchmark for future neuro‑symbolic systems aiming to combine the flexibility of LLMs with the rigor of formal solvers.

Large Language Models as Formalizers on Constraint Satisfaction Problems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment