ASP-Bench: From Natural Language to Logic Programs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automating the translation of natural-language specifications into logic programs is a challenging task that affects neurosymbolic engineering. We present ASP-Bench, a benchmark comprising 128 natural language problem instances, 64 base problems with easy and hard variants. It evaluates systems that translate natural-language problems into Answer Set Programs (ASPs), a prominent form of logic programming. It provides systematic coverage of ASP features, including choice rules, aggregates, and optimization. Each problem includes reference validators that check whether solutions satisfy the problem specification. We characterize problems along seven largely independent reasoning aspects (optimization, temporal reasoning, default logic, resource allocation, recursion, spatial reasoning, and quantitative complexity), providing a multidimensional view of modeling difficulty. We test the benchmark using an agentic approach based on the ReAct (Reason and Act) framework, which achieves full saturation, demonstrating that feedback-driven iterative refinement with solver feedback provides a reliable and robust approach for modeling natural language in ASP. Our analysis across multiple agent runs enables us to gain insights into what determines a problem’s modeling hardness.

💡 Research Summary

The paper introduces ASP‑Bench, a comprehensive benchmark designed to evaluate systems that translate natural‑language (NL) problem specifications into Answer Set Programs (ASP). The authors note that existing NL‑2‑ASP benchmarks suffer from two major drawbacks: evaluation based on exact string matching, which penalizes semantically correct but syntactically different solutions, and a lack of systematic coverage of ASP language constructs. To address these issues, ASP‑Bench comprises 128 problem instances derived from 64 base problems, each provided in an “easy” and a “hard” variant. The easy variants are relatively small and straightforward, while the hard variants increase difficulty by scaling instance size, adding more constraints, introducing multi‑objective optimization, and embedding richer reasoning patterns.

Each problem is accompanied by a reference validator implemented as a Python script. The validator receives a JSON solution, checks that all constraints are satisfied, and for optimization problems also verifies optimality against a known optimum. This semantic verification allows multiple correct solutions and avoids the brittleness of string‑based comparison.

A key contribution of the benchmark is the taxonomy of seven largely independent reasoning aspects that characterize modeling difficulty:

OPT – optimization statements (minimize/maximize).
TEMP – temporal or sequential reasoning requiring ordering constraints.
DEFAULT – soft constraints or preferences.
RESOURCE – resource allocation with capacity limits.
RECURSIVE – fixed‑point or cycle‑detection reasoning.
SPATIAL – grid‑based or neighborhood logic.
QUANT – high quantitative complexity (many distinct constraints).

Hard problems may combine several of these aspects, enabling a multidimensional analysis of what makes a problem hard for an NL‑2‑ASP system. Figure 1 in the paper visualizes the distribution of aspects across the 64 hard instances and correlates them with a novel difficulty metric: the average number of python_exec calls made by the agent during the solution process.

To demonstrate the benchmark and provide a strong baseline, the authors implement an agentic system based on the ReAct (Reason and Act) framework. The agent uses a single‑prompt approach with Claude Sonnet 4.5 (accessed via OpenRouter) as the underlying large language model (LLM). It interacts with the clingo Python API, maintaining a persistent IPython kernel so that state (e.g., generated rules, intermediate solutions) persists across multiple reasoning‑action cycles. The agent’s workflow follows a five‑step plan: (1) planning, (2) analysis & modeling, (3) implementation, (4) solving & extraction, and (5) formatting & verification. A detailed project prompt (clingo.md) supplies the agent with mandatory ASP syntax rules, safety requirements, common modeling patterns, and a catalog of anti‑patterns (e.g., unsafe variables, grounding explosions). The agent iteratively calls python_exec to generate, test, and debug ASP code, using clingo error messages and failed validator runs as feedback. When a solution passes the validator, the agent invokes save_code to write the final program and solution to disk.

Experimental setup: easy problems received a 600‑second timeout for the entire agent process, hard problems a 1200‑second timeout; each clingo call was limited to 20 seconds. The authors ran each easy problem twice and each hard problem three times to assess consistency. All runs completed well within the limits.

Results: The ReAct‑based ASP‑Agent achieved full saturation—it solved every instance correctly, both easy and hard, across all runs. The average number of python_exec calls per hard problem was 7.7, with a range from 3.7 (simplest problems such as Queens domination, Zebra puzzle) to 26.0 (most demanding problems like DNA sequence assembly, Metroidvania level generation, “Who is the killer?” logic puzzle, Nonogram solver, and a historical counterfactual scenario). The authors provide a full table of execution counts and token usage.

To understand what drives difficulty, the authors correlated the python_exec count with the presence of reasoning aspects. Problems that simultaneously involve OPT, QUANT, and RECURSIVE tend to require many more iterations, while those dominated by a single aspect (e.g., only TEMP) are solved with fewer calls. This analysis validates the aspect taxonomy as a useful predictor of modeling hardness.

The paper discusses several implications. First, semantic validation combined with feedback‑driven iterative refinement proves far more robust than one‑shot NL‑2‑ASP generation, which prior work reported as insufficient for many graph‑based or puzzle problems. Second, the python_exec count offers a quantitative proxy for the cognitive effort an autonomous agent must expend, enabling systematic benchmarking of future NL‑2‑ASP approaches. Third, the success of a single LLM (Claude Sonnet 4.5) suggests that current foundation models, when equipped with structured prompts and tool use, can handle sophisticated logical modeling tasks without hand‑crafted pipelines.

Limitations are acknowledged. The agent is tightly coupled to Claude Sonnet 4.5; generalization to other LLMs (e.g., GPT‑4, Gemini) remains untested. The benchmark’s solver timeouts (≤ 20 seconds per call) may not reflect real‑world large‑scale ASP applications that require longer grounding or solving phases. Moreover, the benchmark currently focuses on problems that can be expressed within the chosen 7 aspects; future extensions could incorporate probabilistic reasoning, dynamic environments, or richer ontological vocabularies.

Future work outlined includes (a) evaluating a broader set of LLMs and multimodal models, (b) exploring automated prompt optimization (e.g., meta‑learning of prompt templates), (c) extending ASP‑Bench with additional domains such as planning under uncertainty or hybrid symbolic‑numeric tasks, and (d) integrating human‑in‑the‑loop debugging interfaces to study collaborative agent‑human problem solving.

In summary, ASP‑Bench fills a critical gap in the neuro‑symbolic community by providing a rigorously constructed, semantically verified benchmark that systematically covers core ASP features and diverse reasoning patterns. The demonstrated success of a ReAct‑based autonomous agent establishes a strong baseline and showcases the power of iterative, feedback‑driven modeling for translating natural language into declarative logic programs.

ASP-Bench: From Natural Language to Logic Programs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment