We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is "trained" through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and Open-DialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models.
๐ Full Content
RDF-to-text is a popular task in natural language generation (NLG) that involves converting a subset of a knowledge graph, represented as RDF triples, into coherent natural language text (Castro Ferreira et al., 2020;Agarwal et al., 2021;Kasner and Dusek, 2022;Li et al., 2024). For instance, one possible verbalization of the following RDF triples: (Chopin, birthplace, Poland), (Chopin, birth year, 1810) is "Chopin was born in 1810 in Poland."
RDF-to-text systems are typically built using either rule-based or neural approaches (Gatt and Krahmer, 2018). Rule-based methods (Lavoie and Rainbow, 1997;White and Baldridge, 2003) use predefined templates and linguistic rules for precise, controlled output. In contrast, neural approaches rely on supervised learning from human data (Ke et al., 2021;Chen et al., 2020) or in-context learning with large language models (LLMs) (Axelsson and Skantze, 2023;Mille et al., 2024) to generate more fluent and varied text, yet their incorporation in industrial applications faces significant challenges. Despite impressive benchmark performance, neural NLG systems generally lack interpretability and controllability, suffer from hallucinations, and require substantial computational resources (Zhang et al., 2021;Ji et al., 2023).
In this work, we introduce a novel paradigm for building interpretable RDF-to-text systems that, instead of relying on supervised data, leverages the coding capabilities of large language models (LLM) to develop a full NLG system from scratch in pure Python. Our approach involves a training stage where several LLM agents collaborate to iteratively design, implement, test, and refine a rulebased NLG model for a given domain using only unsupervised data (in-domain RDF triples, with no human references). Once the training is complete, the system operates independently of any LLMs or neural components.
Experiments conducted on five datasets demonstrate that the proposed approach outperforms nontrivial neural baselines on reference-based metrics while offering full interpretability and controllability, producing fewer hallucinations, and providing remarkably fast inference times on a single CPU.
Our approach to training an NLG system relies on five LLM Agents. Software Architect (SA) comes up with a design of the NLG system, making highlevel decisions about the code structure. Software Engineer (SE) iterates over the particular functions of the designed code structure and implements each one. Evaluator is a Python execution engine that runs the automatically written NLG system and then uses an LLM to assess the textual outputs produced. Unit tests for evaluation are supplied by Test Engineer, embracing the test-driven development (TDD) paradigm for software development. Finally, Code Analyst (CA) analyses the NLG system implementation and any failing unit tests, determining whether the issues can be resolved by rewriting specific functions or if a full redesign of the sys-Figure 1: Overview of the presented approach. LLM Agents (boxes with green border) interact with each other to write an entire NLG system in pure Python during the training phase. The final system is fully interpretable, easy to edit by a human, and does not need any LLM during inference.
tem is needed. Depending on CA’s decision, the training process returns to either SE or SA agent, which then revise the selected parts of NLG system accordingly. The approach is illustrated on Fig. 1 and in Appendix D.
The input to the training process is a knowledge graph, parts of which will later be verbalised by the constructed NLG system. Note that no reference texts or annotated examples are used. The output of the training is a single Python file containing the implementation of NLG system. At inference time, the system is able to generalise to unseen data, provided it adheres to the same schema -specifically, that predicates are defined consistently with those in the training graph. We provide a more detailed description of each LLM agent involved in the training process below.
Test Engineer begins by extracting a list of all predicates present in the knowledge graph (KG). To provide the model with contextual understanding of each predicate, a random triple containing the predicate in question is selected from the graph. The LLM is then instructed to generate 50 input-output example pairs1 for a data-to-text system using these predicates. Any examples containing predicates not found in KG are discarded, and the remaining examples are added to the set of unit tests. This process is repeated until each predicate is covered by at least three unit tests. The exact prompts of TE and other agents are provided in Appendix A.
Software Architect is given a list of all predicates found in the KG, along with an instruction to produce the high-level design of a rule-based NLG system. SA’s output defines the code structure by specifying a list of required functions, their responsibilities, input arguments, and interactions. The only hardcoded requirement is the main entry point class and function.
Software Engineer iterates over the SAproduced list of functions and implements them one-by-one, given a description of the design, the code implemented so far, and the signature of the function to be implemented. In the later stages of training, the SE is also given feedback from the Code Analyst and a list of failed unit tests.
Evaluator executes the NLG system code for each unit test within a Python interpreter, running each instance in a separate process with a predefined timeout, marking errors or timeouts as failures. Successful outputs are sent to an LLM, which answers a yes/no question on whether the generated verbalization correctly reflects the given input. To speed up evaluation, the process is terminated as soon as five failed unit tests are detected. If the constructed program passes all unit tests, the training process is terminated.
Code Analyst receives the evaluation results and analyses both the system design and its current implementation to determine the root causes of the failed tests. Based on this analysis, CA decides whether the issues stem from flaws in the overall design or from specific functions in the implementation. If a full redesign is needed, the CA’s textual feedback is passed back to SA, which produces a new design. If only certain functions require revision, CA supplies a list of these functions to SE to reimplement.
The interaction between the LLM agents, i.e. the system training process, terminates either when the constructed NLG system passes all unit tests, or when the maximum iteration limit is reached.
Baselines We compare the results of our rule generation approach with two baselines: fine-tuned BART (Lewis et al., 2020, Datasets We experiment on two domains, with five datasets in total. First, the models were trained on the popular WebNLG domain (Gardent et al., 2017), which contains data expressed as RDF triples alongside their corresponding text references. For evaluation, we used four test sets: the standard WebNLG test set and three datasets from the GEM 2024 shared task (Mille et al., 2024). The GEM datasets were specifically designed to test system robustness by including RDF triples that are: (1) factual -containing factually correct information; (2) counterfactual -data from the factual dataset, with switched entity names; (3) fictionalthe triples contain fictional entities.
Second, we trained and evaluated the models on the OpenDialKG dataset (Moon et al., 2019), which contains dialogues annotated with RDF triples representing the information expressed in each utterance. We use this dataset for RDF-to-triple task, treating the utterances as textualisations of the data without taking dialogue history into account.
During training, our rule-based approach relied solely on the knowledge graph induced by the RDF triples from the dataset, but the fine-tuned neural baseline was trained using reference texts from the training set, with early stopping based on performance on the development set.
Our approach We tested our approach with three different LLMs: one proprietary LLM (GPT-4.1 OpenAI, 2025)) and three open-source models: Qwen 3 235B (Yang et al., 2024), Qwen 2.5 72B (Yang et al., 2024) andLlama 3.3 70B (Touvron et al., 2024). The open-source models were used in 4-bit quantisation through the ollama library. Training was run with a maximum number of 25 iterations (10 for GPT) and repeated three times. The best model was selected based on the number of unit tests passed. We use structured outputs to get an easy-to-process output from SA and CA. As the entire WebNLG graph is substantial, we trained our system separately for each WebNLG thematic category. As different LLMs are not equally strict when assessing the produced outputs, the Evaluator agent always used the Llama 3.3 model for better comparability. The constructed programs are available in the code repository2 .
We evaluate the quality of the generated outputs using several widely adopted reference-based metrics: BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), BERTScore (Zhang et al., 2020), and BLEURT (Sellam et al., 2020). This evaluation was not conducted on the GEM datasets, as they do not include reference texts.
The WebNLG test set results in Table 1 reveal that our model trained by GPT-4.1 agents achieved the highest scores on METEOR and BLEURT metrics. Although a fine-tuned neural model outperformed ours on BLEU and BERTScore overall, our system still achieved better scores on these metrics within a more challenging subset of out-of-domain examples. Our model also outperformed prompted Llama 3.3 70B. Note that neither of these systems was trained on human-written reference texts.
There is relatively little difference in performance between our rule-based systems produced by GPT 4.1 and those produced by the largest open-source model, Qwen 3 235B. While GPT 4.1 achieved better results on METEOR, BERTScore and BLEURT, Qwen 3 performed slightly better on BLEU and on out-of-domain examples.
NLG systems trained using smaller open-source LLMs were less successful, indicating that more powerful LLMs may be necessary for implementing complete NLG systems. Nonetheless, these models retain certain advantages over purely neural models, as they provide full transparency of the generation process and can potentially be manually improved by skilled developers. The inference time comparision 3 in Table 3 shows another advantage of our models: they achieve a 35x speedup on CPU compared to the BART model running on GPU and 272x speedup while running both models on CPU.
The results obtained on OpenDialKG are presented in Table 2. Here, the fine-tuned model clearly obtained the highest results on referencebased metrics, indicating the importance of using the original training data to produce the expected sentence structures. Nevertheless, all of our models outperformed the prompted LLM on all metrics. 3 The reported times do not include loading the models into memory and were measured on a machine with an Nvidia A40 48 GB GPU and an AMD EPYC 7313 CPU.
We perform a reference-less evaluation on all test sets using the LLM-as-a-Judge approach (Zheng et al., 2023;Gu et al., 2025). The selected LLM (Llama 3.3 70B) provides binary judgments on three aspects: grammatical correctness of the generated text (Gram.), presence of unsupported facts (Add.), omission of input triples in the output (Om.). The exact prompts are provided in Appendix B.
The results of systems trained on the WebNLG dataset shown in Table 4 reveal that the outputs of our rule-based system trained with GPT-4.1 are more grammatically correct and contain fewer hallucinations than the output of fine-tuned BART on all four test sets. The outputs produced by an LLM (Llama 3.3) achieve the highest grammatical correctness. On three out of four test sets, our model trained with Qwen 2.5 reduces the number of additions compared to the LLM’s output -sometimes by nearly fourfold -while maintaining a comparable or lower number of omissions.
The results for the OpenDialKG dataset in Table 5 show that our GPT-4.1-trained system produced significantly fewer additions and omissions (t-test, ฮฑ = 5%) than both fine-tuned BART and Llama 3.3. Our model also achieved better grammatical correctness than BART while scoring slightly worse than Llama 3.3.
We performed two ablation experiments: 1) we replaced SA agent with a static system design produced by a human (Abl. 1); 2) we used training examples from WebNLG training set instead of generated unit tests (Abl. 2).The results of the ablations are in Table 6. Using a static design of the system has a highly negative impact, which is espe- cially visible in trainable metrics such as BLEURT. Evaluating using the original WebNLG training set examples instead of automatically generated unit tests also yields slightly worse results, demonstrating the utility of our approach.
We conducted a small-scale in-house human evaluation for 100 randomly selected instances from the WebNLG test set. Outputs of our system (with GPT 4.1) and both baselines (BART, Llama 3.3) were annotated by six NLP experts who answered binary questions about the presence of minor hallucinations (e.g. typos in named entity names), major hallucinations (output containing facts not supported by the data), omissions (missing information), disfluencies (grammar errors or difficult-to-read text) and repetitions (information mentioned twice). In total, 300 system outputs were annotated. The interannotator agreement, measured by Cohen’s Kappa and averaged over all questions, was 0.8288. The results are presented in Table 7. The annotators did not detect any hallucinations in the outputs of our system, indicating that our system generates very few hallucinations. Although our system occasionally omits facts from the input, its omission rate is comparable to that of a prompted LLM. It also achieved the lowest assessment of disfluencies present in the generated text, and the smallest number of repetitions ex aequo with the prompted LLM.
Since the result of training of our rule-based NLG approach is Python code, it should be possible to understand how the text was produced and even modify it if needed. We asked two experienced4 Python software engineers (SEs) to get familiar with the implementation of our NLG system produced by GPT 4.1 and perform two tasks:
โข Interpretability task -we provided 25 examples of input triples and outputs produced by the system. In the output text, one word was randomly highlighted and the SEs were asked to provide the line number containing code that produced that word. If removing the indicated line from the code resulted in a text that did not contain the highlighted word, the test was considered as passed.
โข Modification task -we took all outputs of our system involving omissions, as indicated by human evaluators in Sec. 3.5, and we asked SEs to modify the NLG system code to produce output without omissions. During this test, SEs could use an IDE of their choice, with the possibility of using a Python interpreter for testing, but no AI code assistants such as GitHub Copilot. The outputs of the corrected systems were assessed by a human evaluator to estimate if the generated text still contains omissions.
All tests related to both tasks were successfully passed by the SEs. The average time taken to successfully complete the interpretability task for a single instance was 9.6 seconds. According to the SEs, the code was fully understandable, but it contained some unused parts and could be refactored to improve its clarity. The modification task required more time for code editing, but in almost all cases, this did not exceed five minutes.
On average, a program generated by our approach contains 168 lines of code. A typical NLG system groups RDF triples by subject, processes each group by adding modifiers to the subject, converts the group into a clause and then refines it into a sentence. To improve fluency, modifier ordering is often applied. Different LLMs exhibit varying coding styles, e.g. Qwen 2.5 tends to produce Python code with typing. The generated code frequently imports standard Python libraries such as datetime or defaultdict, but occasionally also relies on less common ones like inflect, num2words or even nltk. While no runtime errors were observed when testing on the WebNLG dataset, evaluation on the GEM datasets produced some errors as the generated programs were not robust enough to handle differences in date formatting between the datasets. This resulted in reduced performance on these sets.
Program Synthesis is the task of automatically generating programs from specifications, traditionally using formal methods (Gulwani et al., 2017) or evolutionary search (Koza, 1994), and increasingly leveraging neural networks (Wyrwiลski and Krawiec, 2024). Modern approaches synthesize programs from natural language, input-output examples, and partial sketches.
LLMs for Coding Recently, Large Language Models trained on large corpora of code and natural language have exhibited remarkable code generation capabilities, enabling them to perform tasks such as code completion, code synthesis from natural language prompts, and bug fixing (Chen et al., 2021;Li et al., 2023). Beyond single-pass generation, reflective approaches like Reflexion (Shinn et al., 2023) and Self-Refine (Madaan et al., 2023) introduced iterative frameworks that equip models with the ability to critique and revise their own outputs to improve constructed programs. These techniques are typically only employed to generate a single function for algorithmic tasks. Drawing inspiration from evolutionary program search, Novikov et al. (2025) recently presented AlphaEvolve framework, which uses an LLM ensemble to evolve more complex programs. To the best of our knowledge, however, these approaches have not previously been applied to NLG system construction or more generally to the implementation of programs involving language processing.
LLMs for NLG template construction Recently, Warczyลski et al. (2024) proposed a rulebased NLG systems that use LLM-written templates tailored to specific combinations of a triplet’s predicates. These systems rely on a hardcoded engine that splits input triples into known combinations, applies the corresponding templates, and merges the results into a single output text. Unlike our approach, this method requires a dataset with reference texts and does not generalize to out-ofdomain examples. While technically interpretable, the method’s interpretability is limited by the high number of templates it generates (over 113,000 for the WebNLG dataset) which also makes the produced systems difficult to maintain. We include a comparison with this approach in Appendix F.
This paper presents a new approach to building RDF-to-text systems that uses neural LLMs to train a rule-based system written entirely in Python. The resulting natural language generation (NLG) system is fully interpretable, enabling human interven-tion to modify its behaviour. The system generates text in a non-autoregressive manner, offering a significant improvement in speed over neural models. Experimental results demonstrate that, although neural models excel at fluency, our approach is often competitive and reduces hallucinations.
Although the presented approach reduces the number of hallucinated texts, it may still generate nonfactual outputs. The NLG system should undergo thorough testing before deployment. scheduler of ฮท with a warmup equal to 10% of optimization steps. The training was scheduled for 20 epochs with early stopping on validation loss (patience of 10 epochs). We used batch size equal to 8 and label smoothing with 0.1 smoothing factor.
The pseudocode of the proposed approach is presented in Alg. 1.
All of the annotators are aged between 20 and 40, hold at least a Master’s degree in Computer Science, and have expertise in NLG systems. Four of the annotators were European and two were Indian. The annotators were not paid specifically for performing the annotations, but were hired by our institution.
We provide the comparison with the most related approach (Warczyลski et al., 2024), which also uses LLM to construct templates for RDF-to-text generation. The results are presented in Table 8. The approach uses reference texts during training and is not able to work on out-of-domain examples.
The approach generates over 113 000 rules to handle different cases in RDF-to-triple generation. To handle the same dataset, our approach generates only 16 programs (one for each domain), providing better interpretability.
You are an experienced software architect specializing in rule -based Natural Language Generation ( NLG ) systems implemented in Python . Your task is to provide high -level design guidance . You do not write implementation code . Instead , you define the structure of the system by specifying functions and their responsibilities .
When given a task , respond with :
-A concise description of the overall architecture .
-A list of functions ( or classes , if needed ) , each with :
-A clear signature .
-A short description of its purpose .
-Expected inputs and outputs . -Optionally , a sketch of how components interact (e.g. as a sequence or flowchart ) .
-Do not write any implementation code . Your focus is on the design and structure of the system .
Your task is as follows .
Write a rule -based NLG system in Python for data -to -text task . Specifically , write a NLGSystem class with a function verbalize_set_of_triples ( triples ) that converts a list of RDF triples into plain text . Each RDF triple is an object containing the following properties : triple . subject , triple . predicate and triple . object . The possible values of triple . predicate are : { possible_predicates } Example : ```t riple1 = RDFTriple ( subject = " School of Business “, predicate = " academic staff size " , object = “737”) triple2 = RDFTriple ( subject = " School of Business “, predicate = " birth country " , object = " Denmark “) triples = [ triple1 , triple2 ] nlg = NLGSystem () output = nlg . verbalize_set_of_triples ( triples ) # output should be e.g. " Denmark ’s School of Business has an academic staff size of 737 people .” ``วธ ote that the subject of all RDF triples will not always be the same , and the list of triples may be shorter or longer than in this example . In some inputs , the subject of one triple may be the object of another , and so on . Make sure that your code generalizes well to all these cases . The generated text should contain all the information expressed in the triples while being fluent . ote that the subject of all RDF triples will not always be the same , and the list of triples may be shorter or longer than in this example . In some inputs , the subject of one triple may be the object of another , and so on . Make sure that your code generalizes well to all these cases . The generated text should contain all the information expressed in the triples while being fluent .
The current implementation of the system is as follows : You are a careful evaluator of NLG systems . Given a set of input RDF triples and an output of data -to -text system , you evalute wheter the output is a correct verbalization of the input . The system output is correct if it all facts expressed in the input triples are verbalized and no additional or incorrect infomation is mentioned . The output should be fluent and not repetitive . You must answer strictly with ’ correct ’ or ’ incorrect ‘.
Input : { sample . data } System output : { output } Is the system output correct ?
Reference-based evaluation on the OpenDi-alKG dataset (MET. = METEOR, BERT. = BERTScore).
While our approach does not use generated pseudoreferences during training, as the whole process is referenceless, we find that instructing the model to generate sets of input triples alongside pseudo-references results in more plausible examples.