GSI Agent: Domain Knowledge Enhancement for Large Language Models in Green Stormwater Infrastructure

Green Stormwater Infrastructure (GSI) systems, such as permeable pavement, rain gardens, and bioretention facilities, require continuous inspection and maintenance to ensure long-term performance. However, domain knowledge about GSI is often scattere…

Authors: Shaohuang Wang

GSI Agent: Domain Knowledge Enhancement for Large Language Models in Green Stormwater Infrastructure
GSI Agent: Domain Kno wledge Enhancement for Lar ge Language Models in Green Stormwater Infrastructure Shaohuang W ang Abstract Green Stormwater Infrastructure (GSI) systems, such as permeable pavement, rain gardens, and bioretention facilities, require continuous inspection and maintenance to ensure long-term perfor- mance. Howe ver , domain kno wledge about GSI is often scattered across municipal manuals, regula- tory documents, and inspection forms. As a result, non-expert users and maintenance staff may strug- gle to obtain reliable and actionable guidance from field observations. Although Large Language Models (LLMs) hav e demonstrated strong general reasoning and language generation capabilities, they often lack domain-specific knowledge and may produce inaccurate or hallucinated answers in engineering scenarios. This limitation restricts their direct application to professional infrastructure tasks. In this paper , we propose GSI Agent, a domain-enhanced LLM framework designed to im- prov e performance in GSI-related tasks. Our approach integrates three complementary strategies: (1) supervised fine-tuning (SFT) on a curated GSI instruction dataset, (2) retriev al-augmented gen- eration (RA G) ov er an internal GSI knowledge base constructed from municipal documents, and (3) an agent-based reasoning pipeline that coordinates retrie val, context integration, and structured response generation. W e also construct a ne w GSI Dataset aligned with real-world GSI inspection and maintenance scenarios. Experimental results show that our framework significantly improv es domain-specific performance while maintaining general knowledge capability . On the GSI dataset, BLEU-4 improv es from 0.090 to 0.307, while performance on the common kno wledge dataset re- mains stable (0.304 vs. 0.305). These results demonstrate that systematic domain knowledge en- hancement can effecti vely adapt general-purpose LLMs to professional infrastructure applications. 1 Intr oduction Green Stormwater Infrastructure (GSI) is widely adopted to reduce urban flooding, improv e water qual- ity , and support sustainable city development. T ypical GSI facilities include permeable pav ement, rain gardens, bioretention basins, and infiltration systems. Ho wev er , the performance of these systems de- pends heavily on regular inspection and proper maintenance. Field staff, engineers, and community members often need clear and consistent guidance when identifying issues such as clogging, sediment accumulation, ve getation over gro wth, or structural damage. In practice, GSI knowledge is distributed across multiple sources, including municipal design manu- als, maintenance guidelines, regulatory documents, and inspection forms. These documents contain de- tailed procedural rules, technical constraints, and compliance requirements. Howe ver , the information is not centralized, and it can be difficult for non-e xpert users to quickly retriev e and interpret relev ant guid- ance for specific field scenarios. Large Language Models (LLMs) hav e sho wn strong performance in general question answering and reasoning tasks. Ho wev er , when applied to specialized engineering do- mains, they often suf fer from two major limitations. First, they lack up-to-date and document-grounded domain knowledge. Second, they may generate plausible but incorrect responses, especially when spe- cific re gulations or procedures are required. Therefore, directly applying a general-purpose LLM to GSI tasks may lead to unreliable outputs. T o address this gap, we focus on improving the domain capability of LLMs for GSI applications through a systematic knowledge enhancement frame work. Instead of designing a new model architec- ture, we enhance a general LLM using three complementary components: 1 • Supervised Fine-T uning (SFT): W e construct a domain-specific instruction dataset and fine-tune the base LLM to learn GSI terminology , reasoning patterns, and structured response formats. • Retrieval-A ugmented Generation (RA G): W e build an internal GSI kno wledge base from mu- nicipal manuals and regulatory documents. During inference, relev ant passages are retriev ed and provided as additional conte xt to improve f actual grounding. • Agent-Based Coordination: W e design an agent workflow that organizes retrie val, context inte- gration, and response generation into a structured reasoning pipeline, improving consistency and reliability . T o support supervised training and ev aluation, we construct the GSI Dataset, which contains instruction- style samples covering question answering, verification, information e xtraction, and procedural gener- ation tasks in GSI scenarios. W e ev aluate our framew ork on both the domain dataset and a general common knowledge dataset to e xamine two key aspects: (1) domain performance improvement and (2) general knowledge retention. Experimental results sho w that our approach substantially impro ves per- formance on GSI tasks without degrading general capability . In particular , BLEU-4 on the GSI dataset increases from 0.090 to 0.307 after applying our domain enhancement frame work, while performance on the common kno wledge dataset remains stable. In summary , this work demonstrates that combining supervised fine-tuning, retriev al augmentation, and agent-based coordination provides an effecti ve and practical solution for adapting general-purpose LLMs to professional infrastructure domains. Our framework offers a systematic approach to domain kno wledge enhancement that can be extended to other engineering and regulatory applications. Category T ypical method Key idea and trade-off Dynamic injection RA G Retriev e external documents at inference time; easy to update, but depends on retrie v al quality . Static injection SFT Put domain patterns into model parameters; strong for style/tasks, but harder to update. Model adapter LoRA T rain small rank adapters; efficient and lo w-cost, but capacity is limited. Prompt optimization Prompt Control behavior without changing weights; fast but may be brittle. T able 1: Knowledge injection taxonomy used in this paper . 2 Related W ork 2.1 Domain-Specific Knowledge Injection Injecting domain-specific knowledge into large language models has become a critical research area for improving task-specific performance. Song et al. [5] provide a comprehensi ve survey categorizing kno wledge injection methods into dynamic integration, static fine-tuning, parameter-ef ficient adaptation, and prompt-based guidance. Dynamic integration connects the model to external kno wledge sources during inference, enabling access to up-to-date documents without retraining, which has been effecti ve in domains such as leg al and biomedical text. Static fine-tuning, or supervised fine-tuning (SFT), in- corporates domain kno wledge directly into model parameters by training on task-specific corpora; prior studies hav e shown that SFT significantly improv es reasoning and accuracy in specialized domains, but updating the knowledge requires retraining. Parameter -efficient methods, such as LoRA [1], reduce computational costs by training only additional low-rank matrices while freezing the original weights, achie ving adaptation with minimal resources. Prompt-based guidance adjusts model beha vior through structured instructions or prompts, which is lightweight but less ef fectiv e for complex professional tasks. 2 In GSI and infrastructure-related domains, where the kno wledge is detailed and highly structured, com- bining SFT , retriev al grounding, and agent orchestration is essential for achieving both accuracy and adaptability . 2.2 Retriev al-A ugmented Generation Retrie val-Augmented Generation (RA G) has been proposed to address the limitations of fixed-parameter models in kno wledge-intensiv e tasks by combining LLMs with external information retriev al [2]. In this framew ork, rele vant documents or passages are retrie ved from a structured corpus based on input queries, and the retrie ved content is incorporated into the model conte xt to improv e factual correct- ness and reduce hallucination. Subsequent research has refined retriev al strategies using dense vector representations, contrastiv e learning, and multi-hop reasoning, significantly improving performance in open-domain and domain-specific applications. RA G is particularly v aluable in professional domains where regulations, manuals, and standards are lengthy and constantly updated, as it allows the model to access authoritative knowledge without full retraining. In infrastructure management, including GSI inspection and maintenance, retriev al grounding ensures that model outputs reflect current municipal guidelines, safety protocols, and operational procedures, which are difficult to embed fully in model parameters. 2.3 Domain-Specific LLM Agents Large language model agents extend the reasoning capabilities of con ventional LLMs by integrating planning, tool in vocation, and structured task e xecution. Early studies, such as chain-of-thought prompt- ing [6], demonstrated that decomposing reasoning into explicit intermediate steps improves problem- solving quality . Building on this idea, the ReAct framework [8] combines reasoning and acting loops, allo wing models to plan, query e xternal tools, and revise actions iterativ ely . Recent surveys [3] highlight that domain-specific agents often incorporate memory , workflo w orchestration, and modular tool inter- actions, and they hav e been applied in scientific discov ery [4] and biomedical assistance [7] to manage complex, multi-step tasks. Ev aluation of agent performance emphasizes planning quality , tool-use ac- curacy , and long-horizon reasoning reliability [9]. Unlike general-purpose agent studies, our approach focuses on infrastructure tasks, designing a GSI-specific LLM agent that inte grates fine-tuned reasoning and retrie val-grounded knowledge into a structured generation pipeline suitable for inspection, mainte- nance, and compliance workflo ws. 3 Methodology In this section, we present our approach to enhancing large language models (LLMs) for the Green Stormwater Infrastructure (GSI) domain. Our method inte grates three complementary strate gies: (i) domain-specific fine-tuning, (ii) retrie val-augmented generation (RA G), and (iii) domain-specific LLM agents. Figure 1 illustrates the overall system architecture. The system is designed to provide profes- sional, reliable, and context-a ware responses to GSI-related queries by grounding outputs in verified domain kno wledge. While the model can optionally process field images to support descripti ve tasks, the core focus remains on textual reasoning and domain e xpertise. 3 Figure 1: Architecture of the proposed GSI kno wledge-enhanced LLM system inte grating domain- specific fine-tuning, retrie val-augmented generation, and agent-based reasoning. 3.1 Domain-Specific Fine-T uning T o achie ve domain e xpertise, we apply domain-specific fine-tuning on a general LLM using our curated GSI dataset (GSI Dataset) following the methods summarized in EMNLP 2025 [5]. W e adopt parameter- ef ficient fine-tuning techniques (e.g., LoRA) to reduce computational cost while enabling the model to learn GSI-specific terminology , reasoning patterns, and regulatory context. This fine-tuning allows the LLM to interpret user queries accurately and generate responses aligned with engineering and planning practices. Optionally , the model can summarize field images to support descripti ve assessments, such as identifying GSI types or visible maintenance issues, but this capability is secondary . 3.2 Retriev al-A ugmented Generation T o improv e factual accuracy and compliance with of ficial guidelines, we incorporate a retriev al-augmented generation (RA G) pipeline. All GSI-related documents, including municipal manuals, inspection forms, and planning documents, are segmented into passages and embedded into a vector index. For each user query , the system retriev es the top- k relev ant passages and provides them as context to the fine-tuned LLM. This reduces hallucination, reinforces domain knowledge, and supports dynamic updates of the kno wledge base without retraining the model. When field images are av ailable, retriev al queries can optionally combine textual input and image summaries to enhance conte xtual relev ance. 3.3 Domain-Specific LLM Agents Finally , we implement domain-specific LLM agents to enable flexible, task-oriented reasoning. The agent combines the fine-tuned LLM with the RA G module and lightweight prompt control to support di verse GSI tasks, such as planning, inspection, and maintenance guidance. Rather than enforcing a rigid output format, the agent applies soft constraints: (i) utilize retriev ed passages when relev ant, (ii) av oid in venting re gulations or technical standards, and (iii) ask concise clarification questions if critical infor- mation is missing. This design allo ws the system to adapt its responses to dif ferent user types—including 4 engineers, planners, maintenance staff, and residents—while maintaining professional domain correct- ness. 4 Experimental Setup In this section, we describe the datasets, ev aluation metrics, baselines, and implementation details used to assess the ef fectiv eness of our knowledge-enhanced LLM for GSI tasks. W e aim to provide a compre- hensi ve view of both domain-specific and general capabilities, supported by statistical summaries and visualizations. 4.1 Datasets W e e valuate our approach on tw o complementary datasets: a domain-specific GSI dataset (GSI Dataset) and a general-purpose benchmark (Common Knowledge Dataset). These datasets allo w us to measure improv ements in specialized GSI reasoning while monitoring the retention of broad LLM knowledge. 4.1.1 GSI Dataset GSI Dataset is a curated, instruction-style dataset designed for supervised fine-tuning of GSI reasoning. It contains document-grounded examples covering di verse tasks such as question answering, verifica- tion, procedural guidance, and reasoning ov er regulatory standards. Each record includes e xplicit con- text, optional additional input, and a reference output grounded in official documents or field manuals (T able 2). Field Description id Unique identifier (UUIDv4). source Source document (PDF) pro viding reference information. source location Geographical tag (e.g., “Philadelphia, P A ”) or empty if not location-specific. task type One of nine predefined task families. deployment type Intended usage: fine-tuning or rag . created at T imestamp of record creation (UTC, RFC3339). instruction Self-contained instruction with context for the model. input Optional supplemental context. output Reference answer grounded in official documents. T able 2: Schema of GSI Dataset for instruction-based fine-tuning. The dataset contains 10,955 examples, with 54.2% having a specific source location (e.g., Philadel- phia) and 45.8% location-agnostic. Deployment types are distributed as 73.3% fine-tuning and 26.7% retrie val-augmented generation (RA G) samples (T ables 3 and 4). The task family distribution reflects di verse reasoning capabilities, with the top three categories—question answering (31.2%), verifica- tion/judgment (30.9%), and generation/composition (15.2%)—cov ering 77.3% of the dataset (T able 5). These distributions indicate a balanced mix of kno wledge-intensive and procedural reasoning tasks. T able 3: Source location distribution in GSI Dataset. Location Count Per centage Philadelphia, P A 5219 54.2% None 4791 45.8% 5 T able 4: Deployment type distribution in GSI Dataset. T ype Count Per centage Fine-tuning 7460 73.3% RA G 3495 26.7% T able 5: T ask type distribution in GSI Dataset. T ask T ype Count Per centage Question Answering 5300 31.2% V erification / Judgment 5241 30.9% Generation / Composition 2573 15.2% Information Extraction 1724 10.1% Classification 1225 7.2% Reasoning / Math / Logic 758 4.4% Dialogue Interaction 100 0.6% Re writing / Transformation 41 0.2% Code / Program Execution 0 0% Figure 2 visualizes the task type distribution, highlighting the dominance of question-answering and verification tasks while maintaining co verage of procedural and reasoning-intensi ve examples. Question Answering 5,300 (31.2%) V erification / Judgment 5,241 (30.9%) Generation / Composition 2,573 (15.2%) Information Extraction 1,724 (10.2%) Classification 1,225 (7.2%) Reasoning / Math / Logic 758 (4.5%) Dialogue Interaction 100 (0.6%) Rewriting / Transformation 41 (0.2%) Code / Program Execution 0 (0.0%) Figure 2: T ask type distribution in GSI Dataset. 4.1.2 Common Knowledge Dataset Common Kno wledge Dataset is a general benchmark used to measure knowledge retention outside the GSI domain. F or our experiments, we sample 5,000 examples from publicly av ailable LLM ev alua- tion datasets such as MMMU/MMBench. This dataset includes question answering, classification, and reasoning tasks in di verse domains. T able 6 summarizes its characteristics. V isualizations of task-type distribution can be included similarly to GSI Dataset for comparison. T ask T ype Count P ercentage Question Answering 400 40% Classification 300 30% Reasoning / Logic 300 30% T able 6: Summary of Common Knowledge Dataset used to e v aluate general knowledge retention. 6 By analyzing both datasets, we ensure that our LLM enhancements improve domain-specific rea- soning without compromising general-purpose capabilities. V isual summaries and task statistics provide transparency and f acilitate reproducibility . 4.2 Metrics W e ev aluate outputs at multiple lev els: lexical overlap, semantic similarity , judge-based quality , and human experts. T able 7 summarizes all metrics, and we giv e formal definitions below . Metric Level What it measures BLEU-4 Lexical N-gram precision with brevity penalty; good for short factual text. R OUGE-1/2/L Lexical Recall-style overlap; captures cov erage for summary-lik e an- swers. Micro-F1 Label Aggregated F1 for classification-style tasks (e.g., issue type). Sentence-BER T Semantic Embedding cosine similarity; complements lexical o verlap. G-Eval (LLM Judge) Semantic/Logic LLM-based scoring for correctness and coherence. Human Expert Real Expert rating on usefulness and correctness (small sample). T able 7: Evaluation metrics and their roles. BLEU-4. BLEU-4 measures 1–4 gram precision with a bre vity penalty to av oid overly short outputs: BLEU - 4 = BP · exp 1 4 4 X n =1 log p n ! , (1) where p n is the modified n -gram precision and BP penalizes too-short candidates. R OUGE-1/2/L. R OUGE measures ho w much reference content is cov ered by the candidate (we report R OUGE-1, R OUGE-2, and R OUGE-L): R OUGE - n = P g ∈G n ( r ) min(coun t c ( g ) , count r ( g )) P g ∈G n ( r ) coun t r ( g ) , (2) where G n ( · ) is the multiset of n -grams (R OUGE-L is an LCS-based v ariant). Micro-F1. F or classification-style ev aluation, we compute Micro-F1 by aggregating errors across all classes: MicroF1 = 2 TP 2 TP + FP + FN , (3) where TP , FP , and FN are total true positi ves, false positi ves, and false ne gati ves. Sentence-BER T similarity . W e compute semantic similarity using cosine similarity between sentence embeddings: SBER T( c, r ) = e ⊤ c e r ∥ e c ∥ 2 ∥ e r ∥ 2 , (4) where e c and e r are embeddings of the candidate and reference answers. G-Eval (LLM as a Judge). W e ask an LLM judge to score each answer with a rubric-based score s i (e.g., 1–5) and report the mean: G - Ev al = 1 N N X i =1 s i . (5) This metric helps e valuate correctness and coherence be yond surface similarity . 7 Human Expert. Similarly , human experts rate each answer (usefulness/correctness) and we report the av erage score: HumanScore = 1 N N X i =1 h i , (6) where h i is the expert score for sample i . 4.3 Baselines In this section, we compare our proposed system ag ainst three baselines to quantify the ef fect of knowl- edge injection: Baseline RA G SFT Agent Notes Base LLM × × × Direct prompting on base model. Base LLM + RA G ✓ × × Retrie val improv es factuality; no parameter updates. Fine-tuned LLM + RA G ✓ ✓ ✓ Full system with LoRA-SFT , RA G, and agent rea- soning. T able 8: Baselines used in ev aluation. W e use Qwen3-VL-2B-Instruct as our primary base LLM and consider other open-source models (Qwen3-VL, InternVL, MiniCPM-V , Phi-3.5-V ision) mainly for feasibility comparison. 4.4 Implementation Details In this section, we provide technical details of model fine-tuning. W e adopt LoRA for parameter -efficient SFT on the Qwen3-VL instruction-tuned model. T able 9 summarizes the configuration. Setting V alue finetuning type LoRA bf16 true template qwen3 vl lora alpha 16 lora dropout 0 lora rank 8 lora target all T able 9: LoRA fine-tuning configuration. 5 Experimental Results In this section, we report the main results and ablation studies, analyzing both general and domain- specific performance. 5.1 Main Results T able 10 presents e valuation results on Common Knowledge Dataset and GSI Dataset. W e observe that our knowledge-enhanced LLM maintains general knowledge performance while achieving substantial improv ement on GSI-specific tasks. BLEU-4 increases from 0.090 to 0.307 on GSI Dataset, indicating strong domain adaptation. Sentence-BER T and G-Ev al scores also sho w improv ed semantic correctness and reasoning quality . 8 Metric Common Knowledge Dataset GSI Dataset Base LLM GSI LLM Base LLM GSI LLM BLEU-4 0.304 0.305 0.090 0.307 R OUGE-1 0.352 0.351 0.157 0.204 R OUGE-2 0.146 0.146 0.032 0.111 R OUGE-L 0.223 0.223 0.071 0.153 Sentence-BER T 0.861 0.869 0.544 0.742 G-Eval 0.82 0.84 0.57 0.79 T able 10: Main results on general and domain datasets. 5.2 Ablation Study T o understand the contribution of each kno wledge-injection strate gy , we conduct an ablation study using G-Ev al as the primary metric. W e compare three settings: (i) LLM + RAG, (ii) LLM + Fine-tuning, and (iii) LLM + RA G + Fine-tuning. T able 11 shows that the hybrid approach achiev es the best balance between factual grounding and procedural reasoning, confirming that GSI tasks benefit from combining dynamic retrie val and learned domain capabilities. Method G-Eval Scor e LLM + RA G 0.51 LLM + Fine-tuning 0.63 LLM + RA G + Fine-tuning 0.72 T able 11: Ablation study of knowledge-injection strate gies. 6 Conclusion In this section, we summarize our findings. W e present GSI Agent, a kno wledge-enhanced LLM system for GSI tasks, which maintains general kno wledge performance while substantially improving domain- specific reasoning. Our experiments demonstrate that combining fine-tuning, retriev al-augmented gen- eration, and agent-based reasoning yields the best ov erall performance. Future work includes scaling human expert ev aluation, refining retriev al strategies, and performing finer-grained error analysis for real-world GSI maintenance applications. A Data Sour ces for RA G Corpus W e list the documents used to build the retrie val corpus. This table is designed to be extended: you can add a “Used?” column if some links are excluded. # Title Y ear Link / Notes 1 Stormwater Management Guidance Manual 2023 https://water.phila.gov/ wp- content/uploads/files/ stormwater- management- guidance- manual. pdf 2 Pennsylvania Stormwater BMPs Manual – https://greenport.pa.gov/ elibrary/GetFolder?FolderID= 1368916 3 City of Philadelphia Green Streets Design Manual 2014 https://www.phila.gov/ media/20160504172218/ Green- Streets- Design- Manual- 2014. pdf 9 # Title Y ear Link / Notes 4 Green Stormwater Infrastructure Maintenance Manual 2016 https://water.phila. gov/wp- content/uploads/ GSI- Maintenance- Manual_v2_2016. pdf 5 Green City , Clean W aters Plan 2009 https://www.phila.gov/ media/20160421133948/ green- city- clean- waters.pdf 6 Green City , Clean W aters Partnership Agreement 2012 https://water.phila.gov/ wp- content/uploads/files/EPA_ Partnership_Agreement.pdf 7 Green City Clean W aters - Long- T erm Control Plan – https://water.phila.gov/ wp- content/uploads/files/LTCPU_ Complete.pdf 8 GSI Planning & Design Manual – https://water.phila.gov/ wp- content/uploads/files/ gsi- planning- and- design- manual. pdf 9 GSI As-built Surve y and Drafting Manual – https://water.phila.gov/ wp- content/uploads/files/ gsi- as- built- survey- and- drafting- manual. pdf 10 SMP Inspection Forms (Cisterns, Roofs, Ponds, Porous Surface, Basins, Filters) – https://water.phila.gov/ wp- content/uploads/files/ smp- porous- surface- inspection- form. pdf 11 Philadelphia W ater Department Reg- ulations 2024 https://water.phila.gov/ wp- content/uploads/files/ pwd- regulations- 2024- 04- 29.pdf 12 Stormwater Grant Resources (portal) – https://water.phila.gov/ stormwater/incentives/grants/ 13 Plan and Report Checklists – https://water.phila.gov/ wp- content/uploads/files/ smgm- e- plan- and- report- checklists. pdf 14 Reported Flood Damages in Philadel- phia (map) 2024 https://www.phila.gov/ media/20241204111812/ Reported- Flood- Damages- Map- v4. 2- 2024.pdf 15 Sustainable Funding for Green City , Clean W aters 2022 https://williampennfoundation. org/sites/default/files/2024- 05/ PHL- GreenCityCleanWaters- Sustain_ 2022_FINAL.pdf 16 GCCW Comprehensiv e Monitoring Plan – https://archive.phillywatersheds. org/ltcpu/GCCW%20Comprehensive% 20Monitoring%20Plan%20Sections% 201- 10.pdf 17 PWD Regulations Chapter 6 - Stormwater – https://water.phila.gov/ wp- content/uploads/files/ pwd- regulations- chapter- 6.pdf 18 SMP Maintenance Guidance – https://water.phila.gov/ wp- content/uploads/files/ smp- maintenance- guidance.pdf 19 SMP Maintenance Guide (portal) – https://water.phila. gov/development/ stormwater- plan- review/ maintenance 10 # Title Y ear Link / Notes 20 Rain Check Contractor Documents – https://www.pwdraincheck.org/en/ contractor- documents 21 GSI Landscape Design Guidebook 2014 https://www.pwdraincheck.org/ images/documents/Landscape_ Manual_2014.pdf 22 Planning & Design Resource Direc- tory (portal) – https://water.phila.gov/gsi/ planning- design/resources/ B SFT Pr ompt T emplate SYSTEM: You are an expert technical writer and dataset engineer for domain-specific LLM fine-tuning. Your task is to read the provided PDF document and extract knowledge that is NOT generic LLM common knowledge, but instead is specific to this document, its policies, rules, responsibilities, procedures, or technical constraints. You must generate a Supervised Fine-Tuning (SFT) dataset in JSON format. IMPORTANT RULES (MUST FOLLOW): 1. Every QA pair MUST be fully self-contained and understandable WITHOUT access to the P DF. - Do NOT reference chapters, sections, figures, tables, or page numbers. - Do NOT use phrases like \Chapter 1", \this section", \as described above", or \the following". - Do NOT rely on document structure for meaning. 2. Avoid ambiguous pronouns and references. - DO NOT use: it, this, that, they, the City, the program, the agreement - INSTEAD, always explicitly name the entity: e.g., \the Philadelphia Water Department", \the Green City, Clean Waters program", \the municipal green stormwater infrastructure guidance". 3. Each instruction must clearly state the domain and context. - A reader with no prior exposure to the PDF should still understand the question. - Example: Wrong: \Describe how pre-development land cover must be represented" Right: \Describe how pre-development land cover must be represented in stormwater m odeling for municipal green stormwater infrastructure projects" 4. Each output must: - Be grounded ONLY in the PDF content - Restate key entities and constraints explicitly - Provide a clear, concrete, and technically meaningful answer - Be as long or as short as needed to fully capture the knowledge (no artificial leng th limits) 5. Each QA pair should represent ONE independent, reusable knowledge unit suitable for L LM fine-tuning. 6. If the document defines: - responsibilities → generate responsibility-focused QA - procedures → generate process-focused QA - design rules → generate constraint-focused QA - evaluation or monitoring → generate metric- or workflow-focused QA OUTPUT FORMAT (STRICT): [ { "instruction": "A fully self-contained task or question", 11 "input": "", "output": "A complete, document-grounded answer with explicit entities and no ambigu ous references" } ] Do NOT add explanations, commentary, or markdown. Only output valid JSON. USER PROMPT Below is the extracted text content from a Green Stormwater Infrastructure (GSI) PDF doc ument. Your task: 1. Identify document-specific, non-obvious, and operationally relevant knowledge. 2. Convert that knowledge into self-contained instruction-style QA samples suitable for supervised fine-tuning (SFT). 3. Ensure that each question and answer can be fully understood without access to the or iginal document. 4. Avoid vague entity references or pronouns unless the entity is explicitly defined in the instruction. Generate as many high-quality samples as the content supports. Fewer high-quality sample s are preferred over many weak ones. Return ONLY a valid JSON array. Do not include explanations or markdown. Refer ences [1] Edward J. Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizhu Chen. Lora: Low-rank adaptation of lar ge language models. arXiv pr eprint arXiv:2106.09685 , 2021. [2] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, et al. Retrie val-augmented generation for knowledge-intensi ve nlp tasks. In NeurIPS , 2020. [3] Junyu Luo, W eizhi Zhang, Y e Y uan, et al. Large language model agent: A survey on methodology , applica- tions and challenges. arXiv pr eprint arXiv:2503.21460 , 2025. [4] Shuo Ren, Pu Jian, et al. T ow ards scientific intelligence: A survey of llm-based scientific agents. arXiv pr eprint arXiv:2503.??? (preprint) , 2025. [5] Zirui Song, Bin Y an, Y uhan Liu, Miao Fang, Mingzhe Li, Rui Y an, and Xiuying Chen. Injecting domain- specific knowledge into large language models: A comprehensiv e survey . In F indings of the Association for Computational Linguistics: EMNLP 2025 , 2025. [6] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, et al. Chain-of-thought prompting elicits rea- soning in large language models. 2022. [7] Xiaoran Xu and Ravi Sankar . Large language model agents for biomedicine: A comprehensi ve re view . Infor- mation , 2025. [8] Shunyu Y ao, Jeffre y Zhao, Dian Y u, et al. React: Synergizing reasoning and acting in language models. In ICLR , 2023. [9] Asaf Y ehudai, Lilach Eden, et al. Survey on ev aluation of llm-based agents. arXiv preprint , 2025. 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment