GSI Agent: Domain Knowledge Enhancement for Large Language Models in Green Stormwater Infrastructure

GSI Agent: Domain Kno wledge Enhancement for Lar ge Language Models in Green Stormwater Infrastructure Shaohuang W ang Abstract Green Stormwater Infrastructure (GSI) systems, such as permeable pavement, rain gardens, and bioretention facilities, require continuous inspection and maintenance to ensure long-term perfor- mance. Howe ver , domain kno wledge about GSI is often scattered across municipal manuals, regula- tory documents, and inspection forms. As a result, non-expert users and maintenance staff may strug- gle to obtain reliable and actionable guidance from ﬁeld observations. Although Large Language Models (LLMs) hav e demonstrated strong general reasoning and language generation capabilities, they often lack domain-speciﬁc knowledge and may produce inaccurate or hallucinated answers in engineering scenarios. This limitation restricts their direct application to professional infrastructure tasks. In this paper , we propose GSI Agent, a domain-enhanced LLM framework designed to im- prov e performance in GSI-related tasks. Our approach integrates three complementary strategies: (1) supervised ﬁne-tuning (SFT) on a curated GSI instruction dataset, (2) retriev al-augmented gen- eration (RA G) ov er an internal GSI knowledge base constructed from municipal documents, and (3) an agent-based reasoning pipeline that coordinates retrie val, context integration, and structured response generation. W e also construct a ne w GSI Dataset aligned with real-world GSI inspection and maintenance scenarios. Experimental results show that our framework signiﬁcantly improv es domain-speciﬁc performance while maintaining general knowledge capability . On the GSI dataset, BLEU-4 improv es from 0.090 to 0.307, while performance on the common kno wledge dataset re- mains stable (0.304 vs. 0.305). These results demonstrate that systematic domain knowledge en- hancement can effecti vely adapt general-purpose LLMs to professional infrastructure applications. 1 Intr oduction Green Stormwater Infrastructure (GSI) is widely adopted to reduce urban ﬂooding, improv e water qual- ity , and support sustainable city development. T ypical GSI facilities include permeable pav ement, rain gardens, bioretention basins, and inﬁltration systems. Ho wev er , the performance of these systems de- pends heavily on regular inspection and proper maintenance. Field staff, engineers, and community members often need clear and consistent guidance when identifying issues such as clogging, sediment accumulation, ve getation over gro wth, or structural damage. In practice, GSI knowledge is distributed across multiple sources, including municipal design manu- als, maintenance guidelines, regulatory documents, and inspection forms. These documents contain de- tailed procedural rules, technical constraints, and compliance requirements. Howe ver , the information is not centralized, and it can be difﬁcult for non-e xpert users to quickly retriev e and interpret relev ant guid- ance for speciﬁc ﬁeld scenarios. Large Language Models (LLMs) hav e sho wn strong performance in general question answering and reasoning tasks. Ho wev er , when applied to specialized engineering do- mains, they often suf fer from two major limitations. First, they lack up-to-date and document-grounded domain knowledge. Second, they may generate plausible but incorrect responses, especially when spe- ciﬁc re gulations or procedures are required. Therefore, directly applying a general-purpose LLM to GSI tasks may lead to unreliable outputs. T o address this gap, we focus on improving the domain capability of LLMs for GSI applications through a systematic knowledge enhancement frame work. Instead of designing a new model architec- ture, we enhance a general LLM using three complementary components: 1 • Supervised Fine-T uning (SFT): W e construct a domain-speciﬁc instruction dataset and ﬁne-tune the base LLM to learn GSI terminology , reasoning patterns, and structured response formats. • Retrieval-A ugmented Generation (RA G): W e build an internal GSI kno wledge base from mu- nicipal manuals and regulatory documents. During inference, relev ant passages are retriev ed and provided as additional conte xt to improve f actual grounding. • Agent-Based Coordination: W e design an agent workﬂow that organizes retrie val, context inte- gration, and response generation into a structured reasoning pipeline, improving consistency and reliability . T o support supervised training and ev aluation, we construct the GSI Dataset, which contains instruction- style samples covering question answering, veriﬁcation, information e xtraction, and procedural gener- ation tasks in GSI scenarios. W e ev aluate our framew ork on both the domain dataset and a general common knowledge dataset to e xamine two key aspects: (1) domain performance improvement and (2) general knowledge retention. Experimental results sho w that our approach substantially impro ves per- formance on GSI tasks without degrading general capability . In particular , BLEU-4 on the GSI dataset increases from 0.090 to 0.307 after applying our domain enhancement frame work, while performance on the common kno wledge dataset remains stable. In summary , this work demonstrates that combining supervised ﬁne-tuning, retriev al augmentation, and agent-based coordination provides an effecti ve and practical solution for adapting general-purpose LLMs to professional infrastructure domains. Our framework offers a systematic approach to domain kno wledge enhancement that can be extended to other engineering and regulatory applications. Category T ypical method Key idea and trade-off Dynamic injection RA G Retriev e external documents at inference time; easy to update, but depends on retrie v al quality . Static injection SFT Put domain patterns into model parameters; strong for style/tasks, but harder to update. Model adapter LoRA T rain small rank adapters; efﬁcient and lo w-cost, but capacity is limited. Prompt optimization Prompt Control behavior without changing weights; fast but may be brittle. T able 1: Knowledge injection taxonomy used in this paper . 2 Related W ork 2.1 Domain-Speciﬁc Knowledge Injection Injecting domain-speciﬁc knowledge into large language models has become a critical research area for improving task-speciﬁc performance. Song et al. [5] provide a comprehensi ve survey categorizing kno wledge injection methods into dynamic integration, static ﬁne-tuning, parameter-ef ﬁcient adaptation, and prompt-based guidance. Dynamic integration connects the model to external kno wledge sources during inference, enabling access to up-to-date documents without retraining, which has been effecti ve in domains such as leg al and biomedical text. Static ﬁne-tuning, or supervised ﬁne-tuning (SFT), in- corporates domain kno wledge directly into model parameters by training on task-speciﬁc corpora; prior studies hav e shown that SFT signiﬁcantly improv es reasoning and accuracy in specialized domains, but updating the knowledge requires retraining. Parameter -efﬁcient methods, such as LoRA [1], reduce computational costs by training only additional low-rank matrices while freezing the original weights, achie ving adaptation with minimal resources. Prompt-based guidance adjusts model beha vior through structured instructions or prompts, which is lightweight but less ef fectiv e for complex professional tasks. 2 In GSI and infrastructure-related domains, where the kno wledge is detailed and highly structured, com- bining SFT , retriev al grounding, and agent orchestration is essential for achieving both accuracy and adaptability . 2.2 Retriev al-A ugmented Generation Retrie val-Augmented Generation (RA G) has been proposed to address the limitations of ﬁxed-parameter models in kno wledge-intensiv e tasks by combining LLMs with external information retriev al [2]. In this framew ork, rele vant documents or passages are retrie ved from a structured corpus based on input queries, and the retrie ved content is incorporated into the model conte xt to improv e factual correct- ness and reduce hallucination. Subsequent research has reﬁned retriev al strategies using dense vector representations, contrastiv e learning, and multi-hop reasoning, signiﬁcantly improving performance in open-domain and domain-speciﬁc applications. RA G is particularly v aluable in professional domains where regulations, manuals, and standards are lengthy and constantly updated, as it allows the model to access authoritative knowledge without full retraining. In infrastructure management, including GSI inspection and maintenance, retriev al grounding ensures that model outputs reﬂect current municipal guidelines, safety protocols, and operational procedures, which are difﬁcult to embed fully in model parameters. 2.3 Domain-Speciﬁc LLM Agents Large language model agents extend the reasoning capabilities of con ventional LLMs by integrating planning, tool in vocation, and structured task e xecution. Early studies, such as chain-of-thought prompt- ing [6], demonstrated that decomposing reasoning into explicit intermediate steps improves problem- solving quality . Building on this idea, the ReAct framework [8] combines reasoning and acting loops, allo wing models to plan, query e xternal tools, and revise actions iterativ ely . Recent surveys [3] highlight that domain-speciﬁc agents often incorporate memory , workﬂo w orchestration, and modular tool inter- actions, and they hav e been applied in scientiﬁc discov ery [4] and biomedical assistance [7] to manage complex, multi-step tasks. Ev aluation of agent performance emphasizes planning quality , tool-use ac- curacy , and long-horizon reasoning reliability [9]. Unlike general-purpose agent studies, our approach focuses on infrastructure tasks, designing a GSI-speciﬁc LLM agent that inte grates ﬁne-tuned reasoning and retrie val-grounded knowledge into a structured generation pipeline suitable for inspection, mainte- nance, and compliance workﬂo ws. 3 Methodology In this section, we present our approach to enhancing large language models (LLMs) for the Green Stormwater Infrastructure (GSI) domain. Our method inte grates three complementary strate gies: (i) domain-speciﬁc ﬁne-tuning, (ii) retrie val-augmented generation (RA G), and (iii) domain-speciﬁc LLM agents. Figure 1 illustrates the overall system architecture. The system is designed to provide profes- sional, reliable, and context-a ware responses to GSI-related queries by grounding outputs in veriﬁed domain kno wledge. While the model can optionally process ﬁeld images to support descripti ve tasks, the core focus remains on textual reasoning and domain e xpertise. 3 Figure 1: Architecture of the proposed GSI kno wledge-enhanced LLM system inte grating domain- speciﬁc ﬁne-tuning, retrie val-augmented generation, and agent-based reasoning. 3.1 Domain-Speciﬁc Fine-T uning T o achie ve domain e xpertise, we apply domain-speciﬁc ﬁne-tuning on a general LLM using our curated GSI dataset (GSI Dataset) following the methods summarized in EMNLP 2025 [5]. W e adopt parameter- ef ﬁcient ﬁne-tuning techniques (e.g., LoRA) to reduce computational cost while enabling the model to learn GSI-speciﬁc terminology , reasoning patterns, and regulatory context. This ﬁne-tuning allows the LLM to interpret user queries accurately and generate responses aligned with engineering and planning practices. Optionally , the model can summarize ﬁeld images to support descripti ve assessments, such as identifying GSI types or visible maintenance issues, but this capability is secondary . 3.2 Retriev al-A ugmented Generation T o improv e factual accuracy and compliance with of ﬁcial guidelines, we incorporate a retriev al-augmented generation (RA G) pipeline. All GSI-related documents, including municipal manuals, inspection forms, and planning documents, are segmented into passages and embedded into a vector index. For each user query , the system retriev es the top- k relev ant passages and provides them as context to the ﬁne-tuned LLM. This reduces hallucination, reinforces domain knowledge, and supports dynamic updates of the kno wledge base without retraining the model. When ﬁeld images are av ailable, retriev al queries can optionally combine textual input and image summaries to enhance conte xtual relev ance. 3.3 Domain-Speciﬁc LLM Agents Finally , we implement domain-speciﬁc LLM agents to enable ﬂexible, task-oriented reasoning. The agent combines the ﬁne-tuned LLM with the RA G module and lightweight prompt control to support di verse GSI tasks, such as planning, inspection, and maintenance guidance. Rather than enforcing a rigid output format, the agent applies soft constraints: (i) utilize retriev ed passages when relev ant, (ii) av oid in venting re gulations or technical standards, and (iii) ask concise clariﬁcation questions if critical infor- mation is missing. This design allo ws the system to adapt its responses to dif ferent user types—including 4 engineers, planners, maintenance staff, and residents—while maintaining professional domain correct- ness. 4 Experimental Setup In this section, we describe the datasets, ev aluation metrics, baselines, and implementation details used to assess the ef fectiv eness of our knowledge-enhanced LLM for GSI tasks. W e aim to provide a compre- hensi ve view of both domain-speciﬁc and general capabilities, supported by statistical summaries and visualizations. 4.1 Datasets W e e valuate our approach on tw o complementary datasets: a domain-speciﬁc GSI dataset (GSI Dataset) and a general-purpose benchmark (Common Knowledge Dataset). These datasets allo w us to measure improv ements in specialized GSI reasoning while monitoring the retention of broad LLM knowledge. 4.1.1 GSI Dataset GSI Dataset is a curated, instruction-style dataset designed for supervised ﬁne-tuning of GSI reasoning. It contains document-grounded examples covering di verse tasks such as question answering, veriﬁca- tion, procedural guidance, and reasoning ov er regulatory standards. Each record includes e xplicit con- text, optional additional input, and a reference output grounded in ofﬁcial documents or ﬁeld manuals (T able 2). Field Description id Unique identiﬁer (UUIDv4). source Source document (PDF) pro viding reference information. source location Geographical tag (e.g., “Philadelphia, P A ”) or empty if not location-speciﬁc. task type One of nine predeﬁned task families. deployment type Intended usage: fine-tuning or rag . created at T imestamp of record creation (UTC, RFC3339). instruction Self-contained instruction with context for the model. input Optional supplemental context. output Reference answer grounded in ofﬁcial documents. T able 2: Schema of GSI Dataset for instruction-based ﬁne-tuning. The dataset contains 10,955 examples, with 54.2% having a speciﬁc source location (e.g., Philadel- phia) and 45.8% location-agnostic. Deployment types are distributed as 73.3% ﬁne-tuning and 26.7% retrie val-augmented generation (RA G) samples (T ables 3 and 4). The task family distribution reﬂects di verse reasoning capabilities, with the top three categories—question answering (31.2%), veriﬁca- tion/judgment (30.9%), and generation/composition (15.2%)—cov ering 77.3% of the dataset (T able 5). These distributions indicate a balanced mix of kno wledge-intensive and procedural reasoning tasks. T able 3: Source location distribution in GSI Dataset. Location Count Per centage Philadelphia, P A 5219 54.2% None 4791 45.8% 5 T able 4: Deployment type distribution in GSI Dataset. T ype Count Per centage Fine-tuning 7460 73.3% RA G 3495 26.7% T able 5: T ask type distribution in GSI Dataset. T ask T ype Count Per centage Question Answering 5300 31.2% V eriﬁcation / Judgment 5241 30.9% Generation / Composition 2573 15.2% Information Extraction 1724 10.1% Classiﬁcation 1225 7.2% Reasoning / Math / Logic 758 4.4% Dialogue Interaction 100 0.6% Re writing / Transformation 41 0.2% Code / Program Execution 0 0% Figure 2 visualizes the task type distribution, highlighting the dominance of question-answering and veriﬁcation tasks while maintaining co verage of procedural and reasoning-intensi ve examples. Question Answering 5,300 (31.2%) V erification / Judgment 5,241 (30.9%) Generation / Composition 2,573 (15.2%) Information Extraction 1,724 (10.2%) Classification 1,225 (7.2%) Reasoning / Math / Logic 758 (4.5%) Dialogue Interaction 100 (0.6%) Rewriting / Transformation 41 (0.2%) Code / Program Execution 0 (0.0%) Figure 2: T ask type distribution in GSI Dataset. 4.1.2 Common Knowledge Dataset Common Kno wledge Dataset is a general benchmark used to measure knowledge retention outside the GSI domain. F or our experiments, we sample 5,000 examples from publicly av ailable LLM ev alua- tion datasets such as MMMU/MMBench. This dataset includes question answering, classiﬁcation, and reasoning tasks in di verse domains. T able 6 summarizes its characteristics. V isualizations of task-type distribution can be included similarly to GSI Dataset for comparison. T ask T ype Count P ercentage Question Answering 400 40% Classiﬁcation 300 30% Reasoning / Logic 300 30% T able 6: Summary of Common Knowledge Dataset used to e v aluate general knowledge retention. 6 By analyzing both datasets, we ensure that our LLM enhancements improve domain-speciﬁc rea- soning without compromising general-purpose capabilities. V isual summaries and task statistics provide transparency and f acilitate reproducibility . 4.2 Metrics W e ev aluate outputs at multiple lev els: lexical overlap, semantic similarity , judge-based quality , and human experts. T able 7 summarizes all metrics, and we giv e formal deﬁnitions below . Metric Level What it measures BLEU-4 Lexical N-gram precision with brevity penalty; good for short factual text. R OUGE-1/2/L Lexical Recall-style overlap; captures cov erage for summary-lik e an- swers. Micro-F1 Label Aggregated F1 for classiﬁcation-style tasks (e.g., issue type). Sentence-BER T Semantic Embedding cosine similarity; complements lexical o verlap. G-Eval (LLM Judge) Semantic/Logic LLM-based scoring for correctness and coherence. Human Expert Real Expert rating on usefulness and correctness (small sample). T able 7: Evaluation metrics and their roles. BLEU-4. BLEU-4 measures 1–4 gram precision with a bre vity penalty to av oid overly short outputs: BLEU - 4 = BP · exp 1 4 4 X n =1 log p n ! , (1) where p n is the modiﬁed n -gram precision and BP penalizes too-short candidates. R OUGE-1/2/L. R OUGE measures ho w much reference content is cov ered by the candidate (we report R OUGE-1, R OUGE-2, and R OUGE-L): R OUGE - n = P g ∈G n ( r ) min(coun t c ( g ) , count r ( g )) P g ∈G n ( r ) coun t r ( g ) , (2) where G n ( · ) is the multiset of n -grams (R OUGE-L is an LCS-based v ariant). Micro-F1. F or classiﬁcation-style ev aluation, we compute Micro-F1 by aggregating errors across all classes: MicroF1 = 2 TP 2 TP + FP + FN , (3) where TP , FP , and FN are total true positi ves, false positi ves, and false ne gati ves. Sentence-BER T similarity . W e compute semantic similarity using cosine similarity between sentence embeddings: SBER T( c, r ) = e ⊤ c e r ∥ e c ∥ 2 ∥ e r ∥ 2 , (4) where e c and e r are embeddings of the candidate and reference answers. G-Eval (LLM as a Judge). W e ask an LLM judge to score each answer with a rubric-based score s i (e.g., 1–5) and report the mean: G - Ev al = 1 N N X i =1 s i . (5) This metric helps e valuate correctness and coherence be yond surface similarity . 7 Human Expert. Similarly , human experts rate each answer (usefulness/correctness) and we report the av erage score: HumanScore = 1 N N X i =1 h i , (6) where h i is the expert score for sample i . 4.3 Baselines In this section, we compare our proposed system ag ainst three baselines to quantify the ef fect of knowl- edge injection: Baseline RA G SFT Agent Notes Base LLM × × × Direct prompting on base model. Base LLM + RA G ✓ × × Retrie val improv es factuality; no parameter updates. Fine-tuned LLM + RA G ✓ ✓ ✓ Full system with LoRA-SFT , RA G, and agent rea- soning. T able 8: Baselines used in ev aluation. W e use Qwen3-VL-2B-Instruct as our primary base LLM and consider other open-source models (Qwen3-VL, InternVL, MiniCPM-V , Phi-3.5-V ision) mainly for feasibility comparison. 4.4 Implementation Details In this section, we provide technical details of model ﬁne-tuning. W e adopt LoRA for parameter -efﬁcient SFT on the Qwen3-VL instruction-tuned model. T able 9 summarizes the conﬁguration. Setting V alue ﬁnetuning type LoRA bf16 true template qwen3 vl lora alpha 16 lora dropout 0 lora rank 8 lora target all T able 9: LoRA ﬁne-tuning conﬁguration. 5 Experimental Results In this section, we report the main results and ablation studies, analyzing both general and domain- speciﬁc performance. 5.1 Main Results T able 10 presents e valuation results on Common Knowledge Dataset and GSI Dataset. W e observe that our knowledge-enhanced LLM maintains general knowledge performance while achieving substantial improv ement on GSI-speciﬁc tasks. BLEU-4 increases from 0.090 to 0.307 on GSI Dataset, indicating strong domain adaptation. Sentence-BER T and G-Ev al scores also sho w improv ed semantic correctness and reasoning quality . 8 Metric Common Knowledge Dataset GSI Dataset Base LLM GSI LLM Base LLM GSI LLM BLEU-4 0.304 0.305 0.090 0.307 R OUGE-1 0.352 0.351 0.157 0.204 R OUGE-2 0.146 0.146 0.032 0.111 R OUGE-L 0.223 0.223 0.071 0.153 Sentence-BER T 0.861 0.869 0.544 0.742 G-Eval 0.82 0.84 0.57 0.79 T able 10: Main results on general and domain datasets. 5.2 Ablation Study T o understand the contribution of each kno wledge-injection strate gy , we conduct an ablation study using G-Ev al as the primary metric. W e compare three settings: (i) LLM + RAG, (ii) LLM + Fine-tuning, and (iii) LLM + RA G + Fine-tuning. T able 11 shows that the hybrid approach achiev es the best balance between factual grounding and procedural reasoning, conﬁrming that GSI tasks beneﬁt from combining dynamic retrie val and learned domain capabilities. Method G-Eval Scor e LLM + RA G 0.51 LLM + Fine-tuning 0.63 LLM + RA G + Fine-tuning 0.72 T able 11: Ablation study of knowledge-injection strate gies. 6 Conclusion In this section, we summarize our ﬁndings. W e present GSI Agent, a kno wledge-enhanced LLM system for GSI tasks, which maintains general kno wledge performance while substantially improving domain- speciﬁc reasoning. Our experiments demonstrate that combining ﬁne-tuning, retriev al-augmented gen- eration, and agent-based reasoning yields the best ov erall performance. Future work includes scaling human expert ev aluation, reﬁning retriev al strategies, and performing ﬁner-grained error analysis for real-world GSI maintenance applications. A Data Sour ces for RA G Corpus W e list the documents used to build the retrie val corpus. This table is designed to be extended: you can add a “Used?” column if some links are excluded. # Title Y ear Link / Notes 1 Stormwater Management Guidance Manual 2023 https://water.phila.gov/ wp- content/uploads/files/ stormwater- management- guidance- manual. pdf 2 Pennsylvania Stormwater BMPs Manual – https://greenport.pa.gov/ elibrary/GetFolder?FolderID= 1368916 3 City of Philadelphia Green Streets Design Manual 2014 https://www.phila.gov/ media/20160504172218/ Green- Streets- Design- Manual- 2014. pdf 9 # Title Y ear Link / Notes 4 Green Stormwater Infrastructure Maintenance Manual 2016 https://water.phila. gov/wp- content/uploads/ GSI- Maintenance- Manual_v2_2016. pdf 5 Green City , Clean W aters Plan 2009 https://www.phila.gov/ media/20160421133948/ green- city- clean- waters.pdf 6 Green City , Clean W aters Partnership Agreement 2012 https://water.phila.gov/ wp- content/uploads/files/EPA_ Partnership_Agreement.pdf 7 Green City Clean W aters - Long- T erm Control Plan – https://water.phila.gov/ wp- content/uploads/files/LTCPU_ Complete.pdf 8 GSI Planning & Design Manual – https://water.phila.gov/ wp- content/uploads/files/ gsi- planning- and- design- manual. pdf 9 GSI As-built Surve y and Drafting Manual – https://water.phila.gov/ wp- content/uploads/files/ gsi- as- built- survey- and- drafting- manual. pdf 10 SMP Inspection Forms (Cisterns, Roofs, Ponds, Porous Surface, Basins, Filters) – https://water.phila.gov/ wp- content/uploads/files/ smp- porous- surface- inspection- form. pdf 11 Philadelphia W ater Department Reg- ulations 2024 https://water.phila.gov/ wp- content/uploads/files/ pwd- regulations- 2024- 04- 29.pdf 12 Stormwater Grant Resources (portal) – https://water.phila.gov/ stormwater/incentives/grants/ 13 Plan and Report Checklists – https://water.phila.gov/ wp- content/uploads/files/ smgm- e- plan- and- report- checklists. pdf 14 Reported Flood Damages in Philadel- phia (map) 2024 https://www.phila.gov/ media/20241204111812/ Reported- Flood- Damages- Map- v4. 2- 2024.pdf 15 Sustainable Funding for Green City , Clean W aters 2022 https://williampennfoundation. org/sites/default/files/2024- 05/ PHL- GreenCityCleanWaters- Sustain_ 2022_FINAL.pdf 16 GCCW Comprehensiv e Monitoring Plan – https://archive.phillywatersheds. org/ltcpu/GCCW%20Comprehensive% 20Monitoring%20Plan%20Sections% 201- 10.pdf 17 PWD Regulations Chapter 6 - Stormwater – https://water.phila.gov/ wp- content/uploads/files/ pwd- regulations- chapter- 6.pdf 18 SMP Maintenance Guidance – https://water.phila.gov/ wp- content/uploads/files/ smp- maintenance- guidance.pdf 19 SMP Maintenance Guide (portal) – https://water.phila. gov/development/ stormwater- plan- review/ maintenance 10 # Title Y ear Link / Notes 20 Rain Check Contractor Documents – https://www.pwdraincheck.org/en/ contractor- documents 21 GSI Landscape Design Guidebook 2014 https://www.pwdraincheck.org/ images/documents/Landscape_ Manual_2014.pdf 22 Planning & Design Resource Direc- tory (portal) – https://water.phila.gov/gsi/ planning- design/resources/ B SFT Pr ompt T emplate SYSTEM: You are an expert technical writer and dataset engineer for domain-specific LLM fine-tuning. Your task is to read the provided PDF document and extract knowledge that is NOT generic LLM common knowledge, but instead is specific to this document, its policies, rules, responsibilities, procedures, or technical constraints. You must generate a Supervised Fine-Tuning (SFT) dataset in JSON format. IMPORTANT RULES (MUST FOLLOW): 1. Every QA pair MUST be fully self-contained and understandable WITHOUT access to the P DF. - Do NOT reference chapters, sections, figures, tables, or page numbers. - Do NOT use phrases like \Chapter 1", \this section", \as described above", or \the following". - Do NOT rely on document structure for meaning. 2. Avoid ambiguous pronouns and references. - DO NOT use: it, this, that, they, the City, the program, the agreement - INSTEAD, always explicitly name the entity: e.g., \the Philadelphia Water Department", \the Green City, Clean Waters program", \the municipal green stormwater infrastructure guidance". 3. Each instruction must clearly state the domain and context. - A reader with no prior exposure to the PDF should still understand the question. - Example: Wrong: \Describe how pre-development land cover must be represented" Right: \Describe how pre-development land cover must be represented in stormwater m odeling for municipal green stormwater infrastructure projects" 4. Each output must: - Be grounded ONLY in the PDF content - Restate key entities and constraints explicitly - Provide a clear, concrete, and technically meaningful answer - Be as long or as short as needed to fully capture the knowledge (no artificial leng th limits) 5. Each QA pair should represent ONE independent, reusable knowledge unit suitable for L LM fine-tuning. 6. If the document defines: - responsibilities → generate responsibility-focused QA - procedures → generate process-focused QA - design rules → generate constraint-focused QA - evaluation or monitoring → generate metric- or workflow-focused QA OUTPUT FORMAT (STRICT): [ { "instruction": "A fully self-contained task or question", 11 "input": "", "output": "A complete, document-grounded answer with explicit entities and no ambigu ous references" } ] Do NOT add explanations, commentary, or markdown. Only output valid JSON. USER PROMPT Below is the extracted text content from a Green Stormwater Infrastructure (GSI) PDF doc ument. Your task: 1. Identify document-specific, non-obvious, and operationally relevant knowledge. 2. Convert that knowledge into self-contained instruction-style QA samples suitable for supervised fine-tuning (SFT). 3. Ensure that each question and answer can be fully understood without access to the or iginal document. 4. Avoid vague entity references or pronouns unless the entity is explicitly defined in the instruction. Generate as many high-quality samples as the content supports. Fewer high-quality sample s are preferred over many weak ones. Return ONLY a valid JSON array. Do not include explanations or markdown. Refer ences [1] Edward J. Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizhu Chen. Lora: Low-rank adaptation of lar ge language models. arXiv pr eprint arXiv:2106.09685 , 2021. [2] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, et al. Retrie val-augmented generation for knowledge-intensi ve nlp tasks. In NeurIPS , 2020. [3] Junyu Luo, W eizhi Zhang, Y e Y uan, et al. Large language model agent: A survey on methodology , applica- tions and challenges. arXiv pr eprint arXiv:2503.21460 , 2025. [4] Shuo Ren, Pu Jian, et al. T ow ards scientiﬁc intelligence: A survey of llm-based scientiﬁc agents. arXiv pr eprint arXiv:2503.??? (preprint) , 2025. [5] Zirui Song, Bin Y an, Y uhan Liu, Miao Fang, Mingzhe Li, Rui Y an, and Xiuying Chen. Injecting domain- speciﬁc knowledge into large language models: A comprehensiv e survey . In F indings of the Association for Computational Linguistics: EMNLP 2025 , 2025. [6] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, et al. Chain-of-thought prompting elicits rea- soning in large language models. 2022. [7] Xiaoran Xu and Ravi Sankar . Large language model agents for biomedicine: A comprehensi ve re view . Infor- mation , 2025. [8] Shunyu Y ao, Jeffre y Zhao, Dian Y u, et al. React: Synergizing reasoning and acting in language models. In ICLR , 2023. [9] Asaf Y ehudai, Lilach Eden, et al. Survey on ev aluation of llm-based agents. arXiv preprint , 2025. 12

GSI Agent: Domain Knowledge Enhancement for Large Language Models in Green Stormwater Infrastructure

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment