대형 언어 모델을 이용한 제올라이트 합성 절차 정보 추출 프롬프트 전략 비교 연구
📝 Abstract
Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domainspecific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies -zero-shot, few-shot, event-specific, and reflection-based -across six state-of-theart LLMs (Gemma-3-12b-it, GPT-5-mini, O4mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve highlevel understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.
💡 Analysis
Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domainspecific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies -zero-shot, few-shot, event-specific, and reflection-based -across six state-of-theart LLMs (Gemma-3-12b-it, GPT-5-mini, O4mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve highlevel understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.
📄 Content
Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies Charan Prakash Rathore1, Saumi Ray1, Dhruv Kumar1 1Birla Institute of Technology and Science, Pilani, India Correspondence: dhruv.kumar@pilani.bits-pilani.ac.in Abstract Extracting structured information from zeo- lite synthesis experimental procedures is criti- cal for materials discovery, yet existing meth- ods have not systematically evaluated Large Language Models (LLMs) for this domain- specific task. This work addresses a funda- mental question: what is the efficacy of differ- ent prompting strategies when applying LLMs to scientific information extraction? We fo- cus on four key subtasks: event type classi- fication (identifying synthesis steps), trigger text identification (locating event mentions), ar- gument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the- art LLMs (Gemma-3-12b-it, GPT-5-mini, O4- mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classifica- tion (80-90% F1) but modest performance on fine-grained extraction tasks, particularly argu- ment role and argument text extraction (50-65% F1). GPT-5-mini exhibits extreme prompt sen- sitivity with 11-79% F1 variation. Notably, ad- vanced prompting strategies provide minimal improvements over zero-shot approaches, re- vealing fundamental architectural limitations. Error analysis identifies systematic hallucina- tion, over-generalization, and inability to cap- ture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high- level understanding, precise extraction of exper- imental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction. 1 Introduction Zeolites are crucial industrial catalysts whose au- tomated synthesis requires extracting structured, machine-readable data from unstructured experi- mental procedures (Jensen et al., 2019). Event extraction - a core information extraction task that identifies specific occurrences or actions mentioned in text along with their associated participants and attributes (Ahn, 2006)-offers a systematic approach to structuring procedural knowledge. Argument ex- traction complements this by identifying and classi- fying the entities, temporal expressions, and other parameters associated with these events (Li et al., 2013; Yang et al., 2019). In the context of zeo- lite synthesis, event-argument extraction involves identifying synthesis actions (e.g., Add, Stir, Cal- cine), their textual triggers, and associated argu- ments such as materials, temperatures, and dura- tions from complex procedural sentences. This task is particularly challenging due to domain-specific terminology, implicit information, complex sen- tence structures, and the need for precise span iden- tification at the token level. Traditional approaches to scientific information extraction rely heavily on supervised learning with domain-specific labeled data (Luan et al., 2018; Jain et al., 2020). However, NLP-based informa- tion extraction for specialized domains remains limited by scarce annotated datasets and domain- specific complexity. The emergence of Large Lan- guage Models (LLMs) pre-trained on vast corpora has demonstrated remarkable capabilities through in-context learning and prompting strategies. This raises a critical question: can general-purpose LLMs effectively perform specialized scientific in- formation extraction without extensive fine-tuning? Recent work on scientific event extraction has primarily focused on developing specialized neural architectures with domain adaptation. The ZSEE (Zeolite Synthesis Event Extraction) dataset (He et al., 2024) introduced expert-annotated data for zeolite synthesis procedures and evaluated tailored models like PAIE, while Zeo-Reader (He et al., 2025) improved representation learning through contrastive learning techniques. Other specialized approaches including AMPERE, DEGREE, and arXiv:2512.15312v1 [cs.CL] 17 Dec 2025 EEQA have demonstrated the effectiveness of care- fully designed architectures (Du and Cardie, 2021; Hsu et al., 2022; Wei et al., 2021). However, a sig- nificant gap exists in understanding how general- purpose LLMs perform on such specialized ex- traction tasks. While ZSEE noted limitations of GPT-3.5-turbo regarding hallucination and over- generalization, no comprehensive systematic evalu- ation across multiple state-of-the-art LLMs using varied prompting strategies has been conducted. We present a systematic benchmark study evalu- ating six contemporary LLMs across four distinct prompting paradigms. Our methodology employs a standardized evaluation framework applied consis- tently a
This content is AI-processed based on ArXiv data.