UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction

Reading time: 5 minute
...

📝 Original Info

  • Title: UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction
  • ArXiv ID: 2512.04518
  • Date: 2025-12-04
  • Authors: ** Tianmai M. Zhang*, Zhaoyi Sun*, Sihang Zeng*, Chenxi Li*, Neil F. Abernethy, Barbara D. Lam, Fei Xia, Meliha Yetisgen (University of Washington) 동등 기여 **

📝 Abstract

The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.

💡 Deep Analysis

Figure 1

📄 Full Content

UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction Tianmai M. Zhang*, Zhaoyi Sun*, Sihang Zeng*, Chenxi Li*, Neil F. Abernethy, Barbara D. Lam, Fei Xia, Meliha Yetisgen University of Washington Correspondence: melihay@uw.edu Abstract The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2—generating patient chemotherapy timelines from raw clinical notes. We evaluated strate- gies involving chain-of-thought thinking, super- vised fine-tuning, direct preference optimiza- tion, and dictionary-based lookup to improve timeline extraction. All of our approaches fol- lowed a two-step workflow, wherein an LLM first extracted chemotherapy events from in- dividual clinical notes, and then an algorithm normalized and aggregated events into patient- level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded compet- itive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and anal- yses could provide useful insights for future attempts on this task as well as the design of similar tasks. 1 Introduction Electronic health records (EHRs) contain rich infor- mation on treatment courses, but extracting tempo- ral relationships is challenging due to variability in care and linguistic complexity (Olex and McInnes, 2021; Gholipour et al., 2023). Oncology regimens often deviate from planned schedules through dose changes or delays, with such modifications usually recorded only in unstructured notes that require chronological alignment (Wang et al., 2020). Clin- ical narratives add further difficulty with relative or vague time expressions and inconsistent date formats (Sun et al., 2013, 2015). Even experts may diverge in interpreting underspecified terms, making accurate normalization and sequencing a persistent challenge for clinical NLP systems. *These authors contributed equally. The ChemoTimelines shared task1 (Yao et al., 2024, 2025) was created to benchmark systems for constructing systemic anticancer treatment (SACT) timelines directly from EHR notes. It consists of two subtasks. In subtask 1, besides the raw EHRs, gold standard annotations of treatment events (EVENTs) and time expressions (TIMEX3s) for each patient EHR note are provided, and the task is to determine temporal relations between them on the patient level. In subtask 2, the task is to extract the patient-level treatment timeline with only the raw EHR notes available. We focus on subtask 2 to provide insights into an end-to-end treatment timeline extraction system. Large language models (LLMs) demonstrate su- perior comprehension and information extraction ability, and were widely used in the previous year of the challenge (Haddadan et al., 2024; Zhang et al., 2024). Without dedicated prompt engineer- ing and chain-of-thought reasoning (Wei et al., 2023), zero-shot prompting on LLMs has shown poor performance (Zhang et al., 2024) in the time- line extraction task. Domain-adapted fine-tuning has proven effective for SACT timeline extraction, with models like Flan-T5-XXL (Chung et al., 2022) and PubMedBERT (Gu et al., 2021) achieving strong results (Haddadan et al., 2024; Tan et al., 2024). However, these approaches have predomi- nantly utilized older or smaller-scale architectures, such as BART (Lewis et al., 2019) and Flan-T5- XXL (Chung et al., 2022), and predicted timelines based on sentence-level contexts. Recent studies on scaling laws suggest that leveraging larger powerful models with rich context presents a clear opportu- nity for further improvement (Kaplan et al., 2020). In parallel, pipeline systems—which first extract events with a curated dictionary and then iden- tify relations (Haddadan et al., 2024; Wang et al., 1https://sites.google.com/view/ chemotimelines2025 arXiv:2512.04518v1 [cs.CL] 4 Dec 2025 2024)—have been developed but typically show inferior performance to end-to-end systems. De- spite integrating external knowledge, the pipeline approach may still be suboptimal. Building on previous efforts, we explore a vari- ety of strategies to fill the gaps. First, to analyze the impact of LLM-based reasoning, we compare a baseline prompting system with a reasoning sys- tem. Second, to rethink the impact of external knowledge, we design a dictionary-enhanced ex- traction approach. Finally, to explore multiple train- ing strategies, we conduct supervised fine-tuning (SFT) and direct preference optimization (DPO) on the latest LLMs. Our fine-tuned Qwen3-14B sys- tem wins first place in the challenge leaderboard. We provide several novel insights into the task that may inform future attempts on this task, as well as the design of similar tasks. 2 Problem Form

📸 Image Gallery

dpo.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut