Title: UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction
ArXiv ID: 2512.04518
Date: 2025-12-04
Authors: ** Tianmai M. Zhang*, Zhaoyi Sun*, Sihang Zeng*, Chenxi Li*, Neil F. Abernethy, Barbara D. Lam, Fei Xia, Meliha Yetisgen (University of Washington) 동등 기여 **
📝 Abstract
The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.
💡 Deep Analysis
📄 Full Content
UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and
Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction
Tianmai M. Zhang*, Zhaoyi Sun*, Sihang Zeng*, Chenxi Li*,
Neil F. Abernethy, Barbara D. Lam, Fei Xia, Meliha Yetisgen
University of Washington
Correspondence: melihay@uw.edu
Abstract
The ChemoTimelines shared task benchmarks
methods for constructing timelines of systemic
anticancer treatment from electronic health
records of cancer patients. This paper describes
our methods, results, and findings for subtask
2—generating patient chemotherapy timelines
from raw clinical notes. We evaluated strate-
gies involving chain-of-thought thinking, super-
vised fine-tuning, direct preference optimiza-
tion, and dictionary-based lookup to improve
timeline extraction. All of our approaches fol-
lowed a two-step workflow, wherein an LLM
first extracted chemotherapy events from in-
dividual clinical notes, and then an algorithm
normalized and aggregated events into patient-
level timelines. Each specific method differed
in how the associated LLM was utilized and
trained. Multiple approaches yielded compet-
itive performances on the test set leaderboard,
with fine-tuned Qwen3-14B achieving the best
official score of 0.678. Our results and anal-
yses could provide useful insights for future
attempts on this task as well as the design of
similar tasks.
1
Introduction
Electronic health records (EHRs) contain rich infor-
mation on treatment courses, but extracting tempo-
ral relationships is challenging due to variability in
care and linguistic complexity (Olex and McInnes,
2021; Gholipour et al., 2023). Oncology regimens
often deviate from planned schedules through dose
changes or delays, with such modifications usually
recorded only in unstructured notes that require
chronological alignment (Wang et al., 2020). Clin-
ical narratives add further difficulty with relative
or vague time expressions and inconsistent date
formats (Sun et al., 2013, 2015). Even experts
may diverge in interpreting underspecified terms,
making accurate normalization and sequencing a
persistent challenge for clinical NLP systems.
*These authors contributed equally.
The ChemoTimelines shared task1 (Yao et al.,
2024, 2025) was created to benchmark systems for
constructing systemic anticancer treatment (SACT)
timelines directly from EHR notes. It consists
of two subtasks. In subtask 1, besides the raw
EHRs, gold standard annotations of treatment
events (EVENTs) and time expressions (TIMEX3s)
for each patient EHR note are provided, and the
task is to determine temporal relations between
them on the patient level. In subtask 2, the task is
to extract the patient-level treatment timeline with
only the raw EHR notes available. We focus on
subtask 2 to provide insights into an end-to-end
treatment timeline extraction system.
Large language models (LLMs) demonstrate su-
perior comprehension and information extraction
ability, and were widely used in the previous year
of the challenge (Haddadan et al., 2024; Zhang
et al., 2024). Without dedicated prompt engineer-
ing and chain-of-thought reasoning (Wei et al.,
2023), zero-shot prompting on LLMs has shown
poor performance (Zhang et al., 2024) in the time-
line extraction task. Domain-adapted fine-tuning
has proven effective for SACT timeline extraction,
with models like Flan-T5-XXL (Chung et al., 2022)
and PubMedBERT (Gu et al., 2021) achieving
strong results (Haddadan et al., 2024; Tan et al.,
2024). However, these approaches have predomi-
nantly utilized older or smaller-scale architectures,
such as BART (Lewis et al., 2019) and Flan-T5-
XXL (Chung et al., 2022), and predicted timelines
based on sentence-level contexts. Recent studies on
scaling laws suggest that leveraging larger powerful
models with rich context presents a clear opportu-
nity for further improvement (Kaplan et al., 2020).
In parallel, pipeline systems—which first extract
events with a curated dictionary and then iden-
tify relations (Haddadan et al., 2024; Wang et al.,
1https://sites.google.com/view/
chemotimelines2025
arXiv:2512.04518v1 [cs.CL] 4 Dec 2025
2024)—have been developed but typically show
inferior performance to end-to-end systems. De-
spite integrating external knowledge, the pipeline
approach may still be suboptimal.
Building on previous efforts, we explore a vari-
ety of strategies to fill the gaps. First, to analyze
the impact of LLM-based reasoning, we compare
a baseline prompting system with a reasoning sys-
tem. Second, to rethink the impact of external
knowledge, we design a dictionary-enhanced ex-
traction approach. Finally, to explore multiple train-
ing strategies, we conduct supervised fine-tuning
(SFT) and direct preference optimization (DPO) on
the latest LLMs. Our fine-tuned Qwen3-14B sys-
tem wins first place in the challenge leaderboard.
We provide several novel insights into the task that
may inform future attempts on this task, as well as
the design of similar tasks.
2
Problem Form