NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts

NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.


💡 Research Summary

The paper presents a solution to the ToxHabits Shared Task of BioCreative IX, which focuses on detecting toxic substance use and associated contextual information in Spanish clinical case reports. The task is divided into two subtasks: (1) Trigger detection (ToxNER), identifying spans of substances such as tobacco, cannabis, alcohol, or drugs, and (2) Argument detection (ToxUse), extracting six types of contextual attributes (type, method, amount, frequency, duration, history). The dataset consists of 1,499 Spanish clinical documents, with an average of about six triggers and eight arguments per document, making it a low‑resource, domain‑specific challenge.

The authors design a multi‑output architecture built on BETO, the Spanish‑specific BERT model. A shared BETO encoder feeds two separate decoding branches: one for triggers and one for arguments. Each branch consists of a linear projection layer followed by a Conditional Random Field (CRF) decoder, which enforces BIO tag transition constraints and yields globally optimal label sequences via the Viterbi algorithm. This joint model learns shared contextual representations while allowing task‑specific predictions.

To mitigate over‑fitting on the small corpus and address severe class imbalance, the authors employ three complementary training strategies. First, they apply label‑weighted loss, assigning higher loss weights to under‑represented classes. Second, they perform data oversampling, duplicating sentences that contain triggers or arguments to balance the class distribution. Third, they use weighted random sampling during batch construction, increasing the probability that informative sentences appear in each epoch. These strategies are combined with a 5‑fold cross‑validation split of the training data, producing multiple resampled subsets. For each subset, a separate BERT‑CRF model is trained, resulting in ensembles of 6 models (full‑train data) or 19 models (partial‑train data) depending on the experimental condition.

A crucial preprocessing step is sentence filtering. A binary classifier, also based on BETO, is fine‑tuned to predict whether a sentence contains any trigger or argument. During inference, only sentences classified as positive are passed to the main multi‑output model, reducing computational load and improving precision. After token‑level predictions are generated, a post‑processing stage detokenizes sub‑word outputs, normalizes spans, and aggregates predictions across the ensemble via majority voting.

Experimental results show that the ensemble with sentence filtering and hyper‑parameter tuning on the full training set achieves the best performance for trigger detection: precision 0.97, recall 0.92, F1 0.94. The same configuration without filtering yields precision 0.92, recall 0.94, F1 0.94, indicating that filtering mainly boosts precision. For argument detection, the plain full‑train ensemble reaches precision 0.91, recall 0.90, F1 0.91, while adding filtering raises precision to 0.95 but lowers recall, resulting in a slight F1 drop to 0.90. Models trained on the partial dataset perform substantially worse (F1 ≈ 0.84 for triggers, ≈ 0.77 for arguments), confirming the importance of using the full data.

The discussion emphasizes that the multi‑output design enables the model to capture dependencies between triggers and their arguments, and that ensembling across diverse training strategies stabilizes predictions. The authors acknowledge that they did not incorporate large language models (LLMs) such as GPT‑4 or LLaMA, which could further improve contextual understanding and generalization. They also note that CRF decoding, while effective for BIO constraints, may not fully model complex logical relations between triggers and arguments. Future work is proposed to explore LLM integration, possibly via prompting or fine‑tuning, and to investigate more sophisticated decoders (e.g., transformer‑based sequence‑to‑sequence or graph neural networks) to better capture inter‑entity dependencies.

In conclusion, the paper delivers a robust, low‑resource solution for Spanish clinical NER of toxic substance use. By combining BETO‑based multi‑output BERT‑CRF modeling, strategic data resampling, label weighting, and a sentence‑filtering pre‑processor, the system achieves high precision and competitive F1 scores on both subtasks, demonstrating the viability of ensemble deep learning approaches in specialized biomedical NLP settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment