ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries

ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Background: Population-based cancer registries (PBCRs) manually extract data from unstructured pathology reports, a labor-intensive process where assigning reports to tumor groups can consume 900 person-hours annually for approximately 100,000 reports at a medium-sized registry. Current automated rule-based systems fail to handle the linguistic complexity of this classification task. Materials and Methods: We present ELM (Ensemble of Language Models), a novel hybrid approach combining small, encoder only language models and large language models (LLMs). ELM employs an ensemble of six fine-tuned encoder only models: three analyzing the top portion and three analyzing the bottom portion of each report to maximize text coverage given token limits. A tumor group is assigned when at least five of six models agree; otherwise, an LLM arbitrates using a carefully curated prompt constrained to likely tumor groups. Results: On a held-out test set of 2,058 pathology reports spanning 19 tumor groups, ELM achieves weighted precision and recall of 0.94, representing a statistically significant improvement (p<0.001) over encoder-only ensembles (0.91 F1-score) and substantially outperforming rule-based approaches. ELM demonstrates particular gains for challenging categories including leukemia (F1: 0.76 to 0.88), lymphoma (0.76 to 0.89), and skin cancer (0.44 to 0.58). Discussion: Deployed in production at British Columbia Cancer Registry, ELM has reduced manual review requirements by approximately 60-70%, saving an estimated 900 person-hours annually while maintaining data quality standards. Conclusion: ELM represents the first successful deployment of a hybrid small, encoder only models-LLM architecture for tumor group classification in a real-world PBCR setting, demonstrating how strategic combination of language models can achieve both high accuracy and operational efficiency.


💡 Research Summary

The paper introduces ELM (Ensemble of Language Models), a hybrid system designed to automate tumor group classification of pathology reports in population‑based cancer registries (PBCRs). Manual abstraction of these unstructured reports is labor‑intensive, consuming roughly 900 person‑hours annually for a medium‑sized registry handling about 100,000 reports. Existing rule‑based tools such as eMaRC fail to capture the linguistic complexity of pathology narratives, leaving about 40 % of reports unassigned and requiring full manual review for the rest.

ELM combines six fine‑tuned small encoder‑only transformer models with a large language model (LLM) for arbitration. The six models are split into three that process the first 512 tokens (the “top” of the report) and three that process the last 512 tokens (the “bottom”). This segmentation exploits the typical structure of pathology reports, where diagnostic conclusions appear early and interpretive commentary appears later, while respecting token limits of BERT‑style architectures. Each model independently predicts one of 19 tumor groups; a simple majority vote is taken, but a prediction is only accepted if at least five of the six models agree. Additionally, a predefined set of historically difficult groups (e.g., skin cancer, cervix, multiple myeloma, primary unknown) are forced to LLM arbitration regardless of vote count.

When a case does not meet the confidence threshold or belongs to the “hard” set, it is sent to an LLM—specifically Mistral Nemo Instruct‑2407, a 12‑billion‑parameter instruction‑tuned model. The LLM receives the full report together with a carefully engineered prompt that (1) positions the model as a “specialized pathology assistant,” (2) constrains the answer space to a small subset of likely tumor groups derived from the encoder ensemble’s top predictions, and (3) explicitly outlines nuanced distinctions (e.g., leukemia vs. lymphoma, melanoma vs. non‑melanoma skin cancer, cervical vs. other gynecologic cancers). The prompt also requires a JSON response containing both the predicted group and a brief reasoning, providing traceability and explainability.

The authors evaluated ELM on a held‑out test set of 2,058 pathology reports from 2023‑2024, each manually labeled at the report level by expert registrars. ELM achieved weighted precision and recall of 0.94 (F1 = 0.94), a statistically significant improvement over the encoder‑only ensemble (F1 = 0.91, p < 0.001) and a large margin over rule‑based approaches. Notably, performance gains were most pronounced for challenging categories: leukemia (F1 0.76 → 0.88), lymphoma (0.76 → 0.89), and skin cancer (0.44 → 0.58).

From an operational perspective, only about 15‑20 % of reports require LLM processing; the remaining 80‑85 % are resolved by the fast encoder ensemble (≈0.6 s per report). Including LLM arbitration, the average processing time per report is ≈0.85 s, compared with 2‑3 s for a pure LLM solution—a 3‑4× speedup and substantial cost reduction. Deployed in production at the British Columbia Cancer Registry, ELM reduced manual review workload by 60‑70 %, translating to an estimated annual saving of 900 person‑hours while maintaining data quality standards.

Key insights include: (1) leveraging report structure to split documents into top and bottom segments effectively overcomes token‑limit constraints; (2) a high‑agreement voting threshold ensures that only ambiguous or difficult cases invoke the expensive LLM, balancing accuracy and efficiency; (3) prompt engineering that limits the LLM’s answer space and enforces structured output dramatically reduces hallucination risk and provides explainability. The authors argue that this hybrid architecture can be generalized to other clinical NLP tasks where large models are desirable but computational resources and regulatory constraints limit their unrestricted use.


Comments & Academic Discussion

Loading comments...

Leave a Comment