Small Language Models Can Use Nuanced Reasoning For Health Science Research Classification: A Microbial-Oncogenesis Case Study

February 21, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Small Language Models Can Use Nuanced Reasoning For Health Science Research Classification: A Microbial-Oncogenesis Case Study
ArXiv ID: 2512.06502
Date: 2025-12-06
Authors: ** - Muhammed Muaaz Dawood * (School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) - Mohammad Zaid Moonsamy * (School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) – Corresponding author - Kaela Kokkas (Department of Clinical Microbiology and Infectious Diseases, University of the Witwatersrand, Johannesburg, South Africa) - Hairong Wang (School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) - Robert F. Breiman (Infectious Diseases and Oncology Research Institute, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa) - Richard Klein (School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) - Emmanuel K. Sekyi (OncoVectra, London, United Kingdom) - Bruce A. Bassett † (Wits MIND Institute and School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) *동등 기여 †IDORI (Infectious Diseases and Oncology Research Institute) 소속 **

📝 Abstract

Artificially intelligent (AI) co-scientists must be able to sift through research literature cost-efficiently while applying nuanced scientific reasoning. We evaluate Small Language Models (SLMs, <= 8B parameters) for classifying medical research papers. Using literature on the oncogenic potential of HMTV/MMTV-like viruses in breast cancer as a case study, we assess model performance with both zero-shot and in-context learning (ICL; few-shot prompting) strategies against frontier proprietary Large Language Models (LLMs). Llama 3 and Qwen2.5 outperform GPT-5 (API, low/high effort), Gemini 3 Pro Preview, and Meerkat in zero-shot settings, though trailing Gemini 2.5 Pro. ICL leads to improved performance on a case-by-case basis, allowing Llama 3 and Qwen2.5 to match Gemini 2.5 Pro in binary classification. Systematic lexical-ablation experiments show that SLM decisions are often grounded in valid scientific cues but can be influenced by spurious textual artifacts, underscoring need for interpretability in high-stakes pipelines. Our results reveal both promise and limitations of modern SLMs for scientific triage; pairing SLMs with simple but principled prompting strategies can approach performance of the strongest LLMs for targeted literature filtering in co-scientist pipelines.

💡 Deep Analysis

📄 Full Content

Small Language Models Can Use Nuanced Reasoning For Health Science Research Classification: A Microbial-Oncogenesis Case Study MUHAMMED MUAAZ DAWOOD∗†, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa MOHAMMAD ZAID MOONSAMY∗†‡, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa KAELA KOKKAS†, Department of Clinical Microbiology and Infectious Diseases, University of the Witwatersrand, Johannesburg, South Africa HAIRONG WANG, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa ROBERT F. BREIMAN, Infectious Diseases and Oncology Research Institute (IDORI), Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa RICHARD KLEIN, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johan- nesburg, South Africa EMMANUEL K. SEKYI, OncoVectra, London, United Kingdom BRUCE A. BASSETT†, Wits MIND Institute and School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa Artificially intelligent (AI) co-scientists must be able to sift through research literature cost-efficiently while applying nuanced scientific reasoning. We evaluate Small Language Models (SLMs, ≤8B parameters) for classifying medical research papers. Using literature on the oncogenic potential of HMTV/MMTV-like viruses in breast cancer as a case study, we assess model performance with both zero-shot and in-context learning (ICL; few-shot prompting) strategies against frontier proprietary Large Language Models (LLMs). Llama 3 and Qwen2.5 outperform GPT-5 (API, low/high effort), Gemini 3 Pro Preview, and Meerkat in zero-shot settings, though trailing Gemini 2.5 Pro. ICL leads to improved performance on a case-by-case basis, allowing Llama 3 and Qwen2.5 to match Gemini 2.5 Pro in binary classification. Systematic lexical-ablation experiments show that SLM decisions are often grounded in valid scientific cues but can be influenced by spurious textual artifacts, underscoring need for interpretability in high-stakes pipelines. Our results reveal both promise and limitations of modern SLMs for scientific triage; pairing SLMs with simple but principled prompting strategies can approach performance of the strongest LLMs for targeted literature filtering in co-scientist pipelines. ∗Both authors contributed equally to this research. †Also with Infectious Diseases and Oncology Research Institute (IDORI), Faculty of Health Sciences, University of the Witwatersrand ‡Corresponding author Authors’ Contact Information: Muhammed Muaaz Dawood, 2425639@students.wits.ac.za, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa; Mohammad Zaid Moonsamy, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa, 2433079@students.wits.ac.za; Kaela Kokkas, Department of Clinical Microbiology and Infectious Diseases, University of the Witwatersrand, Johannesburg, South Africa; Hairong Wang, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa; Robert F. Breiman, Infectious Diseases and Oncology Research Institute (IDORI), Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa; Richard Klein, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa; Emmanuel K. Sekyi, OncoVectra, London, United Kingdom; Bruce A. Bassett, Wits MIND Institute and School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa. Manuscript submitted to ACM 1 arXiv:2512.06502v1 [cs.CE] 6 Dec 2025 2 Dawood et al. Additional Key Words and Phrases: Small language models, Biomedical literature screening, Few-shot learning, Prompt optimization, Virus–cancer association, Relevance classification, Systematic review automation 1 Introduction Artificial Intelligence (AI) co-scientists represent a promising new paradigm in scientific research that could revolutionize how researchers search, process and ideate within the exponentially growing research literature [14], with an estimate of more than 2 million new papers added to the corpus of human knowledge each year [48]. In addressing a generic question that requires deep understanding of the existing research, an AI co-scientist must undertake three broad steps: (1) Discover: scan the entire literature for relevant research, (2) Filter: narrow this initial set of results down to a highly relevant subset, and (3) Understand: process this final subset in as much depth as the AI can manage. Current commercial approaches, such as the Deep Research modes of Gemini and GPT-5 [45, 98], can execute these tasks but function as opaque, resource-intensive black boxes. Thei

📄 Read Full PDF on ArXiv