Title: Small Language Models Can Use Nuanced Reasoning For Health Science Research Classification: A Microbial-Oncogenesis Case Study
ArXiv ID: 2512.06502
Date: 2025-12-06
Authors: ** - Muhammed Muaaz Dawood * (School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) - Mohammad Zaid Moonsamy * (School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) – Corresponding author - Kaela Kokkas (Department of Clinical Microbiology and Infectious Diseases, University of the Witwatersrand, Johannesburg, South Africa) - Hairong Wang (School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) - Robert F. Breiman (Infectious Diseases and Oncology Research Institute, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa) - Richard Klein (School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) - Emmanuel K. Sekyi (OncoVectra, London, United Kingdom) - Bruce A. Bassett † (Wits MIND Institute and School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa) *동등 기여 †IDORI (Infectious Diseases and Oncology Research Institute) 소속 **
📝 Abstract
Artificially intelligent (AI) co-scientists must be able to sift through research literature cost-efficiently while applying nuanced scientific reasoning. We evaluate Small Language Models (SLMs, <= 8B parameters) for classifying medical research papers. Using literature on the oncogenic potential of HMTV/MMTV-like viruses in breast cancer as a case study, we assess model performance with both zero-shot and in-context learning (ICL; few-shot prompting) strategies against frontier proprietary Large Language Models (LLMs). Llama 3 and Qwen2.5 outperform GPT-5 (API, low/high effort), Gemini 3 Pro Preview, and Meerkat in zero-shot settings, though trailing Gemini 2.5 Pro. ICL leads to improved performance on a case-by-case basis, allowing Llama 3 and Qwen2.5 to match Gemini 2.5 Pro in binary classification. Systematic lexical-ablation experiments show that SLM decisions are often grounded in valid scientific cues but can be influenced by spurious textual artifacts, underscoring need for interpretability in high-stakes pipelines. Our results reveal both promise and limitations of modern SLMs for scientific triage; pairing SLMs with simple but principled prompting strategies can approach performance of the strongest LLMs for targeted literature filtering in co-scientist pipelines.
💡 Deep Analysis
📄 Full Content
Small Language Models Can Use Nuanced Reasoning For Health Science
Research Classification: A Microbial-Oncogenesis Case Study
MUHAMMED MUAAZ DAWOOD∗†, School of Computer Science and Applied Mathematics, University of the
Witwatersrand, Johannesburg, South Africa
MOHAMMAD ZAID MOONSAMY∗†‡, School of Computer Science and Applied Mathematics, University of
the Witwatersrand, Johannesburg, South Africa
KAELA KOKKAS†, Department of Clinical Microbiology and Infectious Diseases, University of the Witwatersrand,
Johannesburg, South Africa
HAIRONG WANG, School of Computer Science and Applied Mathematics, University of the Witwatersrand,
Johannesburg, South Africa
ROBERT F. BREIMAN, Infectious Diseases and Oncology Research Institute (IDORI), Faculty of Health Sciences,
University of the Witwatersrand, Johannesburg, South Africa
RICHARD KLEIN, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johan-
nesburg, South Africa
EMMANUEL K. SEKYI, OncoVectra, London, United Kingdom
BRUCE A. BASSETT†, Wits MIND Institute and School of Computer Science and Applied Mathematics, University
of the Witwatersrand, Johannesburg, South Africa
Artificially intelligent (AI) co-scientists must be able to sift through research literature cost-efficiently while applying nuanced scientific
reasoning. We evaluate Small Language Models (SLMs, ≤8B parameters) for classifying medical research papers. Using literature
on the oncogenic potential of HMTV/MMTV-like viruses in breast cancer as a case study, we assess model performance with both
zero-shot and in-context learning (ICL; few-shot prompting) strategies against frontier proprietary Large Language Models (LLMs).
Llama 3 and Qwen2.5 outperform GPT-5 (API, low/high effort), Gemini 3 Pro Preview, and Meerkat in zero-shot settings, though
trailing Gemini 2.5 Pro. ICL leads to improved performance on a case-by-case basis, allowing Llama 3 and Qwen2.5 to match Gemini
2.5 Pro in binary classification. Systematic lexical-ablation experiments show that SLM decisions are often grounded in valid scientific
cues but can be influenced by spurious textual artifacts, underscoring need for interpretability in high-stakes pipelines. Our results
reveal both promise and limitations of modern SLMs for scientific triage; pairing SLMs with simple but principled prompting strategies
can approach performance of the strongest LLMs for targeted literature filtering in co-scientist pipelines.
∗Both authors contributed equally to this research.
†Also with Infectious Diseases and Oncology Research Institute (IDORI), Faculty of Health Sciences, University of the Witwatersrand
‡Corresponding author
Authors’ Contact Information: Muhammed Muaaz Dawood, 2425639@students.wits.ac.za, School of Computer Science and Applied Mathematics,
University of the Witwatersrand, Johannesburg, South Africa; Mohammad Zaid Moonsamy, School of Computer Science and Applied Mathematics,
University of the Witwatersrand, Johannesburg, South Africa, 2433079@students.wits.ac.za; Kaela Kokkas, Department of Clinical Microbiology and
Infectious Diseases, University of the Witwatersrand, Johannesburg, South Africa; Hairong Wang, School of Computer Science and Applied Mathematics,
University of the Witwatersrand, Johannesburg, South Africa; Robert F. Breiman, Infectious Diseases and Oncology Research Institute (IDORI), Faculty of
Health Sciences, University of the Witwatersrand, Johannesburg, South Africa; Richard Klein, School of Computer Science and Applied Mathematics,
University of the Witwatersrand, Johannesburg, South Africa; Emmanuel K. Sekyi, OncoVectra, London, United Kingdom; Bruce A. Bassett, Wits MIND
Institute and School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa.
Manuscript submitted to ACM
1
arXiv:2512.06502v1 [cs.CE] 6 Dec 2025
2
Dawood et al.
Additional Key Words and Phrases: Small language models, Biomedical literature screening, Few-shot learning, Prompt optimization,
Virus–cancer association, Relevance classification, Systematic review automation
1
Introduction
Artificial Intelligence (AI) co-scientists represent a promising new paradigm in scientific research that could revolutionize
how researchers search, process and ideate within the exponentially growing research literature [14], with an estimate
of more than 2 million new papers added to the corpus of human knowledge each year [48]. In addressing a generic
question that requires deep understanding of the existing research, an AI co-scientist must undertake three broad steps:
(1) Discover: scan the entire literature for relevant research, (2) Filter: narrow this initial set of results down to a
highly relevant subset, and (3) Understand: process this final subset in as much depth as the AI can manage.
Current commercial approaches, such as the Deep Research modes of Gemini and GPT-5 [45, 98], can execute these
tasks but function as opaque, resource-intensive black boxes. Thei