BioACE: An Automated Framework for Biomedical Answer and Citation Evaluations
With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval-augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground-truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (https://github.com/deepaknlp/BioACE) evaluation package.
💡 Research Summary
The paper introduces BioACE, an automated framework designed to evaluate both the content of biomedical answers generated by large language models (LLMs) and the citations that are meant to support those answers. The authors argue that existing evaluation methods—largely manual or based on lexical overlap—are insufficient for the biomedical domain, where nuanced terminology, paraphrasing, and factual correctness are critical. BioACE therefore decomposes answers into “nuggets,” i.e., atomic factual statements, and assesses four key dimensions: completeness, correctness, precision, and recall, each measured against a set of ground‑truth nuggets.
For the precision‑recall component, the authors compare three sentence‑level embedding models—All‑MiniLM‑L6‑v2, All‑mpnet‑base‑v2, and Sup‑SimCSE‑RoBERTa‑large—using cosine similarity and a calibrated probability threshold. Sup‑SimCSE‑RoBERTa‑large achieves the best balance (precision 44.68 %, recall 58.39 %, F1 50.62 %). This demonstrates that a robust semantic similarity model can capture paraphrastic matches that lexical metrics miss.
Completeness is evaluated in two stages. First, several pre‑trained language models (PLMs) are fine‑tuned on a training split of the BioGen2024 dataset, with hyper‑parameters selected on a validation split. RoBERTa‑Large emerges as the top PLM (weighted F1 75.37 %). Second, a suite of LLMs—including Llama‑3.3‑70B‑Instruct, Llama‑3‑8B‑Instruct, Qwen‑3‑14B, and Mistral‑7B‑Instruct‑v0.3—are tested in zero‑shot and fine‑tuned regimes. Llama‑3.3‑70B‑Instruct attains the highest zero‑shot F1 (76.20 %) and remains competitive after fine‑tuning (78.33 %). Notably, Mistral‑7B‑Instruct‑v0.3 shows a 5.15‑point gain when fine‑tuned, indicating that model size alone does not guarantee superiority; task‑specific adaptation matters.
Correctness is framed as a binary classification problem (answer sentence correct vs. incorrect). Classical SVM and logistic regression baselines are outperformed dramatically by PLMs: RoBERTa‑Large reaches 97.65 % accuracy and 99.34 % AUC, essentially matching human judgments. In contrast, LLMs in a zero‑shot setting perform poorly (e.g., Llama‑3‑8B‑Instruct precision 28.77 %). Fine‑tuning yields modest improvements for only a few models, underscoring that current LLMs are better at generation than factual verification.
The authors also compare embedding‑based similarity with natural language inference (NLI) scoring for document‑answer and evidence‑answer pairs. While NLI assigns low probabilities to negative pairs, cosine similarity often remains high, suggesting that similarity alone cannot reliably discriminate false evidence.
Citation evaluation extends the framework to assess whether cited PubMed abstracts truly support the answer. Multiple transformer‑based models (Llama‑3.3, FLAN‑T5, FLAN‑UL2) and specialized scoring functions (alignscore, summacconv, summaczs) are evaluated. Llama‑3.3‑Base achieves the most balanced performance (precision ≈ 76 %, recall ≈ 76 %) and modest gains after LoRA‑based fine‑tuning (≈ 78 %/78 %). Summarization‑based scores perform poorly, especially on precision, indicating that simple summarization similarity is insufficient for citation verification.
Across all experiments, the best-performing combination—Sup‑SimCSE‑RoBERTa‑large for nugget matching, RoBERTa‑Large for correctness, and Llama‑3.3 for citation matching—exhibits the highest correlation with human annotations. The discussion highlights that recall, when computed in a cluster‑based manner akin to prior TREC QA evaluations, aligns closely with human‑derived recall, reinforcing its utility as an automated metric. The authors note limitations: NLI models struggle with negative examples, fine‑tuning does not uniformly improve LLMs, and prompt engineering remains a critical yet under‑explored factor.
In conclusion, BioACE provides a comprehensive, modular pipeline for automated evaluation of biomedical QA systems, integrating semantic similarity, PLM‑based factual verification, and transformer‑based citation assessment. The extensive empirical analysis demonstrates that a hybrid approach—leveraging both embedding similarity and high‑capacity language models—outperforms any single method. Future work is suggested in three directions: (1) refined prompt design and hyper‑parameter optimization for LLMs, (2) incorporation of multimodal evidence such as tables and figures, and (3) development of domain‑specific NLI models to better capture biomedical entailment. BioACE’s open‑source release (https://github.com/deepaknlp/BioACE) positions it as a valuable benchmark and tool for the community aiming to build trustworthy, evidence‑grounded biomedical conversational agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment