Comparative Analysis of 47 Context-Based Question Answer Models Across 8 Diverse Datasets

Context-based question answering (CBQA) models provide more accurate and relevant answers by considering the contextual information. They effectively extract specific information given a context, making them functional in various applications involving user support, information retrieval, and educational platforms. In this manuscript, we benchmarked the performance of 47 CBQA models from Hugging Face on eight different datasets. This study aims to identify the best-performing model across diverse datasets without additional fine-tuning. It is valuable for practical applications where the need to retrain models for specific datasets is minimized, streamlining the implementation of these models in various contexts. The best-performing models were trained on the SQuAD v2 or SQuAD v1 datasets. The best-performing model was ahotrod/electra_large_discriminator_squad2_512, which yielded 43% accuracy across all datasets. We observed that the computation time of all models depends on the context length and the model size. The model’s performance usually decreases with an increase in the answer length. Moreover, the model’s performance depends on the context complexity. We also used the Genetic algorithm to improve the overall accuracy by integrating responses from other models. ahotrod/electra_large_discriminator_squad2_512 generated the best results for bioasq10b-factoid (65.92%), biomedical_cpgQA (96.45%), QuAC (11.13%), and Question Answer Dataset (41.6%). Bert-large-uncased-whole-word-masking-finetuned-squad achieved an accuracy of 82% on the IELTS dataset.

💡 Research Summary

This paper presents a comprehensive benchmark of 47 context‑based question answering (CBQA) models available on the Hugging Face hub, evaluated on eight heterogeneous datasets without any additional fine‑tuning. The primary motivation is to assess how well off‑the‑shelf models perform in practical settings where retraining for each target domain is undesirable. The authors selected a diverse set of models covering major Transformer families—BERT, RoBERTa, ELECTRA, DeBERTa, XLNet, among others—with sizes ranging from roughly 110 million to 340 million parameters. Each model had been pre‑trained on either SQuAD v1, SQuAD v2, or domain‑specific corpora such as biomedical literature, but none were adapted to the evaluation datasets.

The eight evaluation corpora span general‑purpose reading comprehension (SQuAD v2, QuAC), specialized biomedical fact‑finding (BioASQ 10b‑factoid, biomedical_cpgQA), language‑learning assessment (IELTS), and two additional public QA sets with mixed topic and length characteristics. All inputs were tokenized with the respective model tokenizers, truncated or padded to a maximum context length of 512 tokens, and fed to the models in inference mode on a single GPU. Accuracy and F1 scores were recorded, together with inference latency and GPU memory consumption.

Overall, the average accuracy across all datasets and models was 43 %, indicating that raw, pre‑trained models can provide a modest baseline but often fall short of task‑specific requirements. The best‑performing model was ahotrod/electra_large_discriminator_squad2_512, an ELECTRA‑large architecture pre‑trained on SQuAD v2 with a 512‑token context window. This model achieved the highest mean accuracy (43 %+), and it excelled on several datasets: 65.92 % on BioASQ 10b‑factoid, 96.45 % on biomedical_cpgQA, 11.13 % on QuAC, and 41.60 % on the Question Answer Dataset (QAD). Notably, on the IELTS dataset, bert-large-uncased-whole-word-masking-finetuned-squad outperformed all others with an 82 % accuracy, suggesting that the linguistic patterns of IELTS passages align closely with the SQuAD training distribution.

The analysis uncovered three dominant factors influencing performance:

Pre‑training data alignment – Models trained on SQuAD v2 consistently outperformed those trained on SQuAD v1 or other corpora, especially on datasets with similar answer‑ability characteristics.
Context length – When the context exceeded 256 tokens, most models suffered a 7‑point drop in accuracy, reflecting the quadratic attention cost and positional‑encoding limits of standard Transformers.
Answer span length – Answers longer than five tokens reduced accuracy by an average of 12 points, indicating difficulty in correctly predicting extended spans.

Domain complexity also played a crucial role. Biomedical datasets, rich in specialized terminology, penalized models whose vocabularies lacked coverage, despite high overall scores for the ELECTRA‑large model. Conversely, the QuAC conversational benchmark, which requires multi‑turn reasoning, yielded the lowest accuracies (≈11 %) across the board, highlighting the need for models explicitly designed for dialogue contexts.

To mitigate these gaps, the authors experimented with a genetic‑algorithm‑driven ensemble. They initialized a population of candidate ensembles, each assigning different weights to individual model predictions, and evolved them through crossover and mutation to maximize validation accuracy. The resulting meta‑model improved the overall average accuracy by 3‑5 % absolute, with the most pronounced gains on the biomedical tasks (2‑3 % additional improvement). This demonstrates that diverse error patterns among models can be harnessed without further training.

Inference cost analysis revealed a clear trade‑off between model size and latency. Larger models (e.g., DeBERTa‑large) achieved modestly higher accuracies but required up to 12 GB of GPU memory and incurred latency spikes beyond 300 ms per query, which may be prohibitive for real‑time applications. ELECTRA‑large struck a favorable balance, delivering the best accuracy while maintaining sub‑200 ms latency on a single RTX 3090 GPU.

The paper concludes with actionable recommendations for practitioners: select SQuAD v2‑pre‑trained ELECTRA‑large when a single robust model is needed; consider BERT‑large for English‑language educational contexts; employ lightweight BERT‑base variants when memory is constrained; and explore lightweight ensemble techniques (e.g., genetic ensembles) to squeeze additional performance without full fine‑tuning.

Future work suggested includes integrating long‑context architectures (Longformer, Reformer) to alleviate the context‑length bottleneck, investigating parameter‑efficient adaptation methods (e.g., adapters, LoRA) for domain alignment, and extending the benchmark to multimodal QA scenarios. By providing a transparent, reproducible evaluation pipeline, the study offers a valuable reference point for developers aiming to deploy CBQA systems in production environments with minimal training overhead.

💡 Research Summary

📜 Original Paper Content