Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.
💡 Research Summary
This paper addresses the notable gap in evaluating large language models (LLMs) for Greek question answering (QA), a low‑resource language that has received limited attention compared to English‑centric research. The authors make three primary contributions. First, they introduce DemosQA, a novel Greek QA benchmark derived from the Reddit community r/greece. By programmatically harvesting posts tagged as questions, applying stringent filtering (minimum five up‑votes and five answers, removal of duplicates, images, adult content, and low‑quality posts), and then manually curating over 2,100 candidate pairs, they produce a high‑quality dataset where each question is accompanied by four candidate answers and a reference answer selected as the most up‑voted comment. This design captures real‑world user preferences and reflects contemporary Greek social, cultural, and political discourse, offering a complementary perspective to existing curated datasets that focus on medical, legal, or academic domains.
Second, the authors propose a memory‑efficient evaluation framework that leverages 4‑bit quantization (Dettmers & Zettlemoyer, 2023). By quantizing model weights to 4 bits, they dramatically reduce GPU memory requirements while preserving accuracy, enabling the systematic testing of models with 7–12 billion parameters on commodity hardware. The framework includes standardized data loading, quantized model inference, automated prompt generation, and result aggregation, thereby enhancing reproducibility across diverse QA datasets and languages.
Third, they conduct an extensive empirical study of eleven LLMs—both monolingual Greek models (Meltemi 7B, Llama Krikri 8B) and multilingual models that support Greek (Mistral Nemo 12B, Llama 3.1 8B, Gemma 2 9B, Tekun 7B, EuroLLM 9B, Aya Expanse 8B) as well as the proprietary GPT‑4o mini. All models are instruction‑tuned variants with at least 7 billion parameters, allowing direct prompting without additional fine‑tuning. The evaluation spans six human‑curated Greek QA datasets (including medical MCQA, TruthfulQA, BELEBELE, INCLUDE, ASEP MCQA) and the newly released DemosQA.
Three prompting strategies are examined: (1) Zero‑Shot (question only), (2) Few‑Shot (question plus answer choices), and (3) Chain‑of‑Thought (question, choices, and a meta‑instruction indicating that the correct answer is the most up‑voted community response). This systematic variation isolates the impact of additional context and reasoning cues on model performance.
Key findings include:
-
Monolingual superiority – Greek‑specific models outperform multilingual counterparts by an average of 3–5 percentage points, especially on culturally nuanced topics such as history and politics. This underscores the value of language‑specific pre‑training data that captures Greek morphology, syntax, and domain knowledge.
-
Competitive multilingual models – Modern multilingual models like Aya Expanse 8B and EuroLLM 9B narrow the performance gap, suggesting that large‑scale multilingual pre‑training can yield respectable results for low‑resource languages when the training corpus is sufficiently diverse.
-
Prompt engineering impact – Chain‑of‑Thought prompts consistently achieve the highest accuracy, delivering 6–9 pp gains over Zero‑Shot, particularly for questions requiring inference across answer options. The meta‑instruction about community up‑votes appears to guide the model toward human‑aligned ranking of candidates.
-
Open‑weight vs. proprietary – Aya Expanse 8B approaches GPT‑4o mini’s performance within 1–2 pp, demonstrating that open‑weight models can serve as cost‑effective alternatives without sacrificing much accuracy.
-
Dataset relevance – Models attain higher scores on DemosQA than on the other five curated datasets (average +4 pp), indicating that community‑sourced, socially grounded QA data better reflects real user expectations and may be a more challenging benchmark for future LLM development.
The authors release the DemosQA dataset, the quantized evaluation code, and detailed experiment logs to the public, fostering reproducibility and encouraging further research on Greek and other under‑represented languages. They also discuss limitations, such as potential bias introduced by Reddit’s user base and the reliance on up‑vote counts as a proxy for answer quality, and outline future directions, including expanding DemosQA to additional domains, incorporating multilingual human annotations, and exploring advanced prompting or fine‑tuning strategies to further close the gap between monolingual and multilingual LLMs.
Overall, this work provides a comprehensive benchmark, a practical evaluation toolkit, and insightful empirical evidence on the trade‑offs between monolingual and multilingual LLMs for Greek QA, advancing the broader agenda of equitable NLP research across languages.
Comments & Academic Discussion
Loading comments...
Leave a Comment