FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering

The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.

💡 Research Summary

The paper introduces FARSIQA, a Retrieval‑Augmented Generation (RAG) system specifically designed for Persian‑language Islamic question answering, where factual faithfulness and source authority are non‑negotiable. Recognizing that large language models (LLMs) excel at general language tasks but frequently hallucinate or stray from canonical texts in high‑stakes domains, the authors propose a novel architecture called FAIR‑RAG (Faithful, Adaptive, Iterative Refinement). FAIR‑RAG departs from the traditional single‑pass retrieve‑then‑generate pipeline by incorporating three core capabilities: (1) dynamic query decomposition, (2) evidence‑sufficiency assessment, and (3) an iterative refinement loop that can generate additional sub‑queries until the system judges that the gathered evidence fully supports an answer.

The workflow begins with an LLM‑based analyzer that decides whether a user question is simple or requires multi‑hop reasoning. For complex queries, the analyzer automatically splits the question into semantically coherent sub‑queries. Each sub‑query triggers an independent retrieval over a curated knowledge base—FAIR‑Corpus—containing over one million authoritative Islamic documents (Qur’an, Hadith collections, fiqh manuals, theological treatises, etc.). Retrieved passages are fed to a meta‑reader, a BERT‑style encoder fine‑tuned on 10 k expert‑labeled QA pairs, which predicts a binary “sufficient/insufficient” label based on coverage, source credibility, and internal consistency scores. When insufficiency is detected, the meta‑reader pinpoints the specific information gap, prompting the system to generate a new sub‑query that targets the missing piece. This creates a closed‑loop refinement process.

To keep the loop efficient, the authors introduce a Reinforcement‑Learning‑based refinement policy that balances expected accuracy gains against computational cost, thereby selecting the optimal number of iterations (average 2.3 per question in experiments). After the evidence pool is deemed adequate, a final generation stage produces the answer together with explicit citations. The LLM also performs self‑correction: it compares the draft answer with the evidence set, identifies inconsistencies, and rewrites the response if necessary.

Data preparation involved large‑scale crawling of Persian Islamic texts, rigorous cleaning, and metadata enrichment (author, date, source type). The resulting FAIR‑Corpus was used to extend the public IslamicPCQA benchmark with 5 k new multi‑hop, negation, and conditional questions, forming the IslamicPCQA‑Extended suite.

Evaluation focuses on two metrics: Negative Rejection (the ability to correctly refuse answering when the evidence is lacking) and Answer Correctness (exact match with the gold answer). FARSIQA achieves 97.0 % Negative Rejection—a 40‑point leap over baseline RAG systems—and an Answer Correctness of 74.3 %, setting a new state‑of‑the‑art for Persian Islamic QA. In the particularly challenging multi‑hop subset, FARSIQA reaches 68 % correctness versus roughly 45 % for a conventional single‑pass RAG. Error analysis reveals that most residual mistakes stem from “topic drift” during the initial retrieval phase; improving the meta‑reader’s precision could yield an additional 5–7 % performance boost.

The authors conclude that adaptive, iterative architectures are essential for deploying trustworthy AI in sensitive domains such as religion, law, or medicine. They suggest future work on extending the meta‑reader to multimodal evidence (images of manuscripts, audio recitations) and refining the reinforcement‑learning policy for real‑time deployment. FARSIQA and the FAIR‑RAG framework thus establish a robust blueprint for building faithful, reliable question‑answering systems where correctness is a matter of cultural and ethical significance.

💡 Research Summary

📜 Original Paper Content