Towards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study

Towards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models show promise for knowledge-intensive domains, yet their use in agriculture is constrained by weak grounding, English-centric training data, and limited real-world evaluation. These issues are amplified for low-resource languages, where high-quality domain documentation exists but remains difficult to access through general-purpose models. This paper presents AgriHubi, a domain-adapted retrieval-augmented generation (RAG) system for Finnish-language agricultural decision support. AgriHubi integrates Finnish agricultural documents with open PORO family models and combines explicit source grounding with user feedback to support iterative refinement. Developed over eight iterations and evaluated through two user studies, the system shows clear gains in answer completeness, linguistic accuracy, and perceived reliability. The results also reveal practical trade-offs between response quality and latency when deploying larger models. This study provides empirical guidance for designing and evaluating domain-specific RAG systems in low-resource language settings.


💡 Research Summary

This paper introduces AgriHubi, a domain‑adapted Retrieval‑Augmented Generation (RAG) system designed to provide decision‑support answers for Finnish‑language agriculture. The authors identify three major obstacles that limit the usefulness of large language models (LLMs) in this sector: weak grounding of generated text, an English‑centric training bias, and a lack of real‑world evaluation, especially for low‑resource languages where high‑quality domain documentation exists but is not readily exploitable by generic models.

AgriHubi’s architecture follows a classic four‑stage RAG pipeline—document ingestion, vector retrieval, generative inference, and user interaction—implemented entirely in Python for modularity and reproducibility. Raw agricultural PDFs (including scanned documents) are first processed with OCR, segmented into coherent text chunks, and enriched with metadata (document ID, section, date). These chunks are embedded using OpenAI’s text‑embedding‑ada‑002 model (8192‑token limit) and indexed with FAISS (L2‑normalized vectors). At query time, the user’s question is embedded, the top‑k most similar chunks (default k = 5) are retrieved, and the concatenated context is fed to a Finnish‑capable LLM from the open‑source PORO family. The system experiments with four model configurations: Llama 3.2 (baseline), PORO‑34B, PORO‑2‑8B, and the largest PORO‑2‑70B. Model calls are routed through the Farmi/GPT‑Lab APIs, eliminating the need for local GPUs.

A key design decision is explicit source grounding. Prompt templates force the model to cite the retrieved passages and to append a citation block (document name, page number) to every answer. This makes the system’s output auditable, a crucial requirement for agricultural advisors who must verify recommendations against official regulations.

The user interface is built with Streamlit and supports Finnish, Swedish, and English. It presents a chat‑style conversation, streams model responses, and collects structured feedback via a five‑point rating plus free‑text comments. All interactions—including the query, retrieved passages, model version, rating, and timestamp—are persisted in a SQLite database, forming the backbone of an iterative feedback loop.

Development proceeded through eight iterations between January and August 2025. Early iterations (1‑3) established the core pipeline and revealed that Llama 3.2 could not handle Finnish agricultural terminology. Switching to PORO‑34B immediately improved fluency and domain coverage. Iterations 4‑6 introduced the feedback mechanism, conducted the first external evaluation (67 responses), and used the collected ratings to refine chunking logic, similarity thresholds, and prompt wording. Iterations 7‑8 incorporated the largest PORO‑2‑70B model, achieving the highest answer quality at the cost of increased latency.

Two structured user studies were carried out. Quantitative metrics (answer completeness, linguistic accuracy, perceived reliability) and qualitative observations showed clear gains across all model sizes. Compared with the baseline, PORO‑34B raised average scores by roughly 1.2 points on a 5‑point scale; PORO‑2‑70B added another 0.8 points. The proportion of low‑scoring answers (1–2) dropped from 48 % in early tests to 12 % after the final iteration. Users reported a 30 % increase in trust when answers included explicit citations. However, latency grew from an average of 1.8 seconds (PORO‑34B) to 3.7 seconds (PORO‑2‑70B), and API costs roughly doubled, highlighting a practical trade‑off between quality and responsiveness.

The authors discuss three main limitations. First, the retrieval component operates only on plain‑text chunks, ignoring richer document structures such as tables, figures, or multimodal data. Second, the embedding model is English‑centric, which can under‑represent Finnish‑specific lexical nuances. Third, the reliance on external hosted APIs makes large‑scale deployment expensive and potentially vulnerable to service interruptions.

Future work is outlined along four directions: (1) training or fine‑tuning a Finnish‑specific embedding model to improve semantic matching; (2) extending the index to support multimodal artifacts and hierarchical document sections; (3) implementing a cost‑aware routing layer that selects smaller models for low‑complexity queries while reserving the largest model for high‑stakes advice; and (4) collaborating with national agricultural agencies to continuously enrich the document corpus and to co‑design evaluation protocols that reflect real‑world decision‑making workflows.

In summary, AgriHubi provides an empirically validated roadmap for building domain‑specific RAG systems in low‑resource language contexts. It demonstrates how careful integration of open‑source LLMs, robust retrieval pipelines, explicit grounding, and systematic user feedback can together overcome the grounding and bias challenges that plague generic LLM deployments. The paper contributes concrete insights on model selection, prompt engineering, feedback‑driven iteration, and the latency‑quality trade‑off, offering actionable guidance for both researchers and practitioners aiming to deploy trustworthy AI assistants in specialized, multilingual domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment