Designing Staged Evaluation Workflows for LLMs: Integrating Domain Experts, Lay Users, and Model-Generated Evaluation Criteria
Large Language Models (LLMs) are increasingly utilized for domain-specific tasks, yet evaluating their outputs remains challenging. A common strategy is to apply evaluation criteria to assess alignment with domain-specific standards, yet little is understood about how criteria differ across sources or where each type is most useful in the evaluation process. This study investigates criteria developed by domain experts, lay users, and LLMs to identify their complementary roles within an evaluation workflow. Results show that experts produce fact-based criteria with long-term value, lay users emphasize usability with a shorter-term focus, and LLMs target procedural checks for immediate task requirements. We also examine how criteria evolve between a priori and a posteriori phases, noting drift across stages as well as convergence in the a posteriori phase. Based on our observations, we propose design guidelines for a staged evaluation workflow combining the complementary strengths of these sources to balance quality, cost, and scalability.
💡 Research Summary
The paper addresses a critical gap in the evaluation of large language models (LLMs) for domain‑specific tasks: how to generate, select, and apply evaluation criteria that balance quality, cost, and scalability. While prior work has introduced tools that let users define criteria (e.g., EvalAssist, EvalGen) and has highlighted the importance of human‑in‑the‑loop evaluation, little is known about the nature of criteria produced by different sources—domain experts, lay users, and the LLMs themselves—and where each source adds the most value in a multi‑stage workflow.
To answer this, the authors conduct a case study in two distinct domains—nutrition and mathematics education—chosen because both require specialized knowledge but differ in the type of reasoning involved. Participants from each domain (experts and lay users) were asked to create evaluation criteria for a set of tasks in two phases: an a priori phase (before seeing model outputs, based solely on the prompt) and an a posteriori phase (after reviewing model outputs). In parallel, the authors prompted a state‑of‑the‑art LLM (GPT‑4) to generate its own criteria for the same tasks. This design enables a direct comparison of criteria across three sources and across the two temporal phases, allowing the authors to investigate “criteria drift” (changes from a priori to a posteriori) and “criteria convergence” (when different sources end up using similar criteria).
The analysis reveals clear, systematic differences. Domain experts produce fact‑based, instructional criteria that aim to prevent misconceptions and ensure long‑term educational or health value. Their criteria tend to be defined early (a priori) and remain stable, reflecting deep subject‑matter knowledge. Lay users focus on usability, presentation, and perceived relevance to end‑users; they generate many of their criteria only after seeing model outputs (a posteriori), reacting to gaps or confusing language. LLMs generate procedural or surface‑level checks—e.g., “does the response contain the required keywords?” or “is the grammar correct?”—that are useful for immediate task compliance but lack the depth to catch domain‑specific errors. Moreover, LLM‑generated criteria rarely evolve in response to output errors; they are largely static and can inadvertently reinforce hallucinations or biases present in the model’s own output.
A key empirical finding is the presence of criteria drift: in the a priori stage, the three sources diverge sharply in focus (expert vs. usability vs. procedural). However, in the a posteriori stage, all sources converge on criteria that directly reference observable features of the model’s output. This convergence suggests that once an output is visible, human evaluators gravitate toward concrete, surface‑level judgments, while the LLM continues to rely on its original procedural checklist. The drift‑to‑convergence pattern underscores the limited added value of LLM‑only evaluation in later stages and highlights the importance of human feedback for uncovering deeper, domain‑specific issues.
Based on these insights, the authors propose a staged, hybrid evaluation workflow that strategically combines the strengths of each source:
- Expert‑a priori stage – Experts define a core set of factual and ethical criteria (e.g., “aligns with clinical guidelines”, “avoids known nutrition myths”). These become the backbone of the evaluation and are embedded directly into the prompt.
- LLM‑a priori stage – The prompt, enriched with expert criteria, also asks the LLM to generate procedural checks (e.g., “list required data fields”, “verify numeric consistency”). This yields an automatic, low‑cost baseline set of criteria.
- Model output generation – The LLM produces responses using the combined prompt.
- Lay‑user a posteriori stage – End‑users review the outputs, adding or refining criteria related to usability, clarity, and perceived relevance. Their feedback captures real‑world concerns that experts may overlook.
- Expert a posteriori validation – Finally, experts review the aggregated criteria and the model outputs, confirming that critical factual and ethical standards are met and adjusting any missed issues.
The workflow minimizes expert time (they intervene only at the start and end), leverages cheap LLM‑generated checks for rapid iteration, and incorporates lay‑user perspectives where they matter most—usability and trust. The authors also provide concrete prompting guidelines (e.g., “Provide five procedural validation items based on the following expert checklist”) and suggest a hierarchical scoring scheme that weights expert criteria higher than procedural or usability items.
In addition to the workflow, the paper contributes three broader implications: (1) Human input is most valuable after the model has produced output, especially for detecting domain‑specific errors that LLM‑generated criteria miss; (2) Relying solely on LLM‑generated criteria is risky, as it can propagate hallucinations and bias; (3) Prompt engineering influences the quality of LLM‑generated criteria, indicating that better prompts can partially mitigate the limitations of automated criteria.
Overall, the study offers an empirically grounded, actionable blueprint for designing evaluation pipelines that are both cost‑effective and high‑quality. By systematically mapping where each stakeholder’s expertise is most impactful, the work advances the state of practice for trustworthy AI deployment in complex, high‑stakes domains. Future work is suggested to test the workflow in additional fields (e.g., law, radiology) and to explore methods for automatically detecting and correcting bias in LLM‑generated criteria.
Comments & Academic Discussion
Loading comments...
Leave a Comment