MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.


💡 Research Summary

The paper introduces MILE‑RefHumEval, a novel framework for evaluating large language models (LLMs) without relying on reference outputs or coordinated evaluator interactions. Traditional automatic metrics such as BLEU, ROUGE, METEOR, or BERTScore require ground‑truth references and often fail to capture semantic nuance, especially for open‑ended or multi‑modal tasks. Recent “LLM‑as‑judge” approaches mitigate this by using LLMs as evaluators, but they typically involve a single model playing multiple roles, or a collaborative setting where evaluators exchange feedback. These designs introduce consensus bias, dominance bias, and inter‑rater dependence, limiting reliability and interpretability.

MILE‑RefHumEval addresses these issues by deploying multiple independent LLM evaluators, each prompted with the same human‑aligned evaluation schema but without any shared context. The framework consists of four key components:

  1. Task‑specific evaluation dimensions – For each benchmark (e.g., summarization, image captioning, dialogue) a set of criteria (coherence, consistency, fluency, relevance, etc.) is defined, together with clear scoring rubrics (1‑5 or binary). These dimensions directly mirror human annotation guidelines, ensuring that the scores are interpretable and human‑aligned.

  2. Prompt engineering – Each evaluator receives a role‑declaration (“You are Evaluator A”) followed by clearly delimited sections for the input, the candidate response(s), and the evaluation criteria. The prompts are concise yet exhaustive, minimizing ambiguity and guiding the LLM to focus on the intended aspects.

  3. Independent evaluation – Seven diverse LLMs (ranging from 8 B to 32 B parameters, including DeepSeek‑R1‑Distill‑Llama‑8B, Mistral‑Small‑3.2‑24B‑Instruct, GPT‑4.1‑mini, Llama‑3.1‑8B‑Instruct, Gemma‑3‑12B‑it, Phi‑4, and Qwen‑qwq‑32B) are queried once per sample. No evaluator sees the outputs of another, eliminating cross‑model influence and the associated bias.

  4. Ensemble aggregation – For discrete decisions (e.g., “which answer is better?”) a majority‑vote is taken; for continuous scores (e.g., 1‑5 ratings) the arithmetic mean is computed. This simple aggregation leverages the complementary strengths of the models while smoothing individual noise.

The authors evaluate MILE‑RefHumEval on five publicly available benchmarks:

  • FairEval (80 open‑ended QA samples): MILE‑RefHumEval outperforms the baseline CHA‑TE‑VAL across accuracy, F1, Cohen’s Kappa, and Matthews Correlation Coefficient, while reducing the total number of API calls by ~30 %.
  • SummEval (1,600 news summaries): Correlation with human judgments (Spearman 0.78, Kendall 0.71) exceeds that of the recent G‑Eval‑4 (0.71/0.64). The framework also achieves higher agreement on each of the four dimensions (coherence, consistency, fluency, relevance).
  • OID Image Caption (6,182 image‑caption pairs): Framed as a binary “good/bad” classification, MILE‑RefHumEval attains 84.3 % accuracy, F1 0.81, and Kappa 0.73, surpassing CLIP‑Score (Acc 76.5 %).
  • PandaLM (1,000 pairwise response comparisons): The system selects the best response with 87.2 % accuracy, beating JudgeLM (81.5 %) and a task‑specific PandaLM model (83.0 %).
  • Topical Chat (360 dialogue turns): Using six fine‑grained dimensions, the mean squared error against human scores is 0.42, lower than G‑Eval‑4’s 0.58, indicating tighter alignment with human perception.

Overall, the experiments demonstrate that independent multi‑LLM evaluation can achieve higher reliability, better alignment with human judgments, and lower computational cost than both single‑model judges and collaborative ensembles.

Strengths of the approach include:

  • Reference‑free operation, making it applicable to domains where gold standards are scarce or infeasible.
  • Elimination of evaluator interaction, which removes consensus and dominance biases.
  • Modular prompt‑based adaptation to any task (text‑to‑text, image‑to‑text, etc.) without retraining.
  • Cost efficiency through the use of smaller, complementary models and reduced query counts.

Limitations are acknowledged:

  • The quality of the final score depends on the baseline competence of the constituent LLMs; inclusion of weak models can degrade performance.
  • Prompt and schema design still requires manual effort and domain expertise for each new task.
  • Simple majority/average aggregation may be vulnerable to systematic over‑ or under‑rating by a subset of models.
  • While query count per sample is lower than some baselines, large‑scale deployment still incurs non‑trivial API costs.

Future directions suggested by the authors involve automated schema generation, learning dynamic evaluator weights, extending the framework to richer multimodal evaluators, and integrating the evaluation feedback into LLM fine‑tuning loops.

In sum, MILE‑RefHumEval proposes a reference‑free, human‑aligned, independent‑evaluator ensemble that advances the state of LLM evaluation by delivering more trustworthy, interpretable, and scalable assessments across a wide spectrum of tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment