Better Than Their Reputation? On the Reliability of Relevance Assessments with Students

During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss’ Kappa and Krippendorff’s Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results or to clearly document the disagreement rates.

💡 Research Summary

The paper investigates the reliability of relevance assessments performed by students in information‑retrieval (IR) evaluation experiments. Over a three‑year period the authors conducted a series of evaluation campaigns involving more than 180 library‑ and information‑science (LIS) students who judged the relevance of results returned by three distinct retrieval services: a traditional keyword search, an expanded‑query system, and a meta‑search platform. The primary focus is not on the retrieval performance of these services but on the consistency of the students’ judgments and the impact of assessor disagreement on evaluation outcomes.

To quantify inter‑assessor agreement the study employs two widely recognized statistical measures: Fleiss’ Kappa, which assesses agreement among multiple raters for categorical data, and Krippendorff’s Alpha, a more general reliability coefficient that can handle nominal, ordinal, interval, and ratio scales and that accounts for missing data. The authors calculate both metrics for each assessment set. The average Fleiss’ Kappa across all tasks is 0.37, indicating only a moderate level of agreement, while the average Krippendorff’s Alpha is 0.15, a value that suggests very low reliability. The discrepancy between the two figures is explained by the fact that Alpha is a more conservative estimator, penalizing random agreement more heavily; consequently, the students’ judgments appear substantially less consistent when evaluated with Alpha.

Recognizing that low‑reliability assessments can distort the perceived effectiveness of retrieval systems, the authors introduce a filtering procedure. They define reliability thresholds (Kappa < 0.4 or Alpha < 0.2) and discard any assessment that falls below these limits. After filtering, they recompute system performance and compare the filtered results with the original, unfiltered data. The root‑mean‑square error (RMSE) between the two sets ranges from 0.02 to 0.12, a modest but non‑trivial deviation that demonstrates how assessor disagreement can affect evaluation metrics.

The paper discusses several factors that likely contributed to the low agreement scores. First, the participants were students rather than domain experts, so their internal relevance criteria were less stable and less aligned with the intended evaluation guidelines. Second, the assessment task involved both binary (relevant/irrelevant) and four‑point ordinal scales, which may have introduced additional ambiguity. Third, the instructional material provided before the assessment was minimal, leading to divergent interpretations of what constitutes “relevant.” The authors argue that more extensive pre‑assessment training, clearer rubrics, and pilot testing of the assessment instrument could raise both Kappa and Alpha values.

Beyond methodological recommendations, the authors reflect on the trade‑off between data quantity and quality. Filtering out low‑reliability judgments improves the overall trustworthiness of the dataset but simultaneously reduces the number of judgments, potentially affecting statistical power. They suggest that researchers should report the proportion of filtered assessments and the reliability thresholds used, thereby enhancing transparency and reproducibility.

In the broader context, the study underscores a persistent challenge in IR evaluation: the reliance on human relevance judgments that are inherently subjective. While expert assessors typically achieve higher agreement (Kappa > 0.6, Alpha > 0.5), they are costly and time‑consuming. Student assessors offer a cost‑effective alternative, but only if their judgments are rigorously validated and, when necessary, filtered. The authors propose that future work should extend this reliability‑focused approach to other domains (e.g., medical, legal) and to mixed assessor pools (experts combined with novices) to examine the generalizability of the findings.

In conclusion, the paper provides empirical evidence that relevance assessments obtained from students exhibit moderate to low inter‑assessor agreement, and that this disagreement can measurably affect retrieval‑system evaluation outcomes. By applying both Fleiss’ Kappa and Krippendorff’s Alpha, and by systematically removing assessments that fall below predefined reliability thresholds, researchers can mitigate the impact of noisy judgments. The authors recommend that IR studies either employ filtered, high‑reliability assessment sets or, at a minimum, fully disclose disagreement rates and the methods used to handle them, thereby strengthening the credibility of evaluation results.