LeWiDi-2025 at NLPerspectives: Third Edition of the Learning with Disagreements Shared Task
Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.
💡 Research Summary
The paper presents LeWiDi‑2025, the third edition of the “Learning With Disagreements” shared task, co‑located with the NLPerspectives workshop at EMNLP 2025. Building on the first two editions, LeWiDi‑2025 expands the benchmark to four heterogeneous, inherently subjective NLP tasks: conversational sarcasm detection (CSC), multilingual irony detection (MP), natural‑language inference with error‑aware labeling (VEN), and paraphrase detection (Par). The datasets cover English, Arabic, Dutch, French, German, Hindi, Italian, Portuguese, and Spanish, and include rich annotator metadata (gender, age, nationality, ethnicity, occupation). Two complementary evaluation paradigms are introduced.
Task A (soft‑label prediction) requires systems to output a probability distribution over all possible labels for each item, reflecting the population‑level disagreement. Unlike previous editions that relied solely on cross‑entropy, LeWiDi‑2025 adopts distance‑based metrics that are better suited for multiclass, multilabel, and ordinal settings. Specifically, the Average Manhattan Distance is used for binary and multiclass datasets (MP, VEN), while the Average Wasserstein Distance (with a ground‑cost matrix defined by absolute label position differences) evaluates ordinal Likert‑scale datasets (Par, CSC). These metrics capture both the magnitude and the direction of distributional errors, offering a more interpretable measure of how well a model mirrors human disagreement.
Task B (perspectivist prediction) asks participants to model the labeling behavior of individual annotators. For nominal categories (MP, VEN) performance is measured by the Average Error Rate (AER), which computes the proportion of mismatched annotator‑item pairs. For ordinal scales (Par, CSC) the Average Normalized Absolute Distance (ANAD) is employed; it normalizes the absolute label deviation by the scale range and expresses it as a percentage, enabling fair comparison across different Likert ranges.
The competition ran in three phases: a practice phase with full training and development data (including metadata) and a public leaderboard; an evaluation phase with unseen test data where missing submissions were filled with organizer baselines; and a post‑campaign phase where test data and gold labels were released for long‑term research. Two simple baselines—a random predictor and a most‑frequent‑label predictor—were provided to lower the entry barrier.
Participation was modest but focused: 53 individuals registered, 15 teams submitted predictions, and 9 system papers were accepted. Analysis of the results shows that models leveraging the soft‑label paradigm, especially those that directly optimize distance‑based losses, achieved superior performance on multiclass and ordinal tasks. In the perspectivist setting, approaches that incorporated annotator embeddings or demographic features were able to reduce AER and ANAD substantially, highlighting the value of modeling individual bias.
Overall, LeWiDi‑2025 demonstrates that disagreement should be treated as a signal rather than noise. By providing unified data formats, extensive annotator metadata, and novel evaluation metrics, the task establishes a robust benchmark for disagreement‑aware NLP. The authors argue that these resources will catalyze future research on models that can both predict population‑level label distributions and adapt to specific annotator perspectives, thereby advancing more nuanced, fair, and human‑aligned language technologies.
Comments & Academic Discussion
Loading comments...
Leave a Comment