CORE: Comprehensive Ontological Relation Evaluation for Large Language Models

CORE: Comprehensive Ontological Relation Evaluation for Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen’s Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.


💡 Research Summary

The paper introduces CORE (Comprehensive Ontological Relation Evaluation), a new benchmark designed to assess large language models’ (LLMs) ability to distinguish between genuine semantic relations and the absence of any relation. CORE consists of two components: (1) a massive corpus of 225,000 multiple‑choice questions covering 74 academic disciplines, each formatted as an analogy “A:B → C:?”; and (2) a rigorously validated subset of 203 questions that evenly balances 24 distinct semantic relation types (e.g., agent‑instrument, cause‑effect, synonym, etc.) with a dedicated “unrelated” class. Human annotation involved over 1,000 participants and achieved perfect inter‑annotator agreement (Cohen’s κ = 1.0). Human performance on the benchmark is 92.6 % overall and 95.1 % on unrelated pairs, indicating that people can reliably recognize when no meaningful relation exists.

The authors evaluate 29 state‑of‑the‑art LLMs (spanning 8 B to 405 B parameters, various architectures, and training regimes) using a uniform prompting and deterministic inference protocol. Results reveal a striking asymmetry: on related pairs, models achieve near‑ceiling accuracy (86 %–100 %), comparable to humans. However, on unrelated pairs, accuracy collapses to 0 %–41 %, despite models reporting high confidence scores (≈92 %–94 %). Expected Calibration Error (ECE) on unrelated pairs is 2–4× higher than on related pairs, confirming severe mis‑calibration. The authors also introduce a “semantic collapse rate,” the proportion of unrelated pairs incorrectly classified as having a relation, which averages 37.6 % across models—far above random guessing (≈75 % error) but still indicative of systematic false‑relation generation.

When the same models are run on the full 225 K CORE dataset, overall accuracy drops to roughly 2 %, underscoring that the difficulty scales dramatically with domain‑specific content. The paper argues that this failure mode is critical for safety‑sensitive applications (medical decision support, legal reasoning, financial forecasting) where falsely inferring a relation could lead to harmful outcomes.

To address these shortcomings, the authors propose several research directions: (1) augment training data with a substantial proportion of unrelated examples to teach models explicit “no‑relation” detection; (2) design loss functions or auxiliary objectives that penalize the generation of spurious relational structures; (3) apply post‑hoc calibration techniques (e.g., temperature scaling, Dirichlet calibration) to align confidence with true correctness, especially on negative examples; and (4) develop evaluation pipelines that routinely include unrelatedness reasoning as a core metric.

In summary, CORE fills a gap in current LLM evaluation by providing a large‑scale, balanced test of both relational understanding and the ability to recognize the absence of relations. The findings reveal that while modern LLMs excel at identifying existing semantic links, they systematically hallucinate relations where none exist, leading to over‑confidence and poor calibration. This work highlights “unrelatedness reasoning” as an under‑explored yet essential frontier for robust, trustworthy AI.


Comments & Academic Discussion

Loading comments...

Leave a Comment