Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The thematic fit estimation task measures semantic arguments’ compatibility with a specific semantic role for a specific predicate. We investigate if LLMs have consistent, expressible knowledge of event arguments’ thematic fit by experimenting with various prompt designs, manipulating input context, reasoning, and output forms. We set a new state-of-the-art on thematic fit benchmarks, but show that closed and open weight LLMs respond differently to our prompting strategies: Closed models achieve better scores overall and benefit from multi-step reasoning, but they perform worse at filtering out generated sentences incompatible with the specified predicate, role, and argument.


💡 Research Summary

The paper investigates whether large autoregressive language models (LLMs) possess consistent, expressible knowledge of thematic fit—the compatibility between a predicate, an argument, and a semantic role—by systematically probing them with a variety of prompt designs. The authors frame the problem along three orthogonal axes: (1) reasoning style (simple single‑prompt versus chain‑of‑thought “step‑by‑step” prompting), (2) input format (raw lemma tuples versus sentences generated by the model), and (3) output format (numeric Likert‑style scores versus categorical labels that are later mapped to numeric values). By combining the two options on each axis they conduct eight experiments (Exp 1.1–4.2).

Two closed‑weight models (GPT‑4.1 and GPT‑4‑Turbo) and two open‑weight models (Llama 3.2 and Qwen 2.5) are evaluated on four human‑rated thematic‑fit datasets: McRae (1,444 items), Padó (414 items), Fer‑Ins (instrument role) and Fer‑Loc (location role). Performance is measured with Spearman’s rank correlation (ρ) between model predictions and human averages.

Key findings: closed‑weight models achieve the highest overall ρ, especially when using step‑by‑step prompting together with generated sentences and categorical output (Exp 4.2). The chain‑of‑thought approach consistently outperforms simple prompting, indicating that breaking the task into sub‑steps (listing argument properties, required role properties, compatible roles, mismatches, and finally scoring) helps the model reason more accurately. Providing full sentences rather than bare lemmas improves performance for the closed models, but open models often generate incoherent sentences; their dedicated filtering pipeline (four‑stage semantic coherence check) mitigates this issue, giving them a relative advantage in filtering precision. Categorical output proves more robust than raw numeric output, likely because it aligns better with the model’s textual generation strengths and reduces numerical instability.

The study contributes a novel, zero‑shot prompting methodology for thematic‑fit estimation, surpassing previous state‑of‑the‑art distributional and neural baselines that relied on supervised SRL data. It also offers a detailed comparative analysis of closed versus open LLMs, revealing that closed models excel in overall accuracy while open models can be more selective in discarding incompatible generated sentences. The authors release their code and prompt templates, providing a practical blueprint for future work that wishes to leverage LLMs for fine‑grained semantic evaluation without additional fine‑tuning.


Comments & Academic Discussion

Loading comments...

Leave a Comment