Deconstructing sentence disambiguation by joint latent modeling of reading paradigms: LLM surprisal is not enough

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Using temporarily ambiguous garden-path sentences (“While the team trained the striker wondered …”) as a test case, we present a latent-process mixture model of human reading behavior across four different reading paradigms (eye tracking, uni- and bidirectional self-paced reading, Maze). The model distinguishes between garden-path probability, garden-path cost, and reanalysis cost, and yields more realistic processing cost estimates by taking into account trials with inattentive reading. We show that the model is able to reproduce empirical patterns with regard to rereading behavior, comprehension question responses, and grammaticality judgments. Cross-validation reveals that the mixture model also has better predictive fit to human reading patterns and end-of-trial task data than a mixture-free model based on GPT-2-derived surprisal values. We discuss implications for future work.

💡 Research Summary

This paper tackles the longstanding problem of how readers process temporarily ambiguous garden‑path sentences, using the classic example “While the team trained the striker wondered …” (with and without a comma). The authors collect data from four widely used reading paradigms—eye‑tracking, unidirectional self‑paced reading (SPR), bidirectional self‑paced reading (BSPR), and the Maze task—and they ask participants to answer comprehension questions and make grammaticality judgments after each trial. Traditional analyses compare mean reading times or regression probabilities across conditions, implicitly assuming that all observed measures arise from a single deterministic process. The authors argue that this approach ignores two crucial facts: (1) reading time, regressions, and task responses are generated by a shared set of latent cognitive mechanisms, and (2) each trial may follow a different mixture of processes (e.g., inattentive skimming, guessing, covert reanalysis, overt rereading, or successful parsing).

To capture this complexity, they develop a hierarchical multinomial processing tree (MPT) mixture model. The model posits five mixture components for reading times at the critical word and its spillover region: (i) inattentive reading, (ii) attentive reading without garden‑pathing, (iii) garden‑pathing with covert reanalysis, (iv) garden‑pathing with covert reanalysis plus an additional reanalysis cost, and (v) garden‑pathing that triggers an overt regression. Each component follows a log‑normal distribution whose mean is the sum of a baseline latency (µ), an attentional cost, a garden‑path cost, and possibly a reanalysis or regression cost; the variance for inattentive trials (σ₁) is constrained to be smaller than that for attentive trials (σ₂). A shift parameter captures non‑decision time (encoding, motor execution) and is allowed to differ across paradigms (smaller for eye‑tracking).

The latent cognitive processes are encoded by seven probabilities: p_attentive (attention), p_gp (garden‑path occurrence), p_overt (choice of overt reanalysis), p_postpone (postponing covert reanalysis to the spillover region), p_success_o / p_success_c (success of overt or covert reanalysis), p_infer (pragmatic inference when answering comprehension questions), and p_base_regress (baseline regression unrelated to garden‑pathing). These probabilities vary across participants, items, and conditions (comma vs. no‑comma) via crossed random effects. The model links the mixture proportions of the reading‑time distribution to a multinomial distribution over the four possible outcome categories (accept/no‑regression, accept/regression, reject/no‑regression, reject/regression). For example, the probability of observing a “yes” answer with no regression is derived as a sum of four paths: inattentive + guessing, attentive + no garden‑path + inference, attentive + garden‑path + failed covert reanalysis, and attentive + garden‑path + successful covert reanalysis + inference. This explicit mapping allows simultaneous estimation of process probabilities and cost parameters from the joint data.

To evaluate the model, the authors compare it against two surprisal‑based baselines: (1) a simple linear regression of reading times on GPT‑2‑derived word surprisal, and (2) a hybrid model that adds surprisal as a predictor while retaining the mixture structure. Using five‑fold cross‑validation, they compute log‑likelihoods, WAIC, and prediction accuracy for each paradigm. The latent mixture model consistently outperforms both baselines, achieving higher likelihoods and lower prediction error for eye‑tracking, SPR, BSPR, and Maze data. Notably, the mixture model captures paradigm‑specific effects: higher baseline regression probability in eye‑tracking, lower regression cost in BSPR, and distinct variance structures for inattentive versus attentive trials. It also reveals that garden‑path probability is substantially higher in the no‑comma condition, while the success probability of reanalysis remains low, indicating that many readers either give up or rely on pragmatic inference (captured by p_infer) rather than fully reconstructing the correct parse.

The results demonstrate that GPT‑2 surprisal alone cannot account for the observed processing costs of garden‑path sentences. While surprisal predicts a portion of the variance in reading times, it systematically underestimates the extra slowdown associated with reanalysis and the variability introduced by inattentive or guessing behavior. The mixture model’s ability to separate these sources provides empirical support for classic psycholinguistic theories that posit dedicated reanalysis mechanisms (e.g., Frazier’s garden‑path model, Marcus’s reanalysis hypothesis). Moreover, the inclusion of p_infer shows that participants often answer comprehension questions based on pragmatic expectations rather than on the actual syntactic parse, a nuance missed by pure surprisal models.

In sum, the paper offers a comprehensive, probabilistic framework that unifies reading‑time data, regression behavior, and task responses across multiple experimental paradigms. By modeling latent cognitive states and their mixture, it reveals that human sentence processing is fundamentally non‑deterministic and cannot be reduced to next‑word prediction probabilities derived from large language models. The work opens avenues for applying similar latent mixture models to other linguistic phenomena (e.g., ambiguity resolution, garden‑path effects in different languages) and for integrating richer neural network predictions with cognitively plausible process models.

Deconstructing sentence disambiguation by joint latent modeling of reading paradigms: LLM surprisal is not enough

💡 Research Summary

Comments & Academic Discussion

Leave a Comment