Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains
The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over sy…
Authors: Andrew Katz
Extending Minimal Pairs with Ordinal Sur prisal Curv es and Entr opy Acr oss A pplied Domains Andrew Katz Department of Engineering Education V irginia T ech akatz4@vt.edu Abstract The minimal pairs paradigm of compar- ing model probabilities for contrasting com- pletions has prov en useful for e valuating linguistic knowledge in language models, yet its application has largely been con- fined to binary grammaticality judgments ov er syntactic phenomena. Additionally , standard prompting-based e valuation re- quires expensi v e text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty . W e address both limi- tations by extending surprisal-based ev alu- ation from binary grammaticality contrasts to ordinal-scaled classification and scor- ing tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic “sur- prise” (negati v e log probability) they assign to each position on rating scales (e.g., 1–5 or 1–9), yielding full surprisal curves that rev eal both the model’ s preferred response and its uncertainty via entropy . W e explore this frame work across four domains: social- ecological-technological systems classifica- tion, causal statement identification (binary and scaled), figurativ e language detection, and deductiv e qualitativ e coding. Across these domains, surprisal curves produce in- terpretable classification signals with clear minima near expected ordinal scale po- sitions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items. 1 Introduction Large language models (LLMs) are increasingly used for tasks such as classification, assess- ment, and decision-making across div erse do- mains ( Bro wn et al. , 2020 ; Chowdhery et al. , 2022 ; OpenAI , 2023 ). There are suggestions that these models demonstrate emergent capabilities ( W ei et al. , 2022a ; Berti et al. , 2025 ) and can learn representations of the world ( Y ildirim and Paul , 2024 ). Myriad e v aluation paradigms exist, yet such approaches face tradeoffs. For exam- ple, explicit prompting requires API access to any model and then comparing the generated output with ground truth answers, but it can be e xpensi v e for text generation and that generated reasoning ( W ei et al . , 2022b ; K ojima et al. , 2022 ) may con- stitute post-hoc rationalizations rather than gen- uine understanding ( T urpin et al. , 2023 ; Lanham et al. , 2023 ). Alternati vely , binary outputs are very simple but lack the uncertainty quantifica- tion necessary for high-stakes applications ( Guo et al. , 2017 ). If asked directly , a model may be very confident in its answer, but that confidence may not be well-calibrated ( Geng et al. , 2024 ). Finally , models may encode knowledge in their learned representations that the y do not readily ar- ticulate through explicit generation. These limita- tions moti v ate the search for complementary ev al- uation approaches that can efficiently access mod- els’ representations and uncertainties in a quan- tifiable way without depending on text generation alone. Our goal is to find a method to probe LLM representations by measuring their surprisal at en- countering dif ferent tokens as a means to (a) glean insight into those representations and (b) identify ho w LLMs could be used in do wnstream tasks us- ing this mechanism. 2 Background and Related W ork 2.1 Building on a Simple Idea: Surprisal as a Windo w into Models’ Representations The key idea motiv ating the work here is sim- ple. W e tend to form expectations about the world based in part on our internal representations of our surroundings ( Hohwy , 2020 ; Spre v ak and Smith , 2023 ; Pezzulo et al. , 2022 ). When we encounter something unexpected, by definition, that is some- thing for which we assigned a low probability . Like wise, LLMs may also form internal represen- tations of the world, which then inform their ex- pectations over tokens they would encounter in text. Current autoregressi ve LLMs (by design) generate probability distributions of tokens con- ditioned on prior contexts. Logically , that means some tokens would be more lik ely to be in a gen- erated text completion compared with other to- kens, giv en the preceding tokens. For example, if a model with sufficient world knowledge is asked to complete the statement “The capital of France is. . . ” then the next token is more likely to be “Paris” than “T okyo” because the model has learned that “Paris” is the capital of France and not “T okyo” (or at least that it appears at the end of such statements more, based on its training data). Similarly , if the statement “The capital of France is now” concluded with “T okyo” then that would be an unlikely completion and be more surprising to the model (and readers!). The basic hypothe- sis is that LLMs may hav e internal representations gi ven their training and contexts, and those repre- sentations lead to the models being more or less surprised by certain completion tokens. Thus, if we wanted an indirect way to probe those repre- sentations, then we could measure the model’ s sur- prise at encountering dif ferent tok ens. NB: when we say “the model is surprised” we mean that in the information-theoretic sense and not in an an- thropomorphization sense. The information-theoretic measure that cap- tures this intuition is surprisal . Mathematically , surprisal is the negati ve log probability of an e v ent, quantifying its unexpectedness. Lo wer surprisal corresponds to higher probability e vents (or to- kens) occurring, reflecting what a person (or the model) considers more “natural” or “expected. ” In the extreme, for certain ev ents where p ( x ) = 1 then the surprisal is defined as 0. That should make sense because finding out that something is true when you already kno w it is true is not sur- prising. Con versely , if p ( x ) = 0 . 001 then the surprisal associated with finding out x occurred is very large because an improbable ev ent actually occurring is very surprising. Building an ev alu- ation frame work on that idea hypothetically may provide direct access to models’ learned proba- bility distributions and representations without re- quiring text generation. Thus, rather than asking an LLM a question like “What is the answer?” and analyzing its generated text, we can mea- sure surprisal for alternativ e completions: “The answer is X” versus “The answer is Y . ” This is the premise behind the minimal pairs e v aluation paradigm ( Zhou et al. , 2025 ). Surprisal has deep connections to human lan- guage processing. Psycholinguistic research sug- gests that in some cases word surprisal corre- lates with reading times ( Hale , 2001 ; Le vy , 2008 ; Smith and Levy , 2013 ), eye-tracking measures ( Demberg and K eller , 2008 ), and neural responses (N400 ERP component) ( Frank et al. , 2015 ; Ku- perberg and Jaeger , 2016 ), though this is still a matter of debate ( Slaats and Martin , 2025 ). Hale ( 2001 ) proposed that processing dif ficulty is pro- portional to surprisal, formalizing the intuition that une xpected w ords are harder to process. Le vy ( 2008 ) extended this to the surprisal theory of sen- tence processing, which has been extensi vely stud- ied as a popular theory across languages and con- structions ( Smith and Le vy , 2013 ). This suggests that some aspects of human language comprehen- sion may in volv e predicti ve processing, where the brain continuously generates expectations about upcoming input ( W illiams , 2018 ). Our work lev er - ages these insights: if surprisal reflects process- ing difficulty for humans, it may similarly rev eal what LLMs find “natural” or “expected, ” provid- ing a windo w into their learned representations. 2.2 F ormal Inf ormation Theory F oundations Information theory , introduced by Shannon ( 1948 ), provides a mathematical foundation for quantifying information content and uncertainty in random v ariables, signals, or distrib utions ( Lom- bardi et al. , 2016 ). In that framew ork, the surprisal (or self-information) of an e vent x is defined as: S ( x ) = log( 1 P ( x ) ) = − log P ( x ) (1) Surprisal quantifies the “une xpectedness” of ob- serving x . Rare e vents hav e high surprisal while common ev ents hav e low surprisal. F ollo wing from that definition, the entr opy of a discrete prob- ability distribution X measures a verage surprisal: H ( X ) = − X x ∈ X P ( x ) log P ( x ) (2) Entropy quantifies uncertainty , which means that uniform distributions have higher entropy while peaked distrib utions ha ve lo wer entropy . In language modeling, surprisal measures how unexpected a token is given its context ( Jurafsky and Martin , 2000 ). For a token t n follo wing con- text in the windo w t 1: n − 1 : S ( t n | t 1: n − 1 ) = − log P ( t n | t 1: n − 1 ) (3) A verage surprisal across a corpus relates to per - plexity , a standard language model quality metric ( Jelinek et al. , 1977 ). 2.3 Related A ppr oaches and Prior W ork A pre v alent ev aluation paradigm in volves prompt- ing models to generate answers, often with reason- ing chains ( W ei et al. , 2022b ; K ojima et al. , 2022 ). Large-scale benchmarks like HELM ( Liang et al. , 2022 ), GPQA ( Rein et al. , 2024 ), and BIG-bench ( Sri v astav a et al. , 2022 ) ev aluate models across di- verse tasks using prompting. When the bench- mark in v olv es multiple-choice questions, accuracy is easily computed, but if the task is more open- ended then ev en another LLM may be used for e v aluation ( Gera et al. , 2025 ; Jiang et al. , 2025 ; W ang et al. , 2025 ). While effecti v e, prompting- based ev aluation has limitations. T urpin et al. ( 2023 ) sho w that generated explanations may mis- lead humans, with models producing plausible- sounding reasoning that didn’t reflect their actual decision process. Lanham et al. ( 2023 ) found that models can generate con vincing arguments for incorrect answers. Additionally , text genera- tion is computationally e xpensi ve, limiting ev alu- ation scale. An alternativ e approach to token output-based e v aluation trains pr obing classifiers on model rep- resentations to test whether specific information is encoded ( Belinkov et al. , 2017 ; Conneau et al. , 2018 ; Manigrasso et al. , 2024 ). Linear probes on hidden states can detect information about syntax ( He witt and Manning , 2019 ), semantics ( T enney et al. , 2019 ), and f acts ( Petroni et al. , 2019 ). Ho w- e ver , probing has its own limitations. Rogers et al. ( 2020 ) noted that probes require training data and may not reflect how models actually use encoded information. Pimentel et al. ( 2020 ) further argued that probe performance depends on representation geometry , not just information content. 2.3.1 Prior W ork on Surprisal-Based Analysis The minimal pairs methodology entails comparing model behavior on carefully constructed sentence pairs that dif fer in a single linguistic feature. Mar - vin and Linzen ( 2018 ) introduced this approach for systematic language model e v aluation, test- ing models on subject-verb agreement and other syntactic phenomena. The standard paradigm measures whether a model assigns higher prob- ability to the grammatical sentence than its un- grammatical counterpart. This approach has also prov en valuable for testing humans on reading tasks ( Sprouse et al. , 2013 ). W ilcox et al. ( 2018 , 2019 ) e xtended minimal pairs ev aluation to study ho w models represent syntactic structure, finding that surprisal patterns can re v eal hierarchical rep- resentations even in models with recurrent archi- tectures. W arstadt et al. ( 2020 ) further scaled the approach by creating BLiMP (Benchmark of Lin- guistic Minimal Pairs), a benchmark comprising 67 datasets with 67,000 minimal pairs targeting grammaticality judgments across morphological, syntactic, and semantic phenomena. More re- cently , Hu et al. ( 2024 ) demonstrated that LLM probability judgments ali gn with human grammat- icality intuitions across a range of constructions. Beyond minimal pairs, surprisal has been used to ev aluate masked language models ( Salazar et al. , 2020 ) and analyze syntactic knowledge ( Misra et al. , 2020 ; Wilcox et al. , 2020 ). These works demonstrated that probability-based ev al- uation is methodologically promising for assess- ing aspects of linguistic knowledge. Howe ver , they focused primarily on grammatical phenom- ena like syntax, morphology , and basic semantics. Our w ork e xtends this paradigm in two directions: first, from binary grammaticality judgments to or- dinal scales that capture degrees of confidence; second, from linguistic phenomena to classifica- tion tasks across di verse applied domains. Another critical methodological consideration for probability-based ev aluation is surface form competition. This phenomenon refers to the ob- serv ation that token probabilities are influenced by surface-lev el features independent of seman- tic content. Holtzman et al. ( 2021 ) demonstrated that the highest-probability answer is not always semantically correct because probability mass is distributed across surface variants (e.g., “Paris”, “ Paris”, “paris”, “P ARIS”). This means that nai ve probability comparisons can yield mislead- ing results. W e address surface form competition through careful prompt design: ensuring consis- tent formatting, using leading spaces to match nat- ural tokenization, and testing multiple response formats as robustness checks. Additionally , Zhao et al. ( 2021 ) proposed contextual calibration for fe w-shot learning, estimating and correcting for content-free prior biases. While we do not apply explicit calibration in the present work, the princi- ple of being aware of surface-le vel biases informs our experimental design. 2.3.2 LLM Calibration and Confidence Recent work has also examined whether LLMs can reliably express confidence in their outputs. Kadav ath et al. ( 2022 ) found that models can partially self-assess their kno wledge, producing higher confidence for correct answers, but this self-assessment is imperfect. Geng et al. ( 2024 ) provided a comprehensi ve survey of confidence estimation and calibration in LLMs, distinguish- ing between verbalized confidence (asking mod- els to state their confidence), consistenc y-based methods (measuring agreement across samples), and probability-based methods (our approach). These studies suggest that LLMs still tend to be poorly calibrated when reporting their own con- fidence. As such, probability-based confidence measures ha ve a theoretical adv antage ov er the LLM’ s reported confidence: they directly access the model’ s learned distribution instead of rely- ing on the model’ s ability to accurately report its o wn uncertainty . Xiong et al. ( 2023 ) contrasted verbalized uncertainty with actual probability dis- tributions, finding that expressed confidence of- ten di ver ges from underlying probabilities. Our entropy-based uncertainty quantification offers a principled alternati ve aligned with those findings that does not require models to articulate their con- fidence. 2.4 Contrib utions of This W ork Despite the substantial literature at the intersec- tion of LLM e v aluation and capabilities assess- ments, a gap remains: very fe w unified frame- works exist for applying surprisal-based e v alua- tion to classification tasks beyond grammaticality . Existing work has explored this idea of surprisal- based LLM e v aluation via minimal pairs anal- ysis ( Leiv ada et al. , 2025 ; Pistotti et al. , 2025 ; Sinha et al. , 2023 ; Park et al. , 2021 ); the calibra- tion literature emphasizes verbalized confidence or sampling-based methods; and few , if an y , sys- tematic studies hav e examined how context af- fects surprisal-based classification across di verse domains. T o be sure, log-probability scoring for multiple-choice tasks is already used in prac- tice. Examples include Brown et al. ( 2020 ) se- lecting answers based on completion likelihood in GPT -3 e v aluations, benchmarks such as MMLU ( Hendrycks et al. , 2020 ) offer log-prob scoring as an e v aluation mode, and the commonly used lm-evaluation-harness framework ( Gao et al. , 2024 ) implements log-likelihood-based scoring as a standard ev aluation method across hundreds of benchmarks. Ho we ver , these appli- cations tend to use log-probabilities as a scoring con v enience rather than as a formalized ev alua- tion framework. Moreov er , these approaches tend not to extend beyond binary or multiple-choice an- swer selection to ordinal scales or systematic un- certainty quantification through entropy . W e address this gap by extending the minimal pairs paradigm in two directions. First, we mov e from binary grammaticality judgments to ordinal scales (1-5, 1-9) that capture degrees of confi- dence and enable richer uncertainty quantification through entropy . Second, we apply this extended paradigm to practical classification tasks such as causal reasoning, figurativ e language detection, entity classification, and qualitativ e data labeling. Doing so helps demonstrate the approach’ s gener- ality beyond linguistic acceptability . As such, this frame work may be particularly valuable for large- scale ev aluation, uncertainty-aw are applications, and understanding implicit model kno wledge. 3 Methodology: Surprisal-Based Evaluation Framework 3.1 Cor e F ormulation 3.1.1 Surprisal f or Classification and Scoring Consider a classification task where you are observing someone else being asked to identify whether a “vegetable” is an instance of animal or a plant. Y ou might be more surprised to hear that person say that “vegetable” is an animal than to hear that it is a plant. That is probably because you hav e some existing knowledge about the world that then tempers your expectations of what you will hear . That ‘tempering’ is just conditioning a probability distribution gi ven information about the world. In other words, you are computing p ( “ve getable” is a plant | world kno wledge ) and p ( “ve getable” is an animal | world kno wledge ) . Based on those probabilities, you then form expectations of what you will hear, leading to that in verse relationship between surprisal and probability . Most current LLMs produce probability distrib utions ov er tokens. From that formulation, we can lev erage this property to e v aluate their performance on a wide range of tasks. A typical minimal pairs experiment w ould then compare the surprisal of the tw o options, leading to a classification of which option is more likely (or less surprising) gi ven the context. These are commonly performed in grammar tasks ( He et al. , 2024 ; Hu et al. , 2024 ), b ut we extend the idea to classification tasks. That is, instead of asking which of tw o competing completions is more or less surprising, we ask which of a range of options on a scale is more or less surprising. For our approach, we focus on classification tasks where we want to determine a model’ s pref- erence among alternati ve completions of a state- ment { a 1 , a 2 , . . . , a n } given context c . T raditional prompting might ask the model to generate an an- swer . Instead, we construct prompts with contexts that end just before the tar get token and measure surprisal for each alternative, often in the form of a 5-point or 9-point scale. Extending this to ordinal scales, consider the scenario where you are taking a survey and are presented with the question “On a scale from 1-5, where 1 means ‘strongly disagree’‘ and 5 means ‘strongly agree’, how strongly do you agree with the following statement: Sunsets are beautiful. ”. Y ou might be asked to respond with a number from 1 to 5, and you might be surprised to see someone respond with a “1” gi v en what you kno w about many people enjoying sunsets. More- ov er , you might be less surprised to see someone with a rating of “2” compared to a “1” and so on. Y ou have learned something about the world and human preferences that leads you to e xpect people responding with a higher number on that 1-5 pref- erence scale. W e leverage the same idea here and frame the task as a surv ey question for the LLM to respond to. The prompt sets up the task and rating scale as part of the context and ends just be- fore the target tok en, which in this case would be the number on that scale. W e then measure sur- prisal for each scale position (e.g., “1”, “2”, “3”, “4”, “5”). In theory , the position with minimum surprisal represents the model’ s more “natural” or “expected” response. Additionally , by measuring the surprisal for each position, we can also quan- tify the model’ s uncertainty or confidence in its response. A steeper drop in surprisal across the scale positions might indicate lo wer uncertainty (viz. entropy) in the LLM, while a more grad- ual drop indicates higher uncertainty (i.e., higher entropy). V isually , this would look like a spikier surprisal curve or a smoother surprisal curve, re- specti vely . Mathematically , this is expressed as follo ws. Definition 1 (Completion-Based Surprisal) . Given context c and a set of alternative com- pletions A = { a 1 , . . . , a n } , the surprisal of alternative a i is: S ( a i | c ) = − log P ( a i | c ) (4) wher e P ( a i | c ) is the model’ s probability of gener- ating token a i given context c . The alternativ e completion a with minimum surprisal, a ∗ , represents the model’ s most “natu- ral” or “expected” completion: a ∗ = arg min a i ∈ A S ( a i | c ) (5) The central idea is that identifying a ∗ for a model can rev eal something about the model’ s learned representations and understanding of the task. This applies across task types ranging from binary classification to ordinal scoring, as de- scribed belo w . 3.1.2 T ask-Specific A pplications Binary Classification. For binary tasks (e.g., identifying a statement as expressing a causal re- lationship vs. non-causal relationship), we com- pare surprisal for two completions, similar to typ- ical minimal pairs used in psycholinguistics: Class. = ( Pos. if S ( “T” | c ) < S ( “F” | c ) Neg. otherwise (6) where “T” and “F” denote “True” and “F alse” completions, respectiv ely . The completion with minimum surprisal is the one the model deems most lik ely gi v en the context. The surprisal dif fer - ence ∆ S = |S ( a 1 | c ) − S ( a 2 | c ) | may also provide a confidence measure, as larger dif ferences indi- cate stronger model preference, though the e xtent to which this relationship holds remains an open empirical question. Ordinal Scoring. Mo ving be yond binary choices, we can frame tasks on ordinal scales (e.g., 1–5 or 1–9) where scale positions are not independent classes but are related through their ordering. This is our key extension of the minimal pairs paradigm. For an n -point ordinal scale, we measure surprisal for each possible position and identify the model’ s preferred score: Score ∗ = arg min s ∈{ 1 ,...,n } S ( s | c ) (7) The resulting surprisal curve across all scale positions provides richer information than a single classification decision. Moreov er , dif ferent anchor wordings yield dif ferent task framings. F or exam- ple, we can apply a bipolar scale (1 = “strongly disagree, ” 5 = “strongly agree”) v ersus a unipolar scale (1 = “very low , ” 5 = “very high”) and then we can test multiple framings for robustness. The choice of scale length is itself an additional design v ariable: expanding from 1–5 to 1–9 may allo w finer-grained ev aluation, though the tradeoff be- tween scale granularity and measurement quality is yet another open question. 3.1.3 Uncertainty Quantification A key advantage of surprisal-based e v aluation (compared to question-and-answer approaches) is uncertainty quantification through entropy . Al- though we could ask the model to report its o wn confidence or uncertainty , this may require addi- tional training and calibration, neither of which can we assume are embedded in the model. How- e ver , through measuring surprisal o ver a range of possible ordinal completions, we can deriv e a probability distrib ution o ver the possible comple- tions and then use that for an entropy calculation. An important detail is ho w this con version is per - formed. The model produces logits (unnormalized log-probabilities, from which we compute proba- bilities and thus surprisal) o ver its full vocabulary V at the target position. W e restrict attention to only the tokens corresponding to our predefined alternati ves A = a 1 , . . . , a n and renormalize o v er this restricted set. W e denote these renormalized probabilities P A to distinguish them from the ra w model probabilities P used in Definition 1: P A ( a i | c ) = exp( logit ( a i | c )) P n j =1 exp( logit ( a j | c )) (8) This renormalization ensures that P n i =1 P A ( a i | c ) = 1 , yielding a valid proba- bility distrib ution over the alternati ves of interest. Note that this discards probability mass assigned to tokens outside A ( V \ A ); in other words, we condition on the assumption that the model’ s response is one of the predefined alternativ es. This is analogous to a forced-choice experimental paradigm. Because renormalization preserves the ordering of probabilities, the classification decisions in Equations 5–7 (which depend only on the argmin) are identical under either P or P A . W ith these normalized probabilities, the entropy of the distribution o ver alternati v es, which quanti- fies model uncertainty , can then be calculated as: H ( A | c ) = − n X i =1 P A ( a i | c ) log P A ( a i | c ) (9) Where P A ( a i | c ) is the renormalized probabil- ity of completion a i gi ven context c as defined in Equation 8 . High entrop y indicates the model is uncertain among alternativ es; lo w entropy in- dicates a strong preference. This provides a can- didate principled confidence measure without re- quiring calibration on held-out data or model self- report, and may serve as a useful signal for down- stream applications (this claim requires further empirical validation.) A further practical advan- tage is computational efficiency: surprisal-based e v aluation requires only a single forward pass, reading out logits for a small set of tokens. For a binary classification task with short chain-of- thought reasoning, explicit prompting might gen- erate 50–100 tok ens of reasoning, while surprisal measurement requires computing logits for just 2 tokens, yielding a speedup in e v aluations. 3.1.4 Experimental Design Considerations The prior calculations were all conditioned on a fixed context c . This opens research av enues for exploring the impact of context and framing on model surprisal. Our e xperiments vary con- text levels to study how information pro vision af- fects surprisal patterns (see Section 4 for details). Some experimental design principles and patterns include prompt structure and context manipula- tion. For example, each prompt combines rele- v ant context, a task description, and an incom- plete statement positioned so the model completes with the target token. Likewise, for context ma- nipulation, we can test a gradient of content le vels such as no conte xt, a brief definition, or a compre- hensi ve background of the task, all with the goal of studying how information provision affects sur- prisal patterns. W ith this kind of setup, it is straightforw ard to employ factorial experimental designs to system- atically study multiple factors. The model factor in v olves testing multiple LLMs for cross-model v alidation. This can be dev eloped for models within the same architecture (i.e., models of dif- ferent sizes), or across dif ferent architectures (e.g., Qwen vs. LLaMA vs. Phi vs. etc.). pr ompt fac- tors include v ariations in personas, context le vels (e.g., no context vs. brief definition of tasks vs. comprehensi ve background), and section delim- iters (e.g., XML vs. Markdo wn vs. none). V ary- ing those prompt factors allows us to study ho w dif ferent prompt designs affect surprisal patterns. Finally , task factors capture domain-specific vari- ables such as statement types and dif ficulty le vels. If we can associate the tasks with difficulty ratings from human raters, then we can study how task dif ficulty affects surprisal patterns. This factorial structure enables analysis of main effects and in- teractions, pro viding robust insights into what f ac- tors influence surprisal-based performance. For scoring tasks, we can analyze surprisal curves to e xtract multiple insights. The minimum location identifies the optimal score, that is, the scale position with the lowest surprisal. In the- ory , the steepness of the curve around that mini- mum reflects the model’ s confidence: steep curv es indicate high confidence, whereas flat curves sug- gest lo w confidence. Multi-modality , the presence of multiple local minima, can signal genuine am- biguity in the task or competing interpretations within the model. Finally , asymmetry in the curve, where one tail is steeper than the other , may re- veal directional biases in the model’ s judgments. W e briefly v alidate surprisal-based classifications through comparisons with human judgments on some experiments below . Systematic calibration studies across domains and models remain an im- portant direction for future work. 4 Experiments and Results W e demonstrate our surprisal-based ev aluation frame work across four domains, each follo wing the methodology in Section 3 . While these do- mains may appear disparate, they share a common origin: our research program on mental models of social-ecological-technological systems ( Jones et al. , 2011 ; Rouse and Morris , 1986 ), which studies ho w people understand complex systems through the entities they identify , the causal re- Domain T ask T ype Scale SETS Classifica- tion Score Discov ery 1-9 Causal Statement Scaled Classif. 1-5, 1-9 Figurativ e Lang. Multi-format Binary + Scales Deductiv e Coding Code Applic. 1-5, 1-9 T able 1: Summary of experimental domains and task types. lationships they describe, the figurativ e language they use, and the thematic codes that character - ize their responses. Each domain thus repre- sents a task where LLM-assisted analysis could support qualitati ve research, and where surprisal- based ev aluation offers a principled way to assess LLM performance. T able 1 summarizes the do- mains and task types. 4.1 Social-Ecological-T echnological Systems Scoring 4.1.1 T ask Description The Social-Ecological-T echnological Systems (SETS) frame work ( McPhearson et al. , 2022 ) analyzes entities across three interconnected dimensions: social, ecological, and technological. In their foundational paper , the authors displayed a ternary plot showing entities arranged some- where in the two-dimensional triangle along the three social-ecological-technological dimensions. In this first task, we tested the extent to which LLMs would differentiate entities based on their social-ecological-technological dimensions by testing whether the surprisal-based approach can identify appropriate scores for entities on these dimensions. 4.1.2 Methodology For each entity (e.g., “park, ” “current, ” “web”) from a set of statements, for each dimension (i.e., social, ecological, technological), we measure sur- prisal for score assignments on a 1-9 scale. The statements were crafted so that pairs of sentences with homon yms (e.g., web, current, spring) would hav e different meanings giv en the context. A rep- resentati ve prompt is shown below; the full set of prompt v ariations is described in Appendix A . The Social-Ecological-T echnological Systems (SETS) frame work analyzes entities across three interconnected dimensions: Social: human aspects such as commu- nity interactions, go vernance, economic systems, cultural v alues, and social eq- uity Ecological: the natural en vironment and its components, which are often in- volv ed in biophysical processes, includ- ing natural resources, ecosystem func- tions, and en vironmental conditions T echnological: human-made systems and engineered infrastructures, includ- ing infrastructure, technological tools, and innov ations The framew ork can be used to clas- sify entities and concepts based on their alignment with these dimensions. When doing so, it can be helpful to consider not only the entity but the surrounding context in which it w as mentioned. Consider the following context and en- tity: Context: The spring w as compressed too much Entity: “spring” On a scale from 1-9, where 1 corre- sponds to the entity having no ecological characteristics and 9 corresponds to ex- tremely high ecological characteristics, gi ven the context, the entity “spring” score on the ecological dimension is: W e then measure surprisal for X ∈ { 1 , 2 , 3 , . . . , 9 } and construct surprisal curv es such as those sho wn in Figure 1 . 4.1.3 SETS Classification K ey Findings T o quantify observ ations across the full dataset, T able 2 reports the mean absolute error (MAE) be- tween each model’ s surprisal-optimal scores and expected scores across all 90 entities in the dataset, broken down by SETS dimension. These results use a standardized set of four Qwen2.5 models. Lo wer MAE indicates better alignment with ex- pected scores. The results sho w a clear relation- ship between model size and accuracy: the 14B v ariants achie ve the lo west MAE (1.43–1.45), fol- lo wed by 7B-Instruct (1.83), while the 3B-Instruct model performs substantially worse (2.95). The 14B base model slightly outperforms the 14B instruction-tuned v ariant overall, driven primarily T able 2: SETS classification MAE by model and di- mension ( n = 90 entities). Lower is better . All models are Qwen2.5. Model Soc. Ecol. T ech. Mean 3B-Instruct 2.92 2.99 2.94 2.95 7B-Instruct 2.21 1.51 1.78 1.83 14B-Instruct 1.86 1.17 1.33 1.45 14B 2.04 0.98 1.27 1.43 by its strong performance on the ecological di- mension (MAE = 0.98). W e note that the ex- pected scores are researcher-assigned v alues rather than multi-rater consensus scores, so these results should be interpreted as alignment with one re- searcher’ s e xpectations rather than an objectiv e benchmark. Surprisal curves often showed clear minima corresponding to expected scores indicating the LLMs were able to distinguish between homon y- mous entities. For example, we tested the word “bug” and found that the insect meaning sho ws minimum surprisal at 7–8 on the ecological di- mension but 1 on the technological dimension (Figure 1 ). Y et, as one would hope, when the en- tity is a software b ug, the pattern flips for most models: the minimum surprisal is at 7 on the technological dimension and 1–2 on the ecolog- ical dimension (Figure 2 ). The figures show re- sults for four Qwen2.5 models (3B-Instruct, 7B- Instruct, 14B-Instruct, and 14B base). T esting multiple models from the same family helps cap- ture possible performance differences associated with model size. In particular , the 3B model as- signs a score of 1 across all dimensions for both meanings of “bug, ” suggesting it cannot reliably disambiguate the entity at this parameter scale. The 7B model performs better on the ecological dimension for the insect meaning (score 7) but still fails on the technological dimension for the soft- ware meaning (score 1). The 14B models cor- rectly identify the distinguishing dimensions for both meanings, suggesting that lar ger models hav e more robust representations of entity semantics. Context of the entity also plays a crucial role in determining the optimal scores. For e xample, we took three instances for the entity “virus” (as in a computer virus) and found that the technologi- cal score changed dramatically depending on the context (Figure 3 ). W ith minimal context (“The virus was detected”), all models assign a techno- logical score of 1, and the 14B models instead Figure 1: Surprisal curves for “bug” (insect meaning) on the social, ecological, and technological dimensions with moderate context (“The garden bug pollinated flo wers while feeding on nectar”). The 14B models correctly identify high ecological scores. Figure 2: Surprisal curves for “bug” (software meaning) on the social, ecological, and technological dimensions with moderate context (“The software bug caused the application to crash unexpectedly”). The 14B models cor- rectly identify high technological scores. assign higher ecological scores (5–7), suggesting they interpret “virus” as biological. Howe v er , as we add more context making it clearer that we are discussing a computer virus, the pattern flips: with moderate context (“The computer virus corrupted files and spread through email attachments”), the technological score jumps to 9 for the 14B models and 7 for the 7B model, while the ecological score drops to 1–2. The rich context (a detailed descrip- tion of malware exploiting zero-day vulnerabili- ties) also displayed this pattern with technological scores of 8–9 for the 14B models. The 3B model ne ver picked up on the technological dimension regardless of conte xt, assigning a score of 1 across all three context le v els. 4.2 Causal Statement Identification Causal reasoning is fundamental to human cogni- tion and critical for AI systems ( Pearl , 2009 ). As such, there is general interest in exploring the ex- tent to which LLMs can identify or internally rep- resent causal relationships ( Kiciman et al. , 2023 ). If the y can do so, then that may suggest elements of functional world models. T o in v estigate this, we test how well surprisal measurements can provide insights into an LLM’ s ability to distinguish causal from non-causal statements. W e do this by fram- ing the task either as a binary classification task or an ordinal-scaled rating task. 4.2.1 Binary T ask Description For this task, the model must decide whether or not a gi v en statement e xpresses a causal relation- ship. Under this binary classification task framing, for each statement, we can measure the extent to which the LLM is surprised by a “T rue” or “False” completion to a prompt such as: [Causal relationship definition] Figure 3: Surprisal curv es for “virus” (computer meaning) across three context lev els: (a) minimal conte xt, where models interpret “virus” as biological and assign high ecological scores; (b) moderate context, where 14B mod- els correctly shift to high technological scores; (c) rich context describing malware beha vior , indicating context- dependent disambiguation. The 3B model never adjusts its technological score re gardless of conte xt. Statement: “Smoking causes lung cancer.” This statement expresses a causal relationship: In theory , that sounds simple. In practice, it is more nuanced. For example, ho w much con- text should we provide to the model about what a causal relationship is? And what should the response format be? As an initial proof of con- cept, we test three conte xt le v els (full causal back- ground, minimal definition, no context) and multi- ple response formats (T rue/False, Y es/No, binary choice). The full conte xt provides a detailed def- inition of causality and causativ es, the minimal context provides a brief definition, and the no con- text provides no definition. W e test the dif ferent binary scales to account for potential v ariations in how the models might interpret the task and options. This is akin to a robustness analysis to check whether the model’ s performance is sensi- ti ve to the specific wording of the choices. It is also the experiment most aligned with the mini- mal pairs experiment setup. The more complete set of prompts is provided in Appendix B . For this set of experiments, we use a synthetic test dataset generated by the researcher to inten- tionally range across a spectrum from clear-cut to ambiguous cases. The statements can be cat- egorized into fiv e groups, ranging on the degree of causality expressed. The groups are: explicit causal (clear causal markers like “causes”, “re- sults in”, etc.), implicit causal (implied causa- tion like “if-then” statements), correlational (tem- poral/statistical association), non-causal (descrip- ti ve), and ambiguous (borderline cases). The dis- tribution of those was: 37 causal, 45 non-causal, 18 correlational. For binary classification, we treated explicit and implicit causal as positive and all others as negati ve, yielding 37 positiv e and 63 negati ve items. W e use that setup to test hypothe- ses about model performance under different le v- els of difficulty with the statements. Examples of implicit causal statements can include “If it rains, the ground will get wet” and “If I study hard, I will get an A in class”. In contrast, associational statements can include sentences such as “ As tem- peratures rose, the ice cream sales increased. ” 4.2.2 Binary T ask Key Findings Using the same four Qwen2.5 models as the SETS experiment, we tested these models with three bi- nary options (True/F alse, Y es/No, Binary Choice) and three context lev els (full causal background, minimal definition, no context). Figure 4 sho ws results for two clear-cut cases. For the causal state- ment “The hea vy rain caused widespread flood- ing in the city” (panel a), models demonstrate lo wer surprisal for the “T rue” completion across all models and context le vels, as shown by lines sloping upward from left to right. The non-causal statement “The meeting was scheduled for 3 PM” (panel b) shows the opposite: lines slope down- ward, indicating lower surprisal for “False. ” This suggests that the models are not simply biased to- ward one classification by the task framing. More revealing are the ambiguous cases sho wn in Figure 5 . For the indirect causal statement “If you heat water to 100 degrees Celsius, it will boil” (panel a), most models and context le vels still demonstrate lo wer surprisal for “T rue, ” but the lines are less steep than in the clear causal case, in- dicating greater model uncertainty . This reduced steepness reflects higher entropy—the probability mass is more spread out across completions rather than concentrated on a single answer . For the correlational statement “Students who study more tend to get better grades” (panel b), the pattern is more mixed: while most models still lean to- ward “True, ” some context levels produce nearly flat lines. The ground truth label for this state- ment is “correlational, ” and the varying slopes re- flect genuine ambiguity: the 14B-Instruct model T able 3: Causal binary classification accuracy (%) by model and context le vel ( n = 100 ). Accuracy av eraged across response formats. All models are Qwen2.5. Model Full Minimal None Mean 3B-Instruct 68.0 47.3 44.7 53.3 7B-Instruct 75.7 72.7 69.3 72.6 14B-Instruct 78.0 77.0 76.7 77.2 14B 76.0 68.3 77.0 73.8 sho ws relati vely consistent upward slopes regard- less of context, while the 14B base model shows more v ariation across context le v els. T able 3 reports aggregate classification accu- racy across the full dataset of 100 statements, including explicit causal, implicit causal, corre- lational, non-causal, and ambiguous categories. These are broken do wn by model and context le vel. The results use a standardized set of four Qwen2.5 models and were consistent with the pat- terns observ ed in the individual examples abov e: larger models achiev e higher accuracy , and the ef fect of context depends on model size. For the 3B-Instruct model, pro viding full context im- prov es accuracy substantially (68.0% vs. 44.7% with no context), whereas the 14B-Instruct model sho ws minimal sensitivity to context lev el (76.7– 78.0%). The 14B base model shows a note- worthy pattern where the no-context condition (77.0%) outperforms the minimal-context condi- tion (68.3%), suggesting that incomplete context definitions may interfere with the base model’ s prior representations of causality . 4.2.3 Ordinal-Scaled T ask Description Building on binary classification, one might want a more granular expression of uncertainty or rec- ognize that there might be finer nuances in the way that people express causal relationships. T o capture that gradation in expression, we tested whether we could extract granular confidence es- timates by measuring surprisal across numerical scales. This tests whether surprisal can provide calibrated uncertainty quantification beyond the binary choices tested in the preceding experiment. 4.2.4 Ordinal-Scaled T ask Methodology Expanding to an ordinal scale raises practical de- sign questions. For example, ho w many points should the scale have, e ven or odd, fi ve or nine, etc. Additionally , what should the anchors be? Should the y be equidistant or should the y be more Figure 4: Binary classification surprisal for clear cases: (a) the causal statement “The hea vy rain caused widespread flooding in the city , ” where upward-sloping lines indicate lower surprisal for “True”; (b) the non-causal statement “The meeting was scheduled for 3 PM, ” where downw ard-sloping lines indicate lower surprisal for “False. ” Each panel shows one model with three conte xt le vels. spread out? Should the scale be bipolar (i.e., capturing both causal and non-causal content) or unipolar (i.e., capturing only causal content)? For an initial approach into this area, we tested two of the fi ve framings listed in Appendix B : bipo- lar causality and causal strength. The others were explored in preliminary work but are not reported here. For those two framings, we test on a five- point scale and a nine-point scale. W ith each scale and each framing, we specify anchor point val- ues. For example, for the causal strength scale, the anchor points are 1 = no causal content and 5 = very strong causal content. For each statement and framing, we measure surprisal for ev ery scale position completion (e.g., “1”, “2”, “3”, “4”, “5”) and consequently extract the full surprisal distri- bution. W e perform this measurement for v arying le vels of background information on causal state- ments (context le vels), specifically testing three le vels (no information, minimal information, and full information), as was also the case with the bi- nary classification task. For the minimal conte xt v ariation of the experi- ment, a representati ve prompt looks lik e this: A causal relationship exists when one e vent, action, or state brings about, in- fluences, or determines another . Causal relationships can be expressed through explicit markers (because, causes, leads to) or implied through conditional state- ments and purpose expressions. Ho w strong is the causal content in this statement: “Monitoring stations indicate that heavy rainfall led to widespread flooding in low-lying areas, according to reports. ” Rate from 1 to 5: 1 = No causal content 5 = V ery strong causal content Rating: For each scale length, we measure surprisal for every position completion (1 through 5 or 1 through 9) and repeat across all scale framings and context le v els. 4.2.5 Ordinal-Scaled T ask Key Findings T o enable direct comparison with the binary ex- periment, we re visit the same statements on ordi- nal scales. F or the causal statement “The heavy rainfall caused widespread flooding in the city , ” the surprisal curves across all model, conte xt, and scale combinations are predominantly monotoni- cally decreasing—models find higher causal rat- ings less surprising (Figure 10 in Appendix E ). This monotonicity indicates internal consistency: models that preferred a rating of 5 also found 4 less surprising than 3, and so on. Different scale framings (causal strength and bipolar causality) sho wed con ver gent patterns for this clear-cut case. For the correlational statement “Students who study more tend to get better grades, ” the curves shift to more parabolic shapes with minima falling Figure 5: Binary classification surprisal for ambiguous cases: (a) the indirect causal statement “If you heat water to 100 degrees Celsius, it will boil, ” where lines slope upward b ut less steeply than clear cases; (b) the correlational statement “Students who study more tend to get better grades, ” where mixed slopes reflect genuine ambiguity . Compare with Figure 4 . in the middle of the scales rather than at the ex- tremes (Figure 11 in Appendix E ). This suggests that the models find moderate ratings less surpris- ing than those indicating the statement definitely was or was not causal, which is consistent with the genuine ambiguity of a statistical association. The 14B base model showed flatter surprisal curves than the instruction-tuned models for this state- ment, suggesting greater uncertainty . Exploring ho w fine-tuning affects model uncertainty on am- biguous items is an area for future work. Beyond these qualitativ e patterns, we can also quantify ov erall performance on this task. T a- ble 4 reports directional accuracy (the percentage of statements where the surprisal minimum falls on the correct side of the scale midpoint) across the full dataset of 100 statements, using a stan- dardized set of four Qwen2.5 models. Results are brok en down by context lev el and scale fram- ing (CS = causal strength, BC = bipolar causal- ity). Accuracy is generally high across all condi- tions (74–92%), suggesting that the ordinal scal- ing approach preserves the discriminati ve power observed in binary classification while offering more granular uncertainty information, in general. The causal strength framing tended to slightly out- perform the bipolar causality framing, and lar ger models maintain their advantage, with the 14B- Instruct model achieving the highest mean accu- racy (89.0%). 4.3 Figurative Language Detection Figurati ve language provides a signal about how people conceptualize topics ( Lakof f and Johnson , 1980 ). F or instance, describing LLMs as “stochas- tic parrots” ( Bender et al. , 2021 ) versus “a gold mine of opportunities” re v eals v ery dif ferent men- tal models. Detecting whether language is figura- ti ve or literal is thus a natural task for surprisal- based e v aluation and one that directly tests the e x- tended minimal pairs paradigm. W e believe this is especially true since figurati v e and literal state- ments can share surface-le vel lexical overlap while dif fering in meaning. 4.3.1 T ask Description W e test whether surprisal can distinguish figura- ti ve from literal language using a paired minimal- pairs design. Each pair consists of a figurativ e statement and a literal counterpart that shares key lexical items b ut uses them literally . For example: • Figurative: “The words hung in the air between them. Neither wanted to reach out and grab them, afraid of what might happen if they ac- kno wledged the truth. ” (metaphor—words as physical objects) • Literal: “The banner hung in the air between them, suspended by wires. Neither wanted to reach out and grab it, afraid they might tear the delicate fabric. ” (literal—physical object) T able 4: Causal scaled directional accuracy (%) by model, context le vel, and framing ( n = 100 ). CS = causal strength, BC = bipolar causality . All models are from the Qwen2.5 f amily . Model Full/CS Full/BC Min./CS Min./BC None/CS None/BC Mean 3B-Instruct 82.0 82.0 79.0 85.0 81.0 74.0 80.5 7B-Instruct 86.0 83.0 84.0 83.0 87.0 78.0 83.5 14B-Instruct 90.0 91.0 91.0 88.0 88.0 86.0 89.0 14B 88.0 87.0 89.0 86.0 92.0 85.0 87.8 Both statements contain “hung in the air” and “reach out and grab, ” b ut only the first uses these phrases figurativ ely . The dataset contains 30 such pairs (60 statements total), spanning metaphor , simile, personification, and analogy across do- mains including b usiness, psychology , nature, and technology . This is a small proof-of-concept test set that will be built upon in future e xperiments. 4.3.2 Methodology For each statement, we measure surprisal on a metaphor intensity scale (1 = completely literal, 5 = highly metaphorical) and identify the scale position with minimum surprisal. The primary metric is the scale discrimination rate : for each pair , does the figurati ve statement recei ve a higher minimum-surprisal position than its literal coun- terpart? This directly tests whether surprisal mea- surement can discriminate figurativ e from literal language within controlled minimal pairs. W e tested both 5-point and 9-point scales with the same four Qwen2.5 models and two context lev- els (minimal and full definitions of figurati ve lan- guage). A representati ve prompt structure is: [Figurative language definitions] On a scale from 1 to 5, rate ho w metaphorical this statement is: “The words hung in the air between them. Neither wanted to reach out and grab them, afraid of what might happen if they ackno wledged the truth. ” 1 = Completely literal 3 = Some what metaphorical 5 = Highly metaphorical Rating: The full set of prompts and context le vels are described in Appendix C . 4.3.3 Key Findings Figure 6 illustrates a representati ve pair on the 5-point metaphor intensity scale. For the figu- rati ve statement (left panel), surprisal curves de- crease to ward the high end of the scale, indicating that models found high metaphor intensity ratings less surprising. F or the literal counterpart (right panel), the pattern re versed: curves were relatively flat with minima in the middle of the scale or in- creasing, with minimum surprisal falling at low or middle intensity scores. This contrast within a single minimal pair , where both statements share the phrases “hung in the air” and “reach out and grab, ” demonstrated that surprisal was sensiti ve to semantic rather than purely lexical features. T able 5 reports the scale discrimination rate across all 30 pairs. On the 5-point scale, the 14B base model achie ves 95.0% mean discrimi- nation, which was the the highest of any model and scale combination. The 7B-Instruct model achie ves 90.0% on the 9-point scale with identical performance across context le vels. Discrimina- tion rates are substantially abov e the 50% chance baseline across most conditions, suggesting that surprisal-based measurement can reliably distin- guish figurati ve from literal language within con- trolled minimal pairs. T wo patterns in these results are note worthy . First, the 14B base model’ s strong performance on the 5-point scale (95.0%) versus the instruction- tuned 14B model (66.7%) suggests that instruction tuning may introduce response biases that distort raw surprisal distrib utions. The base model’ s sur - prisal curves may more faithfully reflect the un- derlying language model’ s representation of fig- urati veness. Second, minimal context often out- performs full context on the 5-point scale (e.g., 96.7% vs. 93.3% for the 14B base model, 80.0% vs. 53.3% for 3B-Instruct), suggesting that addi- tional definitional context can narrow the surprisal distribution in w ays that reduce discriminability . Figure 6: Paired metaphor intensity surprisal curves (5-point scale) for Pair 7. Left: the figurative statement “The w ords hung in the air between them...afraid of what might happen if they ackno wledged the truth” sho ws decreasing surprisal tow ard high intensity . Right: the literal counterpart “The banner hung in the air between them...afraid they might tear the delicate fabric” shows minimum surprisal at low intensity . Both statements share key phrases b ut dif fer in figurati ve status. T able 5: Figurativ e language paired scale discrimina- tion rate (%) by model, scale, and context ( n = 30 pairs). Rate = percentage of pairs where the figurativ e statement’ s minimum-surprisal position exceeds the lit- eral counterpart’ s. All models are Qwen2.5. Model Minimal Full Mean 5-point scale 3B-Instruct 80.0 53.3 66.7 7B-Instruct 80.0 80.0 80.0 14B-Instruct 70.0 63.3 66.7 14B 96.7 93.3 95.0 9-point scale 3B-Instruct 46.7 73.3 60.0 7B-Instruct 90.0 90.0 90.0 14B-Instruct 70.0 66.7 68.3 14B 76.7 63.3 70.0 4.4 Deductive Coding of Qualitativ e Survey Responses 4.4.1 T ask Description Qualitati ve research often in volv es applying codes to text data ( Saldaña , 2016 ). Those codes can be either deducti ve (i.e., a priori , the codes are based on a pre-existing codebook or theoretical frame- work) or inducti ve (i.e., the codes are based on the data itself). These codes are labels that researchers would apply to segments of qualitative data to de- scribe patterns (e.g., themes) in those data. The codebook used in our ev aluation is shown in T a- ble 6 . Once a researcher has that codebook, they will typically read the data and make decisions about which code to apply to which segment of data. Thus, one can think of coding as a deci- sion problem where there can be v arying de grees of certainty about which code to apply to which segment of data. Here, we test the e xtent to which surprisal can facilitate deductiv e coding of surv ey responses. 4.4.2 Methodology Gi ven a codebook with codes and definitions for the codes along with short text to code (i.e., la- bel with the codes, such as open-ended survey re- sponses), we measure surprisal for code applica- bility scores. W e tested both quantitati ve scales (1-9 scale, where 1 = not applicable, 9 = highly applicable) and qualitativ e scales. For the quali- tati ve scales, we constrained scale labels to single tokens to av oid multi-token averaging. Qualitati ve scale labels included degree-of-applicability (e.g., “not”, “some what”, “very”, “e xtremely”). An ex- ample of the prompt and framing used for the de- ducti ve coding experiment is shown in Appendix D . As sho wn in the Appendix D , experiments for this task can systematically vary the scale options, the number of scale options, the scale type (quan- titati ve or qualitati ve), background information on what qualitativ e coding is, persona assignments, and prompt formatting (i.e., use of XML tags or Code Definition Back to normal with no changes/admin Frustration with administration expecting return to pre-pandemic condi- tions without implementing improv ements Delays, setbacks in ca- reers/research Career advancement delays, research disruptions, publication setbacks, or professional dev elopment impacts Don’t w ant to work in of fice Preference to continue remote work rather than returning to of fice-based work Equity Concerns about fairness, equal treatment, disparities, or social justice issues related to pandemic recov ery Family , personal priorities, childcare Statements about family responsibilities, childcare arrangements, or bal- ancing personal priorities with work Financial/support Financial concerns, economic impacts, funding issues, or need for institu- tional support Lack of common purpose Absence of shared goals, collecti v e mission, or unified direction in w ork or institutional settings Not over , less hope of recover - ing Pessimism about pandemic recov ery , feeling that the crisis is ongoing with little hope for improv ement Online teaching challenges Difficulties with remote teaching, online education deliv ery , technology is- sues, or virtual learning en vironments Public safety (masks, vaccines, etc.) Statements about public health measures, safety protocols, mask require- ments, vaccination policies, and related safety measures Return to campus, students, teaching Statements about returning to in-person campus acti vities, student interac- tions, or classroom teaching Supporting others, employees, students Responsibilities for supporting colleagues, employees, students, or others during recov ery W ork/life boundaries Challenges with maintaining separation between work and personal life, especially during remote work T able 6: Codebook for Deductiv e Coding of Faculty P andemic Surve y Responses (13 codes used in e valuation) other headings), whether to giv e the LLM the code and definition or only the code, and the size of the segment of text to code. By altering these vari- ables, we can also test the robustness of surprisal- based coding and capture degrees of uncertainty in the coding process. Hypothetically , that uncer- tainty could be a useful signal for human coders to re vie w the LLM’ s coding decisions. For the standardized e v aluation, we used texts from a surve y of faculty members returning to campus tow ard the end of the pandemic, with a codebook reflecting themes such as work/life bal- ance, career adv ancement, and public safety . W e tested four Qwen2.5 models on a balanced dataset of 40 text–code pairs using a 1–5 code applicabil- ity scale (1 = not applicable, 5 = highly applica- ble), with two persona conditions (no persona and qualitati ve researcher). 4.4.3 Key Findings Belo w we present figures with tw o panels (one per persona condition) with four model curves, allo w- ing direct comparison of how model size and per- sona assignment af fect coding judgments. Figure 7 shows a positi v e case where the code “not over , less hope of ev er recov ering” was tested against the te xt “That the pandemic is not going to end!” Most models correctly identify this code as appli- cable, with minimum surprisal falling at scores of 3–4 across most models and persona conditions. The bowl-shaped curv es with interior minima in- dicate that the models found moderate applicabil- ity scores less surprising than the extremes of no applicability or high applicability , suggesting be- lief that the code applies but that it might not be a clear-cut case. T o check whether the models are simply biased to ward finding all codes applicable, we can exam- ine a ne gati v e case. Figure 8 shows the code “fam- ily , personal priorities, childcare” tested against text about pandemic media cov erage. Here, we see a re versal of the prior pattern: surprisal increases monotonically from score 1 to 5, with all models finding score 1 (not applicable) as the least sur- prising option (with the exception of the 7B model for the no persona prompt, where the 4 and 5 sur- prisals do not follo w that monotonic trend). This consistency across models and persona conditions provides a good check that the models are not sim- Figure 7: Surprisal-based coding for the code “not o ver , less hope of e ver recovering” applied to a text about the pandemic not ending. Models correctly find minimum surprisal at scores 3–4 (applicable). ply follo wing a fixed pattern of surprisal scores. Figure 9 shows another positiv e case for the code “not ov er , less hope of ev er recovering” with a different text (“we striv e to achiev e a new normal— social distancing and mask wearing are not going aw ay anytime soon”). Here, the mod- els show more varied patterns across model sizes. The 14B-Instruct and 14B base models find scores of 2–3 as the minimum, while the smaller models sho w less consistent patterns. This variation high- lights how model size interacts with coding dif- ficulty: when the textual markers are less direct, larger models sho w more nuanced discrimination while smaller models may struggle. T o quantify performance more systematically , T ables 7 and 8 report aggre gate accuracy and F1 scores across a balanced dataset of 40 text–code pairs, using a standardized set of four Qwen2.5 models. Accuracy is computed by thresholding the surprisal-optimal score at ≥ 3 on the 1–5 scale (i.e., the model considers the code applicable if the minimum-surprisal position is 3 or higher). The 14B-Instruct model achieves the best performance on both metrics (75.0% accurac y , 72.2% F1), con- sistent with the size-related patterns observed in other experiments. The persona manipulation has inconsistent ef fects across models: it slightly im- prov es the 3B-Instruct model but slightly hurts the 7B-Instruct model. The discrepanc y between ac- curacy and F1 for the 7B-Instruct model (66.3% accuracy vs. 58.4% F1) suggests an asymmetry in its error patterns, with more false positiv es than T able 7: Deducti ve coding accuracy (%) by model and persona ( n = 40 , threshold ≥ 3 ). All models are Qwen2.5. Model No Pers. Qual. Res. Mean 3B-Instruct 65.0 70.0 67.5 7B-Instruct 67.5 65.0 66.3 14B-Instruct 75.0 75.0 75.0 14B 67.5 67.5 67.5 T able 8: Deducti ve coding F1 score (%) by model and persona ( n = 40 ). All models are Qwen2.5. Model No Pers. Qual. Res. Mean 3B-Instruct 65.0 64.7 64.9 7B-Instruct 60.6 56.2 58.4 14B-Instruct 72.2 72.2 72.2 14B 66.7 64.9 65.8 false ne gati ves. These in vestig ations highlight the promise of surprisal-based coding for deducti ve coding of surve y responses. Future work will explore the ro- bustness of this approach with a particular focus on capturing de grees of uncertainty in the coding process and the e xtent to which it aligns with hu- man coding decisions . Doing so might ha ve direct applications for qualitativ e data analysts in addi- tion to providing further indicators of LLM capa- bilities. Figure 8: Surprisal-based coding for the code “family , personal priorities, childcare” applied to a text about pan- demic media cov erage. Models correctly find minimum surprisal at score 1 (not applicable). 5 Discussion Across four domains, three consistent patterns emerged. First, the surprisal-based approach produced interpretable classification signals with clear minima at expected scale positions and monotonic curves across ordinal reasoning in clear-cut cases. Second, entropy over the re- stricted completion set ( a i ) tended to flag gen- uinely ambiguous items, e.g., correlational state- ments such as “students who study more tend to get better grades” or deductiv e codes with par - tial textual support tended to ha ve higher entropy while clear-cut items produced peaked distribu- tions with lower entropy . Third, performance gen- erally scaled with model size, though there were notable exceptions: the 14B base model some- times outperformed its instruction-tuned v ariant, and smaller models occasionally performed bet- ter than the larger models. These findings sug- gest that the relationship between fine-tuning, pa- rameter count, and surprisal-based accuracy is not strictly monotonic and that tuning or scaling the models may reshape probability distributions in ways that do not always enhance surprisal-based e v aluation. 5.1 Model Behavior Acr oss Domains Context sensitivity emerged as a ke y dif feren- tiator across experiments. In the SETS task, providing disambiguating context for homonyms like “virus” shifted the 14B models’ minimum- surprisal position on the technological dimension from 1 to 9 while the 3B model’ s scores did not change (Figure 3 ). In the causal binary task, providing full causal definitions improv ed the 3B model’ s accuracy from 44.7% to 68.0%, b ut had minimal effect on the 14B-Instruct model (76.7% to 78.0%; T able 3 ), and not in the direction one would expect. In the figurati v e language task, additional definitional conte xt sometimes reduced discrimination rates (e.g., 96.7% to 93.3% for the 14B base model on the 5-point scale; T able 5 ). These patterns suggest that context provision is not uniformly beneficial: it helps most when mod- els lack suf ficient prior kno wledge, but can narrow probability distributions in ways that reduce dis- criminability when models already represent the distinction well. This aligns with Rauba et al. ( 2024 ) on the importance of probing context sensi- ti vity and with Pezeshkpour and Hruschka ( 2024 ) on how contextual cues can affect LLM perfor- mance in non-obvious w ays. 5.2 Genuine Uncertainty vs. Errors A ke y question for any ev aluation method is whether observed uncertainty reflects genuine am- biguity in the task or simply model confusion. Our experiments suggest that entropy-based un- certainty measures may distinguish between these cases. In the causal statement experiments, the statement “Students who study more tend to get better grades” produced notably flat surprisal curves across models and conte xts, with minimum surprisal values falling in the middle of the scale Figure 9: Surprisal-based coding for the code “not over , less hope of ev er recovering” applied to a text about striving for a ne w normal. Models sho w v aried patterns, with larger models finding minimum surprisal at moderate applicability scores. (see Figure 11 ). This pattern is e xactly what one would e xpect for a genuinely ambiguous case. The statement expresses a statistical association that could plausibly be interpreted as causal or non- causal, and the models’ flatter curves reflect that ambiguity . In contrast, other times, when models were simply wrong (e.g., the 3B model misclassifying “bug” in a software context), they showed con- fident but incorrect responses with low entropy . This asymmetry is encouraging: it suggests that high entropy genuinely signals task dif ficulty or ambiguity , while low entrop y indicates model con- fidence, though not necessarily correctness. The aggregate accuracy tables suggest that lar ger mod- els may be better calibrated in this regard. The 14B-instruct model achieves the highest accurac y across all four domains on a v erage while also pro- ducing flatter curves for ambiguous cases. For practitioners, this may imply that entropy mea- sures can serve as useful flags for cases requiring human re vie w in a human-in-the-loop setting ( W u et al. , 2022 ), complementing the primary classifi- cation decision. Future work can test for correla- tions between human judgment of the dif ficulty for each item and the model’ s entropy for that item. 5.3 Framework Design Considerations 5.3.1 Binary vs. Ordinal Framings Our experiments with causal statement identifica- tion tested both binary (T rue/False, Y es/No) and ordinal (1-5, 1-9) framings of the same underlying task. These dif ferent framings may measure dif- ferent aspects of model judgment. Binary fram- ings force a categorical decision and reveal the model’ s threshold for classification. Although the surprisal dif ference between the two options pro- vides a crude confidence measure, the information is limited because a single delta simply tells the researcher about direction and magnitude of pref- erence b ut nothing about the shape of model un- certainty ov erall. Ordinal framings yield richer information through the shape of the entire surprisal curve. A monotonically decreasing curve (as in Figure 10 ) indicates confident classification with internally consistent ordinal reasoning. A bowl-shaped or parabolic curve with a minimum somewhere in the middle (as in some of the plots in the middle col- umn in Figure 11 ) reveals genuine ambiguity or moderate confidence. The steepness of the curve provides additional confidence information: steep curves suggest high certainty , while flat curves suggest uncertainty regardless of where the min- imum falls. This finding suggests that practition- ers should choose framings based on their infor- mation needs. Binary framings suf fice for simple yes/no classification tasks. Ordinal framings are preferable when uncertainty quantification is im- portant, when the underlying construct admits de- grees rather than categorical distinctions, or when the uncertainty can be used in downstream tasks as an important signal. 5.3.2 The T okenization Challenge Surprisal-based e v aluation is inherently sensitive to tokenization. Because we measure the prob- ability of specific tokens, the choice of comple- tion format directly af fects results. For example, “ T rue” (with a leading space) and “T rue” (with- out) are different tokens with potentially dif fer - ent probabilities. In our experiments, we consis- tently used leading spaces to match natural con- tinuation patterns, but this choice requires explicit documentation. Beyond formatting con v ersations, we observed potential biases in certain comple- tion formats. In preliminary deducti v e coding experiments with qualitativ e scales, the Y es/No scale showed unexpected behavior where “No” was consistently selected as least surprising e ven when the code appeared applicable. This suggests that models may hav e prior biases tow ard certain tokens independent of the classification context; this is a phenomenon related to the surface form competition described by Holtzman et al. ( 2021 ). These findings underscore the importance of testing multiple response formats as a rob ustness check and being cautious about interpreting results from any single format. They also moti v ate the use of numeric scales (1-5, 1-9), though it is unclear whether numeric scales would be less subject to strong prior biases than semantic labels. More- ov er , numeric scales do not escape tokenization challenges entirely . Different tokenizers handle multi-digit numbers differently , so a 1–10 or 1– 100 scale introduces multi-token completions that complicate clean measurement. 5.4 Theoretical Implications 5.4.1 What Does “Model Belief ” Mean? Throughout this paper , we ha ve used language suggesting that surprisal reveals what models “be- lie ve” or find “natural. ” This framing warrants caution. Strictly speaking, surprisal measures the probability assigned to tokens giv en context and nothing more. Whether this constitutes “belief ” in any meaningful sense is philosophically con- tentious ( Schwitzgebel , 2011 ; Goldman , 1979 ). W e can say more confidently that surprisal pat- terns correlate with human processing difficulty in psycholinguistic research ( Hale , 2001 ; Levy , 2008 ). T o the extent that LLMs trained on human- generated text acquire similar statistical re gulari- ties, their surprisal patterns may loosely track what humans find expected or unexpected. Howe ver , this is a correlational claim, not an assertion about genuine understanding or belief. For practical pur- poses, we recommend interpreting surprisal-based results as rev ealing learned statistical associations rather than making strong claims about model cog- nition. F or no w , the framew ork’ s value lies in its ef ficiency and uncertainty quantification, not in re- solving deep questions about machine understand- ing. T urpin et al. ( 2023 ) demonstrate that chain-of- thought explanations can be disconnected from actual model reasoning, implying that explicit prompted answers may not reflect underlying rep- resentations. A final speculativ e conjecture is that surprisal-based measures might provide more di- rect access to implicit model beliefs, though this hypothesis requires targeted in v estigation. W e speculate that the distinction between surprisal- based and prompting-based ev aluation may be analogous in some ways to dual-process theories of cognition ( Ev ans , 2003 ). In these theories, Sys- tem 1 processing is automatic and intuiti ve, while System 2 processing is deliberate and analytical. Surprisal-based e v aluation may access something analogous to System 1 responses: immediate, re- flexi v e probability assignments based on learned associations. Prompting with explicit reasoning (e.g., chain-of-thought) may analogously engage System 2-lik e processing, allo wing models to de- liberate and potentially override initial impres- sions, for better or worse. 5.5 Future Directions The four experiments presented above demon- strate the core methodology and illustrate design principles, but they also point to multiple direc- tions for future work, as we ha ve been noting throughout the paper . 5.5.1 Factual Knowledge Ev aluation A natural extension of the framework is to factual kno wledge assessment. Prior work has sho wn that language models encode substantial world knowl- edge in their weights ( Petroni et al. , 2019 ), but systematic ev aluation of that knowledge remains challenging. Surprisal-based ev aluation could ad- dress this by measuring surprisal for correct versus incorrect completions of factual statements, e.g., comparing surprisal for “Paris, ” “London, ” and “Berlin” gi v en the prompt “The capital of France is. ” Lo wer surprisal for the correct answer w ould indicate that the model has encoded the relev ant factual association. Applied at scale, this could enable systematic mapping of factual kno wledge across domains and model versions. 5.5.2 Bias and Fair ness Analysis Bias detection is another promising application. Research has documented that language models can encode and amplify societal biases present in training data ( Bolukbasi et al. , 2016 ), yet detecting these biases often requires generating text and subjectively e v aluating it. Surprisal- based analysis of fers a more direct and quanti- tati ve alternati v e by comparing surprisal patterns across demographic groups. For instance, mea- suring surprisal for gendered pronouns across oc- cupations (e.g., “The [doctor/nurse] walked into the room. [He/She]”) would rev eal whether a model finds certain gender–occupation pairings more “expected” than others, directly quantifying stereotypical associations encoded in the model’ s probability distrib ution. Because this approach accesses implicit associations rather than explicit generated statements, it may surface subtle biases that are not apparent in generated text. 6 Limitations While our experiments suggest the promise of surprisal-based ev aluation, se veral limitations constrain its current applicability and interpreta- tion on top of the prior ones noted abov e such as the single model family and synthetic datasets used in some of the experiments. 6.1 Measurement Constraints The current framework imposes sev eral practi- cal constraints. First, completions are limited to single tok ens for clean measurement, which restricts response formats. Multi-token answers like “Strongly agree” cannot be directly compared with single-token alternati ves without averaging or modification. For example, rather than using a scale from 1-100, we are using a scale from 1-9 to av oid potential multi-token completions asso- ciated with a two- or three-digit number . While av eraging across tok ens may be a v alid approach, as it is used when studying surprisal of a sentence ( Rezaii et al. , 2023 ; Huber et al. , 2024 ), it is un- clear how well that technique would work with these scale-based experiments. Second, scale de- sign in volv es non-trivial choices: odd versus ev en scales affect central tendenc y , anchor wording af- fects interpretation ( Chen et al. , 2015 ), and the op- timal number of scale points (3, 5, 7, 9) likely v aries by task. W e tested 5-point and 9-point scales b ut did not systematically compare their ef- fecti veness. Third, tokenization sensitivity means that results depend on formatting choices (leading spaces, capitalization) that may not be ob vious to practitioners. Like wise, what a 1 and 9 correspond to (as dictated by the prompt and anchors) may also be ambiguous or arbitrary . 6.2 Calibration and Interpretability A key limitation is that model confidence (hypo- thetically measured by entropy) does not guaran- tee accurac y . In our e xperiments, smaller models sometimes showed high confidence (low entropy) on incorrect answers (e.g., the 3B model on SETS scoring tasks). This means low-entropy responses cannot automatically be trusted without v alidation. Con v ersely , high-entrop y cases signal uncertainty but do not distinguish between genuine task am- biguity and model confusion. Language models are not necessarily calibrated to generate accurate self-reflection on their o wn confidence estimates ( Geng et al. , 2024 ; Kadav ath et al. , 2022 ). W e proposed entrop y as a proxy for confidence, but it remains an open question whether this re- lationship generally holds. At a minimum, one would expect the calibration relationship between entropy and accurac y to v ary by model, domain, and task type. Establishing reliable calibration would require e xtensi ve validation studies, poten- tially undermining the ef ficiency gains that moti- v ate the approach. For high-stakes applications, we currently recommend treating surprisal-based results as preliminary assessments requiring hu- man verification rather than final decisions. Another limitation of this surprisal-based ev al- uation is the broader interpretability of the ap- proach. Ultimately , surprisal is just a function of the probability distribution ov er tokens. It does not e xplain why the model prefers certain comple- tions. Moreover , the model might respond differ - ently if asked to generate reasoning, as in a test- time compute setup ( Snell et al. , 2024 ). Insofar as reasoning traces may provide insights into the model’ s reasoning process, surprisal-based ev alu- ation does not provide access to this information. Ho we ver , some evidence suggests those reason- ing traces are only loosely coupled to the model’ s actual decision process or internal representations ( Arcuschin et al. , 2025 ). Additionally , there are inherent biases in which token the models will pre- fer a priori without conditioning on the context, so some of the observed surprisal scores could be due to these biases rather than the context. 6.3 Access to T oken-Level Pr obabilities Surprisal-based e v aluation requires access to token-le v el log-probabilities (logits), which are av ailable when running open-weight models lo- cally (e.g., via HuggingFace T ransformers) but may not be exposed by proprietary API-only mod- els. As of this writing, some commercial APIs (e.g., OpenAI) provide limited log-probability ac- cess for top- k tokens, while others provide no access at all. This means the frame work in its current form is most directly applicable to open- weight models. Ho we v er , to the extent that API providers increasingly expose log-probabilities (or that open-weight models continue to close the per - formance gap with proprietary models) this limi- tation may diminish over time. Researchers who require e v aluation of closed models may need to rely on prompting-based methods or approximate probability estimates through repeated sampling, which forfeits the efficienc y advantages of single- pass surprisal measurement. 6.4 Absence of Direct Comparison with Prompting-Based Ev aluation A notable gap in the present work is that we do not directly compare surprisal-based classification with standard prompting-based classification on the same tasks. While we argue that the two ap- proaches are complementary , this claim would be substantially strengthened by empirical evidence sho wing (a) whether they agree or div erge, (b) in which conditions each approach e xcels, and (c) whether their combination yields better per - formance than either alone. Such a comparison is a priority for ongoing work and would in v olve running both zero-shot prompting and surprisal- based classification on the same statement sets with the same models, comparing accuracy , agree- ment rates, and confidence estimates across meth- ods. 7 Conclusion The minimal pairs paradigm has gained adop- tion as an approach v aluable for probing linguistic kno wledge in language models, b ut in most cases its application has largely stopped at binary gram- maticality contrasts. This paper ar gues for e xtend- ing the paradigm to ordinal scales and for tasks beyond grammaticality predictions. Extending to ordinal scales enables measurements not only of the preferred response b ut the full surprisal distri- bution, pro viding richer ev aluation signals suitable for applied classification tasks across domains. Our experiments, which were moti v ated by ongo- ing research in mental models such as entity scor- ing according to the SETS framework, causal rea- soning, figurative language detection, and deduc- ti ve coding, supported this claim. Surprisal curv es tended to produce internally consistent classifica- tion signals, and calculating the entropy ov er the token completion set promisingly distinguished genuinely ambiguous items from confident errors. T o be sure, these results are preliminary , and ke y gaps remain. Most notably , there was an ab- sence of direct comparison with prompting-based methods and there is room for systematic cal- ibration studies. Nonetheless, the framework’ s core properties of single-pass efficiency , princi- pled uncertainty quantification, and access to im- plicit model judgments rather than generated ra- tionalizations position it as a complement to e xist- ing e v aluation approaches. W e hope the demon- strations in this paper serve as a starting point for broader model families and domains. Acknowledgments This work was supported in part by a grant from the V irginia T ech Academy of Data Science Dis- cov ery Fund and NSF grants EEC 2107008, DUE 2300977, and 2339702. References Iván Arcuschin, Jett Janiak, Robert Krzyzano wski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy . 2025. Chain- of-thought reasoning in the wild is not always faithful. arXiv pr eprint arXiv:2503.08679 . Y onatan Belinkov , Nadir Durrani, F ahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? arXiv pr eprint arXiv:1704.03471 . Emily M Bender , T imnit Gebru, Angelina McMillan-Major , and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Pr oceedings of the 2021 A CM confer ence on fairness, account- ability , and transpar ency , pages 610–623. Leonardo Berti, Flavio Giorgi, and Gjergji Kas- neci. 2025. Emergent abilities in lar ge lan- guage models: A surve y . arXiv preprint arXiv:2503.05788 . T olga Bolukbasi, Kai-W ei Chang, James Y Zou, V enkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Ad- vances in Neural Information Pr ocessing Sys- tems , 29. T om Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry , Amanda Askell, et al. 2020. Language models are fe w-shot learners. Ad- vances in Neural Information Pr ocessing Sys- tems , 33:1877–1901. Xinxin Chen, Hongyan Y u, and Fang Y u. 2015. What is the optimal number of response alter- nati ves for rating scales? from an information processing perspectiv e. J ournal of Marketing Analytics , 3(2):69–78. Aakanksha Cho wdhery , Sharan Narang, Jacob De- vlin, Maarten Bosma, Gaura v Mishra, Adam Roberts, P aul Barham, Hyung W on Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv pr eprint arXiv:2204.02311 . Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a sin- gle $&!#* v ector: Probing sentence embed- dings for linguistic properties. arXiv pr eprint arXiv:1805.01070 . V era Demberg and Frank Keller . 2008. Data from eye-tracking corpora as evidence for theories of syntactic processing complexity . Cognition , 109(2):193–210. Jonathan St BT Evans. 2003. In two minds: dual- process accounts of reasoning. T r ends in co gni- tive sciences , 7(10):454–459. Stefan L Frank, Leun J Otten, Giulia Galli, and Gabriella V igliocco. 2015. The erp response to the amount of information con v eyed by words in sentences. Brain and Languag e , 140:1–11. Leo Gao, Jonathan T ow , Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster , Laurence Golber, Jeffre y Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighof f, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, A viya Sko wron, Lintang Sutawika, Eric T ang, Anish Thite, Ben W ang, Ke vin W ang, and Andy Zou. 2024. A frame- work for few-shot language model ev aluation. https://github.com/EleutherAI/ lm- evaluation- harness . Zenodo. doi:10.5281/zenodo.10256836. Jiahui Geng, Fengyu Cai, Y uxia W ang, Heinz K oeppl, Presla v Nakov , and Iryna Gure vych. 2024. A surve y of confidence estimation and calibration in large language models. In Pr o- ceedings of the 2024 Confer ence of the North American Chapter of the Association for Com- putational Linguistics: Human Language T ech- nologies (V olume 1: Long P apers) , pages 6577– 6595. Ariel Gera, Odellia Boni, Y otam Perlitz, Roy Bar - Haim, Lilach Eden, and Asaf Y ehudai. 2025. Justrank: Benchmarking llm judges for system ranking. In Pr oceedings of the 63r d Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 682–712. Alvin I Goldman. 1979. What is justified belief? In J ustification and knowledge: New studies in epistemology , pages 1–23. Springer . Chuan Guo, Geoff Pleiss, Y u Sun, and Kilian Q W einberger . 2017. On calibration of mod- ern neural networks. In International Confer- ence on Machine Learning , pages 1321–1330. PMLR. John Hale. 2001. A probabilistic earley parser as a psycholinguistic model. In Pr oceedings of the second meeting of the North American Chapter of the Association for Computational Linguis- tics , pages 159–166. Linyang He, Peili Chen, Ercong Nie, Y uanning Li, and Jonathan R Brennan. 2024. Decoding probing: Re vealing internal linguistic structures in neural language models using minimal pairs. arXiv pr eprint arXiv:2403.17299 . Dan Hendrycks, Collin Burns, Ste v en Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring mas- si ve multitask language understanding. arXiv pr eprint arXiv:2009.03300 . John He witt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Pr oceedings of the 2019 Confer ence of the North American Chapter of the Association for Computational Linguistics , pages 4129–4138. Jakob Hohwy . 2020. Ne w directions in predictive processing. Mind & Language , 35(2):209–223. Ari Holtzman, Peter W est, V ered Shwartz, Y ejin Choi, and Luke Zettlemoyer . 2021. Surface form competition: Why the highest probabil- ity answer isn’t always right. arXiv pr eprint arXiv:2104.08315 . Jennifer Hu, Kyle Mahowald, Gary Lupyan, Anna Iv anov a, and Roger Levy . 2024. Lan- guage models align with human judgments on key grammatical constructions. Pr oceed- ings of the National Academy of Sciences , 121(36):e2400917121. Ev a Huber , Sebastian Sauppe, Arrate Isasi- Isasmendi, Ina Bornkessel-Schlese wsk y , Paola Merlo, and Balthasar Bickel. 2024. Surprisal from language models can predict erps in pro- cessing predicate-argument structures only if enriched by an agent preference principle. Neu- r obiology of language , 5(1):167–200. Frederick Jelinek, Robert L Mercer , Lalit R Bahl, and James K Baker . 1977. Perplexity—a mea- sure of the difficulty of speech recognition tasks. The J ournal of the Acoustical Society of America , 62(S1):S63–S63. Hongchao Jiang, Y iming Chen, Y ushi Cao, Hung- yi Lee, and Robby T T an. 2025. Code- judgebench: Benchmarking llm-as-a-judge for coding tasks. arXiv pr eprint arXiv:2507.10535 . Natalie A Jones, Helen Ross, T imothy L ynam, Pascal Perez, and Anne Leitch. 2011. Mental models: an interdisciplinary synthesis of theory and methods. Ecology and Society , 16(1):46. Daniel Jurafsky and James H Martin. 2000. Speech and language pr ocessing: An intr o- duction to natural language pr ocessing, com- putational linguistics, and speech r eco gnition . Prentice Hall. Saurav Kadav ath, T om Conerly , Amanda Askell, T om Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer , Zac Hatfield-Dodds, Nov a DasSarma, Eli Tran-Johnson, et al. 2022. Lan- guage models (mostly) know what they know . arXiv pr eprint arXiv:2207.05221 . Emre Kiciman, Robert Ness, Amit Sharma, and Chenhao T an. 2023. Causal reasoning and lar ge language models: Opening a new frontier for causality . T ransactions on Machine Learning Resear c h . T akeshi K ojima, Shixiang Shane Gu, Machel Reid, Y utaka Matsuo, and Y usuke Iwasa wa. 2022. Large language models are zero-shot rea- soners. Advances in Neural Information Pro- cessing Systems , 35:22199–22213. Gina R Kuperberg and T Florian Jaeger . 2016. What do we mean by prediction in language comprehension? Language, Cognition and Neur oscience , 31(1):32–59. George Lakof f and Mark Johnson. 1980. Metaphors we live by . University of Chicago Press. T amera Lanham, Anna Chen, Ansh Radhakrish- nan, Benoit Steiner , Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Ev an Hub- inger , Jackson Kernion, et al. 2023. Measur- ing faithfulness in chain-of-thought reasoning. arXiv pr eprint arXiv:2307.13702 . Evelina Lei vada, Raquel Montero, Paolo Mo- rosi, Natalia Moskvina, T amara Serrano, Mar- cel Aguilar , and Fritz Guenther . 2025. Large language model probabilities cannot distin- guish between possible and impossible lan- guage. arXiv pr eprint arXiv:2509.15114 . Roger Levy . 2008. Expectation-based syntactic comprehension. Cognition , 106(3):1126–1177. Percy Liang, Rishi Bommasani, T ony Lee, Dim- itris Tsipras, Dilara Soylu, Michihiro Y asunaga, Y ian Zhang, Deepak Narayanan, Y uhuai W u, Ananya Kumar , et al. 2022. Holistic e val- uation of language models. arXiv pr eprint arXiv:2211.09110 . Olimpia Lombardi, Federico Holik, and Leonardo V anni. 2016. What is shannon information? Synthese , 193(7):1983–2012. Francesco Manigrasso, Stefan Schouten, Lia Morra, and Peter Bloem. 2024. Probing llms for logical reasoning. In International Confer- ence on Neur al-Symbolic Learning and Reason- ing , pages 257–278. Springer . Rebecca Marvin and T al Linzen. 2018. T argeted syntactic e v aluation of language models. In Pr oceedings of the 2018 Confer ence on Empir - ical Methods in Natural Languag e Pr ocessing , pages 1192–1202. T imon McPhearson, Elizabeth M Cook, Marta Berbés-Blázquez, Chingwen Cheng, Nancy B Grimm, Erik Andersson, Olga Barbosa, David G Chandler, Heejun Chang, Mikhail V Chester , et al. 2022. A social-ecological- technological systems frame w ork for urban ecosystem services. One Earth , 5(5):505–518. Kanishka Misra, Allyson Ettinger, and Julia T ay- lor Rayz. 2020. Exploring bert’ s sensitivity to lexical cues using tests from semantic priming. arXiv pr eprint arXiv:2010.03010 . OpenAI. 2023. Gpt-4 technical report. arXiv pr eprint arXiv:2303.08774 . Kwonsik P ark, Myung-Kwan P ark, and Sanghoun Song. 2021. Deep learning can contrast the minimal pairs of syntactic data. Linguistic Re- sear c h , 38(2):395–424. Judea Pearl. 2009. Causality . Cambridge Univer - sity Press. Fabio Petroni, T im Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Y uxiang W u, and Alexander Miller . 2019. Language mod- els as kno wledge bases? arXiv pr eprint arXiv:1909.01066 . Pouya Pezeshkpour and Este v am Hruschka. 2024. Large language models sensitivity to the or- der of options in multiple-choice questions. In F indings of the Association for Computational Linguistics: N AA CL 2024 , pages 2006–2017. Giov anni Pezzulo, Thomas Parr , and Karl Friston. 2022. The ev olution of brain architectures for predicti ve coding and acti ve inference. Philo- sophical T ransactions of the Royal Society B: Biological Sciences , 377(1844). T iago Pimentel, Josef V alvoda, Ro wan Hall Maudslay , Ran Zmigrod, Adina W illiams, and Ryan Cotterell. 2020. Information- theoretic probing for linguistic structure. arXiv pr eprint arXiv:2004.03061 . T imothy Pistotti, Jason Brown, and Michael J Wit- brock. 2025. Exploring gaps in the aps: Di- rect minimal pair analysis in llm syntactic as- sessments. In Pr oceedings of the Second W ork- shop on the Bridges and Gaps between F or- mal and Computational Linguistics (BriGap-2) , pages 20–25. Paulius Rauba, Nabeel Seedat, Max Ruiz Luyten, and Mihaela van der Schaar . 2024. Context- aw are testing: A new paradigm for model testing with large language models. Ad- vances in Neural Information Pr ocessing Sys- tems , 37:112505–112553. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty , Richard Y uanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. Gpqa: A graduate-lev el google-proof q&a benchmark. In F irst Confer ence on Lan- guage Modeling . Neguine Rezaii, James Michaelov , Sylvia Josephy-Hernandez, Boyu Ren, Daisy Hochberg, Megan Quimby , and Bradford C Dickerson. 2023. Measuring sentence infor- mation via surprisal: Theoretical and clinical implications in nonfluent aphasia. Annals of neur ology , 94(4):647–657. Anna Rogers, Olga Ko v ale v a, and Anna Rumshisky . 2020. A primer in bertology: What we kno w about how bert works. T rans- actions of the Association for Computational Linguistics , 8:842–866. W illiam B Rouse and Nancy M Morris. 1986. On looking into the black box: Prospects and limits in the search for mental models. Psychological bulletin , 100(3):349–363. Julian Salazar, Davis Liang, T oan Q Nguyen, and Katrin Kirchhoff. 2020. Masked language model scoring. In Pr oceedings of the 58th an- nual meeting of the association for computa- tional linguistics , pages 2699–2712. Johnny Saldaña. 2016. The coding manual for qualitative r esear chers , 3rd edition. Sage. Eric Schwitzgebel. 2011. Belief. In The Rout- ledge companion to epistemology , pages 14–24. Routledge. Claude E Shannon. 1948. A mathematical theory of communication. The Bell System T echnical J ournal , 27(3):379–423. K oustuv Sinha, Jon Gauthier , Aaron Mueller , Kanishka Misra, Keren Fuentes, Roger Le vy , and Adina Williams. 2023. Language model acceptability judgements are not alw ays robust to context. In Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 6043–6063. Sophie Slaats and Andrea E Martin. 2025. What’ s surprising about surprisal. Computational Brain & Behavior , 8(2):233–248. Nathaniel J Smith and Roger Le vy . 2013. The ef- fect of word predictability on reading time is logarithmic. Cognition , 128(3):302–319. Charlie Snell, Jaehoon Lee, K elvin Xu, and A viral Kumar . 2024. Scaling llm test-time compute optimally can be more ef fectiv e than scaling model parameters. arXiv pr eprint arXiv:2408.03314 . Mark Spre v ak and Ryan Smith. 2023. An intro- duction to predicti v e processing models of per - ception and decision-making. T opics in Cogni- tive Science . Jon Sprouse, Carson T Schütze, and Diogo Almeida. 2013. A comparison of informal and formal acceptability judgments using a random sample from linguistic inquiry 2001–2010. Lin- gua , 134:219–248. Aarohi Sriv asta v a, Abhinav Rastogi, Abhishek Rao, Abu A wal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv pr eprint arXiv:2206.04615 . Ian T enney , Dipanjan Das, and Ellie Pa vlick. 2019. Bert rediscovers the classical nlp pipeline. In Pr oceedings of the 57th annual meeting of the association for computational linguistics , pages 4593–4601. Miles T urpin, Julian Michael, Ethan Perez, and Samuel R Bo wman. 2023. Language mod- els don’t alw ays say what they think: Unfaith- ful e xplanations in chain-of-thought prompting. arXiv pr eprint arXiv:2305.04388 . Ruiqi W ang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Y ong Chong, and Xin Xia. 2025. Can llms replace human ev aluators? an empir- ical study of llm-as-a-judge in software engi- neering. Pr oceedings of the ACM on Software Engineering , 2(ISST A):1955–1977. Alex W arstadt, Alicia Parrish, Haokun Liu, An- had Mohananey , W ei Peng, Sheng-Fu W ang, and Samuel R Bo wman. 2020. Blimp: The benchmark of linguistic minimal pairs for en- glish. T ransactions of the Association for Com- putational Linguistics , 8:377–392. Jason W ei, Y i T ay , Rishi Bommasani, Colin Raf- fel, Barret Zoph, Sebastian Borgeaud, Dani Y o- gatama, Maarten Bosma, Denny Zhou, Don- ald Metzler , et al. 2022a. Emergent abili- ties of large language models. arXiv pr eprint arXiv:2206.07682 . Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Pr o- cessing Systems , 35:24824–24837. Ethan W ilcox, Roger Le vy , T akashi Morita, and Richard Futrell. 2018. What do rnn language models learn about filler-gap dependencies? arXiv pr eprint arXiv:1809.00042 . Ethan W ilcox, Peng Qian, Richard Futrell, Miguel Ballesteros, and Roger Levy . 2019. Struc- tural supervision improv es learning of non-local grammatical dependencies. In Pr oceedings of the 2019 Confer ence of the North American Chapter of the Association for Computational Linguistics , pages 3302–3312. Ethan Gotlieb W ilcox, Jon Gauthier, Jennifer Hu, Peng Qian, and Roger Le vy . 2020. On the pre- dicti ve power of neural language models for hu- man real-time comprehension behavior . arXiv pr eprint arXiv:2006.01912 . Daniel W illiams. 2018. Predicti ve processing and the representation wars. Minds and Machines , 28(1):141–172. Xingjiao W u, Luwei Xiao, Y ixuan Sun, Junhang Zhang, T ianlong Ma, and Liang He. 2022. A surve y of human-in-the-loop for machine learn- ing. Future Generation Computer Systems , 135:364–381. Miao Xiong, Zhiyuan Hu, Xinyang Lu, Y ifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empir- ical ev aluation of confidence elicitation in llms. arXiv pr eprint arXiv:2306.13063 . Ilker Y ildirim and LA P aul. 2024. From task structures to world models: what do llms kno w? T r ends in Cognitive Sciences , 28(5):404–415. Zihao Zhao, Eric W allace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International confer ence on ma- chine learning , pages 12697–12706. PMLR. Xinyu Zhou, Delong Chen, Samuel Cahyawijaya, Xufeng Duan, and Zhenguang Cai. 2025. Lin- guistic minimal pairs elicit linguistic similarity in large language models. In Pr oceedings of the 31st International Confer ence on Computa- tional Linguistics , pages 6866–6888. A SETS Classification Prompts A.1 SETS Framework Context The Social-Ecological-T echnological Systems (SETS) framework analyzes entities across three interconnected dimensions: • Social : Human aspects such as community in- teractions, gov ernance, economic systems, cul- tural v alues, and social equity • Ecological : The natural en vironment and its components, often inv olv ed in biophysical pro- cesses, including natural resources, ecosystem functions, and en vironmental conditions • T echnological : Human-made systems and engi- neered infrastructures, including infrastructure, technological tools, and innov ations The framew ork can be used to classify entities and concepts based on their alignment with these dimensions. When doing so, it is helpful to con- sider not only the entity but the surrounding con- text in which it w as mentioned. A.2 Prompt T emplate Structure For each entity and dimension, we measure sur- prisal for scores 1-9 using the follo wing template: [SETS Framework Context] Consider the following context and en- tity: Context: “[conte xt_sentence]” Entity: “[entity]” On a scale from 1-9, where 1 corre- sponds to the entity having no [dimen- sion] characteristics and 9 corresponds to extremely high [dimension] charac- teristics, the entity “[entity]” score on the [dimension] dimension is: A.3 Example Prompts A.3.1 Example 1: Park (Ecological Dimension) The Social-Ecological-Technological Systems (SETS) framework analyzes entities across three interconnected dimensions: [...] Consider the following context and entity: Context: “We went to the neighborhood park.” Entity: “park” On a scale from 1-9, where 1 corresponds to the entity having no ecological characteristics and 9 corresponds to extremely high ecological characteristics, the entity “park” score on the ecological dimension is: W e then measure surprisal for completions “ 0” , “ 1” , ..., “ 9” . A.3.2 Example 2: Virus (T echnological Dimension) For the homonym “virus, ” context determines scoring: Computer virus context: Context: “The computer virus corrupted files and spread through email attachments.” Entity: “virus” On a scale from 1-9, [...] the entity “virus” score on the technological dimension is: Biological virus context: Context: “The virus was detected.” Entity: “virus” On a scale from 1-9, [...] the entity “virus” score on the ecological dimension is: B Causal Statement Classification Prompts B.1 Context Levels B.1.1 Minimal Context Causality refers to relationships where one event causes or influences another. Look for statements that express how one thing brings about, leads to, or is responsible for another thing. B.1.2 Full Context The full context provides detailed guidance on causal linguistic mark ers. An abbreviated version is sho wn belo w; the complete version includes 18 categories (a–r) covering explicit markers, causal verbs, noun phrases, adverbial clauses, im- plied causation, causal reasoning, causativ e types, causal chain comple xity , temporal ordering, coun- terfactual causality , causal strength, discourse- le vel causality , implicit causality , correlation vs. causation distinctions, multiple causation, causal chains, parallel causation, and branching causa- tion. Causal relationships can be indicated through various linguistic structures and at different lev els of language. Be aw are of the following indicators: a) Explicit markers: because, since, there- fore, thus, hence, consequently , so that, in order to b) Causal v erbs: cause, re- sult, produce, generate, lead, induce, trig- ger , prompt c) Noun phrases: causes, rea- sons, effects, consequences, outcomes d) Adverbial clauses: due to, o wing to, as a re- sult of, because of e) Implied causation: If- then constructions, resultativ e constructions f) Causal reasoning: e xpressions of purpose (to, in order to), statements of intention (in- tend to, aim to) g) T ypes of causati ves: Le x- ical, Morphological, Periphrastic h) Causal chain complexity: Simple (A causes B) or complex (A causes B, which causes C) i) T emporal ordering: Cause typically pre- cedes ef fect, but language allo ws v arious orderings j) Counterfactual causality: “If X hadn’t happened, Y wouldn’t hav e oc- curred” k) Causal strength and probability: Language e xpressing de grees of causal in- fluence l) Discourse-le vel causality: Causal relationships spanning across sentences m) Implicit causality: V erbs carrying implicit causal information n) Correlation vs. Cau- sation: Be particularly careful to distinguish between: - T rue causal relationships where one e v ent directly influences another - Mere temporal correlation or co-occurrence - Sta- tistical association without clear causation - Sequential ev ents without proven causation o) Multiple causation: When multiple dis- tinct causal relationships exist in the same statement p) Causal chains: When A causes B, which in turn causes C q) Parallel cau- sation: When multiple causes lead to the same effect r) Branching causation: When one cause leads to multiple effects B.2 Scale Framings W e test fiv e dif ferent scale framings to explore ho w question framing af fects surprisal patterns: B.2.1 Framing 1: Bipolar Causality Scale [Context] On a scale from 1 to 5, rate the causal content of this statement: “[statement]” 1 = Definitely non-causal 3 = Neutral/uncertain 5 = Definitely causal Rating: B.2.2 Framing 2: Belief Strength Scale [Context] The following statement expresses a causal relationship: True or False “[statement]” On a scale from 1 to 5, how strongly do you believe this answer: 1 = Definitely False (not causal) 5 = Definitely True (causal) Rating: B.2.3 Framing 3: Probability Scale [Context] What is the probability that this statement expresses causality: “[statement]” Rate from 1 to 5: 1 = 20% probability (very unlikely to be causal) 5 = 100% probability (very likely to be causal) Rating: B.2.4 Framing 4: Causal Strength Scale [Context] How strong is the causal content in this statement: “[statement]” Rate from 1 to 5: 1 = No causal content 5 = Very strong causal content Rating: B.2.5 Framing 5: Dual Classification [Context] Rate this statement on both dimensions: “[statement]” A) How causal is this statement? (1 to 5) 1 = Not causal at all, 5 = Highly causal B) How non-causal is this statement? (1 to 5) 1 = Not non-causal at all, 5 = Highly non-causal A) Rating: For the dual classification framing, we mea- sure surprisal for both A and B completions (e.g., “A_1” , “A_2” , ..., “B_1” , “B_2” , ...). B.3 Example with Minimal Context Causality refers to relationships where one event causes or influences another. Look for statements that express how one thing brings about, leads to, or is responsible for another thing. How strong is the causal content in this statement: “Monitoring stations indicate that heavy rainfall led to widespread flooding in low-lying areas, according to reports.” Rate from 1 to 5: 1 = No causal content 5 = Very strong causal content Rating: C Figurative Language Detection Prompts C.1 Context Levels C.1.1 Full Context Figurativ e language uses words or expres- sions with meanings different from their lit- eral interpretation to create vivid imagery , comparisons, or emphasis. Ke y types in- clude: 1. MET APHORS: Direct comparisons that state one thing IS another thing (e.g., “Lov e is a battlefield”) 2. ANALOGIES: Extended comparisons that explain one concept by compar- ing it to another (e.g., “The economy works like a machine”) 3. SIMILES: Comparisons using “like” or “as” (e.g., “Bright as the sun”) 4. PERSONIFICA TION: Giving human characteristics to non-human things (e.g., “The ocean roared with fury”) Metaphors create implicit comparisons without using comparison words, while analogies typically in volve more detailed structural comparisons between dif ferent domains. Similes make explicit compar- isons using comparison w ords, and person- ification attributes human qualities to non- human entities. C.1.2 Minimal Context Figurativ e language uses non-literal mean- ings to create comparisons or emphasis. Ke y types: • Metaphors: Direct comparisons (X is Y) without “like/as” • Analogies: Extended comparisons ex- plaining concepts through structural simi- larities • Similes: Explicit comparisons using “like” or “as” • Personification: Giving human qualities to non-human things C.2 Binary Classification Prompts C.2.1 T rue/False F ormat - Metaphor Detection [Context] The following statement contains a metaphor: True or False “Time is money.” Answer: Completion targets: “ True” , “ False” C.2.2 Y es/No Format - Analogy Detection [Context] Does the following statement contain an analogy? “The brain works like a computer.” Answer (Yes or No): Completion targets: “ Yes” , “ No” C.3 Intensity Scale Prompts C.3.1 Metaphor Intensity (1-5 Scale) [Context] On a scale from 1 to 5, rate how metaphorical this statement is: “Time is money.” 1 = Completely literal 3 = Somewhat metaphorical 5 = Highly metaphorical Rating: Completion targets: “ 1” , “ 2” , “ 3” , “ 4” , “ 5” C.3.2 Analogy Strength (1-9 Scale) [Context] On a scale from 1 to 9, rate the analogical content of this statement: “The economy works like a machine.” 1 = No analogy present 5 = Moderate analogical content 9 = Strong analogy Rating: Completion targets: “ 1” through “ 9” C.4 Multi-Category Classification [Context] Does this statement contain a metaphor, analogy, simile, personification, or none of these? “The wind whispered through the trees.” Answer: Completion targets: “ metaphor” , “ analogy” , “ simile” , “ personification” , “ none” D Deductive Coding Pr ompts D.1 T ask Structure Deducti ve coding applies predefined codes (with definitions) to unstructured text. Our prompts systematically vary experimental factors to assess their impact on coding accuracy . D.2 Prompt Components A deducti ve coding prompt consists of: 1. Optional Context : Background on deductiv e coding methodology 2. Optional Persona : Role assignment (qualita- ti ve researcher , domain expert) 3. T ext Section : Surv ey response or text to be coded 4. Code Section : Code name and/or definition 5. T ask Section : Instructions and scale for rating D.3 Section Delimiter Styles D.3.1 XML-Style Delimiters Survey Response: “Work/life balance continues to be the biggest challenge, because the amount of work is still elevated and I don’t see that changing.” Code: work/life boundaries Definition: Challenges with maintaining separation between work and personal life, especially during remote work D.3.2 ALL CAPS Delimiters TEXT TO CODE: Survey Response: “Work/life balance continues to be the biggest challenge, because the amount of work is still elevated and I don’t see that changing.” CODE INFORMATION: Code: work/life boundaries Definition: Challenges with maintaining separation between work and personal life, especially during remote work D.4 Numeric Scale Prompts D.4.1 1-5 Scale with XML Delimiters Survey Response: “Work/life balance continues to be the biggest challenge.” Code: work/life boundaries Definition: Challenges with maintaining separation between work and personal life Your task is to rate how well the given code applies to the survey response text. On a scale from 1-5, where 1 means “not applicable at all” and 5 means “highly applicable”, the score for applying the code in to the text in is: Completion targets: “ 1” , “ 2” , “ 3” , “ 4” , “ 5” D.4.2 Rev ersed 1-5 Scale Some conditions test a reversed scale where 1 = highly applicable and 5 = not applicable: On a scale from 1-5, where 1 means “highly applicable” and 5 means “not applicable at all”, the score for applying this code to this text is: D.5 Qualitativ e Scale Prompts D.5.1 Intensity Scale [Text and Code sections] Your task is to rate how well the given code applies to the survey response text. Use the following scale: • none = the code does not describe the text at all • weak = the code slightly or minimally describes the text • medium = the code moderately describes the text • strong = the code strongly describes the text • perfect = the code perfectly or completely describes the text Using the scale provided above, the intensity of applicability of the code in to the text in is: Completion targets: “ none” , “ weak” , “ medium” , “ strong” , “ perfect” D.5.2 Evidence Scale Use the following scale: • negligible = there is no evidence that the code applies to the text • weak = there is weak or minimal evidence that the code applies to the text • moderate = there is moderate evidence that the code applies to the text • strong = there is strong or compelling evidence that the code applies to the text Using the scale provided above, the evidence that this code applies to this text is: Completion targets: “ negligible” , “ weak” , “ moderate” , “ strong” D.5.3 Binary T rue/False Scale Use the following scale: • false = it is false that the code should be applied to the text • true = it is true that the code should be applied to the text Using the scale provided above, the statement ‘given the context, this code should be applied to this text’ is: Completion targets: “ false” , “ true” D.6 Experimental F actors The deductiv e coding experiments systematically v ary: • Personas : None, qualitati ve researcher , domain expert • Scale types : Numeric (1-5, 1-9) vs. Qualitati ve (intensity , e vidence, binary) • Scale anchors : Different endpoint w ordings for numeric scales • Scale direction : Standard (1=lo w , 5=high) vs. Re versed (1=high, 5=lo w) • Code presentation : Code+definition, code only , definition only • Section delimiters : XML tags vs. ALL CAPS • Context lev els : None, minimal, full deducti ve coding background D.7 Complete Example with P ersona You are an experienced qualitative researcher skilled in systematic coding and analysis of textual data. Survey Response: “Public safety concerns remain paramount, particularly regarding mask mandates and vaccine requirements.” Code: public safety (i.e., masks, vaccines, etc.) Definition: References to public health measures, policies, and safety protocols Your task is to rate how well the given code applies to the survey response text. On a scale from 1-9, where 1 means “doesn’t apply” and 9 means “applies perfectly”, the score for applying the code in to the text in is: Completion targets: “ 1” through “ 9” E Ordinal-Scaled Causal Surprisal Curves Figures 10 and 11 sho w the full ordinal-scaled sur - prisal curves for the causal statement experiments discussed in Section 4 . Each figure displays all combinations of model (ro ws), context lev el and scale length (columns), with two scale framings (causal strength and bipolar causality) shown as separate lines in each panel. Figure 10: Surprisal curves for the causal statement “The heavy rainfall caused widespread flooding in the city” across all model, context, and scale combinations. Each panel shows two scale framings (causal strength and bipolar causality). Monotonically decreasing curves indicate consistent assignment of high causal ratings. Figure 11: Surprisal curves for the correlational statement “Students who study more tend to get better grades” across all model, context, and scale combinations. Compared to the causal statement in Figure 10 , the curves are flatter and more parabolic, reflecting model uncertainty about the causal content.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment