Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequen…

Authors: Cole Walsh, Rodica Ivan

Measuring What Matters—or What’s Con v enien t?: Robustness of LLM-Based Sc oring Systems to Construct-Ir relev an t F actors Cole W alsh [0000 − 0002 − 6284 − 8926] and Rodica Iv an [0000 − 0001 − 6031 − 7200] Acuit y Insights Inc., T oronto, ON, Canada {cwalsh,ri van}@acuityinsigh ts.com Abstract. Automated systems ha ve b een widely adopted across t he ed- ucational testing industry for op en- response assessmen t and essay scor- ing. These systems commonly achiev e p erformance levels comparable t o or sup erior than trained human raters, but hav e frequently b een demon- strated to be vulnerable to the infl uence of construct-irrelev an t factors (i.e., features of responses that are unrelated to the construct assess ed) and adversarial conditions. Giv en t he rising usage of large language mod - els in automated scoring systems, there is a renewed fo cus on “hallu- cinations” and the robustness of these LLM-b ased automated scoring approac hes to construct-irrelev an t factors. This study inv estigates the effects of construct-irrelev ant factors on a dual-arc hitecture LLM-based scoring system designed to score short essa y-like op en-resp onse items in a situational judgment test. It wa s found that the scoring system was generally robust to padding responses with meaningless text, sp elling errors, and writing sophistication. Duplicating large passages of text re- sulted in lo wer scores p redicted by the system, on av erage, contradicting results from previous stu d ies of non-LLM-based scoring systems, while off-topic resp onses were hea vily p enalized by the scoring system. These results provide encouraging sup p ort for the robustness of future LLM- based scoring systems when designed with construct relev ance in mind. Keywords: Aut omated Scoring · Large Language Mo d els · Predictive Models · Assessment · Situational Judgment T est · AI Robustn ess 1 In tro duct ion Automatic ev aluation of op en-resp onse text, including s ho rt-answer r esp o nses and essays, is one of the earliest and mos t widely explored applica tions o f natural la nguage pro cessing a nd a r tificial intelligence (AI) in education. Over time, methods for automatically ev aluating written work have evolv ed from us- ing handcrafted features (e.g., type-token ratio s , part-o f-sp e ech tagging ) and simple mo dels [10] to more complex approa ches including neural net works a nd transformer- based mo dels [10 , 15]. With each new dev elopment automated scor - ing systems ha ve co nsistently demo nstrated strong alignment with human raters [10]. Despite their strong overall p erforma nce and successful ad o ptions, studies hav e demonstrated the susceptibilit y of ma ny automated scoring systems to 2 C. W alsh and R. Iva n construct-irrele v ant factors (i.e., features of r esp o nses that ar e unrelated to the construct as sessed) regar dless of underlying a r chit ectur e. Chief among these construct-irrele v ant factors is text length; ear ly studies in volving simple mo d- els demonstrated tha t repea ting sentences or entire parag r aphs within an essay could artificially inflate scores predicted by these mo dels [1 4], while more r ecent studies hav e demonstrated that this issue per sists even in tr ansformer- based scor- ing sy s tems [15 ]. Other studies hav e demonstrated outsized impacts of injecting particular words o r phrases (e.g., “fr om one p ersp ective”, “ on the other hand”, “in the final analys is ”) [14] and including off-topic text [14, 1 8]. The fragility of ex- isting automated sco ring systems under such adversarial conditions undermines trust in these s ystems to accur ately measure the intended construct rather than exploitable proxies. In rec e nt years, a utomated s c o ring resear ch has b egun to foc us o n systems employing larg e lang uage mo dels (LLMs ) either dir ectly [11] or a s part of more complex sco ring systems [1 , 17]. While these systems ha ve not yet met state- of-the-art p erfor mance metrics a ch ieved b y transformer-ba sed systems in many cases [1, 7 ], what they s acrifice in accura cy they make up for in other ways: c r e- ating human-interpretable s cores a nd feedbac k, reducing t he amo unt of required training data, and easing a doption by less tec hnical a ssessment developer s. With this rise in LLM-based scor ing so lutions, there is a renewed fo cus on the robust- ness of these systems under adversarial conditions, esp ecia lly with widespread public aw areness of the limitations in the underlying tec hnolo gy (e.g., “ halluci- nations” [3 ]), w hich was not the case with prior mo des of automa ted s coring. This study contributes tow ards our under standing o f the effects of construct- irrelev ant factors on LLM-based scor ing s ystems by inv estig ating how the fol- lowing factor s influence scor es pro duced b y one such s y stem: 1. Meaningless T ext, 2. W riting Sophistication, and 3. Off-T opic T exts W riting s o phistication is often considere d a cons tr uct-relev ant factor by asse s s- men ts of writing or language proficiency [1, 13]. As will b e discuss ed in g reater detail in Sec. 2.1, we constructed our LLM-ba s ed s c o ring system for a situational judgmen t test (SJT) designed to measure per s onal and professiona l skills (e.g., communication, teamw ork, problem-so lving, and critical thinking) [12] where el- ement s of writing so phistication (e.g., s pelling , gr ammar, structure, and org ani- zation) were consider ed construct-irrelev ant. This study , then, serves a secondary purp o se o f inv estiga ting the feasibility of constructing automated scoring systems that measur e distinctly different co nstructs from s coring systems employ ed for measuring writing or language proficiency . 2 Metho ds 2.1 Source Data Assessme n t W e collected data using a 30 item op en-resp onse SJT desig ned to assess and provide feedback o n s tudents’ perso nal and professional skills. Robustness of L LM-Based Scoring 3 This instrument , in tended primarily for low-stakes fo rmative (as opp osed to summative) usage within a prog ram of study , assessed students’ s kills a lo ng four dimensions [8]: 1. In trap ersonal skills : Unders tanding and reg ulating oneself by r ecognizing emotions, biases, and b e haviors, staying motiv a ted, adapting to challenges, and committing to lifelong learning for pe rsonal growth. 2. In terp ersonal skil ls: Building meaningful c o nnections and w or king e ffec- tiv ely with other s by co mm unicating clear ly , showing empath y , co lla bo rating tow ard shared goals, and inspiring others tow ards p ositive outcomes. 3. So cial and Ethical Resp onsibili t y : Recognizing and resp ecting p e o ple’s differences, upholding ethical principles, and contributing to the well-being of so ciety throug h resp o ns ible actio ns . 4. Critical Thinking and Problem Solving: Gathering information, ev al- uating o ptions, a nd finding effective solutions to pr oblems while efficiently managing resources and risks. F or each item, students were presented w ith a hypo thetical scena rio and asked how they would resp ond given the situation. Scenarios were pr esented in one of t wo formats: t ext-b ase d scenario s included a short description of a situation, while vide o-b ase d scenar ios were depicted using AI-generated av atar s enacting a situation. Below is a summary of one text-base d scena rio studen ts were shown and an accompanying item (question): Scenario: Ke ndr a and Alex, t wo close friends co-leading a ma jor gro up pro ject, are clashing ov er Kendra pushing a strict schedule to meet dea d- lines. Alex w a nts mo re group input on task division a nd worries that Kendra’s a pproach co uld undermine team trust and strain their friend- ship. Item: How could the co -leads, Alex and K endra, have collab ora ted more effectiv ely with the group to cr e ate a clear and realistic pro ject plan? This item was mea nt to asses s interp e r sonal skills and, mor e spe cifically , co l- lab oration. T o r eceive a hig h score on this item, studen ts w ere exp ected to demonstrate b ehaviors consistent with collab oration in their resp onses such as suggesting appr o aches to managing differ enc es and r esolving c onflict . Automated Scoring System W e developed an automated scoring system for this assess men t that used a dual-ar chit ecture LLM- as-a-J u dge feature extrac- tion comp onent tog ether with clear-b ox reg ression algorithms similar to those describ ed in Refs. [1, 17]. Aligning with the constructs ass essed by the instr u- men t, w e des igned the scor ing sy stem to reward higher -level features related to per sonal and pro fessional s kills (such as the co llab oration-r elated b ehavior noted ab ov e) rather than la nguage proficiency o r conten t-mastery . W e cons tructed this system using 26 , 57 1 assessment resp onses (sprea d acr oss all 30 items) from 9 10 studen ts representing six schoo ls. Thes e s chools comprise d a mixture of busi- ness, engineering, health sciences, a nd medicine programs . All resp onses were 4 C. W alsh and R. Iva n first ev aluated by tra ined human r aters using a 1 – 5 Likert ra ting s c a le; these ratings were used to train and ev aluate the automated scor ing sys tem, which achiev ed p erfor mance comparable to the h uman rater s. 2.2 Data Sele ction F or this study , w e selected a subset of 5 45 respo nses (spread acro s s a ll 30 items) from 318 students. W e chose to use a subset of resp onses rather than the full dataset describ ed ab ove to minimize LLM API cos ts g iven that we would b e executing a num b er of exp eriments that inv olved re-scoring r esp o nses. Throug h simul a tio ns with v arying sa mple sizes, we identifi ed that a set of at leas t 500 resp onses w o uld allow us to reliably calculate paired Cohen’s d effect sizes with our desired precis ion (width of 95% confidence interv al < 0 . 2 ); w e developed our sampling strategy w ith this heuristic in mind. T o ensure all assessment items, as well as resp onses of v arying q ua lit y , were represented in the sampled datas et, we selected r e s po nses from o ur larg e r dataset by stratifying by item and predicted score: we binned predicted sco res using 1 0 equa l width bins b etw een the av ailable scoring range o f 1 –5, then selected up to tw o r esp onses at random from each scoring bin for each item. This sa mpling strateg y ensured that we selected a diversit y of low and high quality resp ons es from a ll 30 a s sessment items. O ur automated sco ring system did not pro duce scor es in certa in scor ing bins for certain items (e.g., the mo del ma y not ha ve predicted any s cores in the (1.4, 1.8] bin for an item), hence our sampled dataset included less than the to ta l of 6 00 resp onses that w ould hav e b een exp ected had all sco ring bins b een attainable for all 3 0 items. T able 1 includes a brea kdown of the n umber of resp onses selec ted for each scoring bin. The mea n predicted s core for sa mpled r esp onses was 3 . 11 with a standard deviation of 1 . 17 . T able 1. N umber of resp onses selected from eac h scoring bin. Our automated scoring system did n ot pro duce scores in certain scoring bins for certain items, hen ce not all scoring bins included the maximum 60 responses. Scoring Bin N [1, 1.4] 60 (1.4, 1.8] 42 (1.8, 2.2] 48 (2.2, 2.6] 50 (2.6, 3.0] 55 (3.0, 3.4] 56 (3.4, 3.8] 55 (3.8, 4.2] 60 (4.2, 4.6] 60 (4.6, 5.0] 59 Studen ts consented to the use of their assessment and survey data for res earch purp o ses by acce pting the T erms and Conditions befor e starting the a s sessment; Robustness of L LM-Based Scoring 5 studen ts were informed that they may withdraw their consent at any time. Only studen ts who consented to their data b eing used fo r res e arch purpos es were included in this study . All s tuden ts who completed the as sessment could also complete an optional demogr aphic information sur vey at the end of the test. T able 2 provides a breakdown of the self-iden tified demogra phic c ha r acteristics for the studen ts who se resp onses w e selected for this study . T able 2. Numb er of stud ents included in this study from v arious demographic sub- groups. T o protect ind ividual priv acy , any subgroups with fewer than 5 individuals w ere included in ‘Other’ sub categories . Demographic Charact eristic N F ract ion English Pr oficiency Other 5 1.6% Goo d 14 4.4% Adv anced 44 13.8% Native/ F unationally Native 255 80.2% Gender Other 1 0.3% W oman 149 46.9% Man 168 52.8% R ac e Other 4 1.3% Southeast Asian 12 3.8% Middle Eastern or Northern African 13 4.1% East Asian 20 6.3% Blac k, African, Caribbean, or A frican American 29 9.1% South Asian 31 9.7% Hispanic, Latinx, or Spanish origin 46 14.5% White or European 163 51.3% 2.3 Exp eriments W e conducted three exp eriments in vestigating the influence o f construct-ir relev ant factors on scores predicted by our automated scor ing system: 1. Ad ding meaningless text tha t does not a dd to the co nt ent of the resp onse 2. V arying wr iting sophisticatio n in terms o f spelling erro rs, v o ca bulary , and sentence structure 3. Generating off-topic resp onses that are not aligned with the item W e describ e each exp eriment in g r eater detail below. In each cas e , w e pro duced new datasets of 545 r esp onses with ea ch g enerated resp onse as a n altered version of one of the original resp onses from our base dataset. W e re-s cored the altered resp onses using our automa ted sco ring sys tem, then computed sta tistics o ver the newly co mputed scor e s including paired Co hen’s d effect size e stimates using the original predicted score s as a baseline for the paired compariso ns. 6 C. W alsh and R. Iva n Exp eriment 1: Meaningless T ext W e tested the impact of app ending four t yp es o f mea ningless text to the original resp onse: A. A copy o f the or ig inal r esp onse B. A sen tence stating what comp etency was assessed by the ite m (e.g., “This question is designed to assess collab ora tio n.”) C. A sentence rephra sing what happe ned in the scena rio. See exa mple in Sec. 2.1. D. A formulaic sentence: “I would a pproach the situation in a resp ectful, non- confrontational, and non-judgmental manner .” While sub-exp eriments B and C introduced no meaningful additional conten t to the or ig inal r esp o nses, sub-exp eriment A was meant to explor e the particular ly adversarial condition where the same text is rep eated, inflating resp onse length without any a dditional conten t. Ravindran & Choi prev iously identified that this exact altera tion led to an av era ge change in predicted sco res of 0 . 93 (on a 1 –6 scale) using a trans former-based scor ing system [15 ], so we wan ted to explore whether this result translated to LLM-bas e d s coring systems . Sub-exp eriment D do es introduce meaningful text, but is emblematic of the kind of memorized, formulaic phras es often o bserved in this and other op en-res p ons e SJT s [9]. Exp eriment 2: W riting Sophistication W e tested tw o types of alterations to the original r esp o nses a ffecting the ov erall written quality and sophisticatio n of the resp onses: A. Introducing sp elling error s B. A djusting the r e a ding level needed to understand the resp ons e F or sub-exp eriment A, w e introduced random character-level edits to resp onses with fixed probabilities r a nging from 5– 50% in steps of five p ercentage po int s. W e used a fixed distribution of e dit t yp es with 40% of edits b eing substitutions, 30% be ing deletions, and 30% be ing insertions. F or substitutions and insertions, inserted characters were selected from all upper and low erc a se english c hara cters with equal weigh ting. This pro cedure pro duced a “worst-cas e” result as there was no pr edictable pattern in gener ated error s unlike those t ypica lly se e n in h uman writing [6]; the results of this experiment s hould therefore be in terpreted as a low er b ound o n the robustness of the sco ring system to sp elling er rors. Previous studies of LLM accuracy for more genera l tas ks found disparities in the robustness of LLMs to these types o f errors along the lines of mo del capability , with more adv anced mo dels g enerally showing super ior robustness to sp elling errors in a pr ompt [4]. The sc o ring system w e in vestigated her e used GPT-5.X mo dels, so w e hypo thesized that it w ould b e similar ly robust to sp elling err ors. F or sub-exp e riment B, we used a n LLM (GPT-5 mini) to rephra se each re- sp o nse at a higher or low er grade reading level while maintaining the original meaning o f the resp ons e. W e measured the reading level of the original and altered resp onses using the Flesch–Kincaid Reading Gra de L evel (RGL) [16 ]. This index co nsiders texts with more sylla bles per word and more words p er sentence as mor e difficult to read and requiring more education to unders ta nd. Robustness of L LM-Based Scoring 7 While an imp erfect measure of writing so phistica tion, this index pr ovides a use- ful heuristic for measuring low-lev el differences in text structures, allowing us to inv estigate whether and to what extent more complex words and sentences influence predicted scores. F eatures of w r iting quality , s uch as these, are construct-relev ant for tra di- tional writing a nd language pro ficiency assessments (see, for exa mple, Refs. [1, 13]), but not SJT s. This exp eriment is par ticularly impor tant, then, for gaug- ing whether our automated sco ring system is effectiv ely ev aluating the intended constructs and not constructs related to writing proficiency . Exp eriment 3: Off-T opic Resp o nses In this exp eriment we inv estigated how off-topic resp ons e s (i.e., r esp onses that did not address the scenario and question presented to the studen t) were interpreted by our automated scoring system. T o do this we created random per m utations of item-r esp onse combinations fro m our or iginal dataset such that each resp onse was still repres ent ed ex actly once and was not matched with its origina l item. W e conducted t wo versions of this exp eriment: A. Resp onses were matched with an item ass essing a differ ent comp e tency as the original item B. Respo nses were ma tc hed with an item assessing the same comp e tency as the original item Whereas for items asses sing the same c ompe tency (e.g., collab oration), our a u- tomated scoring system was designed to ev aluate the same underlying features (though potentially weigh t them differen tly in scoring), there was no ov erlap b e- t ween features ev aluated in items assessing different compe tencies. W e therefore h yp othesized that off-topic resp onses tha t still exhibited relev ant features ex- pec ted of high quality resp onses for the comp etency w ould receive higher scor es than off-topic resp onses that exhibited features unrelated to the comp etency . 3 Results and Discussion 3.1 Exp eriment 1 : Meaningl ess T ext T able 3 shows the mea n word count of the re s po nses ev a luated in each s ub- exp eriment as well as the mean a nd standard deviation (SD) of scores assigned b y our automated sc o ring sys tem. W e also include the the baseline (unaltered) dataset for c omparison a nd the paired Cohen’s d effect size es timate b etw een scores a ssigned to the baseline resp onses and resp onses in each sub-exper imen t. Contrary to the findings by Ravindran & Choi [15 ], we find that simply duplica t- ing a resp onse generally de cr e ases the s core assigned by our a utomated scoring system b y a small amount (Cohen’s d = − 0 . 24 ). On the other ha nd, adding other kinds o f meaning les s text like a statement of the comp etency tested by the item (sub-exp eriment B) or a rephras ing o f the scenar io (s ub-exp e r iment C) has neg ligible impac t on scores ( | d | ≤ 0 . 01 ). W e did find, how ever, tha t a dding 8 C. W alsh and R. Iva n certain formulaic phrases (sub-exp eriment D) could po sitively bias our sco ring system to assign higher sco res, though (at leas t in the cas e examined here) those effects are very small ( d = 0 . 16 ). T able 3. Summary of responses ev aluated in our baseline dataset and each sub- exp eriment in Exp eriment 1 including mean word count of t h e resp onses ev aluated, the mean and standard dev iation (SD) of scores assigned by our automated scoring system, and paired Cohen’s d effect size estimates b etw een scores assigned in each sub- exp eriment with the baseline d ataset. Sub-Experim ent Me an W ord Coun t M ean Score (SD) Cohen’ s d (95% CI) Baseline 54.0 3.11 (1.17) — A 107.4 2.82 (1.12) -0.24 (-0.33, -0.16) B 61.6 3.09 (1.15) -0.01 (-0.10, 0.07) C 95. 7 3.12 (1.14) 0.01 (-0.08, 0.09) D 66.0 3.28 (1.02) 0.16 (0.08, 0.24) These results indicate that increasing text verbo sity without meaningfully adding to the conten t of the text is unlikely to p o s itiv ely influence scor es as signed b y our automated sco ring system. By adding a sta tement ab out the competency tested by the item or a rephras ing o f the s cenario presented in the item w e were able to increase mean r esp o nse length by 14% and 77 %, resp ectively , with effec- tiv ely zero c hang e in the mean scores assigned. Adding text that detracted from the clarity of the o verall resp onse by duplicating the orig inal resp onse ev en led to nega tiv e ov erall shifts in s cores as s igned, again p o int ing to the impo rtance of precise re spo nses over those that are extr a neously verb ose. A dding text to resp onses can po sitively influence scor e s assigned by our sc oring system, but only when that text co ntains meaningful and new informatio n , as we found in sub-exp eriment D. The results of that sub-exp eriment indicate that adding mem- orized, formulaic phrases may only hav e very small effects on a ssigned scor es, though more in- depth in vestigation of the use of such phrases on their own and in combination would b e needed to under s tand the full scope o f the effects. 3.2 Exp eriment 2 : W riting Sophi stication Sp elli ng Errors T able 4 shows the mean scor e a s signed to r esp o nses in each simul a tio n with different character error r ates (CERs). W e also include the base- line (unaltered) dataset for comparison and the paired Cohen’s d effect size e s - timate b etw een s c ores as signed to the baseline resp onses a nd r e spo nses in each simul a tio n. W e find very small or negligible effects o f sp elling error s up to a CER o f ab out 30% with small obser v able effects showing up at a CER of 35%. Large or v ery la rge effects a ppea r at CERs of 40 % and beyond as the texts b egin to b ecome indistinguishable from random string s. Fig. 1 illustr a tes the av erage score predicted b y o ur automated sco ring system as a function o f CER as well, where this drop-off in mean predicted scores is more clearly obse r v able. Robustness of L LM-Based Scoring 9 T able 4. Summary of simula tions in tro ducing spelling errors in to resp onses. The char- acter error rate (CER) d enotes the p robability of any charac ter in a resp onse b eing edited (substitution, deletion, insertion). Compared to our baseline dataset with no ar- tificial sp elling errors introduced, there are very small or negligi ble differences in scores assigned by our automated scoring system, on av erage, up to a CER of ab out 30%. Character Error Rate Mean Predicted Score ( S D) Cohen’s d (95% CI) Baseline 3.11 (1.17) — 5% 3.06 (1.10) -0.04 (-0.12, 0.05) 10% 3.21 (1.13) 0.09 (0.01, 0.18) 15% 3.14 (1.15) 0.03 (- 0.05, 0.12) 20% 3.08 (1.20) -0.02 (-0.10, 0.06) 25% 3.00 (1.21) -0.09 (-0.18, -0.01) 30% 2.98 (1.18) -0.11 (-0.19, -0.2) 35% 2.80 (1.24) -0.25 (-0.34, -0.17) 40% 2.24 (1.29) -0.70 (-0.79, -0.60) 45% 1.98 (1.15) -0.97 (-1.07, -0.87) 50% 1.70 (1.01) -1.28 (-1.40, -1.17) These r esults indicate that our scoring sy stem is robust to spelling err ors. T o illustrate, s imulating a 30% CER in the previous s e ntence pro duces the follow- ing result: “These esflts indicap e thst ofur ecsr ind sytv a ils rnobstOo sp eloiV g errgprs .” This altered sentence may be barely legible for a human rea der , but the LLMs used a s part o f our scoring sys tem app ear to be able to interpret these resp onses s ufficien tly in regards to the provided instructions. This is a desired behavior of our asses sment and a uto mated scoring system as it allows students to not fo cus on sp elling while completing the assessment. Reading Lev el T able 5 provides a summary of six simulation studies adjusting the RGL of the original res po nses to v arying deg rees. While the word count remains r elatively stable acros s simulations, the average num b er o f words p er sentence incr eases prog ressively over the sim ulations as we alter the resp onses to include few er, but long er and more complex, sentences. W e also observe that the av erage num b er of syllables per word increas es steadily ov er the sim ulations as we re pla ce simpler words with mo re complex ones. Combined, these altera tions pro duce resp onses with higher R GLs, on average. W e find very sma ll or negligible differences in pr edicted s cores for sets of re- sp o nses with mean RGLs up to roughly s ix points apart. F or instance, resp onses in our baseline da taset had a mean RGL of 10 . 5 , while in one sim ulation we pro duced altered resp onses with mean RGL of 4 . 7 ; the paired Cohen’s d effect size b etw een sco res assigned to these tw o s ets of respo nses was 0 . 0 9 . In t wo o ther simul a tio ns we pro duced sets of resp onses with mean RGLs of 7 . 6 a nd 13 . 2 ; the paired Co hen’s d effect size b etw een sc o res assigned to these tw o sets of resp o nses was 0 . 1 1 . W e can illustrate how differences in RGLs manifest in resp ons es with an example; b elow is a sample resp onse to the item from Sec. 2.1 with an R GL of 11 . 7 : 10 C. W alsh and R. Iva n 10 20 30 40 50 1 1.5 2 2.5 3 3.5 4 4.5 5 Character Error R ate (%) Mean Score Baseline Mean Score Fig. 1. Mea n score predicted by our automated scoring system for simula tions of d if- feren t character error rates (CER) in resp onses. The effects of sp elling errors on th e scoring system only b ecome meaningful at CERs of 35% and abov e. Error bars rep- resen t the 95% confidence interv al for the mean predicted score. The shaded region represen ts the 95% confidence inter va l on the mean score assigned t o resp onses in the baseline dataset. T able 5. Summary of simulations adjusting original resp onses t o reflect different Flesc h-Kincaid Reading Grade Leve ls (RGLs), whic h accounts for the av erage num- b er of w ords p er sen tence and syllables p er wo rd in a text. T exts with higher RGLs can b e in terpreted as requiring more education to understand. Results from the baseline dataset where resp onses were not altered are italicized. W e find very small or negligible differences in scores assig ned to sets of resp onses with mean RGLs spanning up to six p oin ts. Mean V alue R GL N(w ords) N(sen tences) w ords sen tence syllables w ord Score (SD) d (95% CI) 3.2 42.3 4.2 10.4 1. 2 2.89 (1.13) -0.19 (-0.27, -0.10) 4.7 47.1 3.7 12.9 1. 3 3.00 (1.14) -0.09 (-0.17, 0.00 ) 6.5 48.9 3.2 15.6 1. 3 3.05 (1.14) -0.05 (-0.14, 0.03 ) 7.6 48.2 2.9 16.8 1. 4 3.10 (1.16) -0.01 (-0.09, 0.08 ) 8.6 47.7 2.7 17.8 1. 5 3.08 (1.14) -0.02 (-0.11, 0.06 ) 10.5 54.0 2.6 21.7 1.5 3.11 (1.17) — 13.2 48.9 2.4 20.9 1. 8 3.22 (1.15) 0.10 (0.02, 0.19) Robustness of L LM-Based Scoring 11 Ale x and Kendr a c ould have c ombine d structur e with input by first outlin- ing a ro ugh timeline and then holding a short gr oup me eting to adjust it to gether. This would ke ep the pr oje ct moving while stil l al lowing memb ers to fe el involve d. Kendr a’s fo cus on de ad lines and Alex’s c onc ern ab out trust ar e b oth valid, and acknow le dging that op enly c ould have help e d b alanc e efficiency with c ol lab or ation. In one simulation, we edited this r e s po nse to an RGL of 2 . 5 : Ale x and Kendr a c ould make a pla n and ask the gr oup t o help. First they c an make a simple plan with dates. Then they c an have a short te am me eting to change t he plan to gether. This ke eps the work moving and lets p e ople help. Kendr a c ar es ab out finish dates. Alex worrie s ab out trust. Both ar e right. S aying this out loud c an help them work fast and work to gether. The edited resp onse is slightly longer (6 8 w or ds compared to 62 in the origina l), contains simpler sent ences , and do es awa y with more complex words (e.g., “o ut- lining”, “allowing”, “a ckno wledging”, “efficiency”, “collab or ation”). Despite these changes, the edited resp onse still maintains the tone and co nt ent of the original resp onse; our a uto mated s coring system as signed a sco re o f 4 . 6 to b oth resp onse s . These results indica te that low-lev el writing features like se ntence structure and vo cabulary are la rgely inco nsequential to our automated scoring system. T aken together w ith the observed robustness of our scoring sys tem to sp elling errors noted above, we hav e strong evidence that writing quality a nd sophistica- tion are generally ignored b y our system. Given that automated scor ing sy stems to da te hav e predominan tly b een designed to meas ure exactly these k inds of fea- tures, this result affirms that o ur s ystem is indeed capturing something different, distinguishing it from o ther scoring sy s tems measur ing writing or langua g e pro- ficiency . 3.3 Exp eriment 3 : Off-T opic Resp onse s W e find that off-topic resp onses are appropria tely penalized by our scoring sys- tem. Resp onses matched with a different, r a ndom item asses sing a differen t co m- petency (e.g., a r esp onse to an item targeting c ol lab or atio n was matched with a different random item n ot ta rgeting c ol lab or ation ) received an av er a ge score o f 1 . 69 co mpared to the same res p ons es w hen cor rectly matched with their o r iginal items, whic h received an average sco r e of 3 . 11 (Cohen’s d = − 1 . 33 ). Resp onses matched with a different, random item asses sing the same comp etency (e.g., a resp onse to an item targeting c ol lab or ation was matched with a different random item also targeting c ol lab or atio n ) received sco res of 2 . 38 , on av era ge (Cohen’s d = − 0 . 62 co mpared to scores a ssigned to resp onses co rrectly matched with their original items). Our scor ing system also returned the minimum poss ible score of 1 . 0 more often when the resp onse was no t aligned with the item. In our ba s eline dataset, our sco ring system assigned 20 sco res of 1 . 0 (3.7 %); this num ber rises to 7 2 (13 .2 %) when resp o nses are matc hed with different items assessing the 12 C. W alsh and R. Iva n same c o mpetency and 1 91 (35.0%) when matched with different items assessing a differen t co mpetency . These r esults provide t wo main insights. First, an otherwise strong r esp onse that is misaligned with the ta sk is considered unfavorably by our scor ing system and is lik ely to result in the low est attainable scor e. Second, our scoring sy s tem appropria tely co ns ider s different comp etencies in respo nses for differen t items. If our scoring system rewarded similar features of resp onse s a cross different items we would hav e exp ected to see similar r esults in bo th versions of the exper i- men t co nducted here. That w e did no t po in ts to our scor ing sys tem working as exp ected: it r ewards certain fea tur e s o f resp onses for certain items and other features for other items. 4 Conclusion This study inv estigated the susceptibilit y of an LLM- based auto ma ted scor ing system to construct-irrelev ant factors. Our results indicated that the scoring system inv estiga ted was robust to padding r esp onse length w ith meaningless text either in the form of stating what competency was asses s ed by an item o r rephrasing the scenario pr ompt provided. W e also found that duplicating text generally had deleter ious effects on sco res predicted by the scor ing system, in contrast to results from previo us studies of non-LLM- based scoring systems [14 , 15] where duplicated text inflated predicted scores. Giv en these r esults, LLM- based scoring s olutions may offer a n av enue for overcoming limitations of existing systems that ov er-index o n text length. W e als o found that the inv estiga ted scor ing system w as relatively robust to writing s ophistication. Spelling error s only beg an to noticea bly affect scores pr e- dicted by the sys tem a t character err or ra tes gr eater than 30%, a result aligned with previous studies of ro bustness of automatic con tent scoring (not writing or language proficiency) [6]. V oca bulary and resp onse structure, simila rly , had very little or no impact on predicted sco r es; r esp onses employing s horter and simpler words and sentences gener ally received compara ble scores to those using mor e complex a nd diverse w or d and s ent ence s tructure. In contrast, pr evious studies which made use of human ra ters found large differences in scoring for texts with and without sp elling error s [2]. AI-based sco ring sys tems may , then, offer adv an- tages over human rating in ignor ing sp elling errors a nd writing sophistication, a pa rticularly imp ortant conclusio n for SJT s and other assessments of complex constructs where writing sophistication is typically constr uct-irrelev ant. Lastly , w e found that off-topic resp onses (i.e., resp onses that did not relate to the cont ent of the question a sked) g enerally received large p enalties from the scoring system and were frequently assig ned the low est p os s ible scor e . Thoug h not the main purp o se of this study , we a ls o found s uppo r t for the claim that the scoring system inv estiga ted here considers differen t compe tency-related features for differen t items: the kinds o f resp onse s that w ould receive a high score for one item will not necessar ily receive a high sco re for a different item where differen t features are ev aluated. Robustness of L LM-Based Scoring 13 4.1 Limitations and F uture W ork This study o nly inv estigated one LL M- based scor ing system for o ne assessment. W e suggest extending this work to o ther LLM-ba sed sco ring systems, particular ly those that us e L L Ms in different capacities. In pa r ticular, the system inv estiga ted here used a dual- a rchitecture system employing LLMs for feature e x traction and maint a ining traditional regress ion techniques for feature weigh ting. W e found, for instance, that this sco ring system was robust to the ev aluation of r esp onses with sp elling err ors, a result that aligned with previous resear ch [4, 6]. Thos e same studies provide ev idence, ho wev er, that our conclusio ns may not ex tend to LLM-based systems employing less capable mo dels (w e used GPT-5.X mo dels here) [4]. Extending this r esearch to other systems c a n further our understanding of the adv an tages and limitations of v arious LLM-based scoring techniques. W e also suggest all developers of automa ted sco r ing s ystems (no t just those that a re LLM-based) adopt a similar approach to the one demonstrated her e to inv es- tigate and ca teg orize the influence o f construct-irre lev an t factors on a utomated scoring systems to build public trust in those systems. In this study , we did no te one particular construct-irrele v ant factor that could influence the b ehavior of our automated sco ring system: using memorize d, formulaic phrases. W e only in vestigated one such phrase her e and found only a very small influence on predicted score s . W e noted previousl y that these kinds of phra ses on their own may cont ain some c onstruct-relev ant information, but when used fr e q uen tly and sup erficially may b e indicative of a test taker acting in bad-faith. F uture w or k should more deeply inv estigate the use of fo r mul a ic phrases for our ass essment and scoring system to unders tand the extent that these phrases can influence the predictions of the scoring system. Ev en when automated scor ing systems ca n b e demonstrated to b e ro bust to cer tain cons truct-irrelev ant factors, there is still v alue in developing systems to flag “unusual” res p ons es [5, 18]. While our scoring system frequently assigned the low est av aila ble score to o ff-to pic res p o ns es, for instance, it did not assig n all such res p ons es the low est av ailable s core. So while it would generally not be adv an tag e ous to a test taker to provide an off-topic r esp o nse, we might wish to sp e cifically identify such respo nses e ither to ensure they a re assigned a sp ecific score or to provide additional context for decis ion-makers. Ac kno wl edgments. W e would like to ac knowledge Colleen Robb , Gill Sitarenios, and Josh Mosko witz for feedbac k on this pap er. Disclosure of Interests. Both authors are employ ees of Acuity Insights, which ad- ministers the assessment used for data collection in this study . References 1. Bruno, J.V., Beck er, L.: Exp lainable writing scores via fine-grained, LLM-generated features. In: Proceedings of the Artificial Intellig ence in Measurement and Educa- tion Conference (AIME-Con): W orks in Progress (2025) 14 C. W alsh and R. Iva n 2. Choi, I., Cho, Y.: The imp act of sp elling errors on trained raters’ scoring decisions. Language Education & A ssessmen t 1 (2), 45–58 (2018) 3. F arquhar, S., Kos sen, J., Kuhn, L., Gal, Y.: D etecting hallucinations in large lan- guage mod els using seman tic entropy . Nature 630 (8017), 625–630 (2024 ) 4. Gan, E., Zhao, Y., Cheng, L., Y ancan, M., Go yal, A., Kaw aguc hi, K., Kan, M.Y., Shieh, M.: Reasoning robustness of llms to adversaria l typographical errors. I n: Proceedings of the 2024 Conference on Empirical Method s in Natural Language Processing. pp. 10449–10 459 (2024) 5. Higgins, D., Burstein, J., Attali, Y.: Identifying off-topic student essa ys without topic-sp ecific training data. Natural Language Engineering 12 (2), 145–1 59 ( 2006) 6. Horbach, A., Din g, Y., Zesch, T.: The influence of sp elling errors on conten t scoring p erformance. In: Pro ceedings of the 4th workshop on natu ral language pro cessing tec hn iques for educational applications (nlptea 2017). pp. 45–53 (2017) 7. Huang, Y., Wilson, J.: Ev aluating LLM-based automated essay scoring: Accuracy , fairness, and v alidit y . In: Pro ceedings of th e Artificial Intelli gence in Measurement and Education Conference (AIME-Con): W orks in Progress (2025) 8. Iqb al, M.Z., Iv an, R., Bynkoski, K., A rc hb ell, K., Richard, C., P etersen, K.H.: F ormative sjt: Devel oping stud ent p ersonal and p rofessio nal comp etencies. The Score (2025) 9. Iqb al, M.Z., Iv an, R ., Robb, C., D erby , J.: Ev aluating factors that impact scoring an op en resp onse situational judgment test: a mixed metho ds approac h. F rontiers in Medicine 11 , 15251 56 (2025) 10. Klebanov, B.B., Madnani, N.: Automated ev aluation of writing–50 years and count- ing. In: Pro ceedings of the 58th annual meeting of the association for computational linguistics. pp. 7796–7810 ( 2020) 11. Lee, G.G., Latif, E., W u, X., Liu, N., Zhai, X.: Applying large language models and chain-of-though t for automatic scoring. Computers and Education: Artificial Intelli gence 6 , 100213 (2024) 12. McDaniel, M.A., Hartman, N.S., Whetzel, D.L., GRUBB I I I , W.L.: Situational judgment tests, resp onse instructions, and v alidit y: A meta-analysis. Personnel psycholo gy 60 (1), 63–91 (2007) 13. Naismith, B., Cardw ell, R., LaFlair, G.T., Nydick, S ., Kostromitina, M.: Duolingo english test: technical manual . T ech. rep., Duolingo, In c. ( 2025) 14. Po w ers, D.E., Burstein, J.C., Cho doro w, M., F o wles, M.E., Kukich, K.: Stumping e-rater: c hallenging the v alidit y of automated essa y scoring. Computers in H uman Beha vior 18 (2), 103–1 34 (2002 ) 15. Ravindran, R., Choi, I.: In vestigating advers arial robustness in LLM-based AES. In: Pro ceedings of th e Artificial Intell igence in Measuremen t and Edu cation Con- ference (AIME-Con): Co ordinated S ession Papers (2025) 16. Thomas, G., Hartley , R .D., Kincaid, J.P .: T est-retest an d in ter-analyst reliability of the automated readability index, flesc h reading ease score, and th e fog count. Journal of R eading Beha vior 7 (2), 149–154 (1975) 17. W alsh, C., Ivan, R., Iqbal, M.Z., Robb, C.: Using LLMs to identify features of p ersonal and professional skills in an op en-resp onse situational judgment test. In: Proceedings of the Artificial Intel ligence in Measuremen t and Education Confer- ence (AIME-Con): F ull Pa p ers (2025 ) 18. Zhang, M., Chen, J., Ruan, C.: Ev aluating the advisory fl ags and mac hine scoring difficulty in the e-rater® automated scoring engine. ETS R esearc h Rep ort Series 2016 (2), 1–14 (2016)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment