Characteristics of hand and machine-assigned scores to college students answers to open-ended tasks

Assessment of learning in higher education is a critical concern to policy makers, educators, parents, and students. And, doing so appropriately is likely to require including constructed response tests in the assessment system. We examined whether s…

Authors: Stephen P. Klein

IMS Collectio ns Probability and St atistics: Essays i n Honor o f David A. F reedman V ol. 2 (2008) 76–89 c  Institute of Mathematical Statistics , 2008 DOI: 10.1214/ 19394030 70000003 92 Characteris tics of h and and mac hin e-assigne d scores to colleg e students’ a nsw ers to op en-ende d tasks Stephen P . Klei n ∗ 1 GANSK & Asso ci ates Abstract: Assessment of lea rning in higher education is a critical concern to p olicy mak ers, educators, paren ts, and students. And, d oing so appropri - ately is likely to require i ncluding constructed response tests in the assessment system. W e examined whether scoring costs and other concerns with using open-end measures on a large scale (e.g., turnaround time and inte r-reader consistency) could be addressed b y machine grading the answers. Analyses with 1359 student s from 14 colleges found that tw o human r eaders agreed highly with each other i n the scores they assigned to the answers to three t ypes of op en-ended questions. These reader assigned scores als o agreed highly with those assigned by a computer. T he correlations of the mac hine-assigned scores wi th SA T scores, college grades, and other measures were comparable to the correlations of these v ariables with the hand-assigned scores. Mach ine scoring did not widen di fferences in mean scores b et w een racial/ethnic or gen- der groups. Our findings demonstrat ed that machine scoring can f acil itate the use of open-ended questions in large-scale testing pr ograms by providing a fast, accurate, and economical wa y to grade r esponses. Un til the turn of the 21st century , mo st larg e-scale K-12 and college testing pro - grams relied almo st exclusively on m ultiple-c hoice tests. There are several under- standable reasons for this. It takes muc h long er to score the answers to essay and other “ constructed resp ons e” (which are also referr ed to b elow as “op en-ended” or “free-res po nse”) questions than it do es to have a ma chine scan multiple-c hoice an- swer sheets. Thus, hand scoring of essay answers tends to increase the time required to repo rt res ults. There also a re co ncerns a b o ut sub jectivit y in gra ding b ecause hu- man readers do not alwa ys ag ree with each o ther (or even with themselves ov er time) in the s core they assign to an answ er. Scoring costs and logistica l pr o blems (such as arranging for reader s) a re m uch greater with open- ended tests than they are with multiple-c hoice exams. In additio n, sc ore reliability p er ho ur of testing time is g enerally gr eater with multiple-c hoice tests than it is with op en- e nded ones (W ainer and Thisssen [ 19 ] and Klein and Bo lus [ 5 ]). Nevertheless, there are imp or- tant skills that can only b e assessed (or assessed w ell) with open-ended measures. This is esp ecia lly so in higher educa tio n. Co nsequently , colleg e and gradua te school ∗ Dr. Laura H amilton from RAND, Professor Richa rd Sha v elson f rom Stanford Univ ersit y , and Professor George Ku f rom Indiana Univ ersity pro vided man y helpful suggestions on earlier drafts of this chapt er. Dr. Roger Bolus, a consultant to the Council f or Aid to Education, r an the statistical analyses present ed in this c hapte r. 1 GANSK & Associates, 120 Ocean Park Blvd., #609, San ta Monica, CA 90405, USA, e-mail: steve@ga nsk.com AMS 2000 subje ct classific ation. 62P99. Keywor ds and phr ases: constructed resp onse, hand scori ng, machine scoring essay answe rs, open-ended tasks, reasoning tasks. 76 Char acteristic s of hand and machine-assigned sc or es 77 admissions tests as well as licensing exams for teachers and other pro fes s ionals ar e now likely to co nt ain constructed resp onse questions. Many educators and practitioner s prefer op en-ended ques tions whereas those who administer larg e-scale testing progr ams are resp onsible for reducing costs and the time needed to rep or t results. The tension b etw e e n these comp eting in terests stimu lated the searc h for effective ways to economically and quickly sc o re free r e - sp onse answers. E fforts to employ machine sco ring for this purp ose b ega n a b o ut 40 years ago (Daigon [ 2 ] and Page [ 13 ]). Significant adv ances in this technology (particularly in computational linguis tics) sta rted to a ppe ar in the literature in the middle 19 90’s (e.g., Bur stein et al. [ 1 ]). Machine scoring o f essay a nswers in op erational pro grams b egan a few years later (see K ukich [ 9 ] for a review of this history). A t least three priv ate companies – Educational T esting Service (ETS), Knowledge Analysis T echn ologies (KA T), and V ant age Learning – provide machine g rading ser- vices for a wide r ange o f clients, such a s the Army’s Officer T r aining and Do ctrine Command, industry training co urses, statewide K-1 2 testing programs, a nd the National Bo ard of Medical Ex aminers (Swyger t et al. [ 18 ]). F o r example, ma chines are used to grade a b o ut 350 ,000 answers a year to the Analytic W riting Assessment section o f the Gra duate Management Admission T est (GMA T). T ogether, the three companies listed a b ove score well ov er a million free response answers p er year. The computer-based metho ds these s ystems us e are complex, statistica lly so phis- ticated, and proprietary . And, the results with them hav e b een very enco uraging. F or example, Laudauer et al. [ 10 ] found a 0 . 86 co rrelation betw een tw o indep en- dent readers o n ov er 2,000 essays across 15 diverse substan tiv e topics from different grade levels. The correlation b etw een the sco res assigned by a single reader to these answers and those genera ted by KA T’s latent semantic analysis based In telligent Essay Assessor engine w as 0 . 85. Similar ly , Pow ers et al. [ 14 ] rep or ted that in a sam- ple of about 1800 essay a nswers to 40 GRE writing pr o mpts that w er e g raded on a six-p o in t scale, pairs of human reader s ag r eed either exa ctly or within one p o in t of ea ch other 99 p ercent of the time in the score they assig ned. The corresp o nd- ing ag reement rate b etw een the hand scor es and those assig ned b y ETS’s “e-r ater” machine grading sy s tem was 93 p erce nt. In short, it a ppe a rs that at least in some applications, machine-assigned g rades can come very clos e to those a ssigned b y hand. If such encour a ging results are replica ble, then machine scoring o f op en-ended resp onses could b e a cost-effective wa y to include c o nstructed res po nse questions in several types of larg e-scale testing pro grams. F or example, universities as well as state and communit y college systems could use machine s coring for placement tests and asses sment prog rams that emphasize students demonstrating comp eten- cies ra ther than simply s atisfying “sea t-time” or credit-hour r equirements. Mac hine scoring also is compatible with the needs o f distance lea rning courses in that stu- dent s can ta ke the tests on line and have their scores retur ned quic kly . Some ma- chine s coring alg orithms also provide diagnostic data rega rding differ ent asp ects of answer quality (such a s c o nten t, style, and mec hanics). 1. Purp oses It is within this broader assessment context that we e x amined the corresp ondence betw een the hand and machine scores that were a ssigned to the answers to three t yp e s of college-level o p e n- ended tasks (the issue and argument pr ompts that are 78 S. P. Klein now used on the GREs a nd a 90- min ute critical thinking p erfo rmance task). These analyses inv es tigated the questions be low to explo re whether ma chine-assigned scores ca n replace hand-a ssigned scor es in certain types of large-sc a le higher educa- tion ass e s sment programs. Our sp ecific fo cus was on testing co nducted for resear ch and p olicy analysis purp o s es; i.e., a s distinct fro m college admiss ions, licensing, o r other high-stakes applica tions. • Do the hand a nd machine-assigned sco r es agree with each other as muc h as tw o ha nd readers agree with each other (and is this agreement r ate high enough to trust the s c ores)? • Are a greement rates a function o f task type? F o r example, is the degree o f agreement bet ween ha nd and machine-assigned scores on a 30 or 45-minute essay question as high as it is on a 90 -minute perfor mance task in which test takers use mult iple r eference do cuments? • Do hand a nd machine-assigned sco res to the ans wers on this p erforma nc e task (and to those on mor e t ypical op en-e nded prompts) hav e similar corr elations with other measur e s (such a s SA T sc o res and c ollege gra des)? • How are these relationships affected when the college (rather than the indi- vidual student) is used as the unit o f ana lysis? • Do ma chine-assigned scores to one task correlate higher with machine-assign- ed scores to a nother task tha n they do with the hand s c o res o n that other task? In other words, is there a unique effect related to scoring metho d that generalizes acros s tasks? This would o ccur if so me students tended to earn higher g rades when their answers w ere hand rather than machine scored and the reverse was true fo r other students. • Do e s machine scor ing tend to widen or na r row differences in mean s c ores betw een g ender or ra cial/ethnic g roups? • Under w ha t conditions is m achine scoring a cost-e ffectiv e s trategy? 2. Pro cedures 2.1. Sample char acteristics The 1359 students who participated in this rese arch were drawn fro m 14 colleg e s and universities that v aried in siz e, selectivity , geo g raphic lo cation, t ype of funding, and the diversity o f their students’ background c haracteristics. F orty-t w o percent of the studen ts were males. The p ercentage o f freshmen, sophomores, juniors , and seniors w ere 29, 25 , 23, and 2 3, resp ectively . Whites, African Americans, Asians, and Hispanics comprised 71, 1 0, 6, and 3 p er cent of the students, respec tively (2 per cent b elonged to other gro ups, 7 p ercent were multiethnic, and 1 per cent did not resp ond). Students w ere recruited a cross a broad sp ectrum of academic ma jors and w ere paid $20 to $25 per hour for their time (the amoun t v aried slig ht ly a s a function of lo ca l co llege practices). 2.2. Me asur es Critical thinki ng tasks. W e used four of the 90-min ute “T asks in Critical Thinking” that were develop ed by the New Je rsey Department o f Higher E du- cation (Ewell [ 4 ] and E rwin and Sebrell [ 3 ]). All of these tasks require d working with v ar ious do cuments to answer 5 to 10 separa tely scored op en-ended questions. Studen ts ha d to interpret a nd critically ev aluate the information in each do cument. Char acteristic s of hand and machine-assigned sc or es 79 Studen ts are given a memo from a company officer that describes an accident in which a teenager suffered seri ous injury while we aring a pair of the com- pan y’s high per formance sk ates. The memo then asks the examinee to address sev eral questions about this incident , suc h as the most likely reasons for it, the evidence that w ould supp ort these hypotheses, and the v alidity of the claim that more than ha lf of the serious sk ating a cciden ts in v olv e the compan y’s sk ates. Studen ts also receiv e a set of document s that they are advised to con- sider in preparing their memo. These materials include a newspaper account of the acciden t, information ab out the compan y and its mark et share, an ac- ciden t r eport prepared by a custodian, transcription of an i n terview with an expert, and a story b oard for the company’s TV ad. The scoring rubric as- signs p oints for recognizing that the relationship b etw een a compan y’s share of the accident s and its market share and for iden tifying plausible reasons for the ac ciden t and th e evidenc e to support these h ypotheses. There also a re o v erall scores for analytic reasoning and communica tion skills. Fig 1 . Sp ortsCo task. W e also administer ed tw o new 9 0-minute critical thinking tasks that were de- veloped for this study , but pr oblems with one o f them precluded using it in the analyses b elow. Figure 1 descr ib e s the other new task (“Spor tsCo”). T he new ta sks are similar to t he New Jersey ones in that studen ts work with v arious documents to pr epare their answers, but unlike the New J ersey ta sks, some of these materi- als ar e more ger mane than other s to the issues tha t need to b e addressed. Thus, student s must decide how m uc h to r e ly o n them in prepa ring their answers (this corres p o nds to the “applica tion of str ategic knowledge” discussed by Sha v elson and Huang [ 17 ]). The new tasks also differe d fro m the New Jersey ones in that studen ts wrote a sing le long answer (in the form of a memo that a ddressed several issues) rather than r esp ond to a set of separ ate questions. GRE analytical writing prompts . The Graduate Recor d Examination (GRE) now contains t wo t ype s of essay que s tions, a 45-minute “issue” task and a 30 -minut e “argument” task (Pow e r s et al. [ 14 ]). The 4 5 -minute “issue” pr ompt pre sents studen ts with a p oint-of-view ab o ut a topic of g eneral in terest and as ks them to resp ond to it from any p ersp ective(s) they wish. One o f these prompts was: “In our time, sp ecialists o f all kinds a re highly ov errated. W e need more g eneralists – peo ple who can provide broad p ersp ectives.” Studen ts are instructed to provide re le v ant reas o ns and examples to explain and justify their views. The 30- minu te “argument” prompt presents an arg umen t and asks test takers to critique it, discussing how well reasoned they found it, r a ther than simply agr e eing or disagr eeing with the author’s p osition (se e Fig ure 2 for an exa mple). According to Pow ers et al. [ 14 ]), the is s ue and a rgument “tasks a re intended to complement one another . One requires test tak ers to construct their o wn arguments by making claims and pr oviding evidence to supp or t a p ositio n; the other r e quires the critique o f so meone else’s arg ument ” (p. 4). Other measures. All the participants completed a survey that included ques - tions ab out their demogr a phic and bac kground c haracteristics . They also gav e us their permiss ion to obtain their colleg e gra de p oint av erage (GP A) and SA T or ACT scores from the r egistrar a t their co llege. All but one of the 14 co lleges pr ovided these scor es. 80 S. P. Klein The Universit y of Claria is generally considered one of the b est univ ersities i n the wo rld because of i ts instructors’ r eputation, whic h is based primarily on the extensive researc h and publishing record of certain f acult y m embers . In addition, sev eral facult y members ar e in ternationally reno wned as leaders in their fields. F or example, many of the facult y fr om the English departmen t are regularly in vited to teach at universities in other coun tries. F urthermore, t w o recen t graduates of the ph ysics departmen t hav e gone on to become candidates for the Nobel Prize in Ph ysics. And 75 p ercent of the studen ts are able to find emplo ymen t after gr aduating. Therefore, b ecause of the reputation of its faculty , the University of Claria should b e the obvious choice for an y one seeking a qualit y education. Fig 2 . Example of a 30-minute GRE “ar gument” pr ompt. 2.3. T est administr ation A t eight of the 14 c olleges, each student to o k t wo of the six 90-minute critical thinking tas ks. Studen ts were assigned r andomly to pairs of these tas ks within schools. All six tasks were administered at e ach sc hoo l. A t the other six colleges, student s were assig ned randomly to o ne of the six 90 -minute cr itica l thinking tasks. The s tuden ts were then assig ned randomly to t wo of the three GRE is sue prompts and, in keeping with the GRE’s administratio n pr o cedures, they were instructed to select one to a nswer. After completing this prompt, a student was assigned randomly to one o f the three ar gument prompts. Thus, examinees had some choice of is s ue pr ompts but not o f ar gument pro mpts. All the critical thinking tasks and GRE prompts were administered at eac h of these six colleges (see Klein et al. [ 7 ] for a mor e detailed de s cription of the ma tr ix sampling plan). The critica l thinking tasks and GRE prompts were presented to studen ts in har d copy . F or the critica l thinking tasks, students ha d the option of handwriting their answers directly in their test b o o klets or prepa ring them o n a computer . F or the GRE prompts, student s had to pr epare their answers on a computer. The tests were administered to students under standar dized pro cto red conditions usually in their college’s computer la b. 2.4. Hand sc oring Sp ortsCo answers F our gra dua te s tudents in English fro m a nationally prominent universit y were trained to use a n analy tic scor ing guide to ev a luate the Sp ortsCo answers. This guide contained 40 se parate items (g r aded 0 or 1) and a 5- po int o verall commu- nication score. F or the latter score, the rea ders w ere to ld to consider whether the answer was well o rganized, whether it communicated clearly , whether arg umen ts and conclusio ns were supp o rted with sp ecific reference to the do cuments provided, and whether the answer used appropriate v o cabulary , la nguage, and sent ence struc- ture. Readers w ere instructed to ig nore sp elling. A studen t’s total raw score was the sum of the 41 scores. Two of the four r eaders were pick ed at random to grade each answer. As a result of this pro ces s, every reader w as paire d ab out the same num ber of times with ev ery other reader. Answers prepa red on the co mputer w ere printed out so that readers ev aluated ha r d copies of all the answers. The reader -assigned s c ores are hereinafter referred to as th e “hand” raw scores so as t o distinguish them f rom the mac hine- assigned sco r es. Char acteristic s of hand and machine-assigned sc or es 81 T a ble 1 Me an, standar d deviation, and me an c orr elation by Sp ortsCo r ea der Mean correla tion Reader Mean r aw score Standard deviatio n with other rea ders 1 11.69 4.80 0.86 2 11.29 3.86 0.83 3 11.35 3.88 0.85 4 12.23 4.34 0.85 Mean 11.60 4.15 0.85 T able 1 shows ea ch reader ’s mean and standard deviation as well as each reader’s mean co rrelatio n with the o ther rea ders. The ov erall av erage correlatio n betw een t wo readers was 0 . 8 5. The small differences in means b e tween readers were not significant. 2.5. Machine sc oring S p ortsCo answer s W e g av e E TS a sample of 323 Sp ortsCo answers and the individual item scor es assigned to them by each reader . ETS us ed these da ta to c onstruct its machine- grading a lgorithms. T o build the e-ra ter algor ithm for the communication score, ETS randomly divided the 323 students into three sets (lab eled A , B , and C ). Next, it used the answers and s cores for the studen ts in Sets A and B to build a mo del that was then applied to the ans wers in Set C , it used Sets A and C to build a mo del that was then applied to the answers in Set B , and it used Sets B and C to build a mo del that was then applied to the a nswers in Set A . The machine-assigned scores cre ated by this pro cess were therefore indep endent of the scor e s that were used to build t he machine scor ing mo dels. ETS’s “e-r ater” scoring engine “is designed to ident ify fea tures in student es- say wr iting that reflect characteristics that are sp ecified in reader scoring guides” (Burstein, et a l. [ 1 ]). It develops a scoring a lgorithm that is based on the g r ades assigned by the hu man readers to a s a mple o f a nswers and co n tains mo dules for ident ifying the following features tha t ar e relev a nt to the sco r ing guide cr iteria: syntax, discour s e, topical con ten t, and lexical complexity . ETS’s “c-rater ” scoring system (Leaco ck and Cho dorow [ 11 ]) was used to create scores for items 1 through 4 0 . This engine is designed for co nt ent -laden short answer questions (r ather than for ev aluating the long memo ex a minees had to pro duce for this task). Thus, it had to lo cate that p ortio n in the student’s answer that was relev a nt to each item in the scor ing rubric as well as g r ade it. The mo dels that were relatively easy to build typically had a high agreement b etw een reader s, a low “baseline” (i.e., r elatively few students received cr edit for the item), a nd had a well- defined rubric with t w o to three main ideas that receive credit. F or example, on item #4, students received credit for reco gnizing that Sp ortsCo ha d double (or “more” ) sales than its chief comp etitor (the AXM Company). There were tw o gener ically correct resp o nses to this item, namely: (1) SportsCo sold t wice as man y s k ates as AXM and (2) AXM so ld ha lf as many sk ates as Sp or tsCo. There are many wa ys in which a student can expre ss these ideas . Hence, the mo del included sev eral v ariations o f eac h sentence. By utilizing synonyms and se- lecting the ess e n tial fea tures, the mo del can iden tify a v a riety of paraphrases. F or example, the first sentence matches the student who wrote, “The reaso n for this is mo st likely b ecause w e also sell mor e than t wice as ma ny sk ates than a ny other sk ating manuf acturer,” and the s tudent who wr ote, “W e had twice as many injuries 82 S. P. Klein due to selling twice a s many sk a tes.” Low agreement betw een ha nd a nd c- r ater scores ca n o c c ur as a r esult o f s e veral factors. F or e x ample, item #35 g av e studen ts credit for saying “the warning was not s tr ong eno ugh.” Instead of saying this, ma ny s tuden ts o ffer ed a solution, such as by describing how the warning sho uld be r evised. Although it was clear that the inten t of these suggestions was to make the warning stronger, c-rater could not recognize this inten t. The tra ining and cross- v alida tio n sets for the c-rater data were not created in the same manner a s those for e-rater. F or c-rater , the scor ed resp o nses for each prompt w ere par titioned indiv idua lly . Abo ut one-third o f the resp ons es were used for tra ining (to b e consulted while ma nually building the c- rater mo dels), and the rest for blind cross v alidation. How ev er, a maxim um of 5 0 respo nses were kept for each item in the development set. Thus, if there were 168 resp o nses receiving credit, 50 w en t to training and 118 to cro ss-v a lidation. In addition, if there were 15 or few er resp onses rec eiving credit in t he entire dataset, these were evenly divided b etw een training and cr oss-v alidation. Finally , the degree of a greement b etw een hand reader s sets an upp er limit on the degree of a greement betw een the hand and ma chine-assigned scores. This is true for b oth the “e- rater” a nd “c-r a ter” scor ing engine s . 2.6. Hand and machine sc ori ng of answers to other tasks A tw o -p erson team ha nd grade d the answers to the four cr itical thinking ta sks that were develop ed by New Jer sey and to the six GRE prompts. This team had extensive prior experience in scoring resp ons es to these tasks. The respo ns es to a task w ere assigned randomly to r eaders. T o asses s reade r agreement, b oth readers also independently gr aded a common batc h of about 15 p er cent o f th e a nswers to each task. On the four New J e r sey critica l thinking tasks (Conland, Mosquito e s , Myths, a nd W omen’s Lives), the grader s used a blend of ana lytic and holis tic scoring rubrics depe nding on the particular question b eing scored. The mean cor relation b etw een readers on the total sco res o n these tasks ra nged from 0 . 8 8 to 0 . 93 with a mean of 0 . 90 (which is slig htly higher than the 0 . 85 o n Sp or tsCo). The readers graded the answers to the GRE prompts on a holistic six-p oint scale; i.e., they read the answers quickly for a total impression that considered s ynt actic v arie ty , us e of grammar, mechanics, and st yle, organiza tion and dev elopmen t, a nd vocabular y usag e. The mean corr elation b etw een tw o rea ders o n a 4 5-minute issue prompt and a 30 -minut e a rgument prompt were 0 . 8 4 and 0 . 8 6, resp ectively . The “e-r ater” s coring engine w as used to e v aluate the answers to the GRE prompts using algo r ithms that were develop e d previous ly (Sc haeffer et al. [ 16 ]). Thu s, it was not necessary to implemen t the model building pr o cess that w as re- quired for Sp or tsCo. 2.7. Sc aling W e use d a standar d conv ersion table to put ACT score s on the sa me s cale o f mea- surement as SA T scores . The conv erted scores ar e hereinafter r e fer red to a s SA T scores. W e tr ansformed the GP As within a co llege to z-scor e s. T o adjust for po ssible differences in g rading standards acros s colleg e s, we a ls o sca le d the GP As within a Char acteristic s of hand and machine-assigned sc or es 83 college to a scor e distribution that had t he same mea n a nd standard devia tio n as its students’ SA T scor e s (these a re her e ina fter referred to as “ a djusted GP As”). As noted ab ov e, we used three GRE issue prompts and three GRE a rgument prompts. T o adjust for poss ible differences in difficult y amo ng these prompts and to facilitate combin ing scores acr oss prompts, the r eader assigned “ raw” scor es on a prompt were conv erted to a score distribution that had the same mea n and standard deviation as the SA T sco res of the studen ts who to o k that prompt. W e did the same thing with the ma chine-assigned s cores on each prompt and with the scores on the New Jersey tasks. 3. Analyses and resul ts 3.1. R elationship b etwe en hand and machine assigne d sc or es There were 323 students who had b oth hand and ma chine-assigned scores on their answers to SportsC o , 5 90 students who had hand and machine-assigned scores on at least one GRE issue prompt and one arg ument prompt, and 79 student s who were in bo th of these t wo groups. The correla tio n betw een t w o hand rea ders w as usually only slightly higher than it was b etw een the ha nd and computer assigned scores (T able 2 ). The one exception w as on the GRE a rgument pr o mpt where the correla tion betw een t w o hand readers was 0 . 19 higher than it w as betw een the hand and machine sco r es. The hand reader s tended to assig n slig htly higher raw scor e s to Sp or tsCo items 1-40 than did the c-ra ter algor ithm whereas they assigned slightly lower raw co m- m unication scor es than did the e-rater algo rithm (T able 3 ). There was almost no difference b etw een the mean hand and mean machine-assigned GRE writing sco res but that may be a by-pro duct of the scaling des crib ed ab ove. Standard deviations also were quite compa rable. There did not appear to be a systematic effect of scoring method acr o ss tasks. F or ins ta nce, Sp ortsCo machine scores corr elated higher with GRE hand scores T a ble 2 Corr elation b etwe en two hand r ea ders and b etwe en hand and machine assigne d sc or es by task Betw een hand a nd T a sk Score Betw een t w o hand readers machine scores SportsCo Sum of items 1 to 40 0.84 0.81 Commun ication (41) 0.61 0.57 T otal (1 to 41) 0.85 0.83 GRE Issue Prompt 0.84 0.73 Argument Prompt 0.86 0.67 Mean GRE 0.78 T a ble 3 Me ans and standar d deviations by sco ring metho d and task Mean Standard deviati on T a sk Score typ e Hand Machine Hand Machine SportsCo Sum of items 1 to 40 8 . 71 7 . 66 3 . 47 3 . 35 Commun ication (41) 2 . 89 3 . 15 0 . 94 1 . 10 T otal Ra w (1 to 41) 11 . 60 10 . 80 4 . 15 3 . 97 GRE Issue Prompt 1118 1126 188 . 5 189 . 1 Argument Scale 1111 1110 183 . 5 182 . 4 Mean GRE Scale 1114 1118 160 . 5 165 . 0 84 S. P. Klein then they did with GRE machine scores (T able 4 ). Mo reov er, the correla tion be- t ween GRE issue and arg umen t prompts was no t consistently higher when the same scoring metho d was used then when different metho ds w e r e used (T able 5 ). 3.2. R elationship of hand and machine sc or es to sc or es on other tests Spo rtsCo’s hand scor es corr e la ted with other indexes o f ac ademic ability to ab out the sa me degr ee as its ma chine scores correlated with those mea sures. F o r exam- ple, Sp ortsCo ha nd and ma chine scor es corr elated 0 . 36 and 0 . 34 , resp ectively , with unadjusted College GP A. Although Sp ortsCo is only a 90-minute task, this re la - tionship was almost a s stro ng as the one betw een SA T s cores and College GP A in our samples (this cor relation was 0 . 36 among students w ith Sp ortsCo scor es and 0 . 31 amo ng those with GRE scor es). GRE hand and machine scores also had com- parable rela tionships with o ther test s cores (T able 6 ). These findings indicate that the tw o scoring metho ds yield v ery similar res ults. 3.3. R elationship of sc oring metho d to student char acteristics Males and f emales had v ery similar mean scor e s on all the m easures regardless of scoring metho d. F o r example, the mean hand and machine total GRE scores for females were 1 112 and 1 122, resp ectively; i.e., a difference of 10 po int s or ab out 0.06 standa rd deviation units. The means for males w ere almos t identical, 11 2 1 and 1118, resp ectively . T a ble 4 Corr elation b etwe en total Sp ortsCo and GRE total sco r es ( N = 79 ) Spor tsCo ha nd Spor tsCo machine GRE Hand Score 0.53 0.58 GRE M ac hine Score 0.36 0.43 T a ble 5 Corr elation betwe en GRE pr ompt typ es when the same ve rsus differ ent sc oring met ho ds ar e use d Argument Issue Correlat ion Hand Hand 0.49 Mac hine Mac hine 0.58 Hand Mac hine 0.45 Mac hine Hand 0.55 T a ble 6 Corr elation of hand and machine sc or es with sc or es on other me asu r es Spor tsCo GRE T ota l Score Hand Machine Hand Ma c hine SA T total 0.51 0.40 0.61 0.55 College GP A 0.36 0.34 0.25 0.22 Adjusted College GP A 0.53 0.48 0.52 0.48 NJ W omen’s Live s 0.50 0.43 0.56 0.63 NJ M osquitos 0.50 0.49 0.60 0.60 NJ Conland 0.54 0.58 NJ M yths 0.57 0.53 Notes: Sp ortsCo was paired wi th only tw o of the four N ew Jersey cr i tical thinking tasks. There we re ab out 90 student s who to ok a given NJ task and al so had b oth a hand and machine GRE total s core. The same wa s true for Sp ortsCo. Char acteristic s of hand and machine-assigned sc or es 85 As a group, the combination o f whites and Asians had sig nificantly higher mean scores on all the meas ures than did other students, but the size o f this difference was similar acros s scor ing metho ds. F or example, the White/Asian ha nd-scored mean on Spo r tsCo was 1 2.25 whereas it w as 9.64 for all other studen ts com bined (i.e., a diff erence of 2.61 p oints or ab out t w o-thirds of a standard deviation). The mean ma chine s cores of these tw o groups w er e 11 .3 5 and 9.1 6, res pec tively (i.e., a difference of 2.19 points). Thus, the net effect of scoring metho d on t he difference betw een these tw o g roups was less than one half of one po int (2 . 61 − 2 . 19 = 0 . 42) or ab out o ne - tent h o f a standard deviation. There were similar results o n the GRE. As a gro up, Whites and Asians had a mean of 114 7 when their answers w ere hand score d a nd a mean of 1 146 when they were ma chine scored. The corr esp onding means for all other s tudents combined were 1025 and 104 1 (i.e., a 16 p oint difference b etw een scoring metho ds whic h is again ab out one-ten th o f a standard deviation). These data suggest that mac hine scoring is mor e likely to ameliorate than exa cerbate the differences in mean sco res betw een racia l/ethnic gr oups. 3.4. Unit of analysis effe cts Certain t ypes of research and p o licy analysis studies use the school rather than the student a s the unit of analy sis (Meyer [ 12 ]). These investigations examine differe nce s in mea n sc o res among schoo ls (or other aggr e gations of student s) s uch a s to ass ess whether the studen ts at a college g enerally sco re hig her or low er than would b e exp ected g iven their SA T scores (see Klein et al. [ 7 ] for an example of this t yp e of study). When the college is the unit, the correla tion b etw een the to tal GRE hand and machine-assigned scores was 0 . 99. The school level corr e lation betw een hand a nd machine scores was 0 . 95 on Spo rtsCo (co mpa red to 0 . 85 when the student is the unit). Using the college a s the unit also incr eases the correla tion among other tests. This o ccurs with b oth the ha nd and machine s c ores. F or exa mple, when the student is the unit, the hand scor es on a New J ersey cr itical thinking task co rrelate 0 . 50 and 0 . 46 with the hand and machine scores on Sp ortsCo , resp ectively . When the college is the unit, the corr esp onding correla tio ns are 0 . 91 a nd 0 . 86 . These similarly high cor relations suggest that the machine sco res alone can b e r elied on for s tudies that use the c ollege a s the unit. Hand sco res would almost b e redundant. 3.5. Cost effe ctiveness Assessment of learning in higher education has b eco me a critica l co ncern to p olicy makers, colle g e accr editation agencies , educators , s tuden ts, a nd parents. Y et to do so without submitting to the ser ious limitations of multiple c hoice tests is expe ns ive in part b ecause of the costs of scoring the answers to op en- e nded questions. Hence, one of the purp oses of this study was to examine the utility of using computer scoring of different types of constr uc ted resp onse ques tions to see if ther e was now a m ore efficient wa y (in t erms of costs and scoring time) to help solve this pa rt of the assess ment puzzle. W e found th at the cos ts o f machine scor ing dep end on several factor s. First, all the answers have to b e in ma chine readable form. Realistically , this means that student s hav e to key enter their res p o nses, such a s by taking the test at one of their colleg e ’s computer labs. Beca us e of limited spa ce and co mputer av a ila bility , 86 S. P. Klein this re q uirement ma y lead to administer ing the measures over sev eral days when a hu ndred or more student s a re tested at a c o llege (test s ecurity concerns may preclude allowing s tuden ts to take the tests in their homes or do rm r o oms). Efficien t pro cedures are needed to send the students’ resp onses to the scoring service, suc h as by uploa ding them o v er the W eb. Second, machine-scoring costs are a function of the num ber of ques tions and the num ber of answers to them. F or instance, o ne company advertised (in 200 3) a setup fee of $75 , 00 0 to develop the scor ing a lgorithms for the fir st ten p rompts and $ 2 , 500 for eac h a dditional pro mpt. It also advertised a c harge of 85- c ent s p er answer. Under these terms , the dir ect cost of mac hine scoring (including the cost of ha nd s c oring a sa mple of 250 answers that the computer needs to “lear n” on) is less than that of hand scoring if (1) ther e are over 4,0 00 answers to be s cored to each of 16 differen t pro mpts and (2) it costs $2 . 50 or more to hand-s core an ans wer once. There are, how ever, several sig nificant indirect costs that also have to be consid- ered. F or e x ample, ha nd scoring o ften requir es training large nu m be r s o f rea ders and the handling , shipping, and k eeping track of n umerous b oxes of answ ers. This can be a staff intensive, time consuming, and co mplex logistical pro ces s that requires lots of staff a nd spa ce. Third, mac hine s c o ring (which ta kes less than eight seconds per answer) is muc h faster than hand scor ing (and this can b e done by multiple computers working simult aneously that do not need coffee or r est breaks). Scoring time is an impo r- tant co nsideration f or programs that ha v e to r ep ort res ults pr omptly . In addition, machine scoring may b e b etter suited to testing pro g rams that are conducted ov er several mo nths b ecause unlik e hand sco r ing, it do es not require c onstantly a rranging for reader s to s c ore r elatively small batches of a ns wers. Because of these and other factors, the Co uncil for Aid to Education’s Co llegiate Learning Assessment (CLA) progr am delivers tests to studen ts electronica lly over the In ternet. Ex aminees answer o nline and their respo nses are uploaded for ma chine scoring. The per s tuden t cost of this s ystem is less than o ne-fourth of that required for the har dcopy test b o oklet a nd hand scoring pro cedur es that were used in the study ab ov e (for deta ils, see www.cae.org ). 4. Discussion and conclusi ons W e examined the cor resp ondence b etw een the hand and machine sco res that w ere assigned to the answers to thre e types of colleg e-level op en-ended tasks (a 90-minute per formance task and the issue and argument prompts that are now used on the GREs). Thes e a nalyses found that the ma chine-assigned scores c ould r eplace the hand ass igned scores for at leas t so me types of la rge-sca le testing pr ogra ms (such as for studies that as sessed whether a college’s studen ts are generally scoring higher or low er than would be ex p ected on the basis of their SA T sco res). The s pec ific findings that supp or t this conclusion ar e as follows: • Hand and machine-assigned scores agr eed highly with each other. And, the degree of agreement b etw e e n these scoring metho ds was about as high as it was b etw een tw o hand r eaders. This was true on Sp ortsCo (the 90-minute per formance task) and o n the GRE prompts. More recent analyses of these measures with muc h la r ger samples confirm these findings (Klein, Shav elson, Benjamin and Bolus [ 8 ]). Char acteristic s of hand and machine-assigned sc or es 87 • There is a near per fect correlatio n (0.95 ) between the hand a nd machine sc ores on the 90 -minut e SportsCo p erfo r mance task when the co lle g e (rather than the student) is the unit of analysis . It w a s 0 . 9 9 on the GRE. • Hand and mac hine-assigned scores have very s imilar correlations with other constructed resp o ns e and selec ted resp onse measur es, such as SA T scores, the grades on o ther critical thinking tas ks, and co llege GP As. In other w ords, the hand a nd machine-assigned scor es b ehave the same wa y . • The machine-assigned scores to one task c orrela te ab out as well with the machine-assigned scor es to another task as they do with the hand scores o n that other t ask. Thus, there is no indicatio n that some students consistently earn higher gr ades when their answ ers are hand versus ma chine scored. • There is no interaction betw een g e nder and scoring metho d, and if anything, machine scoring tends to very s lig htly na rrow the differences in mean scores betw een r acial/e thnic gr oups. It certainly do es not widen them. • It appe ars that the econo mic benefits o f machine scoring can b e r ealized when there are a few thousa nd o r more answers to b e s c ored. The results a b ove w ere obtained in a “low-stak es” research study where students were pa id to pa r ticipate. It is no t certain whether compar able results would b e obtained with a different s et of op en-ended measures o r if there were significant external incent ives for s tuden ts to “ b ea t” the machine (Po wers et al. [ 15 ]), such a s in a colleg e admissions o r licensing context. Nev ertheless, the research descr ib ed ab ov e used a v ariet y of free resp onse tasks and the student s who p articipated in it generally r ep orted that these ta sks were engaging and they tried to do their bes t on them (Klein et al. [ 6 ]). The substantial correla tio ns of their scores with SA T scor es and colleg e GP As certainly sug gest they took the meas ur es ser iously . Regardless o f scoring metho d, there was a b out a tw o-thirds o f a sta nda rd devi- ation difference on our op en-ended measures betw een the mea n scores of the tw o clusters o f ra cial/ethnic gro ups s tudied (i.e., Whites plus Asians v ersus all other s) whereas ther e was ov er a o ne full standa rd deviation difference betw een these groups on the SA T. F or exa mple, their mean total SA T sc o res in the GRE sample were 1162 a nd 968 , resp ectively . While some of this dispar ity is no doubt due to the SA T scores having a hig her relia bilit y than the free resp onse measures , it nevertheless suggests that the k inds of op en-ended task s used in this resea rch might tend to slightly narrow the g ap in mean scor es among gr o ups. Finally , it app ear e d that the s tudy’s op en-ended tasks captured imp ortant abil- ities that are no t fully a ssessed b y more tra ditional m ultiple-c hoice measures. F or instance, college GP As corr elated ab out as w ell with the scor es on our critical thinking mea sures a s they did with SA T score s (the median of the within school correla tions of colleg e GP A with th e scor es on our cr itical thinking tasks and with SA T sco res w ere 0 . 33 and 0 . 36, resp ectively). Ho w ev er, a t o ver half of the partic- ipating colle g es, co mbining SA T scores with the s c o res on op en-ended meas ur es yielded a statistically significantly be tter pr ediction of colleg e GP As than did SA T scores alone (at p < 0 . 05). T aken together , these findings suggest that a dding the kinds of tests used in this resear ch to a battery that alrea dy con tains SA T or A CT scores may improv e o v erall predictive v alidity while at the same time slig ht ly nar r ow differences in mean scor es betw een racia l/ethnic gr oups. Ma chine scoring the a nswers to these or other types of open-ended tasks also may b e key to making their inclusio n a pra c tical option. It will be up to future studies to examine this matter as well as the utilit y of machine scoring for high-stakes testing programs. 88 S. P. Klein References [1] Burstein, J., Kaplan, R ., Wolff, S. and Lu, C. (1996). Using lex ical semantic techniques to classify free r esp onses. In Pr o c e e dings of the ACL SI- FLEX Work shop on Br e adth and Depth of Sema ntic L exic ons . [2] Daigon, A. ( 1966). Computer g rading of English co mp o s ition. English J. 55 46–52 . [3] Er win, D. and Sebrell, K. (2003). Assessment of critical thinking: ETS ˜ Os tasks in critica l thinking . J . Ge ner al Educ ation 1 50–70. [4] Ewell, P . T. (1994). A p olicy guide for as s essment: Making go o d use of the tasks in critical thinking . T echnical rep ort, Educational T esting Service, Princeton. [5] Klein, S. and Bol us, R. (20 03). F actors affecting sc o re reliability on high stakes essay exams. T echnical rep o rt, America n Educatio na l Resear ch Asso ci- ation. [6] Klein, S. , Kuh, G., Chun, M., H amil ton, L. and Sha velson, R. (2003). The search for “Value-Added”: Assessing and v alidating selec ted higher educa- tion outcomes . T echnical repo rt, America n E ducational Resear ch Asso ciation. [7] Klein, S. , Kuh, G., Chun, M., H amil ton, L. and Sha velson, R. (2005). An a ppr oach to meas ur ing cognitive outcomes across higher- education institu- tions. R ese ar ch in H igher Ed uc ation 46 251– 276. [8] Klein, S., Sha velson, R., Benjamin, R. and Bolus, R. (200 7). The co lle- giate learning assessment: F acts and fantasies. Evaluation R evi ew 3 1 415–43 9 . [9] Kukich, K. (20 00). Beyond automated essay scoring. IEEE Intel ligent Sys- tems 15 22– 2 7. [10] Laud a uer, T. K., L aham, D. and Fol tz, P . W . (2003 ). Automatic E ssay Assessment. Assessment in Educ ation 10 295– 308. [11] Leacock, C. an d Chodoro w, M. (2003). C-rater : Scoring o f shor t a nswer questions. Computers and the Humanities 37 38 9 –405 . [12] Meyer, R. (1997). V alue- a dded indicators of scho ol perfor mance: A primer. Ec onomics of Ed uc atio n R eview 16 183 – 301. [13] P a ge, E . B. (1966). The imminence of grading es s ays by computer. Phi Delta Kapp an 48 2 38–24 3. [14] Powers, D., Burstein, J., Chod oro w, M., Fo wles, M. and Kukich, K. (2000a ). Comparing the v alidit y of automated a nd human e s say scoring. T ech- nical Rep ort ETS RR-0 0 -10, Educational T esting Servic e , P rinceton, NJ. GRE No. 9 8-08 a . [15] Powers, D., Burstein, J., Chod oro w, M., Fo wles, M. and Kukich, K. (2000b). Stumping e-rater: Challenging the v a lidit y of a uto mated sco ring. T echnical Rep ort ETS RR-01 -03, Educationa l T e s ting Servic e , Princeto n, NJ. GRE No. 98- 0 8Pb. [16] Schaeffer, G., Briel, J. and F o wles, M. (2001). P sychometric ev aluatio n of the new gre wr iting ass essment. T echnical Repo rt ETS Research Rep or t 01-08 , E duca tional T esting Serv ice, Princeton, NJ. GRE Boar d Profess io nal Repo rt No. 96- 11P . [17] Sha velson, R. and H uang, L. (20 03). Respo nding re s po nsibly to the frenzy to assess lear ning in hig her education. Change 35 10– 19. [18] Swyger t, K., Mar golis, M., King, A., Sift ar, T., Cl yman, S., Ha wkins, R. and Clauser, B. (2003). Ev aluation of an automated pro ce- dure for scoring patient notes as par t o f a c linical skills examination. A c ademic Me dicine 7 8 S7 5–S77. Char acteristic s of hand and machine-assigned sc or es 89 [19] W ainer, H. a nd Thissen, D. (1993 ). Combining m ultiple c hoice and co n- structed resp onse tes t scores: T oward a Mar xist theo ry of test c onstruction. Appl. Me asu r ement in Ed uc ation 6 103–1 18.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment