Evaluating LLMs for Answering Student Questions in Introductory Programming Courses

Ev aluating LLMs for Answ ering Studen t Questions in In tro ductory Programming Courses Thomas V an Mullem, Bart Mesuere, P eter Da wyndt Abstract The rapid emergence of Large Language Mo dels (LLMs) presents b oth opp ortunities and c hallenges for programming education. While studen ts increasingly use generativ e AI tools, direct access often hinders the learning pro cess b y pro viding complete solutions rather than p edagogical hin ts. Concurrently , educators face signiﬁcant w orkload and scalability c hallenges when pro viding timely , p ersonalized feedbac k. This study inv estigates the capabilities of LLMs to safely and eﬀectiv ely assist educators in answ ering student questions within a CS1 program- ming course. T o ac hieve this, w e established a rigorous, reproducible ev aluation pro cess b y curating a b enc hmark dataset of 170 authen tic student questions from a learning management system, paired with ground-truth responses authored by sub ject matter exp erts. Because tra- ditional text-matching metrics are insuﬃcien t for ev aluating op en-ended educational resp onses, w e dev elop ed and v alidated a custom LLM-as-a-Judge metric optimized for assessing peda- gogical accuracy . Our ﬁndings demonstrate that mo dels, such as Gemini 3 ﬂash, can surpass the quality baseline of typical educator resp onses, achieving high alignment with exp ert p ed- agogical standards. T o mitigate p ersisten t risks like hallucination and ensure alignment with course-sp eciﬁc con text, w e adv o cate for a "teac her-in-the-lo op" implementation. Finally , we ab- stract our metho dology in to a task-agnostic ev aluation framew ork, advocating for a shift in the dev elopment of educational LLM to ols from ad-hoc, p ost-deplo yment testing to a quantiﬁable, pre-deplo yment v alidation pro cess. 1 Keyw ords: computer science, programming, LLM, LLM-as-a-Judge, Q&A, task-agnostic ev aluation framew ork, pre-deplo yment ev aluation 2 In tro duction The emergence of Large Language Models (LLMs) capable of solving complex programming tasks has disrupted programming education. Recent studies show that curren t mo dels can solve nearly all standard CS1 and CS2 exercises with high accuracy [5, 6, 20, 22]. Consequen tly , generative AI to ols hav e become increasingly relev an t in students’ careers [19], with students utilizing them to generate code, explain concepts, and debug errors. Ho wev er, while these tools pro vide quic k, detailed resp onses, they frequen tly pro vide complete solutions [8], thereby hindering the learning pro cess instead of supp orting it. F or instance, [11] sho wed the most common approach used by students is a single prompt to generate the solution. This approac h resulted in consisten t negative studen t p erformance on p ost-test ev aluation scores, indicating that unrestricted LLM access is p edagogically 1 undesirable. Con versely , the same authors [11] also highligh ted that a hybrid approac h, where studen ts manually write parts of the code while using AI to generate other parts, yielded p ositiv e trends. This suggests that LLMs can oﬀer v aluable learning supp ort, but the risk of student ov er- reliance underscores the critical imp ortance of goo d instruction and educator ov ersigh t. As these to ols are here to stay , it is imp ortan t to explore their p edagogical opp ortunities [2, 10]. While recent studies ha ve fo cused on inv estigating mo del capabilities for generating solutions to pro- gramming exercises [5, 6, 20, 22], generating personalized assignments [9, 21] and explanations for studen ts [13, 21], studen ts still rely on educators to answ er questions, give feedback and solve prob- lems they encounter. This reliance p oses a challenge regarding scalabilit y and educator w orkload, b ecause providing accurate, comprehensive and p ersonalized resp onses to student questions takes considerable time. T o address this, automated to ols and c hatb ots lik e CodeHelp [14] and Co deAid [12] hav e b een developed to pro vide immediate, automated assistance. Ho wev er, these systems pro vide LLM-generated answers directly to studen ts, thereby exp osing students to p otential hal- lucinations, incorrect information or p edagogically unsound issues (complete solutions instead of hin ts, use of concepts that studen ts ha ve not grasp ed, use of terminology/strategy that diﬀers from what teac hers apply). If a studen t receives erroneous information from these institutionally pro- vided to ols, we risk ero ding studen t trust. If this trust is lost, students will lik ely turn tow ards general-purp ose mo dels like ChatGPT or Gemini, bypassing pedagogical guardrails en tirely . This sho ws the need for a w ay to provide studen ts with scalable, educator-veriﬁed answ ers. F urthermore, the adoption of LLM-pow ered educational to ols has highlighted a critical metho dolog- ical gap in current research: man y of these educational to ols are deploy ed directly into classroom settings without rigorous pre-ev aluation of their generated resp onses. T o ensure these systems are safe and eﬀective, their output should b e v alidated against ground truth data prior to real-w orld de- plo yment. How ev er, automatically ev aluating op en-ended text, suc h as answers to student questions, has historically prov en diﬃcult. Recent adv ancemen ts like LLM-as-a-Judge [26] oﬀer a promising solution. In this approach, an LLM is instructed to give a score to a text based on provided criteria, allo wing us to ev aluate elemen ts like factual correctness, similarit y and semantic v alues of a text. Recen t studies hav e sho wn that these judges can serv e as viable replacemen ts for h uman ev aluation, as they can ac hieve high agreement rates with h umans [24]. In this pap er, we address the ev aluation gap by leveraging the LLM-as-a-Judge technique as a metric to quan tify the p erformance of LLMs executing the Q&A task. W e develop several prompt/mo del com binations designed to answer studen t questions in a CS1 programming course and ev aluate their p erformance b efore classro om deploymen t. T o achiev e this, we create a b enc hmark consisting of authen tic student questions as input data, ground truths and (automated) metrics. This b ench- mark gives us a scalable and repro ducible wa y to decide if LLMs are ﬁt for the task and what prompt/mo del combination is b est. F urthermore, the b enc hmark pro vides us with a future-pro of w ay of ev aluating new prompts, mo dels and technologies. In the ever-c hanging landscap e of LLMs, ha ving a reliable metho d to ev aluate new mo dels and prompts is indisp ensable. This leads to the follo wing researc h questions: R Q1: T o what exten t can LLMs generate pedagogically appropriate answ ers to student questions in a CS1 programming course? R Q2: How can we establish a reproducible, scientiﬁc pro cess for developing and ev aluating LLM- based (educational) tools? R Q3: What reusable workﬂo w or set of principles can b e distilled for designing and ev aluating 2 similar to ols across domains and tasks? 3 Metho ds The goal of this w ork is to assess the answer generation capabilities of LLMs to programming questions asked by students. W e wan t to explore whether curren t state-of-the-art (SOT A) LLMs are go od enough to answ er student questions in CS1 courses, prior to their deploymen t in a classro om en vironment. In order to ev aluate the p erformance of diﬀerent prompts and SOT A mo dels, we structured our research into four distinct phases. The ﬁrst phase inv olv es collecting and curating a representativ e input dataset, which comprises authen tic studen t questions, corresp onding co de submissions, and other system outputs, from a dedicated learning managemen t system. F ollo wing this, a reliable ground truth is established for the selected data by having sub ject matter exp erts (SMEs) author p edagogically sound and veriﬁed answ ers to the selected student questions. The third phase fo cuses on identifying success criteria and selecting or creating accompanying metrics. Finally , w e create a robust ev aluation framework based on the input data, ground truth and metrics. Using this framework, we systematically ev aluate the p erformance of diﬀeren t actors p erforming the task. These actors can b e prompt/mo del combinations or even outputs generated b y educators. Eac h of these phases is detailed in the subsequen t sections. 3.1 Input data T o in v estigate whether LLMs can eﬀectively answer student questions in an introductory computer science classro om, a dataset w as compiled from Do dona [23], a dedicated learning management system (LMS) for learning to co de. The input dataset comprises student questions from an intro- ductory Python programming course (2023-2024) and includes b oth English (EN) and Dutch (NL) items. Studen ts ask questions about the w eekly exercises b oth online through Do dona and in per- son during on-campus lab sessions. Ov er the course of one semester 1140 questions were ask ed b y studen ts through the Q&A module of the LMS, from whic h 200 questions w ere randomly sampled across the semester. Of these questions, 30 w ere left out due to priv acy considerations (GDPR; p ersonal information was included that should not be sent to LLM providers) or p oor quality (e.g., lac k of suﬃcient context, unrelated questions, questions ab out grades). The ﬁnal input dataset con tains 170 questions, of which 134 hav e an answer pro vided by one of the educators of the course. The remaining questions were self-resolved, or mark ed by the studen t as no longer relev an t. Each en try in the input dataset includes the student’s question and submitted co de, the line num b er of the question, their preferred natural language (English or Dutch), the assignment description, details on failing test cases, lin ting errors and the programming language. A sample entry based on the “radians example” shown in Figure 1 can b e found in the supplementary material. The ﬁgure sho ws a student question at the top of the co de, recognizable b y the purple line in front of it. The educator’s resp onse is pro vided as an annotation at a diﬀeren t line of the code. 3 Figure 1: Example of a question ask ed by a studen t (Ray W alsh) at the top of the co de, follow ed b y the answ er provided b y one of the educators (Tim Ho dkiewicz) at line 11 of the co de. Names ha ve b een pseudon ymized. 3.2 Ground truth T o establish a reliable ground truth for ev aluating task p erformance, an exp ert answered all 170 selected student questions from the input dataset. F or questions directly tied to a studen t’s imple- men tation, the exp ert veriﬁed the ground truth by applying the suggested strategy . General answers ab out co ding st yle, errors or programming concepts adhere to b est practices of the programming language (Python). The ground truths do not reveal full algorithms or complete co de solutions. Eac h output consists of an identiﬁed issue, whic h provides a brief summary of the problem, and an ‘answ er for the student’, whic h guides the student to w ards the correct solution by highligh ting sp eciﬁc areas for reconsideration and sp eciﬁc hints. The ground truth for the radians example can b e seen in Figure 2. 4 Figure 2: Ground truth compiled by a sub ject matter exp ert; both the iden tiﬁed issue and an answer for the studen t are pro vided. This answ er is used as ground truth to the question from Figure 1. 3.3 Metrics In order to ev aluate how diﬀeren t actors perform the Q&A task, we deﬁned several success criteria. Success criteria deﬁne what is considered a successful completion of the task at hand. F or the Q&A task, p edagogical accuracy is our primary criterion. Answ ers provided b y an actor should contain the same teac hing p oin ts as provided in the ground truth. How ev er, the practical implementation also requires considering the LLM’s operational cost-eﬃciency . This resulted in t wo metrics for our researc h: the comparison of the ground truth with the actor’s output and the cost of answering a single question. Answ ers to student questions are div erse and op en-ended. They can v ary in phrasing, structure, lev el of detail and they can provide diﬀeren t solutions to the same problem. Because of this v ariation, selecting a ﬁtting metric is crucial. T raditional NLP metrics such as BLEU [18], ROUGE [15] or exact matc h are not suitable for ev aluating the Q&A task. These metrics heavily dep end on surface-lev el similarit y , which is unable to recognize the aforementioned c haracteristics of answ ers to studen t’s questions. Given these limitations, w e successively explored LLM-as-a-Judge metrics. This approach utilizes an LLM to compare the actor’s generated answ er against the ground truth, scoring the output based on deﬁned criteria lik e factual correctness and semantic similarity . W e ﬁrst in vestigated general-purp ose LLM-as-a-Judge metrics including third-party libraries that provide LLM based ev aluations such as con text-precision, context-recall and G-Ev al [17]. All of these metrics call an LLM with a custom prompt that is injected with the actor’s answer, ground truth and examples that show the LLM how to score certain resp onses. While this approach initially pro vided promising results, it lack ed explainabilit y and exhibited non-deterministic b eha vior. Key concepts lik e “precision” and “recall” lac k concrete deﬁnitions when applied to the comparison of tw o texts. The concepts “false-p ositiv e”, “true-p ositiv e”, “false-negativ e” and “true-negativ e” in precision and recall are not easily deﬁned for the comparison of an exp ert answ er to a studen t’s question and an LLM-generated answ er. As neither traditional metrics nor oﬀ-the-shelf LLM-as-a-Judge metrics met our needs, a custom LLM-as-a-Judge w as developed. The dev elopment of this judge fo cused on identifying core teac hing 5 p oin ts rather than exact text matches. T o ensure the judge’s scores are meaningful, w e aligned them with scores of sub ject matter experts (SMEs). A dataset of 100 LLM-generated answ ers with ground truth v alues (scores assigned by an SME) w as constructed and used to align the prompt with the expert’s scoring. This dataset consists of 50 samples represen ting o verly verbose, “ﬁller heavy” explanations and 50 samples with concise but incomplete resp onses. SMEs manually annotated this dataset using a custom scoring rubric (0-5), establishing a b enc hmark for the LLM-as-a-Judge. These scores were then used to iteratively reﬁne the judge’s prompt. By analyzing the scoring distribution through Cohen’s weigh ted k appa [4] and heatmap visualizations, the delta b et w een the model’s and exp ert’ scores was minimized. This alignmen t pro cess ensured that the judge was steered tow ard the same ev aluation patterns as the SMEs, resulting in a metric that prioritizes p edagogical accuracy and completeness. This pro vides us with a scalable alternative to manual ev aluation. 3.4 A ctor Ev aluation The ﬁnal step is actor ev aluation. W e gather data on diﬀerent actors (referring to prompt/mo del com binations, h umans, or a h ybrid of b oth) p erforming the task to make an informed decision on who or what should execute it. The selection dep ends on the p erformance of these actors relative to a baseline. Before ev aluating diﬀeren t mo dels, we establish this baseline using the original answers giv en by the educators during the course. W e term this the human baseline, which reﬂects real-w orld, time-constrained educator p erformance and is distinct from the exp ert-authored ground truth. This baseline con tains the “Best A v ailable Human” (BAH) resp onses. W e ev aluate these resp onses the same w ay as w e do with LLM-generated answ ers by comparing them to the ground truth using the custom LLM-as-a-Judge metric. After establishing the BAH baseline, a cost-eﬀective, con temp orary mo del is used to p erform prompt engineering. During our research, we used Gemini 2.5 ﬂash to obtain a w ell-p erforming prompt, which is describ ed in the next section. The resulting prompt can then b e used to compare the p erformance of diﬀeren t mo dels. Both an intra-family comparison and an in ter-family comparison are p erformed. The in tra-family comparison is p erformed on the Gemini family and sho ws how diﬀeren t generations of the same mo del p erform. After this analysis, mo dels from diﬀeren t families (Op enAI, Anthropic and Gemini) are tested to ﬁnd the optimal prompt/mo del combination. Based on the BAH baseline and the p erformance of models, the b est ﬁtting actor can b e selected. 3.4.1 Prompt Engineering During prompt engineering, our primary goal was to develop an optimal instruction set for the LLM that performs the Q&A task. This in v olves crafting a single, detailed prompt that combines all relev ant student con text (question, code, assignmen t details, etc.) with explicit instructions for the LLM. The prompt’s design ensures the mo del generates p edagogically sound resp onses that align with our ground truth, ultimately maximizing the LLM’s p erformance for the Q&A task and pro viding a robust b enc hmark for mo del comparison. The pro cess inv olv es systematically ev aluating prompt v ariations against the ground truths using the custom LLM-as-a-Judge that w as previously describ ed. This is an iterative pro cess where a wide range of prompts are researc hed via trial and error. It consists of selecting the right data, optimally formatting the data, wording and rewording desired outcomes, correcting for common errors and exploring new prompting techniques. V arious prompts w ere tried, as small v ariations in 6 global structure, lexical choices and grammatical structure can signiﬁcantly inﬂuence the mo del’s p erformance. The most imp ortant observ ations can b e group ed in to three ov erarching categories. Data selection, data formatting and optimization. Selecting the right input data is crucial, since this data pro vides con text to the mo del, whic h ma y result in b etter p erformance. Introducing to o muc h clutter can “confuse” the mo del and lead to diminishing results [16]. During prompt engineering, we applied a ‘leav e-one-out’ approac h on the a v ailable input data. The inﬂuence of every piece of data was studied by excluding parts of the input data and observing whether or not the performance dropped signiﬁcan tly . The next prompt engineering step consists of formatting the input data appropriately . The only w ay to test this is by trial and error. The represen tation of the submitted code and student question are straigh tforward. These are unstructured texts that can b e passed to the mo del as is. Assignmen ts were conv erted to Markdo wn to ensure consistency and uniform formatting. Do dona returns soft w are testing results in a complicated JSON structure with little restrictions, which requires some ﬁltering b efore this data can be passed to the mo del. Subsequently , diﬀeren t orderings of the input data w ere explored and diﬀerent wa ys of splitting up the con ten t w ere ev aluated. XML- tags prov ed to b e the b est delimiters and the order of the information did not ha ve a signiﬁcant impact on the p erformance of the models. The third prompt engineering step is optimization; diﬀerent prompts are iteratively tested and adjusted in order to impro v e the qualit y , consistency and accuracy of the generated answ ers. This in volv es carefully rewording parts of the prompt to get the desired behavior, eliminating unintended b eha viors and getting rid of common mistak es. In particular, it w as found that using p ositiv e tones and goal-oriented instructions works b etter than using negativ e constructs suc h as “do not” or “never use”. Such constructs tend to either confuse the mo del or shift its fo cus to w ard what we w ere trying to a void. This led to a prompt that fo cuses on clear goal-orien ted phrasing that describ es what the mo del should do, not what it should a void. Afterw ards, diﬀeren t prompting tec hniques were explored; zero-shot prompting [3] was used as a baseline, c hain of though t (CoT) to encourage reasoning steps [25], and one-shot prompting [3] to try and steer the mo del in the righ t direction. Eac h of the techniques were ev aluated by the created metrics, whic h led to the conclusion that zero-shot prompting and subtle CoT w ork b est for generating answers to student questions. According to our ﬁndings, this is b ecause long and complicated prompts tend to confuse the mo del and lead to unexp ected b eha viors. 4 Results First, we v alidate our custom LLM-as-a-Judge metric that is used to score how well an answer aligns with the teaching p oin ts cov ered in an exp ert-deﬁned ground truth. W e then use this v alidated scoring approach, together with a cost-per-request metric, to (i) establish a realistic h uman baseline, (ii) optimize the prompt, (iii) compare mo dels within a single mo del family , and (iv) b enchmark mo dels across providers. 7 4.1 V alidit y of automated scoring (LLM-as-a-Judge) Because answers to student questions are op en-ended and can v ary in phrasing and structure, we ev aluate resp onses using a custom LLM-as-a-Judge: given an expert-authored ground truth answer as a reference, the judge assigns an ordinal score from 0–5 to an actor’s answ er based on co verage of the core teaching p oin ts in the reference answ er. Before using this score to ev aluate prompts and mo dels, we calibrated and v alidated the judge on a separate set of 100 resp onses that were indep enden tly scored b y a sub ject-matter exp ert (SME) using the same scoring rubric. A veraged across three runs on this calibration set, the judge matc hed the SME score exactly in 55% of cases, and 91.66% of scores fell within ± 1 p oin t on the 0–5 scale. Agreement b ey ond chance was substan tial (linear-w eighted Cohen’s κ w = 0 . 655 ), and rank/linear asso ciations b et ween judge and SME scores w ere strong (Pearson r = 0 . 8185 ; Sp earman ρ = 0 . 7982 ; Kendall τ = 0 . 7067 ). T o visualise the alignmen t, we plotted the frequency of SME and LLM score pairs as a heatmap (Figure 3). High agreemen t b etw een exp ert and automated judgments is indicated by a strong concen tration of v alues along the diagonal. F ull details of the rubric, judge prompt, and calibration pro cedure are provided in the supplemen tary material (judge prompt) and Appendix A (calibration). 8 Figure 3: Heatmap sho wing the diﬀerence betw een SME-scores and the scores assigned b y the LLM- as-a-Judge, where scores express the alignment b et w een an actor’s answ er and an exp ert’s reference answ er. 4.2 Prompt engineering After establishing the alignmen t of the LLM-as-a-Judge, w e created a suitable prompt that can b e used to ev aluate the performance of diﬀerent models. This prompt is engineered by testing diﬀerent v ariations on a c heap but w ell-p erforming mo del. In our research, the Gemini 2.5 ﬂash model w as used to compare a huge amoun t of diﬀeren t prompt v ariations. These were easily ev aluated on the input data from the programming course using the LLM-as-a-Judge metric. V ariations in input 9 data, data format and diﬀeren t prompting techniques w ere tested to ensure optimal p erformance. The ﬁnal prompt uses the student’s question and co de, the description of the assignment, failing tests and linting errors, the programming language and the studen ts’ preferred natural language. Only the position of the studen ts’ question (line n umber) had no impact on the p erformance of the mo del (Figure 4). XML-tags were used as delimiters and large b o dies of text were conv erted to Markdo wn[7]. The used prompting technique is zero-shot prompting with c hain-of-thought. The optimization pro cess was highly iterativ e and exploratory , in volving n umerous rapid, ad-ho c tests rather than strict, linear progression. Consequen tly , a comprehensive p erformance breakdo wn for ev ery prompt v ariation falls outside the scop e of this researc h. Figure 4: Q&A task accuracy score of the Gemini 2.5 Flash model related to the input data pro vided to the mo del. The baseline is represented b y the ﬁnal and b est p erforming prompt containing all input data except for the line num b er. ‘-‘ indicates a piece of data w as left out of the baseline prompt, ‘+‘ indicates a piece of data was added. 4.3 In tra-family comparison An LLM’s sp eciﬁc version and arc hitecture has a signiﬁcan t impact on its p erformance. T o ev aluate this impact, m ultiple Gemini v ersions and v arian ts w ere tested and compared against one another 10 (Figure 5). The original answ ers provided by the educators (BAH) were used as a baseline, serving as a critical p oin t of comparison for LLM p erformance, yielding a mean score of 2.925 out of 5. Among the tested mo dels, Gemma-27B (a 2024 op en-weigh ts mo del that can b e run lo cally) scored the low est, with a mean of 1.82 out of 5, performing b elo w the h uman baseline. Most mo dern mo dels, how ev er, surpass the human baseline, with Gemini 2.5 pro scoring 3.32 out of 5. These results suggest that modern models are capable of pro ducing answers that not only align with expert exp ectations but also exceed the quality of educator resp onses generated in a realistic education con text. Figure 5: Q&A task accuracy score of diﬀerent mo dels within the Gemini family p erforming the Q&A task. Thinking was disabled on all mo dels. An analysis of the score distributions of the mo dels within the Gemini family also sho ws a correlation b et w een mo del v ersion and obtained scores (Figure 6). Newer mo dels hav e a stronger p ositiv e sk ew to wards high scores compared to the older mo dels. The newest mo del (Gemini 3 ﬂash) obtains a score of 4 out of 5 or higher in 52.35% of the cases. Smaller mo dels, suc h as Gemini 2.0 Flash, only scored 4 out of 5 or higher in 29.4% of the cases. Additionally , 82.35% of the answers generated b y Gemini 3 ﬂash scored 3 out of 5 or higher, compared to 58.8% of the answers generated by Gemini 2.0 Flash. Figure 6 (a) con tains the score distributions for the BAH (answer giv en by educators). Only 134 out of the 170 questions in the dataset hav e a v alid answ er provided by an educator. The scores are somewhere in b et w een those of Gemini 2.0 Flash and Gemini 2.5 Flash: 41.05% of 11 answ ers scored 4 out of 5 or higher and 64.93% scored 3 out of 5 or higher. This indicates that the most recent mo dels p erform at least as w ell as educators. Figure 6: Score distribution of (a) the b est av ailable h uman, (b) Gemini 2.0 Flash, (c) 2.5 Flash without thinking and (d) 3.0 Flash with minimal thinking p erforming the Q&A task. Graph (a) only contains data p oin ts that receiv ed a v alid answ er from an educator. 4.4 In ter-family comparison Lev eraging the established prompt and metrics, we extended our ev aluation to include mo dels from diﬀerent pro viders. Three popular families (Go ogle’s Gemini, Op enAI’s GPT and An thropic’s Claude) were chosen and their most recen t mo dels w ere tested (Figure 7). Ev ery run used the 170 studen t questions from the Q&A ev aluation dataset, the same prompt, and all other default settings of the mo del. All runs used temp erature 0.2 unless the API imp osed a diﬀeren t default; for Gemini 3 ﬂash preview and Gemini 3 pro preview, temp erature 1 was used due to technical limitations. Gemini mo dels used dynamic thinking with all other default settings. Mo dels from the Op enAI family used the default thinking eﬀort (medium). The Claude mo dels hav e thinking disabled by default, therefore, these mo dels were run both with thinking enabled and disabled. When thinking w as enabled for these models, the reasoning eﬀort w as set to medium. With the exception of GPT 5 mini, all ev aluated SOT A mo dels outperformed the human baseline (2.925 out of 5). The mo dels exhibit the same behavior as those within the Gemini family: ﬂagship 12 mo dels consistently p erform b etter. Ho wev er, p erformance gains diminish as the b enchmark b e- comes saturated. Newer mo dels easily achiev e scores close to four out of ﬁv e, limiting the ro om for impro vemen t. Moreo ver, b ecause the task inv olv es op en-ended questions, achieving a perfect score is inherently diﬃcult. As a result, additional mo del scaling yields smaller improv emen ts, such that ev en the c heap est mo dels p erform very w ell on this task. Mo ving tow ards the more exp ensiv e ﬂagship models also comes with a bigger cost (Figure 8, Sup- plemen tary Fig. S.1). The a verage cost of a request to a model ranges from 0.306 USD cents for the most eﬃcien t mo del to 4.189 USD cents for the most expensive ﬂagship mo del, a diﬀerence b y an en tire order of magnitude. One of the mo dels with the b est trade-oﬀ b et ween p erformance and cost is Gemini 3 ﬂash with a price of 0.497 USD cents p er request and an accuracy score of 3.335 out of 5. If that mo del had b een used during the 2023-2024 edition of the programming course (1140 questions), the total cost would hav e b een 5.67 USD for one semester. Moreo v er, new er thinking mo dels are more tok en eﬃcien t than previous generations (Supplementary Fig. S.2), reducing costs ev en further. How ev er, if costs should b e kept as low as p ossible, thinking can b e disabled or think- ing eﬀort can b e reduced. This is shown b y the p erformance of the Anthropic mo dels with thinking disabled. Their p erformance is low er than the p erformance of the mo dels with thinking enabled, but their cost is also considerably low er. The cost of Claude Opus 4.5, for example, drops 47.41% while p erformance only drops 6.03%. As a result, most ﬂagship mo dels with or without thinking can b e used to answ er questions from students as they outperform the h uman baseline. 13 Figure 7: Q&A task accuracy score of the tested mo dels from Go ogle, OpenAI and Anthropic. Solid dots represen t p erformance with thinking enabled. Outlined dots represen t p erformance with thinking disabled. 14 Figure 8: Q&A task accuracy score of models from Go ogle, Op enAI and Anthropic plotted against the a v erage cost per request in USD cen ts. Mo dels p ositioned to w ard the top-left represent the ideal balance of high accuracy and low cost. Solid dots represent p erformance with thinking enabled. Outlined dots represen t p erformance with thinking disabled. 5 T ask-agnostic ev aluation framew ork Based on the research steps taken during the ev aluation of the Q&A task, we abstracted a task- agnostic ev aluation framew ork (Figure 9). The framew ork fo cuses on the pre-deplo yment ev aluation of diﬀerent actors p erforming a certain task. These actors can b e diﬀeren t prompt/mo del combi- nations, humans or agen tic w orkﬂows. The framew ork comprises a preliminary feasibility study , follo wed by three sequen tial stages: Data Preparation, Metrics Selection and A ctor Ev aluation. During the data preparation stage, the ev aluation dataset is created, consisting of the input data and ground truth. After compiling the ev aluation dataset the success criteria are deﬁned and match- ing metrics are selected or created. Giv en the input data, ground truth and metrics, a selection of actors is ev aluated in the ﬁnal stage. The input data is passed to each actor, resulting in an actor output. This output is combined with the ground truth and passed on to the predeﬁned metrics. Eac h metric determines a score which can b e used to compare or rank the diﬀeren t actors. Based on the gathered insigh ts, the optimal actor can b e selected for the given task. In what follo ws, w e will discuss these steps in more detail. 5.1 F easibilit y Study Before automating the ev aluation pro cess, the feasibility of using an LLM for the targeted task should b e ev aluated. This do es not only incorp orate the abilit y of an LLM to p erform this task but also the ethical and legal aspects related to the task at hand. F or example, are y ou allo w ed to use an LLM to p erform this task? Is it p edagogically acceptable? After these considerations, the 15 capabilities of the LLMs can b e tested. This pro cess b egins with curating a small, representativ e dataset and engineering a basic prompt. The LLM’s outputs are then man ually ev aluated: if the prompt works acceptably , the ev aluation process can start. If not, the prompt is iterativ ely reﬁned as long as there are improv emen ts. The results of this feasibilit y study inform the "go/no-go" decision for starting the ev aluation pro cess. Figure 9: General framework to automatically ev aluate the p erformance of diﬀerent actors (prompt/mo del com binations, h umans or a com bination of LLM and h uman) in p erforming a given task. 5.2 Data Preparation F or the Q&A task, input data was derived from student questions stored in Dodona. W e sampled 200 questions and man ually ﬁltered them to ensure GDPR compliance and to remo ve unrelated questions (e.g. questions about scores on tests). F or eac h of the remaining 170 questions the ground truth was established by one of the sup ervising assistants of the course, resulting in the Q&A ev aluation dataset. This pro cess constitutes the ﬁrst step in the ev aluation framework. An ev aluation dataset is curated, consisting of the input data (e.g. studen t questions and other con text) and the ground truth (SME outputs to the input data). The input is sampled from raw source data, and is manually review ed and edited to ensure optimal cov erage of the source data. During the review pro cess, crucial insigh ts are gathered and missing samples are identiﬁed. Based on these insights, certain samples are added 16 or remo ved from the input data. After the input data is selected, one or more SMEs start with the annotation of the input data to create the ground truth. The ground truth serves as the single source of truth for the task giv en the related input data. Ideally the ground truth is compiled by m ultiple exp erts independently , p ossibly follow ed b y compiling a common ground truth among all exp erts. The ev aluation dataset will b e used to ev aluate all current prompts, mo dels and future implemen tations of the task, making this a vital and p ossibly time-consuming step in the framework. 5.3 Metrics Selection F or the Q&A task, w e iden tiﬁed p edagogical accuracy and cost-eﬃciency as the success criteria. In order to successfully complete the Q&A task, the prompt/mo del combinations had to resp ond with an answer that aligned with the ground truth while keeping the cost as low as possible. These success criteria were translated in to t wo metrics, a custom LLM-as-a-Judge that compares actor output against the ground truth and the av erage cost per request. The metrics selection stage follows the same pattern: ﬁrst the success criteria are deﬁned. These criteria represen t the qualities that the generated output should con tain. Subsequen tly , appropriate metrics are chosen or created. These metrics will v ary based on the success criteria, for example if the criterion is answers shorter than 100 words, the metric will b e word count. How ever, if the criterion is a short answer that contains all information, the metric will hav e to tak e into accoun t the amount of information and the w ord coun t. Regardless of the metric, the output should b e closely examined to ensure that the metric aligns with human/expert ev aluations. Discrepancies b et w een the chosen metric and human ev aluation could indicate that the metric is not ﬁt for the task it needs to perform. F or example, traditional machine learning metrics cannot judge sentimen t or p edagogical relev ance. Our recommended approac h when c ho osing metrics is as follo ws: Where possible, non-LLM-based metrics should b e preferred, as they oﬀer greater in terpretability , repro ducibilit y and lo w er suscep- tibilit y to (mo del-induced) bias. T raditional metrics such as BLEU [18], ROUGE [15], METEOR [1], or exact matc h pro vide a reliable starting p oin t. How ev er, these metrics often fall short in tasks in volving op en-ended generation, where lexical o verlap is a po or proxy for quality [24]. In cases where traditional metrics do not correlate with human judgment, general-purp ose LLM- based ev aluators developed and v alidated b y third parties can be used. These mo dels are typically used to assess qualities suc h as relev ance, ﬂuency , helpfulness and factual correctness. How ev er, LLM-as-a-Judge systems do rely on prompt engineering and careful observ ations to pro duce con- sisten t scores. Despite b eing ﬂexible and general-purp ose, they may introduce bias or lack trans- parency . Only when b oth the traditional and existing LLM-based ev aluators fail, a custom LLM-as-a-Judge should b e developed. This custom judge should b e manually ev aluated and compared to exp ert-lev el h uman scoring. Our automated ev aluation framework for prompt and mo del selection can be used to mak e sure the judge performs as exp ected by iteratively improving the prompt used by the judge. 5.4 A ctor Ev aluation and Selection The ﬁnal step in our Q&A researc h w as testing the best av ailable h uman output, ev aluating diﬀerent prompts, performing an intra-family comparison and ending with a comparison of mo dels from 17 diﬀeren t providers. Based on the accuracy scores obtained b y all of these ev aluations, we were able to conclude that LLMs outp erform human educators on the Q&A task, making them suitable for deplo yment as draft-generating assistan ts within a teac her-in-the-lo op workﬂo w. This repeated ev aluation is the ﬁnal step in our framework. The input data is giv en to eac h actor (prompt, mo del, a human annotator, . . . ) which results in an actor output. This output and the ground truth linked to the input data are handed to each of the metrics created in the second stage of the framew ork. This results in multiple scores that can b e used to mak e a ranking, plot or ov erview. Subsequen tly , the data can b e used to draw conclusions ab out the diﬀerent actors p erforming the task. 6 Discussion This study set out to explore the p oten tial of LLMs to supp ort educators in CS1 programming courses by generating answers to student questions, guided b y three researc h questions. 6.1 R Q1: T o what exten t can LLMs generate p edagogically appropriate answ ers to studen t questions in a CS1 programming course? Our ﬁndings indicate that current LLMs are capable of answ ering student questions in a CS1 programming course, with mo dels like Gemini 3 ﬂash surpassing the quality of typical educator resp onses. The "best av ailable h uman" (BAH) baseline, represen ting the quality of time-constrained educator feedbac k, ac hieved a mean score of 2.92 on our 0-to-5 scale. In comparison, most models lik e Gemini 3 ﬂash surpassed this score with 3.335 and higher. This sup erior p erformance suggests that modern LLMs are capable of generating answ ers that not only align with exp ert answers but also exceed the quality of t ypical educator resp onses. The sup erior p erformance of LLMs migh t b e attributed to their abilit y to quickly generate detailed and comprehensive explanations, whereas educators may b e limited by time constraints, leading to less detailed feedback. Ho wev er, it is imp ortan t to ackno wledge that some LLM-generated information, while extensiv e, migh t be pro viding o v erly detailed or inaccurate information, giving a w ay solutions or hallucinating, whic h can mislead the studen t and cause a loss of trust. T o mitigate these issues, a Q&A tool w as implemen ted in Do dona following a teacher-in-the-loop approac h. An LLM generates a draft answ er that is then reviewed and edited by an educator b efore b eing sent to the student who had asked the question. This human-augmen ted LLM to ol is in tended to pro duce more accurate answers that reac h the student faster. The b est performing prompt from this research com bined with the Gemini 2.5 ﬂash model is currently in use. Before deploymen t, the tool was extensively tested on the Q&A b enc hmark, and future models and improv ed prompts can be tested on the same b enc hmark. F uture researc h will inv estigate ho w the "teacher-in-the-loop" metho d aﬀects accuracy , educational impact, and resp onsiv eness. 18 6.2 R Q2: How can we establish a repro ducible, scien tiﬁc pro cess for dev eloping and ev aluating LLM-based (educational) to ols? The ev aluation of the Q&A task serves as a proof-of-concept for RQ2, demonstrating that w e can establish a repro ducible, scien tiﬁc pro cess for developing (educational) LLM to ols. By deﬁning a b enc hmark and metrics that are sp eciﬁcally represen tative for the task at hand and the goals we try to accomplish, we established a metho d to rigorously quantif y the p erformance and cost of LLMs. This approach allo w ed us to assess the eﬀectiveness of diﬀeren t prompt/mo del combinations and even the BAH, pro viding a concrete o verview of the current LLM capabilities to p erform the (educational) task. Bey ond pre-deplo yment assessment, this benchmark provides the necessary infrastructure to future- pro of the developmen t pro cess. It enables systematic re-ev aluation of the to ols as the LLM landscap e ev olves. The inﬂuence of new mo dels, up dated prompts or the introduction of new tec hnologies is quan tiﬁable through our b enchmark. Consequen tly , this shifts the developmen t of (educational) LLM-to ols from ad-ho c implemen tation with p ost-deplo yment ev aluation to a future-pro of, quan- tiﬁable, scalable and repro ducible pre-deplo yment pro cess. 6.3 R Q3: What reusable workﬂo w or set of principles can b e distilled for de- signing and ev aluating similar to ols across domains and tasks? Based on the design and ev aluation of our Q&A task, w e w ere able to abstract an ev aluation framew ork that pro vides us with a reliable w ay of quan tifying the p erformance of diﬀeren t actors (referring to prompt/mo del com binations, humans, or a hybrid of both) p erforming a task (Figure 9). This framework mov es the dev elopment of LLM-based to ols from a trial-and-error approach to a structured, task-agnostic pip eline gov erned by three core principles: (i) gather insights, (ii) automate the ev aluation and (iii) future-pro of the ev aluation. The ﬁrst principle, gathering insigh ts, is ro oted in the early stages of the framework. During a preliminary feasibilit y study , data preparation and metric selection, researchers are forced to deeply engage with the source data and task requirements. By curating the ev aluation dataset and success criteria upfront, crucial insights into the data’s n uances and pitfalls are gained. By deﬁning the goals and iden tifying potential issues early on, the ov erall focus of the task is clariﬁed, which reduces errors and w orkload in later stages. Building up on the initial insights, the fo cus shifts tow ards automating the ev aluation. The core strength of the prop osed framework lies in its ability to translate success criteria into automated, quan tiﬁable metrics. By thoughtfully selecting or designing automated metrics, the b ottlenec k of lab or-in tensiv e man ual ev aluation is eliminated. Automation do es not only reduce time and the required ﬁnancial resources, but also provides quantiﬁable scores that are necessary to reliably ev aluate and compare diﬀeren t actors. With the automatic metrics and ev aluation dataset in place, w e can create a reusable testing pip eline (actor ev aluation phase). As the LLM landscap e ev olves, new mo dels, prompts or agen tic w orkﬂo ws can b e used as actors in the ev aluation pip eline and ev aluated instantly . There is no need for new datasets, man ual annotation or new metrics. LLM to ols can b e contin uously re-ev aluated to main tain state-of-the-art p erformance. The versatilit y of the task-agnostic framew ork is demonstrated by its application to our custom ev al- 19 uation metric. The custom LLM-as-a-Judge was dev elop ed and reﬁned using the same framework and based on the same principles (App endix A). By treating the judge as an actor p erforming a sp eciﬁc task (ev aluating p edagogical accuracy), we could quantify its alignment with h uman exp ert scoring. F urthermore, by building an automated, future-pro of pip eline for the ev aluation of the LLM-as-a-Judge, the judge can b e reﬁned and re-ev aluated with new mo del/prompt combinations. 7 Limitations While this study sho ws the capabilities of LLMs in an educational Q&A setting, several limitations m ust b e ac knowledged. First, the data used in this study originates from one CS1 programming course taugh t in Python. Consequently , the p erformance of the tested mo dels ma y v ary in upp er- lev el courses or courses that use diﬀeren t programming languages and paradigms. Second, the annotations constructed for the 170 selected student questions were annotated b y a single exp ert. While this approac h ensures a p edagogically accurate answ er, the resp onse may not encompass all p ossible solutions or v ariations in teac hing style. Although the LLM-as-a-Judge comp ensates for this b y fo cusing on the core teaching p oin ts, the judge remains an automated pro xy for manual exp ert ev aluation. Finally , the performance of the LLMs used in this study is a snapshot in time, with the rapid evolution of generative AI re-ev aluation using our prop osed framew ork will b e necessary . 8 Conclusion Our researc h demonstrates that LLMs sho w promising p erformance in answ ering CS1 student ques- tions, with mo dels often surpassing time-constrained educator responses. This suggests that LLMs can serv e as eﬀective supp ort to ols in educational contexts, reducing workloa d and improving re- sp onsiv eness. Ho w ever, occasional inaccuracies highlight the importance of human ov ersigh t to ensure correctness and maintain trust in learning environmen ts. T o address this, we implemen ted the Q&A to ol in Do dona with a teac her-in-the-lo op approach. This reduces the risk of misinforma- tion, increases answ er quality and reduces educator w orkload. A dditionally , the researc h for the Q&A task provides us with a repro ducible, scien tiﬁc pro cess for dev eloping educational LLM to ols. By creating task-speciﬁc b enc hmarks and metrics b efore deplo y- men t of the to ol, w e quantiﬁed LLM p erformance of v arious prompt/model combinations and the BAH. F rom this process, w e distilled a task-agnostic ev aluation framework, shifting the developmen t of educational AI to ols from ad-ho c, p ost-deplo yment assessmen t to a quantiﬁable, scalable, and repro ducible v alidation pro cess. This provides a future-pro of pip eline for the systematic ev aluation of new models and prompts as the generativ e AI landscap e evolv es. Ultimately , our researc h highligh ts the p oten tial of LLMs as p ow erful educational aids when inte- grated resp onsibly . By combining h uman o v ersight with pre-deploymen t ev aluation and by pro vid- ing a framework to p erform this ev aluation, we pav e the wa y for safer, more eﬀective and scalable adoption of LLMs in education. 20 A c kno wledgemen ts TVM, BM and PD ackno wledges funding by Researc h F oundation - Flanders (FW O) for ELIXIR Belgium [I002819N]. References [1] Satanjeev Banerjee and Alon Lavie. METEOR: An Automatic Metric for MT Ev aluation with Impro ved Correlation with Human Judgmen ts. In Jade Goldstein, Alon La vie, Chin-Y ew Lin, and Clare V oss, editors, Pr o c e e dings of the A CL W orkshop on Intrinsic and Extrinsic Evaluation Me asur es for Machine T r anslation and/or Summarization , pages 65–72, Ann Arb or, Michigan, June 2005. Association for Computational Linguistics. [2] Brett A. Beck er, P aul Denny , James Finnie-Ansley , Andrew Luxton-Reilly , James Prather, and Eddie Antonio Santos. Programming Is Hard - Or at Least It Used to Be: Educational Opp ortunities and Challenges of AI Co de Generation. In Pr o c e e dings of the 54th ACM T e chnic al Symp osium on Computer Scienc e Educ ation V. 1 , SIGCSE 2023, pages 500–506, New Y ork, NY, USA, Marc h 2023. Association for Computing Machinery . [3] T om Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- w al, Arvind Neelak an tan, Pranav Shy am, Girish Sastry , Amanda Askell, Sandhini Agarwal, Ariel Herb ert-V oss, Gretchen Krueger, T om Henighan, Rewon Child, Adit y a Ramesh, Daniel Ziegler, Jeﬀrey W u, Clemens Win ter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray , Benjamin Chess, Jac k Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ily a Sutskev er, and Dario Amo dei. Language Mo dels are F ew-Shot Learners. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 33, pages 1877–1901. Curran Asso ciates, Inc., 2020. [4] Jacob Cohen. W eighted k appa: Nominal scale agreement provision for scaled disagreement or partial credit. Psycholo gic al Bul letin , 70(4):213–220, 1968. Place: US. [5] James Finnie-Ansley , Paul Denny , Brett A. Bec ker, Andrew Luxton-Reilly , and James Prather. The Rob ots Are Coming: Exploring the Implications of Op enAI Codex on Introductory Pro- gramming. In Pr o c e e dings of the 24th A ustr alasian Computing Educ ation Confer enc e , ACE ’22, pages 10–19, New Y ork, NY, USA, F ebruary 2022. Association for Computing Machinery . [6] James Finnie-Ansley , P aul Denn y , Andrew Luxton-Reilly , Eddie An tonio San tos, James Prather, and Brett A. Bec ker. My AI W ants to Kno w if This Will Be on the Exam: T est- ing OpenAI’s Co dex on CS2 Programming Exercises. In Pr o c e e dings of the 25th Austr alasian Computing Educ ation Confer enc e , A CE ’23, pages 97–104, New Y ork, NY, USA, Jan uary 2023. Asso ciation for Computing Mac hinery . [7] John Grub er. Daring Fireball: Markdown. [8] Arto Hellas, Juho Leinonen, Sami Sarsa, Charles K outcheme, Lilja Kujanpää, and Juha Sorv a. Exploring the Resp onses of Large Language Models to Beginner Programmers’ Help Requests. In Pr o c e e dings of the 2023 ACM Confer enc e on International Computing Educ ation R ese ar ch - V olume 1 , v olume 1 of ICER ’23 , pages 93–105, New Y ork, NY, USA, Septem b er 2023. Asso ciation for Computing Mac hinery . 21 [9] Sv en Jacobs, Henning P eters, Steﬀen Jaschk e, and Natalie Kiesler. Unlimited Practice Op- p ortunities: Automated Generation of Comprehensive, Personalized Programming T asks. In Pr o c e e dings of the 30th ACM Confer enc e on Innovation and T e chnolo gy in Computer Scienc e Educ ation V. 1 , ITiCSE 2025, pages 319–325, New Y ork, NY, USA, June 2025. Asso ciation for Computing Machinery . [10] Enk elejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementiev a, F rank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, Stephan Krusc he, Gitta Kutyniok, Tilman Michaeli, Claudia Nerdel, Jürgen Pfeﬀer, Oleksandra P o quet, Mic hael Sailer, Albrech t Schmidt, Tina Seidel, Matthias Stadler, Jochen W eller, Jo c hen Kuhn, and Gjerg ji Kasneci. ChatGPT for go o d? On opp ortunities and challenges of large language mo dels for education. L e arning and Individual Diﬀer enc es , 103:102274, April 2023. [11] Ma jeed Kazemitabaar, Xin ying Hou, Austin Henley , Barbara Jane Ericson, David W eintrop, and T ovi Grossman. How Novices Use LLM-based Co de Generators to Solv e CS1 Co ding T asks in a Self-Paced Learning Environmen t. In Pr o c e e dings of the 23r d Koli Cal ling International Confer enc e on Computing Educ ation R ese ar ch , Koli Calling ’23, pages 1–12, New Y ork, NY, USA, F ebruary 2024. Asso ciation for Computing Mac hinery . [12] Ma jeed Kazemitabaar, Runlong Y e, Xiaoning W ang, Austin Zac hary Henley , Paul Denny , Mic helle Craig, and T ovi Grossman. Co deAid: Ev aluating a Classro om Deplo yment of an LLM-based Programming Assistan t that Balances Studen t and Educator Needs. In Pr o c e e dings of the 2024 CHI Confer enc e on Human F actors in Computing Systems , CHI ’24, pages 1–20, New Y ork, NY, USA, May 2024. Asso ciation for Computing Mac hinery . [13] Juho Leinonen, Arto Hellas, Sami Sarsa, Brent Reev es, Paul Denn y , James Prather, and Brett A. Beck er. Using Large Language Mo dels to Enhance Programming Error Messages. In Pr o c e e dings of the 54th ACM T e chnic al Symp osium on Computer Scienc e Educ ation V. 1 , SIGCSE 2023, pages 563–569, New Y ork, NY, USA, March 2023. Asso ciation for Computing Mac hinery . [14] Mark Liﬃton, Brad E Sheese, Jaromir Sa v elk a, and Paul Denny . Co deHelp: Using Large Language Mo dels with Guardrails for Scalable Supp ort in Programming Classes. In Pr o c e e dings of the 23r d Koli Cal ling International Confer enc e on Computing Educ ation R ese ar ch , Koli Calling ’23, pages 1–11, New Y ork, NY, USA, F ebruary 2024. Association for Computing Mac hinery . [15] Chin-Y ew Lin. ROUGE: A Pac k age for Automatic Ev aluation of Summaries. In T ext Sum- marization Br anches Out , pages 74–81, Barcelona, Spain, July 2004. Asso ciation for Compu- tational Linguistics. [16] Nelson F. Liu, Kevin Lin, John Hewitt, Ash win P aranjap e, Michele Bevilacqua, F abio P etroni, and Percy Liang. Lost in the Middle: How Language Models Use Long Contexts. T r ansactions of the Asso ciation for Computational Linguistics , 12:157–173, 2024. Place: Cam bridge, MA. [17] Y ang Liu, Dan Iter, Yic hong Xu, Sh uohang W ang, Ruo c hen Xu, and Chenguang Zhu. G- Ev al: NLG Ev aluation using Gpt-4 with Better Human Alignment. In Houda Bouamor, Juan Pino, and Kalik a Bali, editors, Pr o c e e dings of the 2023 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 2511–2522, Singap ore, Decem b er 2023. Association for Computational Linguistics. 22 [18] Kishore Papineni, Salim Roukos, T o dd W ard, and W ei-Jing Zh u. BLEU: a metho d for au- tomatic ev aluation of mac hine translation. In Pr o c e e dings of the 40th A nnual Me eting on Asso ciation for Computational Linguistics - ACL ’02 , page 311, Philadelphia, Pennsylv ania, 2001. Asso ciation for Computational Linguistics. [19] James Prather, Paul Denn y , Juho Leinonen, Brett A. Bec ker, Ibrahim Alblu wi, Michelle Craig, Hiek e Keuning, Natalie Kiesler, T obias K ohn, Andrew Luxton-Reilly , Stephen MacNeil, An- drew P etersen, Raymond Pettit, Brent N. Reev es, and Jaromir Sav elk a. The Robots Are Here: Na vigating the Generative AI Revolution in Computing Education. In Pr o c e e dings of the 2023 W orking Gr oup R ep orts on Innovation and T e chnolo gy in Computer Scienc e Educ a- tion , ITiCSE-WGR ’23, pages 108–159, New Y ork, NY, USA, December 2023. Asso ciation for Computing Machinery . [20] Bren t Reeves, Sami Sarsa, James Prather, P aul Denny , Brett A. Beck er, Arto Hellas, Bailey Kimmel, Garrett P ow ell, and Juho Leinonen. Ev aluating the P erformance of Code Generation Mo dels for Solving Parsons Problems With Small Prompt V ariations. In Pr o c e e dings of the 2023 Confer enc e on Innovation and T e chnolo gy in Computer Scienc e Educ ation V. 1 , ITiCSE 2023, pages 299–305, New Y ork, NY, USA, June 2023. Asso ciation for Computing Mac hinery . [21] Sami Sarsa, Paul Denny , Arto Hellas, and Juho Leinonen. Automatic Generation of Program- ming Exercises and Co de Explanations Using Large Language Mo dels. In Pr o c e e dings of the 2022 A CM Confer enc e on International Computing Educ ation R ese ar ch - V olume 1 , v olume 1 of ICER ’22 , pages 27–43, New Y ork, NY, USA, August 2022. Asso ciation for Computing Mac hinery . [22] Jaromir Sa velk a, Arav Agarw al, Marshall An, Chris Bogart, and Ma jd Sakr. Thrilled b y Y our Progress! Large Language Mo dels (GPT-4) No Longer Struggle to Pass Assessmen ts in Higher Education Programming Courses. In Pr o c e e dings of the 2023 A CM Confer enc e on International Computing Educ ation R ese ar ch - V olume 1 , volume 1 of ICER ’23 , pages 78–92, New Y ork, NY, USA, Septem b er 2023. Asso ciation for Computing Mac hinery . [23] Charlotte V an Petegem, Rien Maertens, Niko Strijb ol, Jorg V an Renterghem, F elix V an der Jeugt, Bram De W ever, Peter Da wyndt, and Bart Mesuere. Do dona: Learn to co de with a virtual co-teacher that supp orts active learning. Softwar eX , 24:101578, Decem b er 2023. [24] Ruiqi W ang, Jiyu Guo, Cuiyun Gao, Guodong F an, Ch un Y ong Chong, and Xin Xia. Can LLMs Replace Human Ev aluators? An Empirical Study of LLM-as-a-Judge in Soft ware Engineering. Pr o c. ACM Softw. Eng. , 2(ISST A):ISST A086:1955–ISST A086:1977, June 2025. [25] Jason W ei, Xuezhi W ang, Dale Sc huurmans, Maarten Bosma, Brian Ich ter, F ei Xia, Ed Chi, Quo c V. Le, and Denn y Zhou. Chain-of-Though t Prompting Elicits Reasoning in Large Lan- guage Mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 35:24824–24837, Decem b er 2022. [26] Lianmin Zheng, W ei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatb ot Arena. A dvanc es in Neur al Information Pr o c essing Systems , 36:46595–46623, Decem b er 2023. 23 A Design of a custom LLM-as-a-Judge to compare generated an- sw ers to a ground truth The automation of the comparison b et w een SME answers (ground truth) and others (generated answ ers) is based on the similarit y b et ween the ground truth and those other answers. These “other” answers can be generated by an LLM or written by educators. The similarit y betw een t w o answ ers is based on the correctness and completeness of the information. In the con text of drafting answ ers to programming questions, correctness and completeness can b e interpreted as conv eying the same teaching p oin ts without adding or missing information. It is also important to note that there is not a single p erfect answ er to a programming question. The ground truth answer represen ts one possible answ er to the question ask ed b y the studen t. This ambiguit y makes it diﬃcult to use an already existing metric to measure the correctness of an answer compared to a ground truth dataset. T o solve this issue, we in tro duced a custom LLM-as-a-Judge metric. The metric assigns a correctness score to every answer and corresp onding ground truth item, reﬂecting how accurate the answer is relative to the ground truth. T o mak e sure this judge is correctly aligned with the opinion of an SME w e used the ev aluation framework. The task is to align the judge with an SME b y automatically ev aluating the diﬀerence in scores b et ween them. A.1 F easibilit y Study The metric selection phase of the Q&A task sho wed that it is not feasible to use conv entional NLP metrics to determine the similarity of generated answers and ground truth items. T o automate the ev aluation for the Q&A task, we used LLM-as-a-Judge which has b een prov en to b e a go od replacemen t for manual ev aluation. In our case, initial tests with items from the ground truth dataset and answers from intermediary runs with other metrics show that LLMs are able to giv e scores that align with those given by an SME. The feasibility study unco vered that the c hallenge in prompting the mo del for optimal alignmen t is hidden in the deﬁnition of ‘similarity‘. T ranslating the criteria of the SME into a scoring rubric or prompt that can b e pro vided to the mo del is the main challenge. A.2 Data Preparation During the data preparation stage the b enc hmark is established to align the custom LLM-as-a- Judge with an SME. F ollowing the insights from the feasibility study , this stage fo cuses on creating a scoring rubric and gathering and annotating the necessary data to align the judge. A ground truth dataset is constructed, wherein similarity scores are assigned b y an SME. This dataset, used for the alignmen t, is created b y combining data from prior runs utilizing alternative metrics (Q&A task) with the existing ground truth dataset from the Q&A task. Each data p oin t in the consolidated dataset is assigned a score betw een 0 and 5 by an SME. A.2.1 Dataset Based on insights gathered in the feasibility study and previous runs with diﬀerent metrics during the Q&A task, a ground truth dataset was constructed to align the LLM-as-a-Judge with the scores of an SME. This ground truth dataset is based on 100 samples from tw o diﬀeren t runs of the Q&A 24 task with other metrics. Fift y of those samples were tak en from a run without limitations on the w ord coun t, the LLM was not hinted tow ards short answ ers nor was a word coun t sp eciﬁed. The other ﬁfty samples were taken from a run where the LLM was instructed to limit the answers to 100 words unless more details were essen tial. The mo del was also instructed to ‘keep the answ ers short and fo cused, av oiding unnecessary details or ﬁller’. This dataset represen ts t w o relev ant scenarios that o ccurred in previous runs of the Q&A task. The ﬁrst scenario is characterized by length y explanations generated by the LLMs. These resp onses, while thorough, often include extra information that is not necessary to help the student. This verbosity can hinder or confuse the studen t. The second scenario in v olves the generation of concise short answers that, while eﬃcien t, could suﬀer from incompleteness. The LLM lea ves out essen tial details or skips some of the necessary teac hing points. Both scenarios are undesirable and result in a low er score. Figure A.1: SME view when grading answers for the ground truth dataset of the LLM-as-a-Judge. The SME sees the generated and exp ected answer and giv es a score b et ween 0 and 5 based on a predeﬁned score rubric. A.2.2 Score Rubric The LLM-as-a-Judge metric fo cuses on the ‘core teac hing p oints’ of the generated (LLM-answ er) and expected answ er (SME-answ er). The task of the judge is to score generated answ ers based on the similarity of the core information that is presented b y that answ er. A scoring rubric is pro vided to the mo del to provide accurate descriptions of what each score represents. The metric fo cuses on the teaching points and the ability of the mo del to recognize the core information the student needs without o verly relying on the reference answ er. An answer that con tains a solution that uses a diﬀerent approach but has the same core meaning (solves the same issue) is considered correct. Complete matches result in a score of 5, a complete mismatc h is assigned a score of 0. The SME uses this score rubric to giv e a score to each item in the ground truth dataset (Figure A.1). This score is later used to align the LLM-as-a-Judge with the exp ert. A.3 Metric Selection T o ev aluate the scores given by the LLM-as-a-Judge a simple metric can b e used. The diﬀerence b et w een the SME and judge scores giv es an indication of ho w well aligned they are. The goal is to 25 minimize this diﬀerence for ev ery item in the dataset. Metrics such as Cohen’s (weigh ted) k appa [4] can b e used to indicate the current alignment. Visual representations suc h as heatmaps can help an exp ert to determine what needs to c hange and can b e automatically generated. A.4 A ctor Ev aluation Lastly , the judge is aligned with the SME scores during the actor ev aluation stage. Diﬀeren t prompting strategies are explored and eac h of the resulting score distributions is carefully studied to ensure optimal alignmen t. The w ording of the prompt and scoring rubric are carefully adjusted to steer the judge tow ards the scores provided b y the SME. Heatmaps are constructed and the explanations given by the judge are insp ected (Figure A.2). The necessary data is selected based on the highest scoring alignment. E.g. prompts are created where the student’s question or the iden tiﬁed issue provided b y the LLM were included. This reﬁnemen t pro cess is rep eated un til no signiﬁcan t positive c hanges are detected. Figure A.2: Output of the LLM-as-a-Judge and the score and justiﬁcation of the SME for the radian example 26 Supplementary Material Submission example (json) The following JSON object is a sample entry from the input dataset, corresponding to the “radians example” shown in Fig. 1 of the main article. { " submission_id ": 1, " line_number ": 0, " user_language ": "Eng lis h", " programming_language ": "Py thon ", " question ": "Hello I don't know what my mistake is. My answer is always pretty close but not correct so i suspected just u mistake in the formula but I don't see it." , " code ": "imp ort math #step1: hoeken ingeven x1= float(input()) y1= float(input()) x2= float(input()) y2= float(input()) #step2 hoeken omzetten in radialen math.radians(x1) math.radians(y1) math.radians(x2) math.radians(y2) #step3: radius of th earth r = 6371 #step3: calculating the great-circle distance d =r*math.acos((math.sin(x1)*math.sin(x2))+(math.cos(x1)*math.cos(x2)*math.co s(y1-y2))) # step 4: show the result on the screen print(f'The great-circle distance is {int(d)} km.')" , " exercise_question ": " A great circle of a sphere is the intersection of the sphere and a plane that passes through the center point of the sphere. A great circle is the largest circle that can be drawn on any given sphere. Any diameter of any great circle coincides with a diameter of the sphere, and therefore all great circles have the same circumference and have the same center as the sphere. For most pairs of distinct points on the surface of a sphere, there is a unique great circle through the two points. The exception is a pair of antipodal points, for which there are infinitely many great circles. The shortest distance between two points on the surface of a sphere \u2014 measured along the surface of the sphere \u2014 is always along a great circle of the sphere, and is therefore called the **great-circle distance**. A great circle divides a sphere into two equal hemispheres. 1 A transatlantic ship, for example, does not travel from Southampton, England to New York, USA alongside an east-west latitude \u2014 which appears to be the shortest route on many world maps (e.g. using Mercator projection) \u2014 but over the great circle that runs through both places. This great circle runs relatively high north over the Atlantic Ocean. The map below initially shows the part of the great circle that forms the shortest distance between Southampton and New York. You can drag the end points at will. The great-circle distance $$d$$ between two points on a sphere can be calculated using the formula: $$d = r \\cdot \\arccos(\\sin(x_1) \\cdot \\sin(x_2) + \\cos(x_1)\\cdot \\cos(x_2) \\cdot \\cos(y_1 - y_2))$$ where $$(x_1, y_1)$$ and $$(x_2, y_2)$$ are the longitude and latitude of both points (given in decimal degrees) and $$r$$ is the radius of the sphere on which the distance is to be calculated. In this assignment we assume that Earth is spherical with radius $$r = 6371\\mbox{ km}$$. Despite the fact Earth is more like a flattened spheroid rather than a perfect sphere, the formula for the great-circle distance still gives an approximation that is correct up to $$0.5\\%$$. ### Input The input consists of 4 numbers $$x_1, y_1, x_2, y_2 \\in \\mathbb{R}$$, each on a separate line. The longitude-latitude pairs $$(x_1, y_1)$$ and $$(x_2, y_2)$$ represent the positions of two points on Earth, where longitudes and latitudes are expressed in degrees. ### Output A description that indicates the great-circle distance (in kilometers) between the two given points on Earth. The great-circle distance must be rounded to the nearest natural number. ### Example **Input:** ``` 48.87 -2.33 37.80 122.40 ``` **Output:**``` The great-circle distance is 8948 km. ``` " , " educator_answer ": "These statements have no effect. They only compute the angles in radians from the angles in degrees, but never do something with the result of the computations. You need to reassign the results to the same variables that are then used further down in the program. 2 ```Python x1 = math.radians(x1) … ```" , " testcases ": { " acc ept ed" : f als e, " sta tus ": "wr ong ", " des cri pti on" : " 50 t est s f ail ed" , " ann ota tio ns" : [ { "c olu mn" : 0, "e xte rna lUrl ": "h ttp s:// pyl int .py cqa .or g/en /la tes t/m ess age s/co nve nti on/ mis sing -fi nal -ne wli ne .h tml ", "r ow" : 1 6, "t ext ": "Fin al new lin e m iss ing" , "t ype ": "inf o" } ] , " tab s": [ { "c ont ext s": [ { "acc ept ed" : f als e, "tes tca ses ": [ { "a cce pte d": fal se, "t est s": [ { "ac cept ed" : f als e, "ge nera ted ": "Th e g reat -ci rcl e di sta nce is 998 1 k m.\ n", "ex pect ed" : " The gr eat- cir cle di sta nce is 894 8 k m.\ n", "ch anne l": "s tdo ut" } ], "d esc rip tio n": "48 .87 -2 .33 37 .80 12 2.4 0 " } ] } ], "d esc rip tion ": "co rre ctn ess " } ] } } 3 Judge prompt The following text shows the prompt-template that is used to instruct the LLM-as-a-Judge. The prompt consists of general instructions, a scoring rubric, formatting information and specific instructions to perform the task. Y ou are an alignment judge for a programming course. Y ou will receive two answers: • Expected Answer: the instructor ’ s reference solution. • Generated Answer: AI generated reply to a student. Y our task is to compare these two answers only . Y ou do not see the original student question. Judge whether the Generated Answer conveys the same core teaching points as the Expected Answer . If the same core issue is addressed but a slightly differe nt solution is proposed, that is acceptable. Important: - A score of 3 or higher means that the generated answer is acceptable to show to a student. This means that the core teaching points are covered, the student is able to solve the issue with the provided information without introducing more mistakes (no extra harmful information). - A score of 2 or lower indicates that either more than one key point is missing, or that the generated answer contains a lot of extra information. - Extra tips in the reference answer are not essential (not a core issue), but give additional information that may help the student. - Don't be afraid to give the maximum score of 5 if the answers are very similar . ⸻ Scoring rubric (0–5) • 5 (Complete alignment): Covers all information in the Expected Answer or suggests an equivalent solution. • 4 (High alignment): Covers all core ideas in the Expected Answer but a detail is added or missing. • 3 (Adequate alignment): Main ideas from the Expected Answer are present but the answer adds a few extra details or one significant step/core idea is missing. • 2 (Low alignment): Partial overlap; answer is too verbose and contains unnecessary details or identifies unrelated core issues or is missing two or more key points. • 1 (Poor alignment): V ery little overlap; most key points are missing, or the answer is mostly irrelevant/misleading or contains incorrect information. • 0 (Complete misalignment): Entirely unrelated, incorrect, or misleading; contains none of the Expected Answer ’ s key points. ⸻ Output Format Always return your evaluation in this Y AML schema: core_points: [ "" ] missing_points: [ "" ] extra_points: [ "" ] justification: "<2–4 sentences explaining summary of comparison>" Score: <0-5> ⸻ 4 Instructions 1. Identify the core teaching points in the Expected Answer . 2. Compare them against the Generated Answer . 3. Mark what is covered, missing, or unnecessarily added. 4. Assign a score using the rubric above. 5. Fill in the Y AML output. {expected_output} {output} 5 Flagship model statistics Fig. S.1 T otal cost for each actor during one run of the Q&A actor evaluation. 6 Fig. S.2 A verage token usage per request of the tested models during one run of the Q&A actor evaluation. 7

Evaluating LLMs for Answering Student Questions in Introductory Programming Courses

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment