Large Language Models in Teaching and Learning: Reflections on Implementing an AI Chatbot in Higher Education

Large Language Mo dels in T eac hing and Learning: Reflections on Implemen ting an AI Chatb ot in Higher Education Fiammetta Cacca v ale 1* , Carina L. Gargalo 1 , Julian Kager 1 , Magdalena Sk owyra 1 , Steen Larsen 1 , Krist V. Gernaey 1 , Ulric h Kr ¨ uhne 1* 1 Departmen t of Chemical and Biochemical Engineering, T ec hnical Univ ersity of Denmark, S ø ltofts Plads 228A, Kongens Lyngby , 2800, Denmark. *Corresp onding author(s). E-mail(s): fiacac@dtu.dk ; ulkr@kt.dtu.dk ; Abstract The landscap e of education is changing rapidly , shap ed by emerging p edagogi- cal approac hes, technological inno v ations such as artificial in telligence (AI), and ev olving so cietal exp ectations, all of which demand thorough ev aluation of new educational to ols. Although large language mo dels (LLMs) present substantial opp ortunities esp ecially in Higher Education, their propensity to generate hallu- cinations and their limited sp ecialized knowledge may introduce significan t risks. This study aims to address these risks by examining the practical implementation of an LLM-enhanced assistan t in a univ ersity lev el course. W e implemented a generativ e AI assistant grounded in a retriev al-augmen ted generation (RA G) mo del to replicate a previously teac her-led, time-intensiv e exercise. T o assess the effectiv eness of the LLM, we conducted three separate exp erimen ts through iterativ e mixed-methods approac hes, including a crossov er design. The resulting data address central research questions related to studen t motiv ation, p erceiv ed differences b et ween engaging with the LLM versus a h uman teac her, the qualit y of AI-generated resp onses, and the impact of the LLM on studen ts’ academic p erformance. The results offer direct insights into students’ views and the p edagogical feasibility of embedding LLMs into sp ecialized courses. Finally , we discuss the main challenges, opp ortunities and future directions of LLMs in teaching and learning in Higher Education. Keyw ords: Artificial Intelligence, Large Language Mo dels, Chatbots in Education 1 1 In tro duction Education is constan tly evolving with new pedagogies (e.g., blended learning, comp etency-based education), technologies (e.g., artificial intelligence, mixed and vir- tual reality), and so cietal demands (e.g., 21st-cen tury skills, sustainabilit y education). Among these technologies, artificial in telligence (AI) is starting to be consistently in tegrated in Higher Education. In recen t y ears, researchers hav e been particularly in trigued large language mo dels (LLMs) and the impact that this tec hnology could ha ve on education. P ossible benefits that LLMs could provide include personalized learning, increased teaching efficiency , tailored student supp ort, which could lead to enhanced learning exp erience and ov erall higher engagement of students ( Kasneci et al. 2023 ). Ho wev er, there are also arguably as many challenges introduced by this technol- ogy , suc h as ov er-reliance on LLMs ( Abd-Alrazaq et al. 2023 ), which could hinder critical thinking and reasoning skills ( Xu et al. 2024 ). These challenges also include the tendency of LLMs to hallucinate, leading to studen ts receiving incorrect informa- tion, and to b e to o generalized to provide domain-sp ecific knowledge that the Higher Education curriculum requires. Other issues include possible ethical concerns, ensur- ing transparency , priv acy and accountabilit y ( Y an et al. 2024 ). Therefore, there is a need for educators to scrupulously consider applications and alignmentof LLMs with learning ob jectives, as well as thoroughly v alidate these mo dels b efore they are used in education. If no precautions are tak en, these issues could cause the introduction of LLMs in education to inhibit rather than facilitate learning. The purpose of this study is to contribute to the ongoing research and discussion on the in tro duction of LLMs in Higher Education. W e in tro duce a c hatb ot previously dev elop ed by the authors ( Cacca v ale et al. 2025 ), which is based on a pre-trained LLM using retriev al-augmented generations (RA G), to a Master’s Degree course. The c hatb ot is created to simulate an in terview-lik e exercise betw een the students (ask- ing the questions) and a fictional company (originally represented by the teachers of the course) An iterative mixed-metho ds approach is used to v alidate the developed c hatb ot. W e provide extensive first-hand opinions from students who tested this tech- nology in three different exp erimen ts, including a crosso ver study . With the collected data and analysis in this con tribution we show: (i) reasons and motiv ations of stu- den ts to interact with a c hatb ot instead of a teac her, (ii) the main differences b etw een in teracting with a teac her compared to the c hatb ot, (iii) the effect on the grading, and (iv) the p erceptions of adding sp eech-to-text and text-to-sp eec h functions on the user experience. The learning from this exp erimen ts Finally , we summarize the opp ortunities and c hallenges that LLMs can introduce to education, and in Section 5 , w e highlight the p edagogical challenges asso ciated with AI and presen t our considerations on ho w this tec hnology can b e leveraged in collab orative learning. 2 Bac kground This section in tro duces the current bac kground for the introduction of AI mo dels in Higher Education. First, the opp ortunities that LLMs could bring to this field will 2 b e discussed, follo wed b y an introduction to their b enefits to collab orativ e learning. Lastly , curren t c hallenges and possible drawbac ks and limitations will b e presen ted. 2.1 Opp ortunities in tro duced by LLMs in education Curren t adv ancements in AI, spe cifically with regards to large language mo dels (LLMs) offer substantial b enefits in Higher Education, including p ersonalized learning ( Gan et al. 2023 ), tailored studen t support ( Ro drigues et al. 2024 ), teaching efficiency ( Xu et al. 2024 ), greater student engagemen t ( P elaez-Sanchez et al. 2024 ), supp ort for researc h, and increased accessibilit y and inclusivit y ( Gan et al. 2023 ; Xu et al. 2024 ). These b enefits can lead to an enhanced learning exp erience and o v erall higher engagemen t of students ( Kasneci et al. 2023 ), as well as improv emen ts in student com- prehension and academic outcomes ( Ro drigues et al. 2024 ). Researc h indicates that AI can greatly improv e learning efficiency by automatically regulating cognitiv e load, offering p ersonalized teaching, and dynamically adjusting learning pathw ays ( Gkin- toni et al. 2025 ). This researc h highligh ts that AI-based strategies are particularly adv antageous in high cognitive load domains such as STEM, where they can sub- stan tially enhance understanding, memory reten tion, and problem-solving abilities ( Gkin toni et al. 2025 ). This evidence on the efficacy of LLMs as learning aids suggests that the future of engineering education will include LLMs ( Kamalov et al. 2023 ). 2.2 LLM supp ort in collab orativ e learning Researc h has observed that LLMs can supp ort collab orativ e learning by acting as in telligent con versational agents, pro viding real-time answ ers or feedback, and helping studen ts by generating discussions tailored to group activities ( Naik et al. 2024 , 2025 ). F or example, LLMs can supp ort group reflection and offer alternativ e solutions, help- ing students engage more deeply , b oth with conten t and with each other. They can also enable inclusiv e collab oration b y detecting and mo deling group dynamics and offer feedback to impro ve teamw ork and communication skills ( Cai et al. 2024 ; Lewis 2022 ). 2.3 P edagogical c hallenges in tro duced by AI Educational research often struggles to keep up with the pace of the changes intro- duced by new p edagogies, technologies and so cietal demands, making it difficult to conduct timely and relev ant studies and capture the nuances of new learning en vi- ronmen ts ( La vicza et al. 2022 ). Moreov er, traditional educational researc h may not b e fully equipp ed to study AI in education ( Chiu 2024 ). Despite the hype surround- ing AI, there is a shortage of evidence for its successful incorp oration and v alidation in real-world educational en vironments ( Y an et al. 2024 ). Studies indicate that most LLM adv ancements lack a focus on h uman-centered design and the inv olv ement of educational stakeholders. Consequently , these tools might imp ede learning rather than impro ve it ( Y an et al. 2024 ). The in tegration of innov ative technology into education presen ts a complex challenge ( Obido vna 2024 ), as learning ob jectives and teac hing activities need meticulous adjustment ( Aggarwal 2023 ). Additionally , the lac k of trans- parency in data collection could result in the misuse of student information and an 3 erosion of trust ( Nguyen et al. 2023 ). Academic in tegrity may lik ewise suffer, as stu- den ts could employ AI to circum v ent learning ob jectives, thereby diminishing the significance of assessmen ts and complicating the ev aluation of gen uine learning, creat- ing an unfair environmen t ( Kelly and Sulliv an 2023 ). AI’s integration into education also profoundly impacts the role of teac hers. Research suggests that educators require comprehensiv e training to learn the capabili ties and limitations of AI, to effectively in tegrate it in to p edagogy , develop inno v ativ e assessment strategies, and promote the resp onsible use of LLMs ( Nguyen 2025 ). Otherwise, AI to ols migh t b e misused or not p erceiv ed as a v aluable asset ( Ng et al. 2023 ). Effective integration of AI in to education requires addressing the current c hallenges and aligning with educational ob jectives, careful managemen t, teac her training, ethical considerations and data pri- v acy safeguards ( Zhang et al. 2024 ; Idris et al. 2024 ; Cacca v ale et al. 2024 ). Therefore, although the b enefits that AI could in tro duce in education are substan tial, it is nec- essary to review the opp ortunities, challenges, and future directions that LLMs can generate in teac hing and learning. 3 Metho ds In this section, we in tro duce the architecture of the AI assistan t presented in this study , as well as the mixed-metho ds iterativ e approach used to develop and v alidate it, including a crossov er study to further gather insigh ts into students’ in teraction and p erception of the chatbot. 3.1 Aim of the exercise In this study , we inv estigate the introduction of a custom LLM in the "Go o d Man u- facturing Practice (GMP) and quality in pharmaceutical, biotech and fo o d industry - Theoretical v ersion" (28855) Master’s Degree course. W e use a chatbot previously dev elop ed b y the authors Cacca v ale et al. ( 2025 ), which is based on a pre-trained LLM using RA G. The exercise is created to sim ulate an audit in a pharmaceutical compan y . The purp ose of the exercise is for the students to ev aluate the company’s compli- ance to standard pro cedures; practically , they prepare and p erform a mo c k audit to gather evidence, b y asking questions and do cumen ts. This information is required to determine whether the company under review could b e a suitable business part- ner or whether the iden tified non-conformities are too serious. T eac hers are expected to either provide precise answ ers when the necessary do cumen ts are av ailable or to resp ond more v aguely , encouraging students to think critically ab out the company’s conduct. The deliv erable for this exercise is an audit rep ort. This is not the final exam, but a group assignment that m ust b e passed to gain access to the exam. This rep ort is assessed to ev aluate the groups’ skills in identifying minor and ma jor non- conformities and in linking them to the relev ant legislation. In a concluding section, the students must state whether the company is compliant and, if it is not, prop ose an appropriate action plan. In previous y ears, the fictional compan y was represented b y teac hers; ho wev er, due to the high n umber of students enrolled in the course (max- im um uptake of 120 students; how ever, 25% more studen ts apply to enroll) and the 4 few teachers av ailable (3 teachers p er semester), this task w as considered quite rep eti- tiv e and time-consuming. Nev ertheless, teachers did not w ant to eliminate it from the course, since it leads to significant learning outcomes. Thus, automating this pro cess w ould benefit both studen ts, increasing the yearly uptake and th us giving the p os- sibilit y to more students to enroll, and teachers, automating a time-consuming and rep etitiv e task. The course is generally discursive, and the regulations and documen- tation are straigh tforward to discuss, so it p oses a suitable use case for the application of AI. Therefore, the solution w as to gradually in tro duce and v alidate a c hatb ot that could substitute the teacher in this exercise. It should b e emphasized that this initiative is not intended to create a course without a teacher. Instead, its purp ose is to allow teac hers to concentrate on delivering lectures and preparing course materials, rather than sp ending substantial time on exercises that are highly time-consuming and rep etitive. It is also essential to recognize that individual students may hav e different preferences regarding whether they feel comfortable interacting with a c hatb ot or prefer engaging with a teacher in p erson. Therefore, participation in this exp erimen t was entirely v oluntary , so that groups can c ho ose whether to carry out the audit with a teacher or with the LLM. 3.2 LLM-enhanced learning system The full mo del pip eline, including data collection, pre-pro cessing, mo delling and deplo yment, is explained in details in Cacca v ale et al. ( 2025 ). The virtual assistant implemen ted and used in this study is a sequence-to-sequence question-answering (QA) mo del. Studen ts in teract with the system through a graphical user interface (GUI). The LLM component is based on a pre-trained, op en-source mo del, which can b e do wnloaded from the Hugging F ace library ( HuggingF ace 2016 ), FLAN-T5 base ( Ch ung et al. 2024 ). A smaller mo del, compare to the more complex and pow erful LLMs currently a v ailable, w as c hosen in order to limit the risk of hallucinations and ensure the correctness of the answ ers, since the model do es not ma jorly deviates from the pro vided con text. W e use retriev al-augmented generation (RA G), a technique that enables LLMs to retriev e and incorp orate new information in the responses ( Lewis et al. 2020 ). Therefore, the prompt is enriched with further con text, which is retrieved through seman tic search by calculating the cosine similarity b et ween the input question (asked b y the student) and all questions contained in the database. In this case, the context is the historical answer provided b y a teacher. The LLM then generates an answer to the question ask ed based on the context provided. The back end of the mo del was developed in Python, using the Flask pack age for deplo yment. The front-end is in HTML. The model was deplo yed lo cally using a single NVIDIA GeF orce R TX 2060 with Max-Q Design GPU. During the exercise, studen ts in teracted with the virtual assistant through a laptop connected to a monitor. If a sp ecific do cumen t w as requested by the studen ts, the mo del w as triggered to show the do cumen t through a p op-up window. The students then had the opp ortunit y to skim the doc umen t b efore pro ceeding to the next question. All the do cuments sho wn to the students during the audit were later sen t to them, together with the transcript of their in teraction with the mo del. 5 3.3 Mixed-metho ds iterativ e approaches This w ork adopts mixed-metho ds iterative approaches that combine quantitativ e data (e.g., p erformance, metrics from AI systems) with qualitativ e insights (e.g., student and teac her exp eriences). T o dev elop the mo del, v arious LLMs were b enchmark ed in previous studies; w e rep ort the p erformance of a selection of the models used for the comparison in the App endix, in T able 12 . Regarding qualitativ e data, we collected students’ p ersp ectiv es before and after the exp erimen t. T o ev aluate the chatbot, we define four principal research questions (R Q), which are presented b elo w. • R Q1: "How happy are you with your exp erience?" • R Q2: "After having p erformed the audit, what do you think of the quality of the answ ers?" • R Q3: "W ould you recommend doing the audit with your auditee (teac her or c hatb ot) to other students?" • R Q4: "Do y ou think this is the future of the audit exercise in this course?" R Q1, RQ2 and RQ3 are scored according to a Likert scale in range 1-5, from low est to highest. R Q4 is a categorical question with three possible giv en answ ers: "Y es", "I am not sure" and "No". The model has b een used b y studen ts on three occasions: (a) in the spring semester of 2024, with 3/21 groups of students (N=18, total=119) i.e., the chatbot was used with 3 groups of students (6 students eac h) out of the 21 total groups; (b) in the spring semester of 2025, with 9/23 groups of students (N=43, total=113); and, (c) in a crossov er study carried out in the fall semester of 2025 with 2/2 groups (N=6, total=6). Students could volun tarily decide whether to fill in the surv ey; therefore, we collected resp onses from: (i) 14 students p erforming the audit with a teacher and 13 with the AI assistant in 2024; (ii) 13 students p erforming the audit with a teacher and 18 with AI assistant in 2025; and, (iii) 6 studen ts in the crosso ver study p erforming the audit with b oth the teac her and AI assistant. F or each cohort, to assess whether the difference b et ween the tw o groups is sig- nifican t, tw o statistical tests were performed. W e conducted a Mann-Whitney U test ( Mann and Whitney 1947 ), which is a non-parametric test that compares the distri- butions of tw o indep enden t groups for the first three research questions (R Q1, RQ2 and R Q3). F or the last researc h question (R Q4), w e p erformed a Chi-square test ( McHugh 2013 ), which assesses whether there is a significant association b et ween tw o categorical v ariables. 3.4 Crosso v er study T o further test the mo del and gain new insigh ts, we employ the mo del again in the fall semester of 2025. This time, the model was tested with a different cohort of studen ts, and hence we cannot assume that the results will b e comparable, due to the differences in experience and background of this group of studen ts. There w ere six studen ts participating in the crosso v er study , divided into tw o groups of three students eac h. T o collect their p erspectives on the different interactions and what the strengths and weaknesses are, we asked eac h group to perform the audit 6 exercise with b oth the teacher and the chatbot. The experiment lasted a total of 1.5 hours, time allo cated equally betw een the chatbot and the teacher. The studen ts w ere made aw are in adv ance of the use of AI in this study , and informed consen t was collected b efore starting the exp erimen t. As in the previous exp eriment, the interaction w as not ev aluated; only the rep ort of the findings is graded. 4 Results In this Section, the results of the three experiments, together with the aggregated grades of the students participating in this initiativ e, will b e presented. Firstly , we will compare how students’ p ersp ectiv es ha ve shifted betw een 2024, when the chatbot w as originally in tro duced, and 2025. Afterwards, the final exam grades of the students will b e in tro duced (2024 and 2025). Finally , the results of the crosso ver study will b e presen ted and a preliminary hypothesis of p ossible factors and implications will b e discussed. This Section contains opinions directly quoted from the students. As suc h, the feedbac k received and incorp orated in the v arious tables was not edited, with the exception of anonymizing the name of the AI assistant. Therefore, the rep orted stu- den ts’ opinions migh t include grammatical errors, t yp os, or syn tactic inconsistencies. W e c hose not to correct these mistakes to preserve the originality of the feedbac k. 4.1 What motiv ated students to use the LLM-enhanced assistan t? In the exp eriments p erformed in the spring semester of 2024 and in the spring semester of 2025, studen ts could volun tarily decide to p erform the exercise with the chatbot or with the teac her. W e administered a surv ey to studen ts who chose the chatbot in the spring semester of 2025. In the survey , we asked the respondents to sp ecify wh y they decided to v olun teer to test the c hatb ot. The answers are presented in T able 1 . By making a theme analysis from the open answ ers displa y ed in T able 1 , the differen t driv ers b ehind the c hoice to p erform the exercise with the chatbot rather than with the ph ysical teacher could b e defined as: curiosity (C), low er stress (LS), wish to help the authors develop the initiative further (H), and obtaining a transcript of the con v ersation (T). Most of the students (10 answ ers) express curiosit y about the new technology and recognize that it is an inno v ative initiativ e. Some (6 answers) choose to interact with the c hatb ot because they felt less stressed than in terviewing a teacher; other students also thought that they would get a b etter exp erience since a chatbot do es not elab orate as m uch as a real person and they would not b e put in the situation of ha ving to in terrupt a teacher asking to contin ue to the next question. Three studen ts wan ted to help us develop the mo del so that more students could enroll in the course the next y ears (curren tly , the limit is set to 120 students p er year). Others (3 answers) thought that receiving the transcript of the interaction would b e easier than having to take notes or listening to an audio recording. One answer presents other drivers (O), stating that the responses of the c hatb ot are less anecdotal and th us easier to in terpret. 7 T able 1 : Studen ts’ motiv ations for choosing the AI assistan t. The c hatb ot’s name w as replaced b y ChatGMP . The identified driv ers are: curiosit y (C), lo wer stress (LS), receiv e transcript (T), willingness to help (H), and others (O). Wh y did you decide to do the audit with the AI assis- tan t? Driv er(s) A teac her said it w as less anecdotal, easier to interpret. O W e w ere curious of seeing how it would work with the ChatBot C The group decided so b ecause we wan ted to help in the Chat- GMP developmen t and also it was a WIN/WIN situation since w e would also ha ve an instructor presen t who would still b e an auditee H T o try out the technology , and hav e a more controlled environ- men t for the audit. C I though t it would be a new exp erience to audit with ChatGMP than the regular ones in p erson. C Sounded inno v ativ e C The pro duced transcript would likely b e easier to pro cess than a v erbal one. T It sounded in teresting and I probably w ould’ve felt a wkw ard essen tially interviewing my professors. C, LS It seemed interesting to test it out. Moreo v er, it allo w ed us to tak e a bit of the pressure off compared to doing the audit with a lecturer/T A C, LS W an ted to try it out C Because I had heard ab out the pro ject, and I know the course has a limited num ber of participan ts. I wan t to help remov e this limit. H More comfortable asking ChatGMP than a teac her. LS W an ted to test something new. Also nervous to do the audit face to face with a teacher. C, LS Main reason was b ecause it felt nice to hav e the whole transcript of the conv ersation, instead of ha ving to listen through a record- ing to remember what was said during the audit. It was also a w ay of av oiding the awkw ardness that probably would feel if I had to interrupt one of the teac hers in the middle of an answer (that w as not what w e ask ed for). T, LS Because we feel it w ould b e b etter to do cumen t the audit pro cess and ma yb e less stress compare to audit with professors. T, LS T o try something new C Our group w anted to help develop the chatbot so it can b e imple- men ted. W e thought we would get a better exp erience since it do esn’t elab orate as m uch as a real person. H, C 8 4.2 Audit results: students’ p ersp ectiv es 4.2.1 Comparison betw een 2024 and 2025 (a) (b) (c) (d) Fig. 1 : Resp onses to surveys in vestigating the p erceptions of students that p erformed the exercise with a teac her or with the LLM-enhanced assistant. Results are compared b et w een 2024 and 2025. Resp onses to the researc h question (a) "How happy ar e you with your exp erienc e?" (R Q1); (b) "After having p erforme d the audit, what do you think of the quality of the answers?" (R Q2); (c) "W ould you r e c ommend doing the audit with your audite e (te acher or chatb ot) to other students?" (R Q3); and, (d) "Do you think this is the futur e of the audit exer cise in this c ourse?" (RQ4). RQ1, RQ2 and R Q3 are on a Likert scale in the range 1-5, from low est to highest. One resp onse is excluded because not relev ant. The AI assistant was tested in the course with three groups volun teering for the first time in 2024. These preliminary results are presen ted in Cacca v ale et al. ( 2025 ). Since then, more data w as collected and the database of the mo del was enriched. The results presented in Figure 1 sho w that there is an impro v ement in studen ts’ p ersp ec- tiv es regarding the AI assistant. The mean and standard deviation are summarized in T able 2 . The Mann-Whitney U test ( Mann and Whitney 1947 ) results are presented in T able 3 . F or readabilit y , and according to the y ear when the studen ts were enrolled in the course, w e will refer to the studen ts that performed the exercise with a teac her as 9 T able 2 : Mean and standard deviation of the groups auditing the teacher and the AI assistan t for R Q1, R Q2, R Q3 and R Q4. The results are rep orted for the experiments conducted in the spring semesters of 2024 ( N teacher = 14 vs. N LLM = 13 ) and 2025 ( N teacher = 13 vs. N LLM = 18 ) and for the crossov er study (CO) p erformed in the fall semester of 2025 ( N teacher = 6 vs. N LLM = 6 ). R Q1 R Q2 R Q3 R Q4 a vg.|std. a vg.|std. a vg.|std. a vg.|std. T eac her 2024 4.29; 0.59 3.79; 0.77 4.50; 0.73 4.29; 0.96 AI assistant 2024 4.23; 0.58 3.54; 0.75 4.00; 0.55 3.92; 1.00 T eac her 2025 4.69; 0.46 4.00; 0.55 4.92; 0.27 3.92; 1.00 AI assistant 2025 4.17; 0.83 3.67; 0.82 4.22; 1.13 4.44; 0.77 T eac her 2025 CO 4.50; 0.50 4.33; 0.75 4.50; 0.75 4.67; 0.75 AI assistant 2025 CO 3.17; 1.07 2.33; 0.75 2.50; 0.96 3.00; 1.41 T able 3 : Mann-Whitney U test of the distributions of the t wo samples for 2024 ( N teacher = 14 vs. N LLM = 13 ) and 2025 ( N teacher = 13 vs. N LLM = 18 ). W e also rep ort the results for the crossov er study (CO) ( N teacher = 6 vs. N LLM = 6 ). Mann- Whitney U test is p erformed for RQ1, R Q2 and RQ3 and Chi-square for RQ4. The differences are not statistically significan t, expect for RQ3 in 2025, and RQ2 and R Q3 in 2025 CO. R Q1 R Q2 R Q3 R Q4 U M |p-v al U M |p-v al U M |p-v al Chi2|p-v al 2024 95.5; .823 112.0; .284 123.0; .092 0.3; .576 2025 158.5; .067 141.0; .281 175.5; .011* 5.1; .077 2025 CO 30.0; .055 34.5; .008** 33.0; .014* 4.5; .105 "T eac her_2024" or "T eacher_2025", and the students that p erformed the audit with the AI assistan t as "AI_2024" or "AI_2025". Regarding RQ1, there is an improv ement, meaning that there are more studen ts that report b eing very happy (score=5) in 2025, being 39% compared to 31% in 2024. Ho wev er, it app ears that more students compared to 2024 are also more unhappy (scoring 2 and 3 on the Likert scale). The same trend is also observ ed for groups that p erformed the exercise with a teacher, where studen ts rep orted b eing happier in 2025 compared to 2024. Comparing the p ersp ectiv es regarding the AI assistan t in 2024 and in 2025 (second and fourth bars), there is a clear shift, where studen ts app ear to b e more satisfied in 2025 about the qualit y of the answers (RQ2), if they would recommend the chatbot as auditee (RQ3) and if they think auditing the AI assistant is the future of the exercise (R Q4). In terestingly , students that interacted with the AI assistan t in 2025 are very certain (78%) that having an AI assistant is the future of the audit (RQ4). On the other hand, while 92% of the "T eac her_2025" group would very likely recommend their auditee (the teacher), compared to 61% in the "AI_2025" group (RQ3), they are less certain that auditing the teachers is the future of the exercise (R Q4). T able 3 presen ts the 10 T able 4 : Students’ p ersp ectiv es regarding the future of the auditing exercise. R Q4: Do y ou think this is the future of the audit exercise in this course? If it increases the throughput then it would be the future, i do not kno w if i am qualified to answer this as i did not try a audit with a teacher. i think you lack the h uman part. in some wa y it felt like just asking for do cumen ts. I see the p otential in using a c hatb ot to b e able to work with more students, but I believe a lot of work needs to b e do ne for it to feel close to a real audit why , b ecause more and more studen ts will b e able to tak e this very imp ortan t course It has a lot of p oten tial, it sa ves time and it works pretty go od. If it gets more training, it will be even b etter. I think it is a goo d alternative and adv anced option for the pro cess of auditing. If the chat gets improved, it could definitely b e the future, but at this p oint then no. Innov ative The pure efficiency and accuracy that comes with increased data - means that future audit exercises will contin ue to pro duce b etter and better results. I think having the option to do either is adv antageous, though it obviously requires more man- power. I think it w orks fine and will open the course up to more students in the future. Also for its purpose as an exercise rather than an exam. It is still missing the personal interaction part, that you would meet in a real audit. This is the necessary future, if the course should b e free of an y participation limits, and I think it is well on the wa y . It was nice ha ving the teaching assistant there, but not necessary at all. A teacher may not alwa ys b e av ailable and reduces teacher’s workload More p eople can take part in the course if this is used. This could also b e something open to students outside of this course in order to test and learn more. With a little w ork I really think this could be a great wa y to do all the audits in the future. It’s go o d. I don’t think it’s efficient I think it would b e really nice that everybo dy could take the course and people understand why they would hav e to use the chatbot. statistical tests p erformed to compare the distributions of the tw o groups (auditing the teac her and the chatbot). The only statistical significant result for the spring semesters is for RQ3 in 2025, where more students would highly recommend a teac her as auditor compared to the LLM. The students were also ask ed to elab orate on their exp erience during the audit. Regarding RQ2, the integral feedbac k of the students who p erformed the audit with the AI assistant is presen ted in the App endix, T able 13 ; the feedback of the students who p erformed it with the teac hers is presented in T able 14 . The "AI_2025" groups highligh ted that ov erall it was a very go od exp erience and the chatbot could answer the ma jorit y of their questions. They also stated that the answ ers seemed "real" and made them feel like they w ere auditing a real compan y , helping them to b ecome familiar with the audit process. Other studen ts found prompting to be challenging since their questions had to be very specific. They also mentioned that sometimes the chatbot w ould pro vide rep etitiv e answers. The "T eac her_2025" groups also found the audit experience to b e goo d and realistic o verall. Ho wev er, they noticed that the teac hers sometimes sp ok e to o extensively or answered unasked questions to gain time, and they felt it was rude to interrupt them. They also felt some of the information requested w as not pro vided (since this is part of their learning). 11 T able 4 presents the opinions of the "AI_2025" groups. F or comparison, the p er- sp ectiv es of the "T eacher_2025" groups are presented in the App endix, in T able 15 . The studen ts think that the c hatb ot solution w e developed is inno v ativ e and has the p oten tial to b ecome the future of the exercise , since this would allow more students to enroll in the course while main taining the same level of educational quality and rigor. On the other hand, some students p oin t out that the current chatbot version has some limitations that should b e addressed in order to b e able to fulfill its purp ose. A dditional feedback is included in T able 16 . 4.2.2 Adv an tages and disadv an tages of the LLM-enhanced assistan t Summarizing the feedback presented in the surveys (T ables 4 , 13 , 16 ) together with the opinions expressed by the students in the audit reports, we derive tw o main trends: (i) adv an tages of using the AI assistant; and, (ii) issues related to this sp ecific mo del. These points are summarized b elo w. A dv antages according to students: • The initiative is highly inno v ativ e, and man y students cited "curiosity" as the driv er b ehind their choice to interact with the AI assistan t rather than with a teacher. • All studen ts should hav e the p ossibilit y to enroll in the course, while no w there is an uptake limit of 120 p eople; therefore, the AI assistant could p ose a v aluable solution and minimize the impact of this limitation. • In teracting with an LLM reduces the pressure and anxiety t ypically associated with preparing and p erforming the exercise for the first time. They might feel less pressured to ask questions to an AI assistan t rather than a teacher. • Receiving the full transcript of the con v ersation. While the teams that audit the c hatb ot receive a transcript of the conv ersation, the other groups receive an audio recording, whic h they migh t find more cum b ersome. • The chatbot allow ed students to use the time b et ween questions to discuss within the team without pressure. During the audit, for each question, the students had time to review the answ er and discuss in the group ho w to proceed. This migh t spur collaborative learning and team effort. • The decreased pressure and stress also encourages to read the answers b efore asking the next question, while in the audit with the teacher, the students tend to rush to the next question b ecause they rely on the recording. Giv en these adv antages, they concluded that the LLM-based chatbot has the p oten tial to b ecome a very useful to ol and can help sav e time. On the other hand, students also p oin t out v arious problems, issues and c hallenges connected with the LLM-enhanced assistan t. Disadv antages and identified issues: 12 T able 5 : A verage grade of the students p erforming the audit exercise with the AI assistan t and the ones p erforming the audit with the teac her. Grades are rep orted according to the Universit y’s grading system, where the minimum passing grade is 2 and the maxim um grade is 12. Y ear A vg. with AI assistan t A vg. total 2024 6.8 7.9 2025 8.7 9.1 • Questions hav e to b e phrased as simple as p ossible, and that the LLM is not at the same lev el at comprehending complex or related questions, compared to the teac her, i.e., it does not understand the similarit y b et ween w ords suc h as "docu- men t" and "pro cedure". This limitation is giv en by the relatively small size of the LLM emplo yed, as correctness was prioritized to generative p o wer when developing the tool. • Since the model bases its kno wledge on historical knowledge, it is sometimes unable to answ er nic he or unseen questions. • The mo del should b e more clear ab out the certaint y of the information given instead of pro viding (factually correct but) unrelated answers. • Sometimes the mo del generated iden tical answ ers to sligh tly differen t questions and rep eated the answers, providing the same information, even when the students rephrased the questions. • A commonly cited issue is that the AI assistant did not remember the whole con- v ersation, so it was cumbersome for students to hav e to sp ecify the context again for follo w-ups. • Another issue being raised was the fact that the students could not "cop y and paste" the questions. Giv en these disadv antages and issues, many studen ts concluded that, although the to ol has p oten tial, it should be improv ed before it is prop erly effective. F urthermore, studen ts also pro vided some ideas for p oten tial impro vemen ts, suc h as (i) to include a virtual tour and images of the facility where the op erations take place, and (ii) to include voice recognition, so that the AI assistant could process oral questions (speech-to-text mo del) and read aloud the answ ers (text-to-sp eec h mo del). 4.3 Studen ts’ p erformance T able 5 presents the a verage final grade for the studen ts performing the exercise with the AI assistan t. The marks follo w the Danish grading system ( DTU.dk 2025b ) and are publicly a v ailable ( DTU.dk 2025a ). Of note is that, while in 2024 the rep ort studen ts handed in after interacting with the LLM w as graded and accoun ted for 25% of the grade, in 2025 it was only Pass/F ail, the refore the grade is disentangled from the report and thus the interaction with the LLM. The results sho w a sligh tly low er grade av erage for the studen ts who interacted with the AI assistan t, compared to the total a verage. Generally , we see that students b elonging to the same groups (who did the audit and wrote the report together) 13 tend to receiv e the same final grade. How ever, this is not a strict indication of the grades, since one group presented a student with the maximum grade and one with the minim um. This ma y suggest that the AI assistan t pro vided all the necessary information for the groups to succeed in the learning ob jectiv es of the exercise and in writing a go od rep ort; ho wev er, the final grade is up for individual study . W e do not ha ve other grade metrics that w ould allow us to characterize the students further, suc h as an o verall av erage grading for the entire education. Therefore, this might b e a pattern that highlights the type of studen ts who choose to p erform the audit with the chatbot (e.g., p erhaps students that aim to the top grade w ould tend to select the teac her rather than testing a new to ol). How ever, this hypothesis needs further in vestigation and no conclusions can b e dra wn at this stage, neither whether the c hatb ot led to low er grades nor that higher-ac hieving students tend to choose to audit the teac her. 4.4 Incorp orating studen ts’ feedback: impro ving the GUI F ollo wing the feedback received in the spring semester of 2025, to improv e the general exp erience of the students, a few mo difications and additional functionalities were implemen ted: (a) improv ed GUI of the model, as presen ted in Figure 2 ; (b) embedded a speech-to-text function, so that students could record their questions to the AI assis- tan t, instead of typing; (c) embedded a text-to-sp eech function, so that the resp onse generated b y the chatbot would b e read out (users can decide whether to turn off this option, by pressing the "Sp eech: OFF" button, and which voice they prefer to hear, in case the speech option is on.); and, (d) allow ed students to "cop y and paste" their questions, for time saving to av oid p enalizing the slow-t yping studen ts. 4.5 Results of the crosso ver study After incorp orating some of the studen ts’ feedback received from the previous iteration of experiments, w e tested the mo del again in a crossov er study . The results of this exp erimen t, conducted in the fall semester of 2025, are presented in Figure 3 , as w ell as summarized in T able 2 . The results app ear to b e quite different from those obtained during the spring semesters of 2024 and 2025. Here, in fact, studen ts seem to prefer the interaction with the teac her, finding the qualit y of the resp onses (RQ2) to b e quite higher (on av erage 4.50 compared to 3.17 with the AI assistan t) and are less sure that the chatbot can effectively replace the teac her (RQ4) mo ving forward (4.67 compared to 3.00). T able 3 shows the results of the statistical tests performed. In the crossov er study , results are statistically significan t for R Q2 (p-v alue < 0.01) and R Q3 (p-v alue < 0.05). Ov erall, T able 6 shows that 66.7% of students preferred the audit with the teac her, while 33.3% stated that b oth w ere go o d. F urther opinions clarifying their reflections are provided in T able 7 . Here, students find the answers given by the ChatGMP to b e useful in finding topics and a goo d to ol to use for students who just wan t a quick and easy non confrontational wa y to get answ ers. How ever, they recognize that the audit with the teacher w as quite close to a real life audit. 14 Fig. 2 : New graphical in terface of the AI assistant. T able 6 : Perceptions of the students regarding the differen t auditees. Question Prefer the teac her [%] Prefer the c hatb ot [%] Both were go od [%] Neither w as go od [%] Did y ou prefer the audit with the teac her or c hatb ot? 66.7 0 33.3 0 T able 8 presents a comparison of the topics in which each auditee w as particularly go od or bad. Students find the teacher to b e b etter at pro viding rich and more nuanced answ ers (and at "improvising"), although they find this setting to b e quite time- in tensive and, as a result, they do not manage to exhaust their list of questions. On the other hand, the chatbot is go o d at pro viding documents and fast answ ers, but sometimes struggles to understand the questions if they are complex or unseen. T able 9 presen ts a further comparison b etw een the teacher and chatbot. One of the students states that the "t wo auditees together actually complement each other w ell, as the [teachers] provide con text, accountabilit y , and access to real records, while the ChatGMP auditee accelerates scoping, and reac hing requested documents". This is also supp orted by another opinion rep orted in T able 7 , where a student suggested 15 (a) (b) (c) (d) Fig. 3 : Resp onses to surveys in vestigating the p erceptions of students that p erformed the exercise with a teac her or with the LLM-enhanced assistan t. Results are reported for the crosso ver study p erformed with six students, divided in tw o groups of three studen ts each, in the fall semester of 2025. that, given the different nature of the exp erience, b oth audit exp eriences should b e offered, since man y other studen ts migh t prefer the AI in teraction. V alidating some of the feedbac k collected in 2024 and 2025, we wan ted to test whether students found the audit with the AI assistant less stressful than with a teac her. Figure 4 shows that most studen ts find the audit with the teacher to b e more stressful, with an a verage of 2.00 (standard deviation 0.58) on the Likert scale, compared to 1.83 (standard deviation 1.07) with the AI assistan t. How ev er, results are not statistically significant (p-v alue=0.485). Finally , as mentioned in Section 4.4 based on the provided studen ts’ feedbac k, w e hav e also integrated text-to-sp eec h and sp eech-to-text into the AI assistant. W e ask ed students whether they tried the functions and if they were useful, the results are illustrated in Figure 5 . It seems that the students preferred to type rather than sp eak to the AI assistan t, th us not finding the function useful. W e argue that this migh t b e due to the fact that they had prepared their questions in adv ance and could cop y and paste them. 16 T able 7 : Students’ reflections in the crosso ver study . T eacher AI assistan t R Q2 Some answ er where go od, some answers a bit too theoretical. W e could b enefit from a batc h report b eing av ailable as do cumentation. The answers w ere meant to guide us a wa y from missing or inadequate information, which could b e realistic, for our team it w as difficult to cut off the answers and go back on track. The ChatGMP answ ers was useful to locate topics, but most resp onses w ere generic and not suitable as primary evidence. In other words, the answers were informative b ut incomplete/sufficient to flag risks, not suffi- cient to demonstrate robust, effectiv e GMP control without follow-up do cumen tation and v alidation. It can clearly be seen that there is a big dif- ference betw een h uman in teraction and AI. F or example there were cases that if asked to elaborate on a question it would get confused or it would not understand a question asked. R Q4 T eacher is better at giving longer answers and explaining the documents. This is also more fun than ChatGMP . Gives a more authen tic audit experience. I think it should be offered along with ChatGMP for those that wan t it. I’m sure you will get many that prefer Chat- GMP . Compared to Chat, the teacher gav e more interesting / elab orate answers, but consid- ering the workload, ChatGMP is a b etter option to op en up the course to others Overall the audit w as quite close to a real life audit and I think that for students that do not know what an audit is it is a very nice introduction to the topic. The AI assistant is a go od to ol to use for students who don’t need the physical exp e- rience or just want a quick and easy non confrontational w ay of getting answers. Using ChatGMP in a time limited situation, makes us just fire off all our questions quickly not reading the responses or asking deep er follow up questions. It is able to generate fine answers based on the transcript receiv ed from previous audits, but it probably needs some more diverse knowl- edge / more transcripts to learn from. As it is a nice idea to hav e an AI, I do not believe that this is the future of an audit as it requires human interaction. There are a lot of topics that an AI could not answer or understand and in general it would b e diffi- cult to implement real life as some questions might be unexpected and the auditors will not receive the answers they are lo oking for. 4.6 T eac hers’ feedbac k In addition to p erforming audits with students, the to ol could also b e used to train new teac hers who join the course and need to prepare for the audits. The teacher w as giv en access to the c hatb ot, so that they could train with the help of the AI assistant. The teac her provided the follo wing feedback: "I enjo y the to ol a lot, and it help ed me initially to verify if my answ ers mak e sense. The only downside was the waiting time, whic h w asn’t to o bad. I also lik ed ho w easy it w as to install it on my computer! I see it as the future of the audit exercise and w ould like to see it developing in the future (with videos etc.)". 17 T able 8 : Students’ feedbac k on topics where the auditee (teacher or AI assistan t) w as either particularly goo d or bad at handling. T eacher AI assistan t An y topic that was partic- ularly go od at? GMP rules Rich context and n uance; there were p ossi- bility to read b ody language, push back, and pivot in real time. Beside quick access to primary records (screenshares, binders) and on-the-spot w alk-through. Presenting mix of procedures [...], records [...], and Manage- ment Review made the exercise feel realistic, and all do cumen ts examples were go od. Overall the audit interview was really nice and had a nice flow as well as presenting the documentations. Giving do cuments. was relatively fast and consistent was goo d for scoping, c hecklists, and clause mapping. goo d in term of reduce note-taking load. easy to lo cate topics and draft follo w-ups. No scheduling friction AI w as v ery goo d in regards to do cumen- tation and I b elieve that that could b e a great fo cus area if it is planned to b e used in companies. It was very nice to receive the documents so fast so a database with all compan y do cumen ts could b e built in so when inspectors come they can get any do c- uments they wish on the sp ot. An y topic that was partic- ularly bad at? Process knowledge Time-intensiv e; answers may b e long. [...] i think the weak est part w as evidencing: too many answers were narrative without primary records, some do cumen ts w ere low- resolution, whic h made clause-level v erifica- tion hard. The discussion basically leaned on narrative discussion without first mo deling to what “go od evidence” and a defendable finding lo ok like. the discussion missed a cou- ple of “chec kp oin ts” where i think teacher can give introduction b efore interview to goo d in terview tec hnique. Not all information was receiv ed stating that it is confiden tial [...] but in real life I am nor aw are that w e could say to an inspec- tor that we cannot give a certain information as they w ould consider that the company is hiding something from them causing more problems. Understanding things for whic h it was not specifically programmed. Responses sometimes w ere generic there were a lot of questions unanswered or AI did not understand the question. 5 Discussion This section aims to contextualize the results presented in this study in light of the general landscap e of AI in education, encompassing the opp ortunities, challenges, and future directions that LLM-enhanced teaching and learning could bring. 5.1 AI assisted collab orative learning: tak ea wa ys of the study This study fo cuses mainly on LLM-enhanced collab orativ e learning, discussing a cus- tom AI assistant dev elop ed for an M.Sc. course. The results highlight v arious trends regarding the use of LLMs in Higher Education. 18 T able 9 : Students’ feedbac k on the main differences b et ween performing the audit with the teac her and with the chatbot. Can you elab orate on the main differences b et ween p erforming the audit with the teac her and the chatbot? flow, time consumption from ChatGMP without results The teacher part was more fun and was heavier on role play . Easier to find weaknesses in the answers and pro cess. With ChatGMP we had better pace as it had fix answers that w e just skimmed through and either follow ed up or mov ed on. The teacher gav e long, more elab orate stories, but it w as longer answers therefore we were able to ask less questions. I think using the t wo auditees together actually complemen t each other well, as the managers provide con text, accoun tability , and access to real records, while the ChatGMP auditee accel- erates scoping, and reac hing requested do cumen ts. Easier flow during the interview, more elab orate and clear answ ers and the answers were under- stoo d b y the teacher while not all of them were understo od b y AI. there was no flow and the feeling of real audit in the chatbot Fig. 4 : Stress perception of the students when p erforming the audit with the teacher and AI assistan t as auditees. Results are on a Lik ert scale b et w een 1 (not stressful at all) and 5 (v ery stressful). A v erage stress with the teacher is 2, with the AI assistant is 1.8 (p-v alue=0.485). Fig. 5 : Students’ p erception on the sp eec h-to-text and text-to-sp eech implemen tations in the c hatb ot. Results of the second graph are presented on a Likert scale from 1 (not useful at all) to 5 (v ery useful). 19 Firstly , among the aspects that motiv ated the studen ts to use AI was curiosit y: they thought the initiative was interesting and inno v ative, and they wan ted to test the developed to ol first-hand. They found the interaction with an LLM to b e less stressful than interviewing a teacher. They also wan ted to help us with the develop- men t and ev aluation of the to ol, since this tool w ould remo ve the limitation of yearly studen ts uptake for the course. Finally , receiving a written transcript of their con- v ersation rather than an audio recording was also one of the main drivers. Although these motiv ations are intrinsic to this sp ecific application and cannot necessarily b e separated from the course, they do suggest some general trends that may also apply to other domains. Ov erall, students participating in the course in the spring semesters of 2024 and 2025 were happy with the to ol and the qualit y of the answ ers, where an increase in satisfaction was observ ed in 2025. This is also evident in Figure 6 , where more studen ts think that p erforming the exercise with the AI assistan t will be the future of the course. On the other hand, the results presen ted in T able 5 highlight an issue that deserv es more attention and further ev aluation. Currently , we do not ha ve enough data to decouple p erformance from AI usage. How ever, if there is an y correlation or whether the c hoice of using AI is in itself an indication of p erformance should b e further in vestigated. In this study , we conclude that although there is a tendency for b etter grades from students who p erformed the audit exercise with a teac her, the AI assistan t provided the necessary information for the groups to succeed in the learning ob jectives of the exercise and in writing a go od rep ort. The difference b et ween the p erceptions of the students who audit a teac her and the AI assistan t is shown in Figure 6 . Here, a p ositiv e bar means that the students’ ratings fa vored the teac her, and negativ e that they preferred the c hatb ot. The graph sho ws that, while in the normal course (2024 and 2025), the difference is relativ ely small, the crossov er study leads to more p olarized opinions. This shift could b e due to differen t reasons. First, the cohort of students is different. The chatbot w as dev el- op ed for a M.Sc. course according to its sp ecific learning ob jectiv es. In this course, studen ts ha ve no previous background in the sub ject, and therefore the audit is their first practical exercise in this area. This group of students usually likes the nov elty in tro duced by the exercise, and they find it to b e an innov ative and effectiv e wa y to test their understanding of the sub ject (since the teacher or c hatb ot w ould not pro vide them with the righ t do cuments if they are unable to ask for it correctly or b e sp ecific enough). Although they receiv e information during the audit, the deep er learning hap- p ens afterwards, when students hav e time to analyze and discuss the do cumen ts and detect the non-conformities and errors. On the other hand, the studen ts p erforming the audit in the fall semester of 2025 attend a more practical and professional-oriented M.Sc. course, where, as part of their studies, they are also employ ed half-time in a compan y . They hav e previous knowledge of the sub ject and they also hav e conducted audits in real-life situations b efore, therefore the exp ectations and prior kno wledge of this cohort is muc h higher, compared to the students attending the course in the spring semester. Even the teacher p erforming the audit admitted to hav e b een more c hallenged by this group. This is eviden t from some of their feedbac k, as presented b elo w. 20 Fig. 6 : Difference b etw een the av erage ev aluation for the four researc h questions giv en to the teacher and AI assistan t as auditee. The difference is p ositiv e if studen ts ev aluated higher the exp erience with the teacher, and negative if they preferred the AI assistan t. • "I did not particularly like the part that this w as only 45 minutes. I think it would ha ve b een nice to ha ve more time with the teac her to discuss more in depth the questions as we would think so m uch ab out the time that the real life exp erience of this (suc h as ha ve a more elab orate discussion ab out a particular question)." • "Usually the documentation part is done separately from an audit interview as the auditees will hav e time to go through the do cumen ts received and ask additional questions. Maybe in the future it w ould be nice that the studen ts learn the meaning b ehind correct do cumentation and how to see for example missing data, mistak es in reporting data etc. that they could identify and ask additional questions." • "Ov erall, it is a v ery impressiv e attempt to use AI for an audit but for real life comparison I do b eliev e that 45 minutes is far from enough. Of course w e also had a lot of questions w e w anted answered and we got carried aw ay so we would not go fully through the answ ers while on a real life interview y ou cannot a void that. A nice test would b e also for example hav e 10-15 main questions and start with the teacher an interview. Note down all additional questions asked during that interview and then follow exactly the same interview with ChatGMP . Time should not b e taken in to consideration for the cases and could b e one of the things for comparison as w ell. Then the pressure of time is tak en aw ay leaving more fo cus to the answ ers receiv e and then getting a more precise ev aluation of the whole exercise." • "Pro vide more realistic do cumen ts, so students judge evidence, not just narratives." In summary , the crossov er study also leads to interesting results: although the AI assistan t was not dev elop ed for the cohort chosen for this in vestigation, and therefore w e cannot assume that the results will b e comparable to the other groups (of the studen ts enrolling in the spring semesters of 2024 and 2025), the feedback is quite 21 p olarized. Th us, it provides many insigh ts into the studen ts’ exp erience that will b e incorp orated to impro ve the model. F or example, since the studen ts performing the crossov er study already had exp erience with the sub ject taught and auditing in general, they were more critical tow ards the limitations of the AI assistant, citing the fact that it was not able to handle complex questions and the exp erience lack ed h uman in teraction. This led us to re-ev aluate the approac h w e are currently using; the model will b e further improv ed before b eing used again in the spring semester of 2026. More details on the improv emen ts w e intend to make are discussed in Section 6 . 5.2 Maximizing correctness vs. generating p ow er trade-off A k ey learning from these three exp erimen ts is that choosing a smaller (and therefore less pow erful) mo del to prioritize correctness and reduce hallucinations might n ot b e the b est strategy going forward, given the constant release of b etter mo dels and the increase ease with which students interact with these AI-based to ols. When we first used the mo del in 2024, students were not as used to in teracting with commercially a v ailable chatbots such as ChatGPT. They tried for fun, but it was not until more recen tly that they started to use them more consisten tly for educational purp oses. In a previous study published in 2024 1 , when interview ed ab out the p ossibilit y of having an AI assistan t in their course, a student men tioned that the LLM "should definitely b e b etter than what ChatGPT is now, if the level w ould b e the same, the qualit y and p erformance would b e low, in my opinion". This might suggest that the students did not consider it reliable in its earlier versions. Ho wev er, reflecting on their exp erience after the audit with the teacher in 2025, students often men tioned that the AI assis- tan t was not as go o d as ChatGPT, highlighting its recent improv ements. Therefore, studen ts now ada ys seem to hav e b ecome accustomed to interacting with LLMs and therefore migh t hav e higher expectations regarding the quality of the answers, the sp eed, and the general language. Ho wev er, this does not necessarily mean b etter quality of learning. More p o werful LLMs m igh t give the illusion of b etter knowledge provided, since the language might b e more fluent than in previous mo dels and the amoun t of information pro vided might b e more extensive. Ho wev er, hallucinations still pose a great risk in, among others, educational settings. Therefore, more research should be done on the safe and correct use of LLMs in education, discussing the learning outcome s of c hoosing to adopt smaller mo dels that provide only correct information, even if not prop erly p ertinent to the asked question (as done in this study) or more p ow erful mo dels, with a higher risk of hallucinations. 5.3 LLMs for teaching and learning: challenges, opp ortunities, and future directions There are many challenges hindering the widespread adoption of LLM-based educa- tional tec hnologies, b oth regarding teac hing and learning. How ever, these c hallenges can lead to as many opp ortunities and future directions. W e expand on these in teac h- ing and learning b y com bining the k ey takea w ays gained from this study with the 1 The study will b e referenced up on acceptance to av oid including information ab out the authors 22 bac kground pro vided in Section 2 . T able 10 shows what w e believe to be the main c hallenges of using LLMs in teac hing, follo wed, resp ectively , by the opp ortunities they in tro duce and the p oten tial future directions they op en up. T able 11 uses the same approac h, but fo cuses on the challenges, opp ortunities, and future directions of LLMs in learning. Among the c hallenges concerning teaching, Y an et al. ( 2024 ) discuss the need for a human-cen tered approach in the design and implementation of these technologies. F urthermore, they stress that the ma jorit y of LLM-based innov ations lack trans- parency b ey ond AI researchers, with minimal inv olv ement of educational stak eholders in their dev elopment and ev aluation. A dditionally , they argue that LLMs generally demonstrate high p erformance in simpler classification tasks and promising results in prediction and generation, while their effectiveness in complex educational scenarios is still developing. There is a need to up date the curricula to reflect new additions and include AI education, redesign assignments to promote higher-order thinking, and to train teachers to b e able to face the new challenges introduced b y AI ( Zhang et al. 2024 ; Idris et al. 2024 ; Caccav ale et al. 2024 ). Another concern for teac hing includes the lack of repro ducible studies, the fact that state-of-the-art (SOT A) research migh t not b e applicable to the particular educational context and lo w tec hnological readiness of univ ersities. A ma jor challenge that should b e further in v estigated is the effect that LLMs ha ve on learning outcomes and the p otential consequences of ov erdep endence ( Ab d- Alrazaq et al. 2023 ). In fact, as also highligh ted b y Xu et al. ( 2024 ), o ver-reliance on LLMs could harm learning by hindering critical thinking and reasoning skills. As discussed in Section 2 , ethical concerns, transparency , priv acy , and accountabilit y are fundamen tal issues to b e addressed and regulated b efore using LLMs in education, as they could ha v e a negativ e impact on learning and potentially inhibit students’ righ ts ( Y an et al. 2024 ). On the other hand, it is also imp ortan t to educate studen ts on the safe and ethical use of LLMs in order to preven t them from b ending the rules of academic integrit y , since these tools in tro duce the potential for plagiarism and c heating ( Perkins 2023 ; Ko v ari 2025 ). Other issues whic h needed to b e addressed to safeguard learning concern equalit y and accessibilit y , given that the financial burden of LLMs and the limited access to infrastructures and tec hnology , could aggrav ate inequit y of learning and access for v arious studen t demographics ( Jin et al. 2023 ; Na vigli et al. 2023 ). Finally , the dominance of English-based models could exacerbate inequalities in language proficiency and accessibility ( Helm et al. 2024 ). 6 F uture p ersp ectiv es The underlying pre-trained LLM will b e upgraded to one of the latest released mo d- els, to improv e the quality of the generated resp onses and therefore the ov erall user exp erience. More data will b e curated and added to the dataset, impro ving the curren t bac kground knowledge of the mo del. One of the most distinctiv e cav eats of this study is the lack of human in teraction when using the AI assistant. T o address this issue, next releases of the model will also include the developmen t of a desktop and virtual reality exp erience where students 23 T able 10 : Challenges, opp ortunities, and future directions of LLMs in teac hing. T eaching Challenges Opp ortunities F uture directions LLMs are contin uously improving, so keeping up with AI dev elopments can b e hard for educational research Constant improv emen t of LLMs means that increasingly more tasks (text-to-speech, video generation) could b e automated with AI Educators should recurrently set some time to investigate new AI developmen ts and how they can be used to impro ve teaching Ethical concerns, trans- parency , priv acy and accountabilit y are mandatory issues to be addressed b efore rolling out LLMs in educa- tional contexts ( Y an et al. 2024 ) LLMs can provide help with a num ber of tasks, e.g., identify conten t gaps, suggest learning ob jectives, help with creat- ing new con tent, pro vide help with assessmen t and ev alua- tion Establish clear guidelines for using LLMs in education, addressing ethical concerns, accountabilit y , priv acy and academic integrit y , to wards human-cen tric AI ( Y an et al. 2024 ) Integrating AI and other inno- v ativ e tec hnologies into edu- cation is a complex challenge learning ob jectiv es and teac h- ing activities needs meticulous adjustments ( Obidovna 2024 ; Aggarwal 2023 ) Redesign the curriculum to reflect current adv ancemen ts and lev erage new technologies, automating rep etitiv e tasks, increasing engagement and saving time Update curricula to reflect new additions and include AI education, redesign assign- ments to promote higher- order thinking. T rain teachers on ethics in AI ( Zhang et al. 2024 ; Idris et al. 2024 ; Caccav- ale et al. 2024 ) Accuracy and reliability are ma jor limitations of LLMs. Ensuring correctness of responses can be difficult due to the uncertaint y of the information ( Augenstein et al. 2024 ) Extract foundational knowl- edge of LLMs through prompt engineering. Inv esti- gate uncertaint y in LLMs for deeper understanding Use iterative approach to rig- orously validate the mo dels, extract uncertaint y of genera- tions and increase explainabil- ity and fairness Lack of replicability , as man y studies fail to provide suffi- cient details and op en-source code and data, making it dif- ficult for educators to v alidate and adapt these inno v ations Many libraries and tutorials on how to use LLMs in gen- eral scenarios can lead to the democratization of LLMs, making it accessible for non- AI experts to leverage these models in educational settings Communit y should share code and data or, if not p ossible, provide detailed explanations on how to replicate the work and establish similar proce- dures in own applications Published researc h and SOT A might not generalize to other pedagogical applications, due to cultural, institutional and technological differences of the specific universit y ecosystem Collecting and analyzing a v ariet y of opinions can lead to better understanding of so cial differences, if data is correctly analyzed Perform cross-universit y/ country research to under- stand generalizable asp ects and isolate culture-sp ecific elements Often analyses used to ev alu- ate LLMs uses biased meth- ods or sub jectiv e judgments, which can in tro duce v ariabil- ity in the results due to indi- vidual biases Inv estigate how different fac- tors (e.g., native language, culture) affect AI p erception and proficiency . This can limit biases and tailor LLMs to sp e- cific needs and cultures F o cus on developing more robust and well-established metrics, how to b enchmark LLMs and limit sub jective biases Most innov ations remain in early developmen t ( Y an et al. 2024 ) Perform new research that aims to con tribute to the establishment of LLMs in edu- cation Establish wa ys-of-w orking on how to embed and v alidate LLMs in authentic educa- tional settings Generational gap between educators and studen ts migh t lead to suboptimal design decisions Inv olving students as co- designers of the to ols can lead to ov erall b etter exp erience and learning improv ement Co-participatory design experiments to inv olve stu- dents in the implementation of LLMs 24 T able 11 : Challenges, opp ortunities, and future directions of LLMs in learning. Learning Challenges Opp ortunities F uture directions Democratization and increas- ingly spread use of AI to ols could lead to more individu- alized learning, since students might start to rely more on AI than on teachers and p eers Students can use LLMs to find more conten t related to their studies, find tutorials and additional material and in general find more accessible conten t Provide assistance to students on ho w to safely use AI and embed it in group exercises to help students reflect on benefits and limitations to b e addressed through discussions in the classro om Over-reliance on LLMs could harm learning by hindering critical thinking and reason- ing skills ( Ab d-Alrazaq et al. 2023 ; Xu et al. 2024 ) Ability to use LLMs in v ari- ous learning settings, such as to quickly simplify complex information, provide prompt feedback T each students how to use LLMs correctly , emphasiz- ing critical ev aluation of AI-generated conten t and responsible usage ( Nguy en 2025 ) AI exp erience might feel dehu- manizing for students, if they can only chat with a bot with- out personalization or human interaction LLMs can b e used to engage students through virtual sim- ulations, interactiv e experi- ence, and to enable p ersonal- ized study plans and learning materials tailored to individ- ual student needs When using LLMs in class- room, plan more stimulan t scenarios and simulations so that exp erience do es not feel dehumanizing or imp ersonal LLMs might facilitate stu- dents to b end the rules of academic integrit y , introduc- ing p oten tial for plagiarism and cheating Access to information, sum- mary and generation of ideas is muc h more av ailable to stu- dents, as well writing assis- tance Provide students with clear guidelines on ho w to resp onsi- bly use LLMs in their course- work, addressing ethical and moral concerns, accountabil- ity and academic integrit y Priv acy issues are largely unaddressed, particularly con- cerning informed consen t and data protection when fine- tuning LLMs with student data ( Nguyen et al. 2023 ) Personalization of learning could lead to deep er knowl- edge gain and improv ed performance When using LLMs with stu- dents, b e transparent on what data is collected, who collects it and how it is used Financial burden of commer- cial LLMs, as well as lim- ited access to infrastructures and technology , could aggra- v ate inequity of learning and access for v arious student demographics Developmen t of distilled LLMs could facilitate and fasten learning in low income Countries Distilled mo dels to be deploy ed in lo wer income, less technology proficien t Coun- tries The dominance of English- based mo dels could exacer- bate inequalities in language proficiency and accessibility Understand how native sp eak- ers of different languages learn and if there are substantial differences Increased fo cus on m ultilin- gual and monolingual lan- guages mo dels for non-English languages will b e able to interact with an a v atar, giving the p ossibilit y to studen ts to choose b et w een the av atar and the c hatb ot medium. 25 7 Limitations This study is sub ject to several potential limitations. A summary of identified limita- tions is as follo ws: (i) the analysis draws on a relatively small sample; (ii) the findings ma y hav e limited generalizability b ecause they are tied to a sp ecific teaching context; and, (iii) institutional structures and the particular technological ecosystem of the univ ersity where the study to ok place may constrain the exten t to whic h these results can be transferred to other educational settings. F urthermore, notew orth y is that rapid adv ances in generativ e chatbots and ev olv- ing educational p olicies ma y influence the future relev ance and v alidity of the results presen ted in this work. W e recognize that the LLM used in this study is not the latest and most p o werful and therefore has architectural constraints. Many mo dels intro- duced in late 2025 can generate higher-quality outputs and are ov erall more capable. Ho wev er, our choice of mo del w as constrained by pro ject resources, including both budget and GPU av ailability , and thus represen ted a reasonable compromise betw een fast inference and the provision of reliable and safe responses for students. Finally , the analysis relied on sub jective judgments, which can introduce v ariability in the results due to personal opinions or preferences, as well as individual biases and levels of exp ertise. This inherent sub jectivity underscores the breadth of p ossible p erspectives, rev ealing div erse viewpoints and prompting researchers to consider their p oten tial implications. 8 Conclusions This study examined a customized LLM-based assistan t in an M.Sc. course. The mo del w as v alidated across sev eral exp eriments emplo ying a mixed-methods iterativ e approac h across three distinct exp erimental setups (spring 2024, spring 2025, and fall 2025 crosso ver study). W e first compare the performance of the LLM betw een the spring semester of 2024, when it was initially in tro duced in to coursew ork, and the spring semester of 2025. After integrating a p ortion of the feedback gathered during this p eriod, the mo del was refined and subsequently assessed in greater depth through a crosso ver study conducted in the fall semester of 2025. The results pro vide significan t insights into the integration of LLMs in Higher Edu- cation, particularly in collaborative learning settings. Our research highlights several critical trends regarding LLM use in education. Students were primarily motiv ated by curiosit y and the desire to use an innov ative to ol. W e found that they view ed the AI in teraction as less stressful than in terviewing a h uman teac her, whic h is a key psycho- logical finding. Practical b enefits also strongly drov e adoption, including receiving a written conv ersation transcript and the AI’s role in removing course enrollment lim- its. These findings suggest that addressing psychological barriers and offering concrete utilit y are p o werful levers for LLM adoption in education. Ov erall student satisfaction with the AI assistant and the qualit y of its answers increased from 2024 to 2025, a trend that aligns with the gro wing p erception that the AI assistant represents the future of the course exercise. A significant finding is the dramatic shift in studen t expectations regarding LLM p erformance. Students in later cohorts (2025) b egan to compare our custom AI assistant against high-p erformance 26 commercial mo dels (lik e ChatGPT), finding the custom mo del lac king in fluency and complexit y . This indicates that the required quality threshold for educational LLM to ols is constan tly b eing elev ated by external tec hnological adv ancements. The crosso ver study , inv olving more exp erienced studen ts, yielded p olarized and highly critical feedbac k. These studen ts, p ossessing prior exp erience in the sub ject, exposed the limitations of the chatbot, citing its inabilit y to handle complex questions and the o verall lack of human interaction. This critical input was inv aluable for identifying sp ecific weaknesses and will guide our mo del refinement. The study also identified a crucial p edagogical trade-off b et ween prioritizing smaller, correctness-foc used mo dels (whic h may disapp oin t users due to low er fluency and limited scop e, as we observed) and larger, more p o werful mo dels (which satisfy user expectations but carry a higher risk of hallucinations). F uture work will fo cus on mo del refinement and, crucially , inv estigating the learning outcomes asso ciated with different LLM adoption strategies, sp ecifically researc hing the safest and most effective use of LLMs in educational settings while minimizing the risk p osed by hallucinations. W e in tend to further improv e the mo del b efore using it again in the spring semester of 2026. Finally , taking inspiration from our first-hand exp erience with implemen ting and testing LLMs in Higher Education, we discuss the challenges, opp ortunities, and future directions that w e see in the foreseeable future of AI in education, b oth reflecting on its implications on teaching and on learning. Statemen ts on ethics Informed consent w as obtained from all participan ts and their priv acy rights w ere strictly observ ed. Before participating, all participants w ere informed that the col- lected data would b e used for researc h purposes and would b e handled anonymously . Detailed information on the study was provided to the participants b efore starting the surv ey , reassuring them of their right to volun tary participation and withdraw al. A v ailabilit y of data and materials This study is committed to the open-source set of principles. The code is freely av ail- able on GitHub at h ttps://github.com/FiammettaC/ChatGMP . Here, we also include a DEMO of the mo del that readers are free to test. The co de is published under the MIT license. Currently , it is compatible with Python 3.x, for Windo ws and Lin ux op er- ating systems. It is also recommended to use a conda environmen t to run the co de. Other data can be obtained by sending an email request to the corresp onding author. F unding F unding for this study w as provided b y (i) the European Union under Horizon Europe researc h and innov ation program with the gran t agreemen t 101159993, within the HORIZON-WIDERA-2023-A CCESS-02-0 call (Dig4Bio); (ii) Nov o Nordisk F ounda- tion, the gran t num b er is: NNF22OC0080136 - Real-time sustainability analysis for Industry 4.0 (Sustain4.0); and (iii) T echnical Univ ersity of Denmark (DTU). 27 Comp eting in terests The authors ha v e no comp eting interests to disclose. A c kno wledgmen ts The authors thank the studen ts who v olunteered to test our AI assistant and help ed us further dev elop the initiativ e. 28 References Ab d-Alrazaq, A., AlSaad, R., Alh uw ail, D., Ahmed, A., Healy , P .M., Latifi, S., Aziz, S., Damseh, R., Alrazak, S.A., Sheikh, J.: Large language mo dels in medical edu- cation: opp ortunities, c hallenges, and future directions. JMIR medical education 9 (1), 48291 (2023) Augenstein, I., Baldwin, T., Cha, M., Chakraborty , T., Ciampaglia, G.L., Corney , D., DiResta, R., F errara, E., Hale, S., Halevy , A., Hovy , E.H., Ji, H., Menczer, F., Míguez, R., Nako v, P ., Scheufele, D.A., Sharma, S., Zagni, G.: F actuality c hallenges in the era of large language mo dels and opp ortunities for fact-c hecking. Nat. Mac. In tell. 6 , 852–863 (2024) h ttps://doi.org/10.1038/s42256- 024- 00881- z Aggarw al, D.: Integration of innov ative tec hnological developmen ts and ai with education for an adaptive learning p edagogy . China P etroleum Pro cessing and P etro c hemical T echnology 23 (2), 709–714 (2023) Cacca v ale, F., Gargalo, C.L., Gernaey , K.V., Kr ¨ uhne, U.: T ow ards education 4.0: The role of large language mo dels as virtual tutors in chemical engineering. Education for Chemical Engineers 49 , 1–11 (2024) Cacca v ale, F., Gargalo, C.L., Kager, J., Larsen, S., Gernaey , K.V., Kr ¨ uhne, U.: Chatgmp: A case of ai chatbots in c hemical engineering education to wards the automation of repetitive tasks. Computers and Education: Artificial In telligence 8 , 100354 (2025) Chiu, T.K.: F uture researc h recommendations for transforming higher education with generativ e ai. Computers and education: Artificial intelligence 6 , 100197 (2024) Ch ung, H.W., Hou, L., Longpre, S., Zoph, B., T ay , Y., F edus, W., Li, Y., W ang, X., Dehghani, M., Brahma, S., et al. : Scaling instruction-finetuned language mo dels. Journal of Mac hine Learning Researc h 25 (70), 1–53 (2024) Cai, Z., Park, S., Nixon, N., Doroudi, S.: Adv ancing knowledge together: in tegrating large language mo del-based con versational ai in small group collab orativ e learning. In: Extended Abstracts of the CHI Conference on Human F actors in Computing Systems, pp. 1–9 (2024) DTU.dk: 28855 GMP og kv alitet i farmaceutisk, biotek og f ø dev areindustri - teoretisk version, Sommer 2025. h ttps://k arakterer.dtu.dk/Histogram/1/28855/ Summer- 2025 . [Online; accessed 29-October-2025] (2025) DTU.dk: Grading system of Danish higher education. h ttps://studyabroad.dtu.dk/ english/- /media/subsites/study_abroad/dokumenter/k arakterer_videregaaende_ en.p df . [Online; accessed 29-Octob er-2025] (2025) 29 Gkin toni, E., Antonopoulou, H., Sort well, A., Halkiop oulos, C.: Challenging cogni- tiv e load theory: The role of educational neuroscience and artificial intelligence in redefining learning efficacy . Brain Sciences 15 (2), 203 (2025) Gan, W., Qi, Z., W u, J., Lin, C.-W.: Large language mo dels in education: Vision and opp ortunities. 2023 IEEE In ternational Conference on Big Data (BigData), 4776–4785 (2023) h ttps://doi.org/10.1109/bigdata59044.2023.10386291 Helm, P ., Bella, G., Koch, G., Giunc higlia, F.: Diversit y and language technology: ho w language mo deling bias causes epistemic injustice. Ethics and Information T echnology 26 (1), 8 (2024) HuggingF ace: Hugging F ace. https://h uggingface.co/ . [Online; accessed 09-Octob er- 2025] (2016) Idris, M.D., F eng, X., Dy o, V.: Revolutionizing higher education: Unleashing the p oten tial of large language models for strategic transformation. IEEE A ccess 12 , 67738–67757 (2024) h ttps://doi.org/10.1109/access.2024.3400164 Jin, Y., Chandra, M., V erma, G., Hu, Y., Choudhury , M.D., Kumar, S.: Better to ask in english: Cross-lingual ev aluation of large language mo dels for healthcare queries. Pro ceedings of the ACM W eb Conference 2024 (2023) https://doi.org/10. 1145/3589334.3645643 K ov ari, A.: Ethical use of c hatgpt in education—b est practices to combat ai- induced plagiarism. F rontiers in Education (2025) h ttps://doi.org/10.3389/feduc. 2024.1465703 Kelly , A., Sulliv an, M.: Chatgpt in higher education: Considerations for academic in tegrity and student learning. Journal of applied learning and teaching (2023) Kamalo v, F., Santandreu Calonge, D., Gurrib, I.: New era of artificial intelligence in education: T ow ards a sustainable m ultifaceted revolution. Sustainabilit y 15 (16), 12451 (2023) Kasneci, E., Se ß ler, K., K¨ uchemann, S., Bannert, M., Dementiev a, D., Fisc her, F., Gasser, U., Groh, G., G ¨ unnemann, S., H ¨ ullermeier, E., et al. : Chatgpt for goo d? on opp ortunities and challenges of large language mo dels for education. Learning and individual differences 103 , 102274 (2023) Lewis, A.: Multimo dal large language mo dels for inclusive collab oration learning tasks. In: Proceedings of the 2022 Conference of the North American Chapter of the Asso ciation for Computational Linguistics: Human Language T echnologies: Student Researc h W orkshop, pp. 202–210 (2022) Lewis, P ., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goy al, N., K ¨ uttler, H., Lewis, M., Yih, W.-t., Ro c kt¨ asc hel, T., et al. : Retriev al-augmented generation for 30 kno wledge-intensiv e nlp tasks. Adv ances in neural information pro cessing systems 33 , 9459–9474 (2020) La vicza, Z., W einhandl, R., Pro dromou, T., Anđi´ c, B., Lieban, D., Hohen warter, M., F enyv esi, K., Brownell, C., Diego-Mantecón, J.M.: Dev eloping and ev aluating edu- cational inno v ations for steam education in rapidly c hanging digital tec hnology en vironments. Sustainability 14 (12), 7237 (2022) McHugh, M.L.: The chi-square test of indep endence. Bio chemia medica 23 (2), 143– 149 (2013) Mik olov, T., Sutskev er, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sen tations of w ords and phrases and their comp ositionalit y . A dv ances in neural information processing systems 26 (2013) Mann, H.B., Whitney , D.R.: On a test of whether one of tw o random v ariables is sto c hastically larger than the other. The annals of mathematical statistics, 50–60 (1947) Na vigli, R., Conia, S., Ross, B.: Biases in large language mo dels: origins, inv entory , and discussion. ACM Journal of Data and Information Quality 15 (2), 1–21 (2023) Nguy en, K.V.: The use of generativ e ai to ols in higher education: Ethical and p edagogical principles. Journal of A cademic Ethics, 1–21 (2025) Ng, D.T.K., Leung, J.K.L., Su, J., Ng, R.C.W., Ch u, S.K.W.: T eachers’ ai digital com- p etencies and tw en ty-first century skills in the p ost-pandemic world. Educational tec hnology research and developmen t 71 (1), 137–161 (2023) Nguy en, A., Ngo, H.N., Hong, Y., Dang, B., Nguyen, B.-P .T.: Ethical principles for artificial intelligence in education. Education and information technologies 28 (4), 4221–4241 (2023) Naik, A., Yin, J.R., Kamath, A., Ma, Q., W u, S.T., Murra y , C., Bogart, C., Sakr, M., Rose, C.P .: Generating situated reflection triggers about alternative solution paths: A case study of generative ai for computer-supp orted collab orative learning. In: In ternational Conference on Artificial Intelligence in Education, pp. 46–59 (2024). Springer Naik, A., Yin, J.R., Kamath, A., Ma, Q., W u, S.T., Murray , R.C., Bogart, C., Sakr, M., Rose, C.P .: Providing tailored reflection instructions in collab orative learning using large language mo dels. British Journal of Educational T echnology 56 (2), 531–550 (2025) Obido vna, D.Z.: The p edagogical-psychological asp ects of artificial intelligence tec hnologies in integrativ e education. International Journal Of Literature And Languages 4 (03), 13–19 (2024) 31 P erkins, M.: Academic integrit y considerations of ai large language mo dels in the p ost- pandemic era: Chatgpt and b eyond. Journal of Univ ersity T eaching and Learning Practice (2023) h ttps://doi.org/10.53761/1.20.02.07 P apineni, K., Roukos, S., W ard, T., Zhu, W.-J.: Bleu: a metho d for automatic ev al- uation of machine translation. In: Pro ceedings of the 40th Annual Meeting of the Asso ciation for Computational Linguistics, pp. 311–318 (2002) P elaez-Sanchez, I.C., V elarde-Camaqui, D., Glasserman-Morales, L.-D.: The impact of large language mo dels on higher education: exploring the connection b et ween ai and education 4.0. F rontiers in Education (2024) h ttps://doi.org/10.3389/feduc. 2024.1392091 Ro drigues, L., Pereira, F.D., Cabral, L., Ga ˇ sevi ´ c, D., Ramalho, G., Mello, R. F.: Assess- ing the quality of automatic-generated short answers using gpt-4. Computers and Education: Artificial In telligence 7 , 100248 (2024) Xu, X., Chen, Y., Miao, J.: Opp ortunities, challenges, and future directions of large language mo dels, including chatgpt in medical education: a systematic scoping review. Journal of educational ev aluation for health professions 21 (2024) Xu, H., Gan, W., Qi, Z., W u, J., Y u, P .S.: Large language mo dels for education: A surv ey . ArXiv abs/2405.13001 (2024) https://doi.org/10.48550/arxiv.2405.13001 Y an, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., Ga ˇ sevi´ c, D.: Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational T echnology 55 (1), 90–112 (2024) Zhang, X., Zhang, X., Liu, H.: Reflections on enhancing higher education classro om effectiv eness through the introduction of large language mo dels. Journal of Modern Educational Researc h (2024) h ttps://doi.org/10.53964/jmer.2024019 32 T able 12 : A verage results of FLAN-T5 and LLaMA 2 mo dels given the prompt (student’s question) and con text (teacher’s answer). The results are calculated o ver a subset of questions (N=10) and A100 GPU w as used. The metrics rep orted are the cosine similarity and BLEU score. Cosine measures, suc h as similarity and distance, are commonly used in NLP to calculate the rela- tionship betw een w ords, sen tences or do cumen ts, as in Mik olov et al. ( 2013 ). BLEU is a precision-based metric originally dev elop ed for ev aluating mac hine translation and it quantifies alignment b et ween the generated and the reference text ( Papineni et al. 2002 ). A dapted from **Anon ymized** (reference will b e added up on accep- tance). Cosine similarit y BLEU Runtime FLAN-T5 base 0.68 0.39 19.0 s LLaMA 2 0.72 0.33 38.9 s 33 T able 13 : Students’ persp ectiv es regarding the quality of the answ ers provided b y the AI assistan t. R Q2: What do you think of the quality of the answ ers? • Some of the answers w ere the same, though i think this was our fault as we found prompting hard. • It felt like it had a set of pre-made answers and couldn’t really elab orate as if it w as a con versation. Many times the answers were unrelated to the question • Y ou ma y structure it in such a w ay the ChatGMP asks the auditor to b e explicit with the question when it do esn’t understand the question rather than giving "Y es, of course" • The questions need to be very sp ecific, and sometimes it would tak e time to reform ulate the question so we could get the exp ected answer. In many cases w e did not get what we w anted. • F or most of the answers y es it w as really nice, but for few ones it didn’t make sense to us. If it was a real audit in person then it w ouldn’t b e acceptable • When you actually get what y ou are asking for, then it is go o d, but because it w as so difficult for the chat to understand the questions, the answ ers w as bad • Liv e p erson elaborate more than bot • It felt like n uance is lost, sligh t c hanges to a question to hone in on specific answ ers w ould yield the same response. Other than that it seemed like the resp onses to prompts w ere given c haracter to simulate a h uman - which made it more personal. • The transcript is very conv enien t, making it easy to scan for information from the answ ers. • While it was really co ol that the answers w ere go o d and were more similar to a real p erson talking rather than ChatGPT, it was really difficult to read the answ ers during the audit as the text was formulated like speech. • If our questions were too sp ecific, the b ot got v ery confused. • In general, the answ ers were v ery go od. The LLM has b een trained w ell on the topic, and provides very coheren t and professional answers. Sometimes w e w ondered if the c hatb ot was even more honest than a real p erson would ha ve b een ab out the issues in Pharma A/S. In a few cases, I think 2 questions, the chatbot had issues understanding what we were asking for. It seems that it might hav e b een trained a bit rigidly , considering that there is a difference b et w een asking for X program or X pro cedure. • W e had to b e really sp ecific on the questions and b e precise with the words we w ere using. • Sometimes the answers were completely unrelated but that is to b e exp ected as the virtual tutor is still learning. • The qualit y of the answers were ok, but it was quite hard to ask the questions in a w a y so that the bot would understand what we were lo oking for. • The answers seem really real, feel lik e doing audit with real companies. Help ed us be more familiar with audit processes and also impro ve our questions list. • not alw ays clear • Ov erall it was a very goo d exp erience. it w as very honest and the answers were of appropriate length. It didn’t understand some of our questions, when w e didn’t use the righ t w ords, suc h as using program instead of proto col. but just asking it again, fixed the issues. Sometimes we ask ed questions, which it didn’t ha ve an answ er to or understand, but it most likely would hav e b een the same with an audit p erson. 34 T able 14 : Students’ persp ectiv es regarding the quality of the answ ers provided b y the teac her in the spring semester of 2025. R Q2: What do you think of the quality of the answ ers? • V ery close to realit y i Think • Answ ers were designed to highligh t mistakes clearly • Most of them were answ ered and do cumentation was supplied, although it was a v ailable a bit later than we exp ected. • T eacher had answers for most of our questions and could refer to the do cumen ts • I w as a bit confused ab out ho w to handle answers that talked ab out a doc- umen t that was not provided. I understand that not all QMS do cuments for No voPharma can b e created, but it w ould hav e b een nice to ha ve b een explained ho w to handle do cumen ts, that the auditee said "existed" but could not pro- vide. • The teac her did try to talk ab out some random things, but that w as exp ected so we fo cused on the relev an t answers. The do cumen ts that we recieved after w ere very go o d. • Also mo c k instalations could b e added as it makes it more visual and helps the studen ts to hav e a notion of visual insp ections and keeping an ey e for details. Ma yb e high res pictures that y ou can zo om in to see if you find an y deviation, or videos with normal/abnormal pro cedures. It is also more engaging rather than just doing a do cumen tative review/control. • I understand that the teacher needs to sp eak for "gain time" for their company , but I felt bad/rude when I cut his conv ersation. But I know that he is simulating a real life. • The quality of the answers were goo d. They didn t hav e all the do cumen ts that w e asked but its not a real world audit so its fine. • It w as ok ay • Sometimes missing info but very reasonable • I actually also learned something ab out working with GMP and ab out pro ce- dures in pharmaceutical pro duction during the actual audit. • I think we got accurate answ ers to most of our questions. There w ere some mom wnts that w e were asking for some evidence and non w as provided, and I wasn’s sure if this w as just because it was a fake audit, or I should ha v e registered it as a finding. 35 T able 15 : Students’ persp ectiv es regarding the future of the exercise. Resp onses pro vided by students that audited a teac her in the spring semester of 2025. R Q4: Do y ou think this is the future of the audit exercise in this course? • The closest exp erience you can hav e to the actual job, offering a v aluable com- plemen t to traditional lectures • It was an o verall go o d exp erience with the teacher, don’t really hav e a com- parison. F or me it came to a certain e xten t close to a real audit and I would recommend others to do it with a teac her as w ell. • It forces you to create a list of questions and hav e a critical view of the answ er- s/do cumen ts y ou receive. • I think the exercise work ed v ery well • The AI assistant has a lot of p oten tial the more refined it gets. But It will nev er b eat a go o d acted mo c k audit with a human auditee • I think is a real and go o d exercise to prepare us even if we w ant or not to work in QA or audit department. y ou can learn a lot and is useful for your future profesional life • Y es, it w as nice to create a real world audit , so we can learn ho w the audit pro cess goes. There was sufficient time for eac h team w ork. If the guest lectures from foo d industry w ould b e nice to o • Not sure • I didn t do it with chat gmp • I ha vn’t tried the c hat v ersion • P ersonally , I hav e never b een part of an audit before, and I think that this exer- cise ga ve me a go o d represen tation of what audits are ab out. the departmen ts that are in v olved and a general image of the pro cedure. 36 T able 16 : Additional reflections of the studen ts regarding the interaction with the AI assistan t. A dditional feedback • It was harder to prompt than what i am used to, as it did not hav e memory of the last question, so follow up questions was seemless. • It felt v ery planned and it looks lik e the mo del was just trained with a specific sets of answ ers and w as just trying to find a b est fit for the questions • It w as very nice to try this out, it felt more relaxed and in control of the next steps. Ev en though, in some cases it was difficult to receiv e the answers w e w anted, or the do cumen ts, it w as still nice to work with it. • It w as fun to try • Clarifying questions were not the easiest, and with the machine sometimes pro ducing identical answers it did sound a little rob otic, for lac k of a b etter w ord. • T ext w as very uninviting to read in general. The slo w resp onses made it not feel more lik e a computer than a "real compan y" • It w as v ery communicating in a v ery professional manner, so except for 2 times where it misundersto o d our questions, I had the impression that we could b e talking to a real p erson. • W e spent some time (too m uch time in my opinion) rephrasing our questions to the b ot several times b ecause it didn’t understand what we meant, this made me quite stressed. And I feel like sometimes our question was to o narro w and detailed, but sometimes to o broad and so I had a hard time getting the hang of ho w to phrase the questions. • It’s really nice but ma yb e would b e better if we could cop y and paste our questions instead of typing. • still early stage developmen t AI • It w as a more relaxed and p ostiv e exp erience in teracting with the chat, whic h w as nice to tak e the pressure off. • It w as particularly goo d at pro viding do cumen ts. 37

Large Language Models in Teaching and Learning: Reflections on Implementing an AI Chatbot in Higher Education

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment