Reading time: 21 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.20714
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Generative AI enables personalized computer science education at scale, yet questions remain about whether such personalization supports or undermines learning. This scoping review synthesizes 32 studies (2023-2025) purposively sampled from 259 records to map personalization mechanisms and effectiveness signals in higher-education CS contexts. We identify five application domains-intelligent tutoring, personalized materials, formative feedback, AI-augmented assessment, and code review-and analyze how design choices shape learning outcomes. Designs incorporating explanation-first guidance, solution withholding, graduated hint ladders, and artifact grounding (student code, tests, and rubrics) consistently show more positive learning processes than unconstrained chat interfaces. Successful implementations share four patterns: context-aware tutoring anchored in student artifacts, multi-level hint structures requiring reflection, composition with traditional CS infrastructure (autograders and rubrics), and human-in-the-loop quality assurance. We propose an exploration-firstadoption framework emphasizing piloting, instrumentation, learning-preserving defaults, and evidence-based scaling. Four recurrent risks-academic integrity, privacy, bias and equity, and over-reliance-are paired with operational mitigation. Critical evidence gaps include longitudinal effects on skill retention, comparative evaluations of guardrail designs, equity impacts at scale, and standardized replication metrics. The evidence supports generative AI as a mechanism for precision scaffolding when embedded in exploration-first, audit-ready workflows that preserve productive struggle while scaling personalized support.

๐Ÿ“„ Full Content

Providing timely, personalized support to novice programmers is a long-standing challenge in computer science education. Novice learners often encounter syntax errors or logic bugs that, without immediate intervention, can lead to frustration and disengagement. While faculties recognize the importance of individualized guidance, delivering it at scale in large introductory courses has historically been impossible due to resource constraints.

Generative AI now offers a potential solution to this bottleneck, enabling context-aware assistance that can support students precisely when they are struggling.

Large language models (LLMs) such as GPT-4/4o and contemporary systems like Claude and DeepSeek have moved from novelty to infrastructure in higher education [1][2][3][4]. Unlike prior rule-based adaptive systems, LLMs synthesize context-aware explanations, worked examples, and code-centric feedback on demand, making personalization feasible at the granularity of a single test failure, a syntax or logic error, or a misconception articulated in dialogue [5,6]. At the same time, GenAI has caused familiar concerns to resurface-integrity, privacy and governance, bias and equity, and over-reliance-that must be managedrather than used to justify paralysis [7][8][9][10].

Institutional responses to GenAI have diverged. Some universities initially prohibited or tightly restricted student use (e.g., Sciences Po’s January 2023 ban on ChatGPT-early public release, for graded work), while others have adopted enabling, systemwide access models. In California, the California State University (CSU) system has deployed ChatGPT Edu across all 23 campuses-providing no-cost, single-sign-on access for more than 460,000 students and 63,000+ employees-and launched the CSU AI Commons to centralize training and guidance [11][12][13][14][15]. This spectrum of policy choices underscores the need for adoption guidance that is practical, evidence-seeking, and resilient to local constraints.

What we mean by personalization.

We define personalization as the systematic adaptation of content, feedback, or task sequencing to a learner’s evolving state using observable evidence (code, tests, errors, and dialogue) [6,7]. In CS contexts this includes (i) solution-withholding programming assistants that prioritize reasoning; (ii) conversational debugging targeted to concrete error states; (iii) tailored worked examples, Parsons problems, and practice sets aligned to course context; and (iv) assessment workflows in which LLMs generate tests, rubric-aligned comments, and explanation-rich grades under human audit [16][17][18][19][20][21][22][23]. The pedagogical aim is precision scaffolding: keeping students in productive struggle with stepwise hints, tracing, and test-driven guidance rather than answer dumps [16,21,24].

A working definition: exploration-first.

We use exploration-firstto denote a deployment stance and workflow in which instructors and institutions pilot small instrument interactions and scale in response to evidence, with learning-preserving defaults built into tools and policies. Concretely, explorationfirst means:

Help design defaults that preserve productive struggle: Explanation-first hints (pseudocode, tracing, and fault localization), solution withholding by default, and graduated hint ladders supported by short reflection prompts before escalation.

Artifact grounding: Tutors and feedback are conditioned on the learner’s current code, failing tests, and assignment specification; assessment is conducted with explicit rubrics and exemplars and unit tests and mutation checks.

Human-in-the-loop audits of any generated tests, items, and grades, with logs retained for pedagogy and moderation (not “detector” policing). 4.

Pilot โ†’ measure โ†’ scale: Activation for one section or assignment, examining process and outcome metrics, and expanding scope when the combined quantitative and qualitative evidence supports doing so.

Enablement governance: Vetted or enterprise instances, data minimization, and prompt and version change logs; short allow-lists in syllabi plus process evidence (what was asked, hint levels used, and test history) instead of AI detectors.

Why a scoping review now?

Since 2023, institutions have shifted from ad hoc experimentation to explorationfirst adoption: course-integrated assistants that guide rather than answer explicit but enabling policies, faculty development, and vetted tooling [8,[25][26][27][28][29][30][31][32]. The literature is expanding quickly but remains heterogeneous in tasks, measures, and outcomes; many studies are classroom deployments or mixed-method evaluations. A scoping (rather than effect-size) review is appropriate to map applications and mechanisms and surface signals of effectiveness and risk and to distill design and governance guidance instructors can use now.

Against this backdrop, there remains a lack of integrative work that focuses specifically on GenAI-enabled personalization in higher-education computer science, systematically mapping not just use cases but the underlying mechanisms, reported outcome signals, and recurrent risks. Existing reviews and position papers typically either address AI in education in general or focus on single tools, courses, or outcome types, leaving instructors and departments without a consolidated view of how personalization is actually being implemented, under what conditions it appears to support or hinder learning, and where the evidence remains thin. The objective of this scoping review is therefore to synthesize recent empirical work to (i) characterize personalization mechanisms across application areas, (ii) identify the types of process and outcome signals that are reported, (iii) relate these mechanisms and signals to established learning-theoretic constructs, and (iv) surface gaps and design considerations that can inform both practice and future research.

Focusing on 2023-2025 in higher-education CS, we contribute:

A structured map of how GenAI is used to personalize CS learning, emphasizing mechanisms (explanation-first hints, ladders, and rubric and test grounding) over brands; 2.

A synthesis of effectiveness signals (time-to-help, error remediation, feedback quality, and grading reliability) and the conditions under which they appear; 3.

A consolidation of risks (integrity, privacy, bias and equity, and over-reliance) with actionable mitigation; 4.

Design principles and workflow patterns for exploration-first personalization; 5.

Department and institution guidance for policy, vendor vetting, and AI-aware assessment [7,8,33].

Research questions.

Which personalization mechanisms (explanation-first help, graduated hints, and code-aware dialogue) are most promising without shortcircuiting learning? [16,17,21,24

Classic Intelligent Tutoring Systems (ITSs) modeled learner knowledge and errors to deliver stepwise hints and mastery-based sequencing [36][37][38][39][40]. Later work refined knowledge tracing and data-driven adaptivity [41,42]. LLMs alter this landscape by generating context-specific explanations, examples, and code-aware feedback through naturallanguage dialogue [1,6,43]. The result is a different granularity of support (line-level commentary and test-oriented guidance) that can be tuned to individual misconceptions.

We use personalization to denote continuous, evidence-driven tailoring of content, feedback, or task sequencing. Adaptation often refers to real-time adjustments (for example, difficulty and hinting) based on performance signals, whereas individualization can include preference or profile-based configuration without continuous updates [44]. Our scope emphasizes code-centric tutoring, targeted feedback, and sequenced practice that leverage generative models to produce the adapted artifacts themselves [5].

Introductory programming affords fine-grained interventions (syntax and logic errors, test failures, tracing), algorithms emphasize strategy explanations and worked examples, and software engineering invites feedback on design and reviews [38,45]. Productive scaffolds include explanation-first help, graduated hints, and practice items aligned to course context and learner readiness.

Before GenAI, personalization drew on behavior and performance signals to deliver immediate feedback, difficulty adjustment, and sequencing [44,46]. Autograding pipelines, unit tests, and program analysis underpinned scalable feedback, but authored hints and items were costly to produce and maintain. Systematic reviews in programming and medical education foreshadowed GenAI’s promise and limitations [47][48][49].

Recurring application patterns (2023-2025) include AI-assisted code review using curated exemplars and model prompts [62][63][64][65][66].

We followed Arksey-O’Malley, Levac, and JBI guidance and PRISMA-ScR reporting where applicable [67][68][69][70]. Eligibility was determined according to JBI population-conceptcontext (PCC): population = higher-education CS learners and instructors; concept = GenAIenabled personalization; context = higher-education CS courses and supports. The window was 2023-2025. Sources included the ACM Digital Library (SIGCSE TS, ITiCSE, ICER, TOCE), IEEE Xplore (FIE and the ICSE SEIP track), CHI, CSCW, L@S, LAK, EDM, and in-dexing via Google Scholar and arXiv. We ran venue-first queries, forward and backward citation chasing, and hand-searched institutional guidance (policy and governance only).

This scoping review was retrospectively registered on the Open Science Framework (OSF); registration details will be provided in the final version. https://osf.io/bge7y (accessed on 16 December 2025).

Inclusion criteria required that studies (i) took place in higher-education computer science contexts; (ii) implemented or enabled GenAI-based personalization rather than generic AI use; (iii) provided empirical evaluation (deployment, design experiment, or prototype study); (iv) were published in English between 2023 and 2025; and (v) reported sufficient methodological detail to characterize personalization mechanisms. Exclusion criteria eliminated K-12 studies, non-GenAI personalization, opinion pieces, patents, and papers lacking empirical evaluation.

Screening proceeded through title/abstract review followed by full-text assessment. Of the 259 screened records, 59 met full-text eligibility. From these, we purposively sampled 32 studies to support a mechanism-focused scoping synthesis. Purposive sampling is consistent with JBI scoping-review guidance when the goal is to map mechanisms rather than enumerate all instances. Our sampling emphasized analytic suitability rather than outcome direction.

A subset of full-text-eligible papers could not meaningfully contribute to mechanism mapping because they lacked operational detail, did not actually implement personalization, or reported outcomes that were uninterpretable for our analytic aims. We therefore prioritized studies that met all three of the following suitability conditions: (a)

Mechanism transparency: Studies that clearly described personalization mechanisms (e.g., hint ladders, explanation-first scaffolding, course-aligned generation, or test-or rubric-grounding) were included. Papers that invoked “ChatGPT support” without detailing intervention logic were excluded.

Interpretable process or learning outcomes: Studies reporting measurable learning, debugging, process, or behavioral outcomes were included. Papers reporting only post hoc satisfaction surveys or generic perceptions without task-linked metrics were excluded because they could not inform mechanism-outcome relationships. (c) Sufficient intervention detail: Studies that described prompts, constraints, workflows, model grounding, or tutor policies were included. Excluded papers typically lacked enough detail to map how personalization was implemented (e.g., no description of scaffolding, no explanation of input grounding, or insufficient reporting of tasks).

Why 27 full-text studies were excluded.

The 27 excluded papers typically exhibited one or more of the following characteristics:

โ€ข Personalization not actually implemented: The system provided static advice or open-ended chat interaction with no evidence of adaptation.

Insufficient mechanism description: The intervention lacked detail on how hints were generated, how tasks were adapted, or how the model was conditioned.

Outcomes limited to satisfaction surveys: No behavioral, process, or learning-related data were reported, preventing mechanism mapping.

โ€ข Redundant or superseded work: Conference abstracts or short papers from the same research groups that were expanded into more detailed publications were included.

Negative or null results with no mechanistic insight: Some studies reported poor or null outcomes but provided too little detail to attribute failure to design, prompting, scaffolding, or grounding decisions.

Implications for bias.

Because we prioritized mechanism-rich studies, our final corpus likely overrepresents better-specified and more mature deployments. This introduces a known selection bias toward successful or interpretable implementations. To mitigate misinterpretation, we treat outcome patterns as indicative signals rather than effectiveness estimates and emphasize throughout that the true distribution of results in the broader literature is likely more mixed.

Figure 1 summarizes the screening and purposive sampling process for the included studies.

Records identified: venue browsing (SIGCSE, ICER, CHI, CSCW, L@S, LAK, EDM, and ICSE (SEIP)) 126

Records identified: indexing (Google Scholar and arXiv) 206

Titles/abstracts screened Inclusion: higher education CS and GenAI-personalization Exclusion: K-12 only, non-GenAI, opinion 259

Full-text assessed for eligibility 59

Studies included in synthesis (2023-2025) 32

We extracted bibliometrics, instructional context, model/tooling details, personalization mechanisms, implementation, evaluation design, outcomes/effectiveness signals, and risks/governance. We used a hybrid deductive-inductive synthesis, mapping to an a priori schema (tutoring, learning materials, feedback, assessment, code review, and governance) and emergent open-coding mechanisms and success/failure conditions. We did not estimate pooled effects, and-consistent with scoping aims and PRISMA-ScR guidancewe did not undertake formal methodological quality appraisal (for example, JBI tools) as our goal was to map applications and mechanisms rather than to compute aggregate effect-size estimates.

Table 1 presents a taxonomy of GenAI-enabled personalization uses in highereducation computer science, organized by application type, mechanisms, and instructional context.

Screening and purposive sampling yielded 32studies (2023-2025) implementing or enabling personalization in higher-education CS. Most appear in peer-reviewed computing education, HCI, and software engineering venues (SIGCSE, ITiCSE, ICER, CHI, L@S, LAK, EDM, and ICSE) with the remainder as detailed preprints. The modal context is CS1 and CS2 and software engineering courses, with several large-course deployments (e.g., CS50 and CS61A) [27,71].

We classify studies into five non-overlapping application areas (Table 2); personalization mechanisms manifest as explanation-first tutoring and graduated hints, course-aligned generation of examples and exercises, targeted formative feedback, test-and rubric-driven assessment, and AI-assisted code review. Figure 2 visualizes the distribution of studies across these application areas.

Evaluation constructs included time-to-help; error remediation (fraction of failing tests resolved and next-attempt correctness); perceived understanding and utility; feedback quality (specificity, actionability, and alignment; inter-rater ฮบ); grading reliability (QWK or adjacent agreement); test quality (statement or branch coverage, mutation score, and unique edge cases); item and exam quality (difficulty, discrimination, KR-20, or ฮฑ); help-seeking behavior (hint versus solution requests and escalation); instructor and TA effort (authoring and audit time); and developer metrics in code review (precision and recall of issue detection and review acceptance). See Table 3.

Tutoring/assistants. Classroom deployments and observations report faster help and higher perceived understanding when assistance emphasizes explanation, pseudocode, and staged hints while withholding final solutions by default; group-aware facilitation is feasible; and unguarded chat drifts toward answer-seeking [16,17,[50][51][52]89].

Personalized materials. LLM-generated worked examples and practice sets are often rated usable and helpful by novices; quality varies and benefits from instructor review; and on-demand Parsons puzzles can adapt to struggle patterns [18][19][20]53].

Targeted feedback. Structured, error-specific feedback and feedback ladders improve perceived clarity and actionability; tutor-style hints benefit from test/fix grounding and quality validation; hybrid LLM plus test feedback in MOOCs is promising; and designoriented formative feedback is emerging [21,24,34,54,55].

Assessment. LLM-generated unit tests can increase coverage and surface edge cases and ambiguities; rubric-guided grading pipelines can approach human-level agreement when explicit rubrics and exemplars, plus calibration, are used; and MCQ and exam generation is viable with expert review and vetting workflows [22,23,[56][57][58][59][60][61].

Code review. Models trained or prompted with high-quality review corpora produce more consistent, personalized critiques; industrial deployments highlight value and pitfalls; and human-in-the-loop processes remain essential [62,63,65,66].

We coded each study by its primary design pattern and the overall outcome valence as reported by the authors (for example, positive, mixed, or negative with respect to stated goals). Recurring patterns included explanation-first, solution-withholding tutoring; graduated hint ladders; test-and rubric-grounded assessment; course-aligned generation of examples and exercises; and unconstrained chat interfaces.

Across the corpus, explanation-first and solution-withholding designs, graduated hints, and test-and rubric-grounded assessment were consistently associated with more positive reported outcomes than unconstrained chat interfaces. Unconstrained chat, especially when it routinely produced complete solutions, appeared more frequently in studies describing mixed or weaker learning benefits, concerns about over-reliance, or integrity risks.

Table 4 summarizes design patterns and high-level observations without attempting formal quantitative comparison.

Explanation-first + solution withholding Often positive Supports productive struggle and perceived learning; requires AI literacy to discourage answer-seeking.

Aligns with stepwise scaffolding; development cost and tuning are nontrivial.

Test/rubric-grounded assessment Often positive

Reliability improves when coupled with clear rubrics, exemplars, and audit; hallucinations surface when specs are vague.

Helps with practice at scale; quality variance highlights need for instructor review and curation.

Solution dumping, reduced productive struggle, integrity concerns, and overreliance are recurrent issues.

Studies reporting positive outcomes commonly share several implementation conditions:

Artifact grounding: Tutoring and feedback anchored in students’ current code, failing tests, and assignment specifications.

Quality assurance loops: Human review of generated tests, items, hints, or grades before or alongside student exposure.

Graduated scaffolding: Multi-level hint structures or feedback ladders requiring reflection or effort before escalation.

AI literacy integration: Explicit instruction on effective help-seeking, limitations of tools, and expectations around academic integrity.

Conversely, studies reporting mixed or negative outcomes often exhibit the following:

โ€ข Unconstrained access to solutions early in the interaction; โ€ข Grading prompts without explicit rubrics or exemplar calibration; โ€ข Limited or no instructor review of generated content; โ€ข Weak integration with existing course infrastructure (autograders, LMS, and version control).

Figure 3 visualizes relationships between personalization mechanisms, implementation conditions, and learning outcomes at a conceptual level.

Four design patterns recur across the corpus. First, explanation-first, solutionwithholding assistance prioritizes tracing, error localization, and pseudocode over finished answers [16,17]. Second, graduated feedback ladders escalate from strategy-level cues to linelevel guidance [21,54]. Third, course-aligned generation of examples and exercises (worked examples and Parsons problems) tailors content to local curricula [18,53]. Fourth, test-and rubric-driven pipelines ground assessment and tutor-style hints in explicit specifications and criteria [22][23][24]. Anti-patterns include unconstrained chat that yields full solutions early and grading prompts without explicit rubrics and calibration.

These mechanisms are consistent with earlier work on ITSs, help-seeking, and feedback quality, where granularity of support, alignment with learner state, and clarity of criteria matter at least as much as raw system accuracy [36,38,86,88].

Improvements tend to arise under conditions that include guardrails preserving student effort (solution withholding and reflection between hint levels); tight task grounding (student code, tests, and specifications); structured evaluation artifacts (unit tests, mutation checks, and rubrics); human-in-the-loop curation (items and examples); and AI literacy plus process evidence to incentivize strategy-seeking [8,16,22,23,91].

Failure modes include answer-seeking drift, variable output quality without review, rubric-poor grading prompts, and equity risks from uneven access and overreliance [17,20,83,84]. These conditions mirror broader findings on adaptive learning systems and persuasive technology: design choices shape not only learning outcomes but help-seeking habits and motivational dynamics [44,[92][93][94].

Integrity and over-reliance can be mitigated via solution withholding, oral defenses or code walkthroughs, process evidence (commit histories and prompt logs), and a mix of AI-permitted and AI-free assessments [90,91,95]. Privacy and governance are addressed via vendor vetting (retention, training use, and access controls), enterprise instances, data minimization, and consent pathways [8,33,35,96]. Bias and equity concerns are mitigated when institutions provision licensed access, review outputs for bias, accommodate multilingual learners, and avoid unreliable detectors [97][98][99][100][101]. Quality and hallucination risks are reduced by composing LLMs with tests and rubrics, calibrating prompts, and auditing outputs and versions [22,23,61].

Three workflow families stand out: (a) Tutoring: defaults to explanation-first; ladder hints; requires reflection between levels; throttles or justifies any code emission, aligning with work on help-seeking, desirable difficulties, and cognitive load [86,88,102,103]. (b) Assessment: specifications feed LLM-generated tests, followed by instructor mutation/coverage audit; rubrics guide exemplar-calibrated grading with moderation [22,23,74,75]. (c) Process-based assessment: grading design rationales, test-first thinking, and revision quality; using Viva or code reviews to assess authorship and understanding [91,95].

Operational transparency-publishing model, prompt, and policy details; logging for pedagogy and audit; and piloting before scale-supports reliability and trust [8,27,[29][30][31][104][105][106][107].

Institutions are converging on policy-backed, centrally supported adoption: AI-use statements on syllabi, faculty development, vetted tools, and privacy-preserving defaults [8,9,25,104,107]. Large-course exemplars (CS50 and CS61A) illustrate assistants that guide rather than answer and embed process-based expectations into course culture [26,27,71,90]. System-level initiatives (e.g., CSU’s ChatGPT Edu deployment and AI Commons; ASU-OpenAI partnership; HKU’s shift from temporary bans to enabling policies) highlight the importance of vendor vetting, training, and governance [12-15,108-111].

Priorities include longitudinal learning effects (retention, transfer, and potential deskilling); comparative effectiveness of guardrails (ladder designs, code throttling, and reflection prompts); equity impacts at scale (stratified analyses by preparation, language, and access); and shared measures for replication (time-to-help, mutation scores, and grading agreement thresholds) [22,23,49,83,97]. Many of these needs echo earlier calls from AI-in-education and digital-education policy communities for more educator-centered, equity-aware research [7,9,10,100].

Having addressed our six research questions empirically, we connect our findings to established learning science principles to understand why certain design patterns succeed.

Desirable difficulties and productive struggle.

The effectiveness of solution-withholding designs (RQ1 and RQ2) connects to desirable difficulties theory [102]: introducing challenges that require effort during learning can improve long-term retention and transfer, even when they slow initial acquisition. GenAI personalization appears most promising when it maintains challenge within the zone of proximal development [112] while reducing unproductive struggle (environment setup, obscure syntax errors, and tooling friction). Graduated hint ladders operationalize this distinction-they provide just-in-time support for unproductive obstacles while preserving the cognitive engagement needed for schema construction.

Worked examples and fading.

The success of course-aligned worked examples and Parsons problems reflects workedexample research showing that novices learn effectively from studying solutions before generating them [113,114]. The key insight is fading: progressively reducing support as competence grows, moving from complete examples through partially-completed problems to independent practice. LLMs enable dynamic, individualized fading at finer granularity than cohort-level progressions-adapting to each learner’s demonstrated understanding rather than seat time.

Test-driven tutoring and rubric-guided feedback exemplify assessment for learning [115,116]: formative processes that make success criteria explicit, provide actionable feedback, and create opportunities for revision. The effectiveness of test-grounded hints and rubric-anchored grading (RQ4) aligns with the idea that transparency about expectationspaired with specific, timely guidance-supports self-regulation and improvement. GenAI amplifies this by scaling individualized feedback that would be impractical for instructors to provide manually.

Cognitive load management.

The apparent advantages of artifact-grounded assistance (conditioned on student code, tests, and specifications) over generic tutoring align with cognitive load theory [103]: learning is optimized when extraneous load is minimized and germane load (effort building schemas) is maximized. Context-aware hints reduce the load of translating generic advice to specific code, freeing working memory for conceptual understanding. Conversely, unconstrained chat that provides complete solutions risks eliminating germane load-the very processing that drives learning.

Taken together, these connections suggest GenAI personalization is not pedagogically novel so much as a mechanism for implementing evidence-based practices at scale. The central challenge is ensuring designs preserve theoretically grounded features (desirable difficulty, graduated scaffolding, and criterion transparency) rather than optimizing for superficial metrics (task completion speed and satisfaction with solution delivery).

Based on patterns in successful deployments, we propose a phased approach to GenAI personalization adoption (Table 5). A practical instructor deployment checklist, including policy approval and communication requirements, is provided in Appendix A. The aim is not to enforce hard thresholds but to encourage routine monitoring of process and outcome metrics and structured decision-making, consistent with digital-education roadmap guidance from organizations such as EDUCAUSE, OECD, UNESCO, and the World Economic Forum [8][9][10]104,107].

Key decision points include (1) whether pilot data and stakeholder feedback justify continuing or adjusting the intervention; (2) whether early scaling maintains or erodes benefits; and (3) how policies and tooling should evolve as models and institutional constraints change.

Temporal and selection bias.

Our 2023-2025 window captures early adoption; designs and models are evolving rapidly. Within this window, we purposively sampled 32 of 59 full-text-eligible studies to prioritize mechanism transparency and analytic richness. Excluded full-texts commonly relied only on student satisfaction, did not clearly implement personalization, lacked sufficient intervention detail to map mechanisms, duplicated stronger work from the same groups, or reported negative or null outcomes without actionable mechanistic insight. As a result, our synthesis emphasizes mechanism-rich, often successful deployments and may underrepresent less well-specified or unsuccessful attempts.

Publication and outcome bias.

Negative results are underrepresented in the published literature, and combined with our focus on mechanism-rich studies, this likely leads to an optimistic skew in the available evidence. We therefore present effectiveness signals as indicative rather than definitive and caution that the true distribution of outcomes may include more mixed or negative results than the included corpus suggests.

Quality appraisal and study design.

Many included sources are conference papers or preprints. Consistent with the aims of a scoping review and PRISMA-ScR guidance, we did not conduct formal methodological quality assessment (for example, JBI tools) and did not attempt to compute pooled effect sizes. Readers should interpret our conclusions as mapping applications, mechanisms, and reported signals rather than as a formal judgment of study quality.

Studies use different metrics, making cross-study comparison difficult. We therefore refrain from cross-study quantitative synthesis and instead rely on narrative descriptions of patterns in reported measures.

Most studies are single-semester deployments. Effects on long-term retention, transfer, and professional preparation remain unknown.

Few studies stratify by student demographics or prior preparation, limiting equity claims and highlighting the need for equity-focused research.

Longitudinal studies of learning and skill development.

Cohorts should be tracked over 2-4 years to assess (1) retention of concepts learned with GenAI support versus traditional instruction; (2) transfer to advanced courses and professional practice; (3) potential de-skilling effects (reduced debugging ability and overreliance on suggestions); and (4) career outcomes (internship acquisition and workplace performance). Needed design: Multi-institutional cohort studies with matched controls and standardized assessments at graduation and 1-2 years post-graduation are required.

Comparative effectiveness trials of guardrail designs.

Randomized controlled trials should be conducted to compare, for example, (1) hint ladder configurations (two-level vs. four-level and reflection prompts vs. time delays); (2) code throttling thresholds (no code vs. pseudocode vs. partial snippets); and (3) artifact grounding strategies (tests-only vs. tests + rubrics vs. tests + exemplars). Needed design: lock-in effects and migration costs should be conducted. Needed: Institutional data, legal expertise, and partnership with privacy organizations (FPF, EFF, and CDT) are required [33,35,96].

Labor and instructor impact.

Investigations of (1) changes in instructor workload (time saved on grading vs. time spent on tool management and audit); (2) deskilling concerns (TAs losing grading experience and faculty losing assessment design practice); and (3) power dynamics (algorithmic management of teaching and surveillance of instructors) should be carried out. Methods: Labor study approaches, critical pedagogy frameworks, and ethnography must be employed.

Long-term institutional case studies.

Multi-year documentation of GenAI personalization adoption at diverse institutions should be developed: (1) policy evolution (from prohibition to enablement to normalization); (2) infrastructure development (procurement, support, and training); and (3) cultural change (faculty attitudes and student expectations). Design: Longitudinal ethnography, document analysis, and interviews with administrators and faculty are required.

GenAI can deliver precision scaffolding in CS education-faster help, clearer targeted feedback, and scalable assessment support-when designs emphasize explanation-first tutoring, graduated hints, and test-and rubric-driven workflows under human oversight. Unconstrained, solution-forward use risks eroding learning and exacerbating integrity and equity issues. An exploration-first stance-clear goals, enabling policies, vetted tools, and routine audits-aligns personalization with durable learning and fairness [8,16,22,23]. Actionable takeaways.

Design for productive struggle: Default to solution withholding and laddered hints; require reflection between hint levels [16,54,102,113].

โ€ข Ground feedback in artifacts:

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut