Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring for Novice Programmers

Novice programmers often struggle to comprehend code due to vague naming, deep nesting, and poor structural organization. While explanations may offer partial support, they typically do not restructure the code itself. We propose code refactoring as …

Authors: Subarna Saha, Alif Al Hasan, Fariha Tanjim Shifat

Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring for Novice Programmers
Impro ving Code Comprehension through Cognitiv e-Load A ware A utomated Refactoring for Novice Programmers Subarna Saha Jahangirnagar University Dhaka, Bangladesh subarna.stu2019@juniv .edu Alif Al Hasan Case W estern Reserve University Cleveland, Ohio, USA alifal.hasan@case.edu Fariha T anjim Shifat Missouri University of Science and T echnology Rolla, Missouri, USA fsvf h@mst.edu Mia Mohammad Imran Missouri University of Science and T echnology Rolla, Missouri, USA imranm@mst.edu Abstract Novice programmers often struggle to comprehend code due to vague naming, deep nesting, and poor structural organization. While explanations may oer partial support, they typically do not restructure the code itself. W e propose code refactoring as cognitive scaolding , where cognitively guided refactoring auto- matically r estructures code to improve clarity . W e operational- ize this in CDDRef actorER , an automated approach grounded in Cognitive-Driven Development that constrains transformations to reduce control-ow comple xity while preserving behavior and structural similarity . W e evaluate CDDRefa ct orER using two benchmark datasets (MBPP and APPS) against two models ( gpt-5-nano and kimi-k2 ), and a controlled human-subject study with novice programmers. Across datasets and models, CDDRef actorER reduces refactoring failures by 54-71% and substantially lowers the likelihood of in- creased Cyclomatic and Cognitiv e complexity during r efactoring, compared to unconstraine d prompting. Results from the human study show consistent impr ovements in no vice code comprehen- sion, with function identication increasing by 31.3% and structural readability by 22.0%. The ndings suggest that cognitively guided refactoring oers a practical and eective mechanism for enhancing novice code comprehension. CCS Concepts • So cial and pr ofessional topics → Computer science educa- tion ; • Human-centered computing → User studies . A CM Reference Format: Subarna Saha, Alif Al Hasan, Fariha Tanjim Shifat, and Mia Moham- mad Imran. 2026. Improving Co de Comprehension through Cognitive- Load A ware A utomated Refactoring for Novice Programmers. In Proceed- ings of The 30th International Conference on Evaluation and Assessment in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honor ed. Abstracting with cr edit is permitted. T o copy otherwise, or republish, to post on servers or to redistribute to lists, r equires prior specic permission and /or a fee. Request permissions from permissions@acm.org. EASE 2026, Glasgow , Scotland, United Kingdom © 2026 Copyright held by the owner/author(s). Publication rights licensed to A CM. ACM ISBN 979-X -XXXX-XXXX-X/2026/XX https://doi.org/10.1145/XXXXXXX.XXXXXXX The user finds reading, debugging, and testing other people’ s code frustrating FYI I'm a CS freshman and while I love writing code on my own, I absolutely despise delving into code written by others, and running unit tests/debugging the code of others, especially if it's buggy . It feels like an exercise in frustration [...] How do I teach myself to not hate reading other people's code? The user struggles to understand other's code Anyone Else Find It Extremely Difficult T o Read Someone Else's Code? I find this very difficult, and no one else seems to struggle with this. When I have to read someone else's code to figure out how their program works.......it feels like I'm trying to punch through a brick wall. [...] r/learnprogramming Figure 1: Examples from Reddit where novice programmers talking ab out diculties in understanding other’s code Software Engine ering (EASE 2026). ACM, New Y ork, NY, USA, 12 pages. https://doi.org/10.1145/XXXXXXX.XXXXXXX 1 Introduction Program comprehension is a central activity in software devel- opment and a persistent challenge for novice programmers [ 18 , 34 , 47 , 62 , 63 ]. Despite acquiring foundational syntactic and se- mantic knowledge, novice programmers frequently struggle to comprehend existing code, including identifying program purpose, tracing control-ow , and recognizing functional decomposition. Critically , prior research does not attribute these diculties to syn- tactic unfamiliarity . It attributes them to structure, specically , to deep nesting, complex control-ow , and unclear modular b ound- aries [11, 22, 48, 61, 62]. Figure 1 illustrates this pattern through examples drawn from public programming forums, where novice programmers consis- tently report that structural organization–not language syntax–is what defeats their attempts at comprehension. Re cent empirical work corroborates this: structural breakdo wns are a documented trigger for confusion and frustration among novices, and are closely associated with cognitive overload during learning activities [31]. The structural account of novice diculty motivates a struc- tural intervention. Refactoring is commonly used to improve code EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom Subarna Saha, Alif Al Hasan, Fariha T anjim Shifat, and Mia Mohammad Imran readability by restructuring code while preserving behavior [ 26 ]. If structural characteristics constitute the primary barrier to novice comprehension, then targeted restructuring should, in principle, lower that barrier . Prior empirical work, however , challenges this inference. Refactoring does not consistently impro ve novice com- prehension [ 61 ]. Approaches that emphasize structural reorganiza- tion or metric reduction without accounting for the cognitive eort required during comprehension [ 42 , 49 , 61 ] may produce transfor- mations that reduce conventional complexity metrics while simul- taneously increasing the reasoning burden on novices–disrupting the control-ow paths and data dependencies they had b egun to trace [ 13 , 69 ]. The ecacy of refactoring therefore depends not solely on behavioral preservation but on how structural modica- tions are constrained. Cognitive-Driven Development ( CDD) provides a principled ba- sis for such constraints [ 65 ]. CDD bounds control-ow complexity within individual code units according to working memory capacity limits [ 41 ], aligning program structure with the cognitive demands of comprehension. Prior work conrms that CDD constraints im- prove code readability and support dev eloper r easoning [ 25 , 51 , 53 ]. Howev er , manual application of CDD remains inconsistent in prac- tice, and novice pr ogrammers cannot apply these constraints r eli- ably without external scaolding [13, 67]. This gap motivates CDDRefa ctorER , an automated refactoring system that encodes CDD-inspired constraints into the prompting strategy of large language models. Unconstrained prompting pro- duces inconsistent and sometimes counterproductive transforma- tions. CDDRef act orER , by contrast, directs the model to identify code units that exceed cognitive thresholds and to apply refactoring strategies grounded in novice compr ehension research, including method extraction, nesting reduction, identier improvement, and sequential ow organization [ 4 , 13 , 23 , 44 ], while preser ving b ehav- ioral delity and structural similarity to the original code [ 26 , 33 , 69 ]. CDDRef act orER does not perform semantic transformation [ 1 ], bug xing, or program repair [36]. T o what extent, cognitive constraints improv e refactoring safety , structural control, and novice comprehension requires empirical validation. W e pursue that validation through the following three research questions: RQ1: Evaluation of Baseline Unconstrained Prompt. How does an unconstrained prompt perform in refactoring tasks intended to be novice-programmer friendly? W e nd that unconstrained pr ompting preserves functional correctness in most cases on novice-oriented benchmarks, but still produces non-trivial refactoring failures. An error analysis reveals that failures commonly arise from unintended logic alterations, inje cted domain assumptions, and small value discrepancies. RQ2: V alidation of CDDRef act orER . How does CDDRef act orER -guided refactoring dier from unconstrained prompting in correctness and code structure? Across two benchmark datasets and two language models, CDDRefactorER reduces refac- toring failures by 54.40 to 71.23 percent r elative to unconstrained prompting. It substantially lowers the likelihood of increases in cyclomatic and cognitive complexity during refactoring, and preser ves higher structural similarity to the original code, indicating more controlled and stable transformations. RQ3: Impact on Comprehension. How does systematic automatic refactoring using CDDRefa ctorER aect novice programmers’ abil- ity to understand code? Results from a controlled between-subject human study with 20 novice programmers show consistent im- provements in self-r eported code comprehension after interacting with CDDRef act orER . The largest gains are obser ved in function identication ( +31.3%) and structural readability (+22.0%), while challenges related to unfamiliar programming concepts persist. Contributions. This paper makes two primary contributions: (i) it introduces CDDRef act orER , a cognitively constrained automate d refactoring approach, and (ii) it pr ovides empirical evidence of its eects on refactoring correctness, code structure, and novice code comprehension. The replication package is publicly available [3]. 2 Background and Related W ork This section provides the theoretical and empirical context for our work. W e rst give an overview of Cognitive-Driven Development (CDD), which forms the basis of our refactoring constraints. W e then review literatur e on cognitive load in programming with an emphasis on novice comprehension, and examine prior work on refactoring for code comprehension. Cognitive-Driven Development ( CDD) . CDD [ 65 ], gr ounded in Cognitive Load Theory [ 64 ] and cognitive complexity research [ 12 ], is a software development approach that constrains code structure based on limits of human working memory [ 41 ]. Rather than op- timizing for abstract structural metrics alone, CDD emphasizes bounding the cognitive eort required to reason ab out control- ow and nesting within individual co de units [ 12 ]. CDD quanties structural complexity through Intrinsic Complexity Points (ICPs), which assign costs to control-ow constructs such as conditionals, loops, and their nesting depth. ICPs are computed by aggregat- ing the contributions of control-ow constructs within a function. Each construct contributes a base cost, and additional cost is in- curred through nesting. The resulting ICP value is compared against predened thresholds to determine whether a code unit exceeds acceptable structural complexity [ 65 ]. The following example il- lustrates how ICPs ar e assigned. Consider a function that checks whether a number is prime: d e f i s _ p r i m e ( n ) : i f n < = 1 : # + 1 I C P ( b r a n c h ) r e t u r n F a l s e e l s e : # + 1 I C P ( b r a n c h ) i = 2 w h i l e i < n : # + 1 I C P ( l o o p ) i f n % i = = 0 : # + 1 I C P ( b r a n c h i n s i d e l o o p ) r e t u r n F a l s e e l s e : # + 1 I C P ( n e s t e d b r a n c h ) i + = 1 r e t u r n T r u e In this example, conditional branches and loops contribute to a total of ve ICPs. The same logic can be expressed with fewer control-ow constructs: d e f i s _ p r i m e ( n ) : i f n < = 1 : # + 1 I C P ( b r a n c h ) r e t u r n F a l s e w h i l e i < n : # + 1 I C P ( l o o p ) i f n % i = = 0 : # + 1 I C P ( b r a n c h i n s i d e l o o p ) r e t u r n F a l s e i + = 1 Improving Code Comprehension through Cognitive-Load A ware Automated Refactoring for Novice Programmers EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom r e t u r n T r u e This version totals three ICPs due to the removal of nested branches, while pr eserving the original program behavior . CDD de- nes structural complexity using ICP counts and pr edened thresh- olds. Prior work shows that CDD benets both professional soft- ware development tasks and novice pr ogrammers (e.g., picking up a new language) [ 25 ]. Researchers reporte d that manually applying CDD is helpful to improve code r eadability [ 7 ] and r efactoring [ 53 ]. Cognitive Load in Programming . Programming requires devel- opers to reason about control-ow , data dependencies, and inter- mediate program state, which places demands on working mem- ory [ 68 ]. Cognitive Load The ory distinguishes b etween intrinsic load, extraneous load, and germane load, and has be en used to ana- lyze programming tasks and learning outcomes [ 64 ]. Prior literature reviews highlight the prevalence of CLT in computing education research and emphasize strategies such as scaolding to manage high element interactivity in code [10, 19]. Neuroimaging and physiological studies demonstrate that code comprehension activates brain regions asso ciated with working memory and attention, with increasing complexity correlating with higher neural load [ 22 , 23 , 27 , 57 ]. Emerging research on AI-assiste d tools, such as GitHub Copilot, suggests that these systems can r e- duce cognitive load by automating repetitive coding tasks, allowing developers to focus on higher-level reasoning [ 8 , 72 ]. However , Prather et al. sho w that while such tools reduce syntactic bur den for novices, they introduce additional metacognitive demands related to verication and understanding of generated code, indicating a shift rather than a net reduction in cognitive load [ 54 ]. Recent stud- ies suggest that when cognitive load remains unmanaged, novices experience persistent confusion and frustration, linking working memory overload to observable aective struggle [31]. Refactoring for Co de Comprehension . Refactoring restructures source code to improve clarity and maintainability without altering external behavior [ 26 ]. Prior studies associate structural transfor- mations such as decomposition, improved identiers, and reduced cyclomatic complexity with lower cognitive load and improved program compr ehension [ 11 , 59 , 61 , 62 ]. Additional work examines ner-grained structural practices, including sele ctive annotation and localize d restructuring, with reported b enets for readabil- ity [ 28 , 40 , 60 ]. Howev er , empirical evidence shows that r eadability improvements do not consistently translate to better novice compre- hension when r efactoring disrupts structural familiarity or existing mental models [ 69 ]. Code smells are also shown to hinder novice performance, while constrained refactorings can improve learning outcomes [4, 33]. Manual refactoring remains dicult for novices and is associ- ated with semantic errors [ 67 , 69 ]. Think-aloud and replication studies report that novices reason locally and struggle to apply refactoring strategies without guidance [ 9 , 13 ]. As a result, prior work has explored external scaolding, including LLM-generated explanations, which support understanding while leaving code structure unchanged [ 16 , 24 , 37 , 58 ]. More recent work inv estigates guided and LLM-based refactoring, reporting improved refactoring quality alongside sensitivity to prompting and novice over-trust in generated outputs [6, 14, 20, 46, 50, 71]. Although prior research has examined cognitive load in program- ming and automated refactoring independently , their integration for novice code comprehension remains undere xplored, a gap this study aims to address. 3 Methodology Figure 2 shows the o verview of the methodology . It has two com- ponents: (i) the design of CDDRefa ct orER and (ii) its evaluation. W e design CDDRefa ctorER by incorporating CDD principles, as describ ed in Section 2. W e then evaluate the approach using two complementary studies. First, we benchmark two pr ompting strategies on the MBPP [ 45 ] and APPS [ 32 ] datasets: (i) an unguided zero-shot baseline prompt without structural constraints, and (ii) a CDD-guided prompt– CDDRef actorER . W e evaluate these refac- torings in terms of functional correctness and structural complexity , measured using cyclomatic and cognitive complexity metrics in RQ1 and RQ2 . Second, we conduct a controlled human-subject study to assess the impact of CDD-guided refactoring on novice code comprehen- sion ( RQ3 ). The following sections describe the prompt design used in the empirical study and the human study methodology . 3.1 Prompt Engineering W e compare two prompting strategies for automated refactoring. 3.1.1 Unconstrained Zero-shot Prompt (Baseline) . The base- line prompt instructs the mo del to refactor code for readability and maintainability without imp osing any explicit structural or cognitive constraints. It serves as a representative unconstrained refactoring approach. The prompt is included in the replication package [3]. Baseline Prompt Y ou are an AI assistant specialized in refactoring code for novice programmers. Y our goal is to make the co de more readable , under- standable, and maintainable. [...] 3.1.2 CDDRef act orER Prompt . The CDDRefact orER prompt operationalizes three CDD principles: dening Intrinsic Complexity Points (ICPs), constraining code complexity to human cognitive capacity , and reducing ICPs through refactoring. Following the original formulation, the prompt assigns ICP values [ 65 ]. W e assign the ICP values and ICP limits in a code block based on the work by de Souza et al. [ 65 ] and their following works [ 51 – 53 ]. The model is instructed to identify code units whose accumulated ICPs e xceed acceptable thresholds as per Miller’s law [ 41 ] and to target these units for refactoring. This emphasis on control-ow is motivated by prior empirical evidence showing that nested conditionals and loops are particularly challenging for no vice programmers to un- derstand [ 22 , 23 , 48 , 61 , 62 ]. The prompt further species a set of refactoring strategies grounded in prior research on code compr e- hension for novices. Extract Method decomposes complex functions into smaller , single-purpose units [ 33 , 59 ], while Reduce Nesting attens deeply nested control structures [ 61 , 69 ]. Eliminate Dupli- cation factors out repeated code fragments [ 33 , 69 ], and Simplify Boolean Returns replaces verbose conditional patterns with direct EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom Subarna Saha, Alif Al Hasan, Fariha T anjim Shifat, and Mia Mohammad Imran CDDRefactorER Design Cognitive Driven Development (CDD) Principles - Measure Intrinsic Complexity Points (ICP) - Set Complexity Limits (ICPs<=7, Miller's Law) Refactoring Constraints - Exact strings - Exact numbers - Exact signatures [...] Refactoring Strategies - Extract Method - Eliminate Duplication - Reduce Nesting [...] + Examples + Examples + Examples Phase I Pre-test & Post-test Assessment 10x Any resources CDDRefactorER Default: GPT -5 Human Study ( between subject) - 20 1st Semester CS student - Completed Intro to Pr ogramming (CS-101) - 3 problems fr om MBPP dataset 20x Phase II 10x 3 3 + Program understanding + Perceived clarity MBPP + APPS Kimi-K2 + GPT -5-nano Evaluation: - Correctness - Cyclomatic Complexity - Cognitive Complexity - CodeBLEU CDDRefactorER/Baseline refactor ed code CDDRefactorER Evaluation Figure 2: Overview of the Methodology . boolean expressions [ 69 ]. Descriptive Naming impro ves identier clarity [ 59 , 61 ], and Sequential Flow encourages chronological or- dering and grouping of statements to support comprehension [ 64 ]. Each strategy is dened through explicit transformation rules and illustrated with concrete examples in the prompt. Finally , we incorporated additional constraints derived from an error analysis of baseline prompt outputs (see: Se ction 4). These constraints are intended to prev ent unintended semantic or stylis- tic alterations introduced by the mo del. In particular , generative outputs occasionally substitute domain-specic constants, such as replacing literal values like 3.14 with 𝜋 , or modify string liter- als by altering capitalization, for example transforming ‘zzbuzz’ into ‘FizzBuzz’ . The prompt explicitly discourages such changes to preserve funtional behavior . The full version of the CDDRefactorER prompt is pro vided in the replication package [3], and a shortened version is shown below . CDDRef act orER Prompt (Short Version) Y ou are CDDRefactorER, an AI that refactors code to reduce cogni- tive load for novice programmers while preserving exact behavior . CDD Principles. (1) Measure ICPs: Control structures ( if : +1; [...] (2) Set Complexity Limits: Keep ICPs ≤ 7 per function (Miller’s Law: humans hold 7 ± 2 items in working memory). [...] (3) Refactor When Excee ded: Decompose complex units into simpler , focused functions. [...] Example [...] Refactoring Strategies. • Extract Metho d: Break complex functions into single-purpose helpers. • Eliminate Duplication: Factor out repeated code. Example: setup(); (a() if x else b()); cleanup() • Improve Naming: Use verb_noun for functions, is/has/can for booleans. • Reduce Nesting: [...] Examples [...] Constraints (Do not violate): [...] • Exact strings: “zzbuzz” must not b ecome “Fizzbuzz” or “FizzBuzz” . • Exact numbers: 3.14 must not b ecome math.pi . • Exact signatures: Don’t change function names, parameters, or order . [...] Examples [...] T ask: Now for the given code snippet, do Code Refactoring using above guideline. [...] 3.1.3 Model . W e evaluated our strategy using one proprietary model and one open-source model: • gpt-5-nano. W e use the gpt-5-nano model to evaluate CD- DRef act orER . This model was released on August 7, 2025. • kimi-k2. kimi-k2 is an open-source Mixture-of-Experts lan- guage model with 32 billion active parameters drawn from a larger expert pool. It reported strong performance on competi- tive coding benchmarks [66]. Both models are evaluated using identical pr ompts and experi- mental conditions to isolate the eect of prompting strategy . 3.1.4 Dataset . W e evaluated our approach against two datasets: • MBPP dataset. The MBPP dataset contains 974 introductory- level Python programs, each pair ed with a problem description, reference implementation, and test cases. The dataset primarily targets novice-level programming tasks and is well suite d for evaluating refactoring correctness and structural changes [45]. • APPS Dataset . The APPS dataset is a large-scale benchmark con- sisting of 10,000 programming problems and over 230k human- written solutions [ 32 ]. W e restrict our analysis to the introductory subset of APPS and randomly sample 5,000 solutions from this subset to ensure tractability while maintaining diversity . 3.1.5 Metrics . W e assess refactoring outcomes using multiple complementary metrics. Functional correctness. Correctness is measured by running the refactored programs against the original test suites provided with each dataset. A refactored program is considered correct only if it passes all associated test cases. Cyclomatic Complexity (CC). Cyclomatic complexity measures the number of independent control-ow paths in a pr ogram [ 39 ]. It is dened on the control-ow graph 𝐺 as 𝑉 ( 𝐺 ) = 𝐸 − 𝑁 + 2 𝑃 , where 𝐸 is the number of edges, 𝑁 is the number of nodes, and 𝑃 is the number of connected components. W e use this metric to capture changes in control-ow structure between the original and refactored code. Improving Code Comprehension through Cognitive-Load A ware Automated Refactoring for Novice Programmers EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom Cognitive Complexity (CogC). Cognitive complexity captures how dicult a program’s control-ow is to understand by account- ing for control constructs and their nesting depth [ 12 ]. It is dened as CogC = Í 𝑛 𝑖 = 1 ( 1 + 𝑑 𝑖 ) , where 𝑛 is the number of control-ow structures (e .g., if , for , while , catch ), and 𝑑 𝑖 represents the nest- ing depth of the 𝑖 -th structure. W e use this metric to assess how refactoring aects nesting and control-ow complexity relative to the original code. Statistical Signicance ( 𝑝 ). T o test whether dierences between baseline and refactored code are statistically signicant, we use non-parametric two-sided Wilcoxon Signed-Rank T est [70]. Eect Size. Eect size measures the magnitude of dierences be- tween conditions. W e report Cli ’s Delta ( 𝛿 ) [ 17 ], a non-parametric eect size measure for comparing two distributions. 𝛿 values are interpreted as negligible ( | 𝛿 | < 0 . 147 ), small ( 0 . 147 ≤ | 𝛿 | < 0 . 33 ), medium ( 0 . 33 ≤ | 𝛿 | < 0 . 474 ), or large ( | 𝛿 | ≥ 0 . 474 ) [38]. CodeBLEU. W e measure syntactic and semantic similarity be- tween the original and refactored code using Co deBLEU [ 56 ]. Code- BLEU extends BLEU by incorporating code-specic features such as 𝑛 -gram overlap, Abstract Syntax T ree (AST) structure, and data- ow information. CodeBLEU produces a weighted similarity score. Higher scores indicate gr eater structural and syntactic similarity between the two programs. 3.2 Human Study Design The goal of the human study is to assess ho w cognitively guided refactoring inuences novice pr ogrammers’ code comprehension. W e focus on understanding pr ogram purpose, logic ow , functional decomposition, and structural readability . 3.2.1 Participants . W e recruited 20 rst-semester computer sci- ence students (6 male, 14 female) who had completed an introduc- tory programming course (CS-101) and had no advance d course- work. Participants had b etween 0 and 2 years of programming experience and represent no vice programmers with foundational but limited exposure. Participation was voluntar y , and informed consent was obtained under Institutional Revie w Board (IRB) ap- proval. 3.2.2 Study Design . The study employed a between-subjects design consisting of two independent groups: a pre-test group and a post-test group [ 55 ]. Each group included 10 participants, and no individual participated in b oth conditions. This design was selected to avoid learning, familiarity , and testing eects that may arise from repeated exposure to the same code artifacts or survey instruments [15, 29]. W e informed the participants that the study evaluated code com- prehension rather than code writing performance, debugging abil- ity , or task completion speed. 3.2.3 T ask Sampling and Allocation . Problem Selection. T asks were drawn from the MBPP dataset using a multi-stage se- lection process. Three authors independently sele cted candidate problems spanning three diculty levels: • Basic tasks required simple arithmetic or string manipulation with minimal control-ow . • Intermediate tasks involved standard data structures such as lists or dictionaries, or nested loops. • Advanced tasks required non-trivial algorithmic reasoning or careful edge-case handling. After merging selections, we obtained a po ol of 81 candidate prob- lems. The authors held discussion sessions and reached consensus on task diculty classications based on algorithmic structure and required prior knowledge. After r eaching unanimous agreement, we randomly selected a nal set of 20 tasks consisting of 10 basic, 5 intermediate, and 5 advanced tasks. W e assigned each participant three tasks: one basic, one interme- diate, and one advanced. W e randomized task assignments while ensuring comparable diculty distributions across the two experi- mental groups. W e set an upper b ound of 15 minutes per task. 3.2.4 Platform and T ooling . W e delivered all tasks and sur veys through a custom Streamlit web application that controlled task presentation, resource access, and response collection. W e deployed CDDRef act orER through the OpenAI platform using the default gpt-5 mo del [2]. 3.2.5 Procedure . The experiment followed a two-phase between- subjects procedure [55]: Phase 1: The 10 randomly selected participants analyzed original, unrefactored code snippets. Their task was to understand the pro- gram’s purpose, logic ow , and structure. T o reect realistic novice learning behavior , participants were permitted to consult external online resources, including general w eb search engines, ChatGPT and other AI tools, and programming-related websites. Phase 2: The other 10 participants rst examined the original, unrefactored code snippets. They then used CDDRef actorER to generate a refactored version of the code. After generating the refactored code, participants analyzed b oth the original and the refactored versions to understand the program’s purpose, logic ow , and structure. During this phase, the participants r estricted themselves to use CDDRef actorER only . Before beginning, they viewed a short instructional video explaining the purpose of CD- DRef act orER and how to use the tool. 3.2.6 Surveys . Both pre-test and post-test conditions included code comprehension assessments administered after each task. Pre-test sur veys consisted of: • Open-ended questions asking participants to describe what the code does and identify confusing sections. • A 5-point Likert scale (ranging from V ery Low - 1 to V ery High - 5) measuring: – Perceived problem diculty . – Understanding of overall program purpose. – Understanding of program logic ow . – Ability to identify key functions and their roles. – Perceived structural clarity and readability . Post-test sur veys: repeated the Likert-scale comprehension ques- tions after participants reviewed the refactor ed code. Open-ended questions aske d participants to describe the refactored code and identify sections that b ecame clearer after using CDDRef act orER . Each participant completed sur veys for three tasks, yielding 30 pre-test and 30 post-test responses. All collecte d data were anonymized to protect participant privacy . EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom Subarna Saha, Alif Al Hasan, Fariha T anjim Shifat, and Mia Mohammad Imran T able 1: Error categories observed in MBPP refactoring out- puts generate d by the gpt-5 model on Baseline approach. Error Category count Logic alteration 19 Small value discrepancy 7 Function signature changes 4 Conditional logic issues 2 Miscellaneous 4 T otal 36 4 Evaluation of Baseline RQ1: How does an unconstrained prompt p erform in refactor- ing tasks intended to be novice-programmer friendly? T o evaluate the performance of an unguide d refactoring prompt on novice-oriented tasks, we rst examine functional correctness, dened as whether the refactored program preserves the original behavior as validate d by the MBPP test suite. Using a simple un- constrained zero-shot refactoring prompt with the gpt-5 model, we apply this baseline setting to 974 programs from the MBPP dataset. Under this criterion, the baseline successfully refactors 938 programs, corresponding to a high success rate and indicating that unconstrained prompting can often preserve functional correct- ness for short, well-scoped programs typical of novice benchmarks. Howev er , functional incorrectness still occurs in 36 cases (3.70%). W e analyze these 36 failing cases further in detail. In addition to correctness, we e xamine how unconstraine d r efac- toring ae cts co de structure (see Table 3). Across the refactored programs, cognitive complexity increases in 229 cases and de creases in 231 cases, resulting in a net change of − 2 . Similarly , cyclomatic complexity increases in 184 cases and decreases in 232 cases, yield- ing a net change of − 48 . These results suggest that, while uncon- strained refactoring can simplify structure, it does not consistently reduce complexity and may intr oduce structural regressions in a non-trivial number of cases. Error Analysis . T o understand the sources of functional correct- ness failures, one author indep endently inspected the refactored programs and labeled observed errors using open co ding, without relying on a predened taxonomy , following established qualitative analysis practices [ 35 ]. A second author then reviewed the derived error categories and their assignments. Any disagreements were discussed until consensus was reached. T able 1 summarizes the distribution of identie d error categories. The most prevalent categor y , accounting for 50% of baseline failures, is logic alteration . These errors occur when the language model introduces incorr ect logic, often based on ambiguous function nam- ing or external domain knowledge. For example, a function named ‘ 𝑎𝑣 𝑔 ( 𝑎, 𝑏 ) ’ that originally returns ‘ 𝑎 + 𝑏 ’ may be “corr ected” by the model to return ‘ ( 𝑎 + 𝑏 ) / 2 ’ due to it’s prior knowledge, caus- ing the refactored program to fail the test cases despite preserving syntactic correctness. The small value discrepancy category captures errors resulting from changes in numeric constants (e.g., replacing an appro ximate value of 𝜋 with 𝑚𝑎𝑡 ℎ .𝑝 𝑖 ) or from precision drift due to r eordering arithmetic operations. The function signature changes categor y , which appears only under the baseline prompt (4/38 cases), reects cases where the model incorrectly assumes input parameter types or modies the function signature. Conditional logic issues involve the introduction of additional input checks during refactoring, such as enforcing constraints on parameter values that wer e not present T able 2: Comparison of Incorrect Refactorings Between CD- DRef act orER and the Baseline (The gray-shaded row de- notes the error analysis from this baseline conguration used to inform the design of CDDRefa ctorER ). Dataset Model CDDRef act orER Baseline Error Change ( 𝑁 ) (Incorrect Count) (Reduction Rate) MBPP gpt-5-nano 9 (0.92%) 36 (3.70%) − 2.78% (75.00%) (974) kimi-k2 11 (1.13%) 39 (4.01%) − 2.87% (71.79%) APPS gpt-5-nano 83 (1.66%) 182 (3.64%) − 1.98% (54.40%) (5000) kimi-k2 107 (2.14%) 372 (7.44%) − 5.3% (71.23%) in the original implementation (e.g., requiring parameters to an average(a, b) function to b e positive). Finally , the miscellane ous category includes a range of failures, such as syntax errors, parsing errors, and uninitialized variables. Summary of RQ1. The results show that while functional correct- ness is preserved in most cases, a non-trivial numb er of refac- torings fail due to logic alterations, inje cted assumptions, and small value discrepancies. Structurally , complexity reductions and increases largely cancel out, resulting in little net simplica- tion. Overall, unconstrained pr ompting does not reliably pr oduce refactorings aligned with novice comprehension needs. 5 Evaluation of CDDRef act orER As discusse d in Section 3.1.2, the design of CDDRefact orER is informed by the error patterns observed under unconstrained r efac- toring as well as CDD principles. Assessing comprehension after automated refactoring for beginners requires more than verifying functional correctness, as refactoring may increase structural com- plexity , thereby hindering code comprehension even when behavior is preserved. While reductions in code complexity do not guaran- tee improved understanding, prior work sho ws that these metrics are associated with increased comprehension eort and perceived mental diculty [ 21 , 30 , 43 ]. Further , extensive or disruptive struc- tural changes may hinder comprehension by reducing familiarity with the original code structure [ 33 , 69 ]. Accordingly , we evaluate refactoring quality in terms of i) functional correctness (T able 2), ii) changes in structural complexity (T able 3), and iii) structural similarity (Figure 3). 5.1 RQ2.A: Functional Correctness How does CDDRef act orER -guided refactoring dier from unconstrained prompting in terms of functional correctness? T able 2 reports the numb er of refactored programs that fail their associated test suites on the MBPP and APPS datasets under baseline and CDDRef actorER . Although MBPP with gpt-5 results are reported for completeness, we exclude this conguration from the discussion since error analysis from this conguration directly informed the design of CDDRefa ct orER . Across all settings, CDDRef actorER consistently reduces the number of refactoring failures relative to the baseline. On MBPP us- ing kimi-k2 , CDDRef act orER produces 963 functionally correct refactorings and 11 failures, compar ed to 935 correct refactorings and 39 failures under the baseline prompt, corresponding to a 71.79% reduction in errors. On the APPS dataset with gpt-5 , CDDRef ac- torER yields 4,917 correct refactorings and 83 failures, compared to 4,818 correct refactorings and 182 failures under unconstrained prompting, resulting in a 54.40% reduction. Similarly , for kimi-k2 Improving Code Comprehension through Cognitive-Load A ware Automated Refactoring for Novice Programmers EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom T able 3: Impact of Baseline and CDDRefa ctorER Refactor- ing on Cognitive and Cyclomatic Complexity ( 𝑁 𝑆 , *, ** , *** , **** indicate 𝑝 ≥ 0.05, 𝑝 < 0.05, 𝑝 < 0.01, 𝑝 < 0.001, and 𝑝 < 0.0001, respectively . ◦ , † , ‡ , and § indicates negligible, small, medium, and large eect sizes). Dataset Mo del Metric Measure Baseline CDDRef actorER Decrease 229 (23.51%) 170 (17.45%) Increase 231 (23.72%) 85 (8.73%) NET (%) -2 ( -0.21%) 85 (8.73%) 𝑝 -value NS ** Cognitive Complexity (CogC) Cli ’s 𝛿 0.024 ◦ 0.180 † Decrease 184 (18.89%) 193 (19.82%) Increase 232 (23.82%) 42 (4.31%) NET (%) -48 ( -4.93%) 151 (15.50%) 𝑝 -value ** **** gpt-5-nano Cyclomatic Complexity (CC) Cli ’s 𝛿 0.159 † 0.613 § Decrease 223 (22.9%) 217 (22.28%) Increase 139 (14.3%) 43 (4.41%) NET (%) 84 (8.62%) 174 (17.86%) 𝑝 -value *** **** CogC Cli ’s 𝛿 0.195 † 0.539 § Decrease 155 (15.9%) 195 (20.02%) Increase 119 (12.2%) 13 (1.33%) NET (%) 36 (3.70%) 182 (18.69%) 𝑝 -value * **** MBPP kimi-k2 CC Cli ’s 𝛿 0.136 ◦ 0.777 § APPS gpt-5-nano CogC Decrease 1746 (34.92%) 1323 (26.46%) Increase 1454 (29.08%) 616 (12.32%) NET (%) 292 (5.84%) 707 (14.14%) 𝑝 -value ** **** Cli ’s 𝛿 0.057 ◦ 0.234 † CC Decrease 1272 (25.44%) 1226 (24.52%) Increase 1328 (26.56%) 309 (6.18%) NET (%) -56 ( -1.12%) 917 (18.34%) 𝑝 -value * **** Cli ’s 𝛿 0.039 ◦ 0.534 § kimi-k2 CogC Decrease 1889 (37.8%) 1571 (31.42%) Increase 1089 (21.8%) 506 (10.12%) NET (%) 800 (16.00%) 1065 (21.30%) 𝑝 -value **** **** Cli ’s 𝛿 0.209 † 0.391 ‡ CC Decrease 1409 (28.2%) 1397 (27.94%) Increase 751 (15.0%) 183 (3.66%) NET (%) 658 (13.16%) 1214 (24.28%) 𝑝 -value **** **** Cli ’s 𝛿 0.291 † 0.685 § on APPS, the number of failing refactorings decreases from 372 to 107, corresponding to a 71.23% reduction in errors. 5.2 RQ2.B: Code Structural Change Analysis How does CDDRef act orER -guided refactoring dier from unconstrained prompting in terms of code structure? W e measure complexity using cognitive and cyclomatic met- rics. Increases reect added structural complexity , while decreases indicate simplication. Structural similarity is assessed using Code- BLEU. It serves as a pro xy for the extent of structural change. Higher CodeBLEU scor es indicate closer adherence to the original program structure, while low er scores reect more r eorganization. 5.2.1 Code Complexity Analysis . T able 3 summarizes the im- pact of refactoring on cognitive and cy clomatic complexity under baseline prompting and CDDRef act orER . For each conguration, we report the proportion of r efactorings that decrease or increase complexity , allowing us to assess whether refactoring tends to simplify or complicate program structure relative to the original implementation. The NET eect quanties the balance b etween complexity-decreasing and complexity-increasing refactorings, ex- pressed as the percentage dierence between the two. GPT -5 Baseline GPT -5 CDDRefactorER Kimi-K2 Baseline Kimi-K2 CDDRefactorER 0.2 0.4 0.6 0.8 1.0 CodeBLEU GPT -5 Baseline GPT -5 CDDRefactorER Kimi-K2 Baseline Kimi-K2 CDDRefactorER 0.0 0.2 0.4 0.6 0.8 1.0 CodeBLEU Figure 3: CodeBLEU similarity distributions after refactoring on MBPP (top) and APPS ( bottom). As with functional correctness, we include MBPP results for the gpt-5 model in the table using CDDRef act orER for completeness but exclude from analysis. Baseline behavior . Reductions and increases in complexity largely oset each other , resulting in limited net structural simplication. For example, on the APPS dataset with gpt-5 , cognitive complexity decreases in 34 . 92% of cases and increases in 29 . 08% of cases, yield- ing a NET eect of + 5 . 84% . For cyclomatic complexity on the same dataset, decreases occur in 25 . 44% of cases while increases occur in 26 . 56% , resulting in a negative NET eect of − 1 . 12% . Similar oset- ting patterns are observed across datasets and models, indicating that unconstrained refactoring does not reliably prevent structural regressions. While most baseline congurations are statistically sig- nicant, all associated eect sizes ar e negligible to small, indicating limited separation between decreasing and increasing outcomes. CDDRef act orER behavior . In contrast, CDDRefa ctorER con- sistently produces positive NET eects across datasets and models by substantially reducing the proportion of complexity-increasing refactorings. On the APPS dataset with gpt-5 , cognitive complexity increases drop from 29 . 08% under the baseline to 12 . 32% with CD- DRef act orER , while decreases occur in 26 . 46% of cases, yielding EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom Subarna Saha, Alif Al Hasan, Fariha T anjim Shifat, and Mia Mohammad Imran Original Co de: d e f p a r a l l e l _ l i n e s ( l i n e 1 , l i n e 2 ) : r e t u r n l i n e 1 [ 0 ] / l i n e 1 [ 1 ] = = l i n e 2 [ 0 ] / l i n e 2 [ 1 ] Baseline - Refactored using gpt-5-nano : d e f p a r a l l e l _ l i n e s ( l i n e 1 , l i n e 2 ) : n 1 , d 1 = l i n e 1 n 2 , d 2 = l i n e 2 # B o t h l i n e s h a v e v e r t i c a l s l o p e ( u n d e f i n e d ) : c o n s i d e r e d p a r a l l e l i f d 1 = = 0 a n d d 2 = = 0 : r e t u r n T r u e # O n e v e r t i c a l , t h e o t h e r n o t : n o t p a r a l l e l i f d 1 = = 0 o r d 2 = = 0 : r e t u r n F a l s e # C o m p a r e s l o p e s w i t h o u t f l o a t i n g p o i n t p r e c i s i o n i s s u e s : n 1 / d 1 = = n 2 / d 2 r e t u r n n 1 * d 2 = = n 2 * d 1 CDDRef act orER - Refactored using gpt-5-nano : d e f p a r a l l e l _ l i n e s ( l i n e 1 , l i n e 2 ) : r e t u r n l i n e 1 [ 0 ] / l i n e 1 [ 1 ] = = l i n e 2 [ 0 ] / l i n e 2 [ 1 ] Figure 4: Original code (top), erroneous baseline refactoring (middle), and correct CDDRef act orER refactoring (b ottom). a NET eect of + 14 . 14% . For cyclomatic complexity , increases are reduced from 26 . 56% to 6 . 18% , while decreases remain compara- ble ( 24 . 52% ), resulting in a NET eect of + 18 . 34% . On APPS with kimi-k2 , NET eects reach + 21 . 30% for cognitive complexity and + 24 . 28% for cyclomatic complexity . All CDDRefact orER congu- rations are statistically signicant, with eect sizes ranging fr om medium to large (except one conguration which is small). 5.2.2 CodeBLEU Analysis . Figure 3 presents CodeBLEU distribu- tions for both datasets and models. Across all settings, CDDRef ac- torER consistently produces refactor ed code that remains closer to the original implementation than baseline refactoring. In terms of central tendency , median Co deBLEU scores increase substantially under CDDRefa ctorER across both datasets and models. On MBPP dataset, the median rises from 0.297 to 0.601 for gpt-5 , corresponding to a relative increase of 102 . 6% , and from 0 . 362 to 0 . 635 for kimi-k2 , an increase of 75 . 6% . On APPS dataset, median similarity increases from 0.261 to 0.540 for gpt-5 , a gain of 106 . 7% , and from 0 . 256 to 0 . 570 for kimi-k2 , a gain of 122 . 7% . These results indicate that CDDRef act orER consistently preserves a substantially larger fraction of the original program structure during refactoring across both datasets and models. Distributional analysis further supports this obser vation. On MBPP, the rst quartile (Q1) increases from 0.219 to 0.421 for gpt-5 and from 0.277 to 0.457 for kimi-k2 , while the Q3 rises from 0.386 to 0.777 and from 0.449 to 0.846, respectively . On APPS, gpt-5 exhibits an increase in Q1 from 0.197 to 0.377 and in Q3 from 0.350 to 0.750, while kimi-k2 shows corresponding increases from 0.198 to 0.413 and from 0.335 to 0.750. The upward shift of both the lo wer and upper quartiles shows that CDDRefa ct orER yields higher structural similarity for most refactored programs. 5.2.3 Illustrative Examples . T o complement the quantitative analysis, we present two representative examples that highlight qualitative dierences between baseline and CDDRef actorER . Original Co de: d e f n t h _ e v e n ( n ) : i f n = = 1 : r e t u r n 0 i f n = = 2 : r e t u r n 2 i f n = = 3 : r e t u r n 4 e l s e : r e t u r n n * 2 - 2 Baseline - Refactored using gpt-5-nano : d e f n t h _ e v e n ( n ) : " " " ␣ ␣ ␣ ␣ R e t u r n ␣ t h e ␣ n - t h ␣ e v e n ␣ n u m b e r ␣ ( 1 - i n d e x e d ) . ␣ ␣ ␣ ␣ E x a m p l e s : ␣ 1 ␣ - > ␣ 0 , ␣ 2 ␣ - > ␣ 2 , ␣ 3 ␣ - > ␣ 4 ␣ ␣ ␣ ␣ " " " i f n < 1 : r a i s e V a l u e E r r o r ( " n ␣ m u s t ␣ b e ␣ a ␣ p o s i t i v e ␣ n u m b e r " ) r e t u r n ( n - 1 ) * 2 CDDRef act orER - Refactored using gpt-5-nano : d e f n t h _ e v e n ( n ) : r e t u r n ( n - 1 ) * 2 Figure 5: Original code (CC=4, CogC=4) (top), baseline refac- toring (CC=2, CogC=1) (middle), and CDDRefact orER refac- toring with lowest complexity (CC=1, CogC=0) (b ottom). Figure 4 shows a function that checks whether two lines are parallel. Under unconstrained prompting, refactoring introduces additional logic based on inferred domain assumptions, altering program b ehavior and causing test failures. In contrast, CDDRef ac- torER preserves the original implementation. Figure 5 presents a case where both approaches preserve func- tional correctness. The original implementation computes the n-th even number using multiple conditional branches for specic val- ues of n resulting in unne cessary control-ow complexity . The baseline refactoring impro ves the implementation by intr oducing a direct mathematical formula and adding input validation. In con- trast, CDDRef actorER further simplies the code by expressing the same formula in its minimal form, removing additional checks and producing the lowest complexity among the thr ee versions. Summary of RQ2. CDDRef act orER consistently produces safer and more stable refactorings than unconstrained prompting. Across datasets and models, it signicantly r educes refactoring failures, limits increases in cognitive and cyclomatic complex- ity , and preserves greater structural similarity to the original code. These results indicate that CDD principles and the imposed constraints enable safer and controlled automated refactoring. 6 Human Study RQ3: How does systematic automatic refactoring using CD- DRef act orER aect novice programmers’ ability to under- stand co de? RQ2 established that CDDRef act orER produces structurally more controlled refactorings than unconstrained prompting — reducing complexity-increasing transformations and preserving greater structural similarity to the original code. RQ3 examines Improving Code Comprehension through Cognitive-Load A ware Automated Refactoring for Novice Programmers EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom whether these structural properties translate into measurable dier- ences in novice comprehension. Specically , lower cyclomatic and cognitive complexity are hypothesized to reduce the control-ow reasoning burden on novices, while higher CodeBLEU similarity is hypothesized to preserve structural familiarity , together supporting comprehension [ 12 , 43 , 69 ]. W e conducted a controlled between- subjects study with 20 rst-semester computer science students as described in Section 3.2 to test this. Findings. Table 4 summarizes the average code comprehension ratings before and after exposure to CDD-refactored code. A cross all four measured dimensions, participants reported higher com- prehension after reviewing the refactored versions. The largest improvement was obser ved in Function Identication, which in- creased from 2.97 to 3.90 (+31.31%), indicating that refactoring sub- stantially aided participants in recognizing functional roles within the code. Ratings for Code Structure for Readability also improved notably , rising from 3.17 to 3.87 ( +22.0%), suggesting clearer struc- tural organization. More moderate but consistent gains were ob- served for Purp ose Understanding, which increased from 3.23 to 3.80 ( +17.65%), and Logic F low Compr ehension, which improv ed from 2.93 to 3.50 ( +19.45%). Overall, these survey feedback suggest that cognitively guided refactoring enhances novice programmers’ perceived understanding of code, particularly in terms of functional decomposition and structural clarity . alitative Feedback. Op en-ended post-test responses consis- tently indicated that CDDRefa ctorER improv ed novice program- mers’ perceived clarity and organization of code. Participants fr e- quently attributed these improvements to clearer structural de- composition, stepwise logic, and more informative naming. For example, one student noted, “[...] Clear names and structure make the logic easy to follow [...]. ” (P09). Several participants emphasized that renaming and organization directly supported readability , re- porting that “The structured way and meaningful name make the code more easier to read and understand. ” (P15). Participants also highlighted the value of explanations and examples accompanying the refactored code. Many described the refactored solutions as more understandable; for example, P05 reported that the solutions were “ easy to understand with example and explanations” and that “the explanation with example is great. ” . These responses suggest that combining structural refactoring with contextual explanations further supports comprehension beyond code-level changes alone. For more advanced problems requiring specialize d or less famil- iar programming concepts, responses revealed both improv ement and remaining challenges. For in one specic advanced tasks, dur- ing the pre-test, both participants (P02 and P04) reported confusion when interpreting compact or non-obvious expressions, noting that certain conditions and operations were dicult to reason ab out. For the same problems, in the post-test, one participant indicated substantial improvement, stating, “I was not understanding the co de earlier . But now I understoo d the code fully . ” (P19). However , not all diculties were r esolved as the other participant continued to report challenges even after refactoring, e xplaining, “i don’t know why but i cannot understand the part of while loop [...] may be my concept is not clear . ” (P05). These responses suggest that, while T able 4: Code comprehension ratings on human study . 5-Point Likert Scale Questions Before After Change Function Identication 2.97 3.9 +31.31% Code Structure for Readability 3.17 3.87 +22.0% Purpose Understanding 3.23 3.8 +17.65% Logic Flow Comprehension 2.93 3.5 +19.45% refactoring can alleviate structural and readability issues, it can- not fully resolve gaps in the learners’ understanding of advanced programming concepts they may not be familiar with. Summary of RQ3. Results from the human study show that novices report higher code comprehension after interacting with CDDRef act orER , with notable improvements in function iden- tication, structural readability , and understanding of program logic. These ndings suggest that cognitively guided refactor- ing can supp ort novice comprehension by reducing cognitive overload. 7 Implications Our study demonstrates that cognitively guided automated refac- toring can meaningfully support novice code compr ehension when structural changes are constrained by cognitive principles. The ndings have implications for educational practice, tool design, and future research. Implications for Educational Practice . Results from the hu- man study indicate that cognitively guided refactoring yields the largest comprehension gains in function identication and struc- tural readability . This suggests that refactoring can act as an eec- tive instructional scaold for helping novices recognize functional decomposition and navigate control-ow , two areas that ar e con- sistently reported as challenging for early learners [ 11 , 48 , 61 , 62 ]. Howev er , qualitative feedback shows that r efactoring alone does not resolve gaps in conceptual understanding, particularly for un- familiar programming constructs [ 69 ]. Consequently , automated refactoring should complement, rather than r eplace, foundational instruction [ 10 , 19 ]. W e recommend integrating refactoring tools after an initial manual comprehension phase, wher e students rst attempt to understand code independently [ 13 ]. This sequencing en- courages active reasoning while allowing refactor ed code to ser ve as a conrmator y or corrective artifact rather than a primary source of understanding [37]. W e recommend a three-step classroom worko w: (1) students rst attempt to understand the original code independently , sur- facing genuine points of confusion [ 54 , 64 ]; (2) instructors use CD- DRef act orER to refactor units wher e confusion is widespr ead [ 31 ]; and (3) refactored and original versions ar e review ed side-by-side, with e xplicit discussion of structural changes to avoid over-reliance on generated outputs [13, 54, 69]. Implications for T ool Design . Across datasets and models, CD- DRef act orER substantially reduces refactoring failures and limits structural regressions compared to unconstraine d prompting [ 5 , 50 ]. These results indicate that cognitive principles should be treated as rst-class design constraints in refactoring tools intended for novices [51, 53, 65]. EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom Subarna Saha, Alif Al Hasan, Fariha T anjim Shifat, and Mia Mohammad Imran The observed increase in structural similarity , as measured by CodeBLEU [ 56 ], suggests that novice-oriented tools should pri- oritize incremental and localized refactorings over aggressive re- structuring [ 33 , 69 ]. In particular , refactoring strategies that extract well-named helper functions and reduce unnecessary nesting ap- pear especially eective [ 59 , 61 ], given the signicant improvement in function identication reported by participants. T ool designers should therefore emphasize bounded transformations that improve clarity while preserving familiarity with the original code struc- ture [65, 69]. Implications for Researchers . The results show that imposing cognitively motivate d structural constraints leads to refactoring outcomes that dier systematically from unconstrained prompting in terms of correctness, structural stability , and comprehension- relevant properties [ 12 , 43 ]. This indicates that cognitive constraints should b e treated as explicit experimental factors when study- ing automated refactoring systems, rather than as implicit design choices [ 6 , 42 ]. Finally , the principles demonstrated here motivate future research on applying cognitively guided constraints to re- lated program transformation tasks, such as code smell detection and remediation [ 33 ], automated program repair [ 36 ], code gen- eration and explanation [ 37 , 58 ], and other code transformation settings where trade-os between correctness, structural change , and human understanding are central [22, 61]. 8 Threats to V alidity W e acknowledge several key threats to validity in our study . Below we describe them: Construct V alidity . Our study uses cyclomatic complexity and cognitive complexity as pro xies for structural diculty and cogni- tive eort during code comprehension. While researchers widely use these metrics and they are the oretically grounded, these are static measures and do not dir ectly capture human cognitive pro- cesses. T o mitigate this limitation, we complement metric-based analysis with a controlled human study that directly measures novice comprehension across multiple dimensions. For functional correctness, we rely on the test suites provided by the MBPP and APPS datasets. Although these test suites may not exhaustively cover all edge cases, they provide a consistent and widely accepted basis for evaluating behavior preservation across both baseline and CDD-guided refactoring settings. Internal V alidity . Internal validity concerns whether the obser ved dierences in outcomes can be attributed to the refactoring ap- proach rather than to confounding factors. Because the human study uses separate pre-test and post-test groups composed of dif- ferent participants, individual dierences in prior kno wledge and programming ability as well as cognitive capacity (e.g., working memory and ability to manage complex control-ow ) may inuence comprehension outcomes independently of the refactoring condi- tion. Although participants wer e drawn from the same course le vel and task diculty was balanced across groups, such dierences in knowledge and mental capacity cannot be fully controlled and may pose a threat to internal validity . In addition, variability in how participants interacted with CDDRefactorER , including dierences in how refactored co de was examined or interpreted, may aect comprehension results. Identical procedures and system settings were used to reduce proce dural bias and limit systematic dierences between conditions. External V alidity . The ndings of this study are grounded in novice-level pr ogramming tasks drawn from the MBPP and APPS datasets, which primarily consist of small, self-contained algorith- mic problems. While these tasks are appropriate for studying novice code comprehension, they may not fully represent the complexity of real-world software systems involving larger codebases, multiple les, or domain-sp ecic frameworks. The human study focuses on rst-year undergraduate students, which matches the intended target population but limits generalization to more experienced programmers. In addition, results are based on two language mod- els and a spe cic refactoring conguration, and outcomes may dier with other models, programming languages, or instructional contexts. Conclusion V alidity . The human study involves a relativ ely small number of participants, which limits statistical p ower and the abil- ity to detect subtle eects. Howev er , the observed improvements are consistent across multiple comprehension dimensions and are supported by qualitative feedback, increasing condence in the reported trends. W e conducted statistical analyses using appro- priate non-parametric tests, and reported eect sizes to support interpretation beyond signicance testing alone. 9 Conclusion and Future W ork This work shows that cognitively guide d automated refactoring im- proves b oth refactoring safety and novice co de comprehension com- pared to unconstrained prompting. Acr oss MBPP and APPS, CD- DRef act orER reduced refactoring failures by 54.40-71.23% and sub- stantially lower ed the rate of structural regressions. On APPS, cogni- tive complexity increases fell from 29.08% to 12.32% for gpt-5-nano and from 21.8% to 10.12% for kimi-k2 , while cyclomatic complexity increases dropped from 26.56% to 6.18% and from 15.0% to 3.66%, respectively . CDDRefact orER also preser ved greater structural similarity , with me dian CodeBLEU scores rising by 75.6–122.7%, reecting more controlled and stable transformations. In the human study , cognitively guided refactoring led to higher self-reported comprehension across all dimensions and reduced cog- nitive overload that had arisen fr om code understanding, with the largest gains in function identication ( +31.31%) and structural read- ability ( +22.0%), followed by improvements in logic ow ( +19.45%) and purpose understanding ( +17.65%). This indicates that constrain- ing refactoring with cognitive principles improves comprehension- relevant structure without sacricing corr ectness. Future work should evaluate whether these gains persist in longi- tudinal settings, scale to larger multi-le codebases, and generalize to other program transformation tasks such as pr ogram repair , code translation, and educational code generation. Further studies should also examine how varying cognitive thresholds aects the trade-o between simplication and structural familiarity . References [1] 2017. Code Transformation. ScienceDirect T opics, Computer Science. https:// www.sciencedirect.com/topics/computer- science/code- transformation (accessed January 11, 2026). Improving Code Comprehension through Cognitive-Load A ware Automated Refactoring for Novice Programmers EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom [2] 2025. CDDRefactor GPT: A Custom GPT Tool for Cognitive-Driven Code Refactoring. OpenAI ChatGPT Custom GPT . https://chatgpt.com/g/g- 6803de5d95fc81919a4cdbcb210b8200- cddrefactorgpt Accessed: 2026-03-15. [3] 2025. Replication Package. https://zenodo.org/records/18153415 [4] Felix Adler , Gordon Fraser , Eva Grundinger , Nina Korber , Simon Labrenz, Jonas Lerchenberger , Stephan Lukasczyk, and Sebastian Schweikl. 2021. Improving Readability of Scratch Programs with Search-based Refactoring . In 2021 IEEE 21st International W orking Conference on Source Code A nalysis and Manipulation (SCAM) . IEEE Computer Society , Los Alamitos, CA, USA, 120–130. [5] Eman Abdullah AlOmar, Mohamed Wiem Mkaouer , and Ali Ouni. 2024. Au- tomating Source Code Refactoring in the Classroom. In Pr oceedings of the 55th ACM T echnical Symposium on Computer Science Education V . 1 (Portland, OR, USA) (SIGCSE 2024) . Association for Computing Machiner y , New Y ork, N Y , USA, 60–66. [6] Eman Abdullah AlOmar, Luo Xu, Soa Martinez, Anthony Peruma, Mo- hamed Wiem Mkaouer , Christian D Newman, and Ali Ouni. 2025. ChatGPT for Code Refactoring: Analyzing T opics, Interaction, and Eective Prompts. 35th IEEE International Conference on Collaborative Advances in Software and Comput- ing (CASCON) (2025). [7] Leonardo Ferreira Barbosa, Victor Hugo Pinto, Alberto Luiz Oliveira Tavar es de Souza, and Gustavo Pinto. 2022. T o What Extent Cognitive-Driven Development Improves Code Readability? . In Proce edings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (Helsinki, Fin- land) (ESEM ’22) . Association for Computing Machinery, New Y ork, N Y , USA, 238–248. [8] Shraddha Barke, Michael B James, and Nadia Polikarp ova. 2023. Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111. [9] Arie Bennett and Cruz Izu. 2025. Replicating a SOLO approach to Measure Students’ Ability to Improve Code Eciency . In Proceedings of the ACM Global on Computing Education Conference 2025 V ol 1 (Gaborone, Botswana) (CompEd 2025) . Association for Computing Machinery , New Y ork, NY, USA, 43–49. [10] João Henrique Berssanette and Antonio Carlos de Francisco. 2021. Cognitive load theory in the context of teaching and learning computer programming: A systematic literature review . IEEE Transactions on Education 65, 3 (2021), 440–449. [11] T eresa Busjahn, Carsten Schulte, and Andreas Busjahn. 2011. Analysis of code reading to gain more insight in pr ogram comprehension. In Proceedings of the 11th Koli Calling International Conference on Computing Education Research (Koli, Finland) (Koli Calling ’11) . Association for Computing Machinery , New Y ork, NY, USA, 1–9. [12] G Ann Campbell. 2018. Cognitive complexity: An overview and evaluation. In Proceedings of the 2018 international conference on technical debt (Gothenburg, Sweden) (T echDebt ’18) . Association for Computing Machinery, Ne w Y ork, NY, USA, 57–58. [13] Eduardo Carneiro Oliveira, Hieke K euning, and Johan Jeuring. 2024. Investigat- ing student reasoning in method-level code refactoring: A think-aloud study . In Proceedings of the 24th Koli Calling International Conference on Computing Education Research . 1–11. [14] Eduardo Carneiro Oliveira, Hieke Keuning, and Johan Jeuring. 2025. Uncovering Behavioral Patterns in Student–LLM Conversations during Code Refactoring T asks. In Proceedings of the 25th Koli Calling International Conference on Comput- ing Education Research (Koli Calling ’25) . Association for Computing Machiner y , New Y ork, NY, USA, Article 39, 11 pages. [15] Gary Charness, Uri Gneezy , and Michael A Kuhn. 2012. Experimental methods: Between-subject and within-subject design. Journal of economic behavior & organization 81, 1 (2012), 1–8. [16] Mark Chen, Jerry T worek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwar ds, Y uri Burda, et al . 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] [17] Norman Cli. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological bulletin 114, 3 (1993), 494. [18] Bart Du Bois, Serge Demeyer , and Jan V erelst. 2005. Does the "Refactor to Understand" Reverse Engineering Pattern Impro ve Program Compr ehension? . In Procee dings of the Ninth European Conference on Software Maintenance and Reengineering (CSMR ’05) . IEEE Computer Society, USA, 334–343. [19] Rodrigo Duran, Albina Zavgorodniaia, and Juha Sorva. 2022. Cognitive load the- ory in computing education research: A review . ACM Transactions on Computing Education (TOCE) 22, 4 (2022), 1–27. [20] Ericsson, Emma. 2023. Evaluating Similarity-Based Refactoring Recommenda- tions. Student Paper . [21] Matteo Esposito, Andrea Janes, Terhi Kilamo, and V alentina Lenarduzzi. 2025. Early Career Developers’ Perceptions of Code Understandability: A Study of Complexity Metrics. IEEE Access 13 (2025), 135027–135042. [22] Sarah Fakhour y , Y uzhan Ma, V enera Arnaoudova, and Olusola Adesope. 2018. The eect of poor source code lexicon and readability on developers’ cognitive load. In Proceedings of the 26th Conference on Program Comprehension (Gothenburg, Sweden) (ICPC ’18) . Association for Computing Machiner y , New Y ork, N Y , USA, 286–296. [23] Sarah Fakhour y , Devjeet Roy , Adnan Hassan, and V ernera Arnaoudova. 2019. Improving source code readability: Theory and practice. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC) . IEEE, 2–12. [24] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre- Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, V ol. EMNLP 2020) . Association for Computational Linguistics, 1536–1547. [25] Ronivaldo Ferreira, Victor Hugo Santiago C. Pinto, Cleidson R. B. de Souza, and Gustavo Pinto. 2024. Assisting Novice Dev elopers Learning in Flutter Through Cognitive-Driven Development. In Pr oceedings of the 38th Brazilian Symposium on Software Engineering, SBES 2024, Curitiba, Brazil, September 30 - Octob er 4, 2024 . SBC, 367–376. [26] Martin Fowler . 2018. Refactoring: improving the design of existing code . Addison- W esley Professional. [27] Lucian José Gonçales, Kleinner Farias, and Bruno C da Silva. 2021. Measuring the cognitive load of software developers: An extended Systematic Mapping Study . Information and Software T echnology 136 (2021), 106563. [28] Dan Gopstein, Jake Iannacone, Yu Y an, Lois DeLong, Y anyan Zhuang, Martin K.-C. Y eh, and Justin Cappos. 2017. Understanding misunderstandings in source code. In Proce e dings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017) . Association for Computing Machinery , New Y ork, NY, USA, 129–139. [29] Anthony G Greenwald. 1976. Within-subjects designs: T o use or not to use? Psychological Bulletin 83, 2 (1976), 314. [30] Gao Hao, Haytham Hijazi, João Durães, Júlio Medeiros, Ricardo Couceiro, Chan T ong Lam, César Teixeira, João Castelhano , Miguel Castelo Branco, Paulo Carvalho, et al . 2023. On the accuracy of code complexity metrics: A neuroscience- based guideline for improvement. Frontiers in Neuroscience 16 (2023), 1065366. [31] Alif Al Hasan, Subarna Saha, and Mia Mohammad Imran. 2026. Learning Pro- gramming in Informal Spaces: Using Emotion as a Lens to Understand Novice Struggles on r/learnprogramming. In Proce edings of the 2026 IEEE/ACM 48th International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET ’26) . ACM, Rio de Janeir o, Brazil, 1–12. [32] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Comp etence With APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , J. V anschoren and S. Y eung (Eds.), V ol. 1. [33] Felienne Hermans and Efthimia Aivaloglou. 2016. Do code smells hamper novice programming? A controlled experiment on Scratch programs . In 2016 IEEE 24th International Conference on Program Comprehension (ICPC) . IEEE Computer Society , Los Alamitos, CA, USA, 1–10. [34] John Johnson, Sergio Lubo, Nishitha Y e dla, Jairo Aponte, and Bonita Sharif. 2019. An Empirical Study Assessing Source Code Readability in Comprehension . In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE Computer Society , Los Alamitos, CA, USA, 513–523. [35] Shahedul Huq Khandkar. 2009. Open coding. University of Calgar y 23, 2009 (2009), 2009. [36] Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair . Commun. A CM 62, 12 (2019), 56–65. [37] Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny , Seth Bernstein, and Juho Leinonen. 2023. Experiences from Using Code Expla- nations Generated by Large Language Models in a W eb Software Development E-Book. In Proceedings of the 54th ACM T echnical Symposium on Computer Science Education V . 1 (T oronto ON, Canada) (SIGCSE 2023) . Association for Computing Machinery , New Y ork, NY, USA, 931–937. [38] Philomena Marfo and G.A. Okyere. 2019. The accuracy of eect-size estimates under normals and contaminated normals in meta-analysis. Heliyon 5, 6 (2019), e01838. [39] T .J. McCabe. 1976. A Complexity Measure . IEEE Transactions on Software Engineering 2, 04 (Dec. 1976), 308–320. [40] Flavio Medeiros, Marcio Ribeiro, Rohit Gheyi, Sven Ap el, Christian Kastner , Bruno Ferreira, Luiz Carvalho, and Baldoino Fonseca. 2018. Discipline Matters: Refactoring of Preprocessor Directives in the #ifdef Hell . IEEE Transactions on Software Engineering 44, 05 (May 2018), 453–469. [41] G. A. Miller . 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review 63, 2 (1956), 81–97. [42] Rodrigo Morales, Foutse Khomh, and Giuliano Antoniol. 2020. RePOR: Mimicking humans on refactoring tasks. Are we there yet? Empirical Software Engineering 25, 4 (2020), 2960–2996. [43] Marvin Muñoz Barón, Marvin Wyrich, and Stefan W agner . 2020. An Empirical V alidation of Cognitive Comple xity as a Measure of Sour ce Code Understandabil- ity . In Procee dings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (Bari, Italy) (ESEM ’20) . Associa- tion for Computing Machinery , New Y ork, NY, USA, Article 5, 12 pages. EASE 2026, 9–12 June, 2026, Glasgow , Scotland, United Kingdom Subarna Saha, Alif Al Hasan, Fariha T anjim Shifat, and Mia Mohammad Imran [44] Sara Nurollahian, Hieke Keuning, and Eliane Wiese. 2025. T eaching W ell- Structured Code: A Literature Review of Instructional Approaches . In 2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T) . IEEE Computer Society , Los Alamitos, CA, USA, 205–216. [45] Augustus Odena, Charles Sutton, David Martin Dohan, Ellen Jiang, Henryk Michalewski, et al . 2021. Program Synthesis with Large Language Models. In n/a . n/a, n/a. n/a. [46] Indranil Palit and T ushar Sharma. 2025. Reinforcement Learning vs Supervise d Learning: A tug of war to generate refactored code accurately. In Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE ’25) . Association for Computing Machinery , New Y ork, N Y , USA, 429–440. [47] Kang-il Park, Jack Johnson, Cole S. Peterson, Nishitha Y edla, Isaac Baysinger, Jairo Aponte, and Bonita Sharif. 2024. An eye tracking study assessing source code readability rules for program comprehension. Empirical Softw . Engg. 29, 6 (Oct. 2024), 60 pages. [48] Norman Peitek, Sven Apel, Chris Parnin, André Brechmann, and Janet Siegmund. 2021. Program Comprehension and Code Complexity Metrics: An fMRI Study . In Proceedings of the 43rd International Conference on Software Engineering (Madrid, Spain) (ICSE ’21) . IEEE Press, NJ, USA, 524–536. [49] Anthony Peruma, Steven Simmons, Eman Abdullah AlOmar , Christian D New- man, Mohamed Wiem Mkaouer , and Ali Ouni. 2022. How do i refactor this? An empirical study on refactoring trends and topics in Stack Overow . Empirical Software Engineering 27, 1 (2022), 11. [50] Y onnel Chen Kuang Piao, Jean Carlors Paul, Leuson Da Silva, Arghavan Moradi Dakhel, Mohammad Hamdaqa, and Foutse Khomh. 2025. Refactoring with LLMs: Bridging Human Expertise and Machine Understanding. arXiv:2510.03914 [cs.SE] [51] Gustavo Pinto and Alberto de Souza. 2023. Cognitive Driven Development helps software teams to keep code units under the limit! Journal of Systems and Software 206 (2023), 111830. [52] Victor Hugo Santiago C. Pinto and Alberto Luiz Oliveira Tavares De Souza. 2022. Eects of Cognitive-driven Development in the Early Stages of the Software Development Life Cycle . In Proceedings of the 24th International Conference on Enterprise Information Systems - V olume 2: ICEIS . [53] Victor Hugo Santiago C. Pinto, Alberto Luiz Oliveira Tavares de Souza, Y uri Matheus Barboza de Oliveira, and Danilo Monteiro Ribeiro. 2021. Cognitive- Driven Development: Preliminar y Results on Software Refactorings. In Pro- ceedings of the 16th International Conference on Evaluation of Novel A pproaches to Software Engine ering - ENASE . INSTICC, SciT ePress, 92–102. doi:10.5220/ 0010408100920102 [54] James Prather , Brent N Reeves, Paul Denny , Brett A Becker , Juho Leinonen, Andrew Luxton-Reilly , Garrett Powell, James Finnie- Ansley , and Eddie Antonio Santos. 2023. “It’s weird that it knows what i want”: Usability and interactions with copilot for novice programmers. ACM transactions on computer-human interaction 31, 1 (2023), 1–31. [55] Raluca Budiu. 2023. Between-Subjects vs. Within-Subje cts Study Design. https: //www.nngr oup.com/articles/between- within- subjects/. Accessed: 2026-01-10. [56] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu T ang, Neel Sundare- san, Ming Zhou, Ambrosio Blanco , and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE] [57] Devjeet Roy , Sarah Fakhour y , John Lee, and V enera Arnaoudova. 2020. A Model to Detect Readability Improvements in Incremental Changes. In Proceedings of the 28th International Conference on Program Comprehension (Se oul, Republic of Korea) (ICPC ’20) . Association for Computing Machinery , New Y ork, NY, USA, 25–36. [58] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten So otla, Itai Gat, Xiao- qing Ellen T an, Y ossi Adi, Jingyu Liu, et al . 2024. Co de Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL] [59] Simone Scalabrino, Mario Linares- V asquez, Denys Poshyvanyk, and Rocco Oliveto. 2016. Improving code readability models with textual features . In 2016 IEEE 24th International Conference on Program Comprehension (ICPC) . IEEE Computer Society , Los Alamitos, CA, USA, 1–10. [60] Sandro Schulze , Jörg Liebig, Janet Siegmund, and Sven Apel. 2013. Do es the disci- pline of prepr ocessor annotations matter? a controlled e xperiment. In Proceedings of the 12th International Conference on Generative Programming: Concepts & Experiences (Indianapolis, Indiana, USA) (GPCE ’13) . Association for Computing Machinery , New Y ork, NY, USA, 65–74. [61] Giulia Sellitto, Emanuele Iannone, Zadia Codabux, V alentina Lenarduzzi, An- drea De Lucia, Fabio Palomba, and Filomena Ferrucci. 2022. T oward Understand- ing the Impact of Refactoring on Program Comprehension. In IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Hon- olulu, HI, USA, March 15-18, 2022 . IEEE, 731–742. [62] Janet Siegmund, Norman Peitek, Chris Parnin, Sven Apel, Johannes Hofmeister , Christian Kästner , Andrew Begel, Anja Bethmann, and André Brechmann. 2017. Measuring neural eciency of program comprehension. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017) . Association for Computing Machinery, New Y ork, N Y , USA, 140–150. [63] José Aldo Silva Da Costa and Rohit Gheyi. 2023. Evaluating the Co de Com- prehension of Novices with Eye Tracking. In Proceedings of the XXII Brazilian Symposium on Software Quality (Brasília, Brazil) (SBQS ’23) . Association for Computing Machinery , New Y ork, NY, USA, 332–341. [64] John Sweller . 1988. Cognitive load during problem solving: Eects on learning. Cognitive science 12, 2 (1988), 257–285. [65] Alberto Luiz Oliveira T avares de Souza and Victor Hugo Santiago Costa Pinto. 2020. Towar d a Denition of Cognitive-Driven Development . In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE Computer Society , Los Alamitos, CA, USA, 776–778. [66] Kimi T eam, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Y anru Chen, Yuankun Chen, et al . 2025. Kimi K2: Open Agentic Intelligence. arXiv:2507.20534 [cs.LG] [67] Peeratham T echapalokul and Eli Tilevich. 2019. Position: Manual Refactoring (by Novice Programmers) Considered Harmful . In 2019 IEEE Blo cks and Beyond W orkshop (B&B) . IEEE Computer Society, Los Alamitos, CA, USA, 79–80. [68] Garry L White and Marcos P Sivitanides. 2002. A theory of the relationships between cognitive requirements of computer programming languages and pro- grammers’ cognitive characteristics. Journal of information systems education 13, 1 (2002), 59–66. [69] Eliane S. Wiese, Anna N. Raerty , and Armando Fo x. 2019. Linking code r eadabil- ity , structure, and comprehension among no vices: it’s complicate d. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering Education and Training (Montreal, Quebec, Canada) (ICSE-SEET ’19) . IEEE Press, NJ, USA, 84–94. [70] Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics bulletin 1, 6 (1945), 80–83. [71] Yisen Xu, Feng Lin, Jinqiu Y ang, T se-Hsun, Chen, and Nikolaos T santalis. 2025. MAN TRA: Enhancing A utomated Method-Level Refactoring with Contextual RAG and Multi- Agent LLM Collaboration. arXiv:2503.14340 [cs.SE] [72] Albert Ziegler , Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rif kin, Shawn Simister , Ganesh Sittampalam, and Edward Aftandilian. 2024. Measuring github copilot’s impact on productivity . Commun. ACM 67, 3 (2024), 54–63.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment