SLOW: Strategic Logical-inference Open Workspace for Cognitive Adaptation in AI Tutoring
While Large Language Models (LLMs) have demonstrated remarkable fluency in educational dialogues, most generative tutors primarily operate through intuitive, single-pass generation. This reliance on fast thinking precludes a dedicated reasoning works…
Authors: Yuang Wei, Ruijia Li, Bo Jiang
SLO W: Strategic Logical-inference Op en W orkspace for Cognitiv e A daptation in AI T utoring Y uang W ei 1 [0000 − 0002 − 8187 − 4011] ⋆ , Ruijia Li 1 [0009 − 0001 − 9680 − 7508] ⋆ , and Bo Jiang 1 [0000 − 0002 − 7914 − 1978] ⋆⋆ Shanghai Institute of Artificial In telligence for Education, Shanghai, China Abstract. While Large Language Mo dels (LLMs) ha ve demonstrated remark able fluency in educational dialogues, most generative tutors pri- marily op erate through in tuitive, single-pass generation. This reliance on fast thinking precludes a dedicated reasoning w orkspace, forcing multiple diagnostic and strategic signals to b e pro cessed in a conflated manner. As a result, learner cognitive diagnosis, affective p erception, and pedagogi- cal decision-making become tigh tly en tangled, whic h limits the tutoring system’s capacit y for delib erate instructional adaptation. W e propose SLO W, a theory-informed tutoring framew ork that supp orts delib erate learner-state reasoning within a transparent decision workspace. Inspired b y dual-process accoun ts of h uman tutoring, SLO W explicitly separates learner-state inference from instructional action selection. The frame- w ork integrates causal evidence parsing from learner language, fuzzy cognitiv e diagnosis with counterfactual stabilit y analysis, and prosp ec- tiv e affectiv e reasoning to anticipate how instructional choices may in- fluence learners’ emotional tra jectories. These signals are join tly consid- ered to guide p edagogically and affectively aligned tutoring strategies. Ev aluation using hybrid human-AI judgments demonstrates significant impro vemen ts in personalization, emotional sensitivity , and clarity . Abla- tion studies further confirm the necessity of eac h mo dule, sho wcasing ho w SLO W enables in terpretable and reliable intelligen t tutoring through a visualized decision-making process. This work adv ances the in terpretabil- it y and educational v alidit y of LLM-based adaptiv e instruction. Keyw ords: In telligen t T utoring Systems · Multi-Agen t Systems · Learner Mo deling · Explainable AI in Education 1 In tro duction The “iron triangle” among scale, p ersonalization, and qualit y has long b een one of the central c hallenges in education [15]. AI has often b een viewed as a p ossible w ay to ease this trade-off by enabling more scalable and adaptive forms of in- structional supp ort. More recen tly , the rapid adv ancement of LLMs has further ⋆ Equal con tribution( philrain.cs@gmail.com, rj.yuzu.li@gmail.com ). ⋆⋆ Corresp onding author( bjiang@deit.ecnu.edu.cn ). 2 Y. W ei et al. rekindled such exp ectations, as their fluent natural language interaction capa- bilities seem to offer new possibilities for in telligent tutoring at scale. Ho wev er, linguistic fluency do es not equate to the capacity for educationally meaningful reasoning and instructional decision-making. In educational tec hnology , efforts to address this tension hav e traditionally b een embo died in In telligent T utoring Systems (ITS) [12] and adaptiv e learning systems (ALS) [6]. These systems typically rely on explicit learner mo deling and instructional strategy adjustments to achiev e p ersonalized supp ort. As LLMs are in tro duced into educational con texts, ho w ever, generative mo dels are in- creasingly expected to p erform b oth learner-state diagnosis and instructional resp onse generation within the same conv ersational agen t [9]. This shift blurs the boundary b et ween learner-state diagnosis and p edagogical decision-making, creating a fundamental c hallenge for LLM-based tutoring: supporting explicit learner-state reasoning that can guide delib erate instructional planning [14]. In most existing LLM-driv en tutoring systems, learner-state inference and instructional action decision-making are compressed in to a single generation pro- cess. As a result, multiple signals, including cognitive diagnosis, affect p erception, and p edagogical inten t, are handled in a highly coupled manner, making it diffi- cult to explicitly verify diagnostic hypotheses, compare alternative p edagogical strategies, or main tain alignment betw een inferred learner needs and generated instructional responses. Figure 1 illustrates a t ypical failure scenario: when a learner expresses confusion ab out the c hronological order of historical ev ents, generic LLM tutoring tends to pro vide abstract metaphors or v ague encourage- men t, rev ealing a clear p edagogical misalignment betw een the system’s response and the learner’s actual cognitiv e difficulty . T o address these issues, existing research has mainly proceeded along t wo paths. One is the theory-driven paradigm, which impro ves the normativity of instructional strategies b y translating educational or cognitive-psyc hological the- ories into mo del constraints, as exemplified by Soc raticLM [8], KELE [13], and P A TS [14]. How ever, such approaches rely hea vily on exp ert-defined rules and often struggle to generalize across domains or long-tail learning b ehaviors. The other is the simulation-based paradigm, which enhances system p erception by constructing virtual learners, such as p ersonalit y-trait-based learner mo deling [9] and dialogue tra jectory simulation [5, 23]. Y et b ecause these sim ulations are largely driv en b y the probabilistic generation of LLMs, their diagnostic logic remains difficult to v erify against educ ational measuremen t and cognitiv e prin- ciples, thereby limiting interpretabilit y and p edagogical credibility . Despite their differen t implementation paths, b oth paradigms generally retain a tigh tly cou- pled generation pro cess in which learner diagnosis and instructional action are resolv ed simultaneously . This differs from exp ert teac hing practice, where teach- ers t ypically elicit and interpret evidence ab out learner understanding and then adjust instructional actions iteratively through formative assessment [1]. As a result, existing systems often act on insufficien tly verified learner models and struggle to supp ort timely state correction during interaction [5], which can lead to p ersisten t strategy misalignment and instructional inertia across turns. SLO W: Strategic Logical-inference Open W orkspace 3 I can't remember the order of these historical events . They are all mixed up in my head . (Smiling warmly) It's al right , mi xing up sequences is very common . Think of history as an interesting story book where each event is a chapter . Why not try a 'timeline' tool to string these events together like a p earl necklace ? I understand it's tricky to get the sequence right . It's already great that you've remembered several events ! Let's take it one step at a time : pi ck just the three most critical events and write their keywords on cards . Once you've arranged those three in order, I'l l check them for you . We'll master it gradually! I can't remember the order of these historical events . They are all mixed up in my head . ×"" Abstract Metaphor ×"" Vague Instruction √ Student - Centered Alignment Student Student AI Tutor AI Tutor Identifi ed Student States Cognitive Status Emotional Need Simulation & Optimizatio n Integrat ed Strategy + If action A is taken, how will the student's aff ective state evolve? If the student mastered..., th ey should perform as .. .; however, .. .. shows a g ap, implying.. .. Fig. 1. An illustrative case motiv ating the need for delib erate learner-state reasoning. General LLMs (Left) often pro vides abstract Metaphors b ecause their "fast-thinking" nature leads to the entanglemen t of diagnosis and decision-making. Our SLO W framew ork (Righ t) introduces a transparent op en workspace to decouple these pro cesses. By performing coun terfactual cognitiv e v alidation and prosp ectiv e affectiv e sim ulation, SLOW explicitly weighs cognitiv e gains against affectiv e risks to generate guidance th at is more aligned with the studen t’s needs. T o address this, we propose SLO W, a “slo w thinking” intelligen t tutoring framew ork inspired by dual-system theory [2]. Rather than directly generating instructional resp onses, SLOW explicitly externalizes learner state reasoning in to an open reasoning w orkspace, structurally decoupling the pro cesses of cognitiv e diagnosis and instructional action selection. This w orkspace operates through four collab orativ e reasoning stages: evidenc e p arsing , c o gnitive validation , affe ct pr e diction , and str ate gy inte gr ation , resp onsible for extracting causally relev an t evidence, v alidating the stability of learning states, estimating affectiv e ev olu- tion risks, and balancing cognitiv e gains against affectiv e risks to generate in- structional actions. By explicitly reasoning ab out learner states and simulating instructional consequences before resp onse generation, SLOW pro vides a trans- paren t, traceable, and educationally rational decision path for in telligen t tu- toring, thereb y establishing a more interpretable and trust worth y architectural foundation for LLM-driv en tutoring systems. The main con tributions of this pap er are as follows: 1. W e reveal a structural limitation in current LLM-driven tutoring systems: the lack of explicit reasoning workspaces leads to coupling b et ween cogni- tiv e diagnosis and instructional planning, resulting in inexplicable and error- prone instructional b eha viors; 2. W e prop ose SLOW, a learner-cen tered reasoning architecture that simulates exp ert teachers’ delib erate instructional pro cesses through an op en reasoning w orkspace, ac hieving in terpretable learner diagnosis and adaptive instruction within a unified framew ork; 3. Through theory-driven empirical ev aluation and human-AI collaborative rat- ing, we sho w that SLO W improv es in instructional p ersonalization, affectiv e sensitivit y , clarit y , and op erabilit y , demonstrating how its traceable reason- ing paths supp ort educational effectiv eness and trustw orth y deploymen t. 4 Y. W ei et al. 2 Related W ork 2.1 Dual-System Theory and T est-Time Scaling The Dual-System Theory serves as a cornerstone of cognitiv e psychology , bifur- cating human cognition into System 1, whic h is intuitiv e and automatic (“fast thinking”), and System 2, which is logical and deliberative (“slow thinking”) [2]. As the research fo cus of LLMs shifts from the “scaling laws” of parameters to test-time sc aling [11], which increases computational steps during inference, this theory has b een revitalized in the field of artificial intelligence. F rontiers suc h as Op enAI o1 hav e demonstrated that b y incorporating Chain-of-Thought (CoT) [20] and searc h algorithms, mo dels can significantly enhance their logical deduction capabilities in tasks with ob jective standards, such as mathematics and programming. How ev er, existing research on test-time scaling is predomi- nan tly concentrated on domains with explicit feedback lo ops [15]. In the context of ITS, whic h inv olve high interactiv e complexity , utilizing test-time computation to enhance the p erception of learner states and p edagogical planning remains an under-explored fron tier. 2.2 LLM-based ITS Existing efforts to enhance p edagogical planning in LLM-based ITS can b e broadly group ed in to tw o paradigms. The theory-driven paradigm translates ed- ucational or cognitiv e theories into explicit behavioral constrain ts. F or example, So craticLM [8] and KELE [13] use supervised fine-tuning or m ulti-agen t strate- gies to strengthen guided questioning, while P A TS [14] explicitly maps learner profiles to instructional strategies. Although these methods improv e the nor- mativit y of instructional b eha vior, they often dep end heavily on expert-defined rules and struggle to generalize across domains or long-tail learner behaviors. The sim ulation-based paradigm impro ves system p erception by constructing virtual learners. Prior w ork has used personality-based modeling to simulate div erse studen t resp onses [9], simulated misconceptions to improv e error correction, and adopted frameworks such as SimT utor [10] to preview dialogue tra jectories. How- ev er, b ecause these sim ulations are largely driven by LLM generation rather than explicit cognitive diagnosis mo del(CDM), it remains difficult to verify whether they align with psychometric principles, which limits diagnostic explainability in complex tutoring settings. 2.3 Dialogue-based Learner Mo deling Learner modeling underpins adaptive education and has ev olved from static assessmen t based on con trolled testing to dynamic inference from natural in- teraction. Early neural cognitive diagnosis mo dels, suc h as NeuralCDM [19], established a basis for learner-state estimation, but remained limited in mo del- ing complex knowledge relations. More recen t work has sought to improv e ro- bustness and transparency . F or instance, tFCM [21] combines T emp oral F uzzy SLO W: Strategic Logical-inference Open W orkspace 5 Student Te a c h e r, when I revi ew medieval history , I always get confus ed about which event happened first . It fee ls a bit chaotic . AI Great initiative! It is comple tely normal to fe el overwhelmed when startin g medieval history . Let's fo cus on the big picture first . Tr y listing just three major fi gures (like Charlemagne) or events (like the Crusades) wit h their approximate dates . Establishing this timeline is a huge ste p fo rwar d ; we can fill in the specific details later . Stra tegic Logical - inference Open Wo rk sp ac e 1. Evid enc e Parsi ng 2. Cognitive Va li d at i on 4. Strat egy Integration 3. Affectiv e Prediction Dialogue LLM MB Styl e 𝒌 𝑬 𝒌 𝒑, 𝜾 𝑬 " Reali ty Counter Factu al !∆ !"# ,% ∆ &' ! Fuzzy tools Itera te Current State … … Scenario Simulation Itera te Next State 𝑪 𝒇𝒊𝒏𝒂𝒍 ' 𝒔 𝒊 , 𝝁 𝑰 , 𝑬 𝒌 ' Focus 𝑲 * 𝑭 𝒆𝒎𝒐 ' 𝑲 * = argmax " ! P(𝒔 𝒊 , 𝝁 𝑰 , 𝑬 𝒌 ) Action Integrat or Fig. 2. Overview of the Strategic Logical-inference Op en W orkspace (SLOW) frame- w ork. The architecture illustrates the transition from (1) Evidence Parsing, where dia- logue is deconstructed in to cognitive and affectiv e primitiv es, to (2) Cognitiv e V alida- tion and (3) Affective Prediction, which utilize coun terfactual simulation and prospec- tiv e simulation to refine the internal state. Finally , (4) Strategy Integration balances these signals to execute a calibrated tutoring action. Cognitiv e Maps with the Mark ov Blank et principle to s upport causally in ter- pretable kno wledge-state transitions. With the rise of LLMs, researchers hav e further explored learner mo deling from dialogue. Dialogue-KT [17] uses conv ersa- tional con text to track knowledge mastery , but suc h generative approaches often lac k explicit educational-measurement constraints, reducing p edagogical in ter- pretabilit y . DiaCDM [5] further in tro duces diagnostic signal extraction based on the Initiation–Resp onse–Ev aluation framew ork. Ho wev er, its design is ori- en ted more to ward offline p ost ho c analysis than online interaction, making it difficult to capture rapid learner-state shifts during tutoring. As a result, exist- ing approaches still struggle to join tly supp ort transparen t reasoning and online diagnosis in op en-ended dialogue. 3 Metho ds The SLO W (Strategic Logical-inference Op en W orkspace) framew ork (Figure 2) facilitates p edagogical reasoning b y constructing an explicit inference space within a generative architecture. This design is grounded in the principles of formativ e assessmen t, whic h, as noted by Sadler [16], necessitates a clear under- standing of a learner’s current state b efore taking p edagogical action to close learning gaps. T o op erationalize this theory , SLOW simulates the in ternal de- lib erativ e mec hanism of an exp ert teac her who p erforms structured analysis and anticipates p oten tial consequences within a mental workspace rather than resp onding impulsiv ely . A ccordingly , the framework deconstructs this psycholog- ical process in to four synergistic stages: parsing evidence, cognitive v alida- tion, affective prediction, and strategy in tegration . 6 Y. W ei et al. 3.1 P arsing Evidence T o mitigate the inherent linguistic noise and redundancy presen t in open-ended learner discourse, the SLO W framework initiates its reasoning pro cess with a structured evidence extraction stage. Grounded in the Marko v Blanket (MB) [3] principle, w e iden tify a minimal sufficien t feature set for eac h Kno wl- edge Comp onen t (KC) k . This ensures that once these diagnostic features are determined, the mastery state of k remains conditionally independent of p eriph- eral dialogue elemen ts. The system decomp oses the raw input u orig in to tw o distinct diagnostic streams: cognitiv e primitives ( k , E k ) and affective triplets ( p, ι, E e ) . Specifically , eac h identified K C k and its asso ciated evidence span E k is transformed into a dense diagnostic enco ding z k : z k = Φ MB ( k , E k ; S MB default ) (1) where S MB default represen ts a sp ecialized set of MB-style feature templates designed to capture mastery-relev an t indicators directly from natural language. Unlik e traditional metho ds that rely on explicit b eha vioral logs, the mapping function Φ MB deriv es latent pro xies of mastery by analyzing th e causal relev ance of the learner’s expressed reasoning within the curren t interaction con text. When the input con tains emotional cues, the system extracts the p olarit y p , in tensity ι , and supp orting text E e to form an affectiv e vector e . This vector serv es as the initial affectiv e baseline for the subsequent sim ulation of emotional tra jectories. The comprehensive output of this stage is the lab el set T ( u orig ) , whic h consolidates the collection of cognitive features Z = { z k } and the initial affectiv e state e for downstream deliberative analysis. 3.2 Cognitiv e V alidation Up on obtaining the structural features z k , the system initiates an analysis of the learner’s cognitive state. Recognizing that knowledge acquisition is a non-discrete ev olutionary process, w e utilize a F uzzy Cognitive Discriminator to represen t the mastery level µ ( C k ) [21]. The state is quan tified as a contin uous membership distribution across four hierarc hical levels: µ ( C k ) = ( µ Un , µ InK , µ K , µ L ) (2) where the states corresp ond to Unknown (Un), Insufficiently Known (InK), Kno wn (K), and Learned (L), resp ectively . T o ev aluate the stabilit y of this diagnosis, the system executes a Coun ter- factual Simulation within the delib erativ e workspace. This pro cess assesses the robustness of the observ ed state b y constructing a counterfactual h yp othesis (e.g., “Assume the learner already p ossesses state K”) and deducing the corre- sp onding t ypical features F sim k . The system ev aluates the alignmen t b etw een the empirical reality F orig k and the sim ulated profile F sim k b y calculating tw o contrastiv e signals: ∆ sim = Diff ( F orig k , F sim k ) (3) SLO W: Strategic Logical-inference Open W orkspace 7 where ∆ sim iden tifies the directional mismatc h signal. These signals, along with the counterfactual effort ∆ cf required to shift b et ween states, are then pro cessed b y F uzzy to ols [21] to iteratively refine the diagnostic score. This v alidation lo op ensures that the final output reaches a stable cognitive context C final , thereby enhancing the transparency and in terpretability of the modeling pro cess. 3.3 Affectiv e Prediction Sim ultaneously , the system activ ates prosp ectiv e affectiv e simulation to antici- pate the impact of potential p edagogical in terven tions. As illustrated in Figure 2 Blo c k 3, the simulator utilizes the emotional triplets extracted during the parsing stage to initialize the Curren t State e before , representing the learner’s baseline p olarit y and intensit y . T o ev aluate the emotional tra jectory , the system conducts a forw ard rollout b y sim ulating a po ol of candidate tutor resp onses { r (1) , . . . , r ( M ) } to predict the learner’s emotional Next State e ( m ) after for each candidate r ( m ) . A transition score ∆ ( m ) is calculated to ev aluate the affectiv e shift b etw een the current state e before and the predicted state e ( m ) after : ∆ ( m ) = Score ( e before , e ( m ) after ) (4) where m denotes the index of the resp onse draft b eing ev aluated. The optimal sim ulation result is then syn thesized in to the final Affectiv e Prediction signal, enco ded as a control vector F emo = ( emo cur , int cur , tgt cur ) . In this form ulation, emo cur and int cur sp ecify the target polarity and intensit y , while tgt cur iden tifies a prescriptive control target (e.g., “encourage” or “stabilize”). This prosp ectiv e mec hanism ensures that the instructional path is motiv ationally supp ortiv e by mitigating the risk of triggering learner frustration b efore the resp onse is actually deliv ered. 3.4 Strategy Integration In the final stage, the system in tegrates the sim ulated cognitive and affective signals to facilitate a multi-criteria decision-making process. As illustrated by the Figure 2, Blo c k 4, the framework iden tifies the optimal p edagogical fo cus k ∗ b y ranking candidates according to their priority scores: k ∗ = arg max k i Priorit y ( s i , µ i , E k i ) (5) This decision logic p erforms a critical trade-off among three primary dimensions for each candidate knowledge comp onen t k i . Sp ecifically , the in tegrator ev alu- ates the mastery severit y s i along the p edagogical hierarch y (Un → InK → K → L), the diagnostic confidence µ i deriv ed from the membership stability and coun terfactual analysis, and the richness of supp orting evidence E k i extracted from the dialogue. 8 Y. W ei et al. The associated state s ∗ of the selected focus k ∗ subsequen tly determines the instructional stance, ranging from foundational scaffolding for state Un to transfer-orien ted extension for state L. T o generate the final tutoring resp onse y resp , the system couples the finalized cognitiv e diagnosis C final with the af- fectiv e con trol vector F emo . By balancing instructional necessity with emotional stabilit y on a conceptual scale, the framework ensures that the resulting tutoring action is b oth cognitiv ely precise and motiv ationally supportive. 4 Exp erimen tal Design This section details the exp erimental configuration, cov ering the construction of the ev aluation dataset, the selection and configuration of baseline mo dels, and the ev aluation mec hanism. 4.1 Dataset The ev aluation dataset is designed to capture representativ e tutoring c hallenges in authentic instructional con texts. It combines authen tic student–teac her inter- action data with mo del-augmen ted samples. The source data were drawn from real in teraction corp ora, reflecting t ypical cognitive h urdles and affective expres- sions, suc h as confusion ov er circuit current directions, fragmented knowledge acquisition, and learner doubt with affective undertones. T o increase div ersity , w e used a large language mo del to generate 200 additional samples conditioned on these authen tic inquiries, while preserving the original knowledge topics and difficult y levels and v arying surface formulations and affective tones. All candi- date instances were then manually reviewed b y educational exp erts to remov e am biguous or p edagogically unrealistic cases, yielding a final set of 100 exp ert- v alidated high-qualit y instances. The final dataset spans K1–K12 and cov ers seven disciplines: Biology (20), Ph ysics (20), Mathematics (20), History (14), Geograph y (12), Chemistry (10), and English (4). It also balances fiv e scenario types—Affective Supp ort (32), P ersonalized Support (26), Strategic Scaffolding (22), Direct Q&A (12), and Error Correction (8)—as well as three emotion categories: Positiv e (36), Neutral (32), and Negative (32). T ogether, these prop erties ensure p edagogical realism, cross-disciplinary co verage, and con trolled v ariation in learner states. 4.2 Baselines T o verify the generalizabilit y of SLOW across differen t foundation mo dels, w e selected represen tative mo dels from three ma jor families: GPT-4o and GPT- 4o-mini from Op enAI, Gemini-1.5-Pro and Gemini-1.5-Flash from Go ogle, and DeepSeek-V3 alongside the reasoning-enhanced DeepSeek-R1. T o ensure a rigorous and fair comparison, all baselines are configured with strong prompting and follow a tw o-stage pipeline where the mo del first explicitly diagnoses the learner’s cognitive and affective states b efore generating a final SLO W: Strategic Logical-inference Open W orkspace 9 resp onse. Critically , the baseline prompts explicitly incorp orate the full set of ev aluation rubrics. This setup ensures that baselines are fully aw are of the scoring preferences, thereby confirming that any p erformance gains from SLO W stem from its internal op en reasoning workspace and simulation mechanisms rather than mere prompt engineering or information disparit y . Baseline prompt has b een open-sourced at https://gith ub.com/Philrain V/SLOW. 4.3 Metrics T utoring resp onses are ev aluated using a principle-driven framework grounded in cognitive load theory [18] and formative feedbac k principles [16]. Rather than relying on reference-based matc hing, whic h is ill-suited to op en-ended tutoring where effective strategies are inherently non-unique, our framework assesses re- sp onse quality directly with resp ect to p edagogical appropriateness and cognitiv e efficiency . Eac h tutoring response is ev aluated along sev en resp onse-lev el dimensions. These dimensions op erationalize core asp ects of effective tutoring, including diag- nostic appropriateness, con trolled cognitive load, and guidance to ward concrete next steps. The framew ork explicitly penalizes excessiv e v erb osit y , redundan t explanations, and m ultiple parallel solution paths that increase cognitiv e and decision-making load, while fav oring concise, fo cused resp onses that align with the learner’s expressed understanding and instructional needs and prop ose a minimal actionable step. T able 1 summarizes the complete ev aluation rubric. The ev aluation protocol follows a single-blind procedure where human ev alu- ators remain unaw are of the mo del identit y behind each resp onse. Each resp onse is indep enden tly rated by tw o exp erts and an automated judge (GPT-5 equiv- alen t mo del) using the 0–100 scale rubrics. The final score is derived from an equally w eighted a verage of h uman and LLM ratings. 5 Results Our ev aluation analyzes the p erformance of the SLO W framew ork from four distinct persp ectives: (i) a m ulti-mo del comparison ev aluating pedagogical gains o ver baseline mo dels across diverse dimensions; (ii) an ablation study to determine the con tribution of individual architectural components; (iii) a computational efficiency analysis assessing the trade-off betw een reason- ing o verhead and instructional quality; and (iv) an interpretabilit y analysis demonstrating the transparen t reasoning pro cess within the workspace. 5.1 Mo del Comparison T able 2 summarizes the p erformance gains of SLO W o v er the prompt-based baseline across six tutoring dimensions. Across all mo del families, SLOW con- sisten tly impro v es resp onse quality on nearly all dimensions, with similar im- pro vemen t patterns observ ed for both large and compact models, suggesting 10 Y. W ei et al. T able 1. Ev aluation rubric for personalized tutoring quality . Dimension Definition Clarit y The response is easy to understand, well-structured, and un- am biguous. Excessiv e verbosity , redundant explanations, or un- necessary en umeration are p enalized due to increased cognitiv e load. Goal Clarity The resp onse mak es its instructional inten t explicit, enabling the learner to clearly understand the immediate learning ob jective for th e current turn. Emotion Sensitivity The resp onse appropriately attends to emotional cues expressed in the learner’s utterance, providing reassurance, encourage- men t, or neutral guidance when appropriate, without exagger- ated o r unnecessary affective language. Self-comparison The resp onse frames feedback in terms of the learner’s own progress and remaining gaps, emphasizing p ersonal improv e- men t rather than peer comparison or competitive ev aluation. P ersonalization The response is tailored to the learner’s expressed difficulty or apparent lev el of understanding, av oiding generic, template- based, or broadly applicable explanations. A ctionability The resp onse provides a specific, minimal, and immediately ex- ecutable next step. Resp onses that presen t multiple parallel op- tions are penalized for increasing decision-making and cognitive load. Ov erall Score A holistic judgmen t of tutoring qualit y , reflecting instructional usefulness, emotional appropriateness, and effective manage- men t of cognitive load. that the gains are attributable to the framew ork’s structural design rather than b eing solely driven b y mo del capacit y . T o assess the robustness of these results, w e conducted Wilco xon signed-rank tests for each of the nine bac kb one mo d- els o v er the ev aluation instances. SLOW outp erformed the baseline significantly for all nine mo dels, with p < 0 . 001 for seven mo dels and p < 0 . 01 for the re- maining tw o. F urthermore, Cliff ’s δ v alues ranged from 0.42 to 0.59, indicating medium-to-large effect sizes. Notable impro vemen ts are observed in clarity and goal clarit y , particularly for DeepSeek-R1 and GPT-4.1, suggesting that SLOW impro ves instructional fo cus and response clarit y . A small degradation in clarit y is observed for DeepSeek-V3 [7], possibly reflecting its tendency tow ard more v erb ose in termediate reasoning. In contrast, large gains in actionability (e.g., +62.4 for DeepSeek-R1[4]) highlight SLOW’s effectiveness in translating diag- nostic insigh ts into concrete next-step guidance. An ablation study compares the full SLOW framework with v arian ts remo v- ing Cognitive V alidation or Affective Prediction. As shown in T able 3, the full system p erforms b est across all mo dels. Removing Cognitiv e V alidation gener- SLO W: Strategic Logical-inference Open W orkspace 11 T able 2. Comparative analysis of p erformance gains: score differences betw een SLO W and the baseline model across tutoring dimensions. Mo del ∆ Clar. ∆ Goal. ∆ Emo. ∆ SelfComp. ∆ Pers. ∆ Act. ∆ Ov erall deepseek-r1 + 41.4 + 42.2 + 20.4 + 47.2 + 19.4 + 62.4 + 38.0 deepseek-v3 - 6.80 + 32.2 + 5.40 + 25.4 + 7.20 + 17.0 + 10.6 gemini-2.5-flash + 31.4 + 48.6 + 28.2 + 39.2 + 19.0 + 36.8 + 29.8 gemini-2.5-pro + 35.6 + 37.4 + 18.8 + 25.2 + 17.8 + 40.6 + 30.0 gemini-3-pro + 28.8 + 22.0 + 14.2 + 25.4 + 11.0 + 21.8 + 20.4 gpt-4.1 + 42.2 + 36.2 + 15.2 + 32.6 + 12.6 + 55.0 + 20.9 gpt-4.1-mini + 22.2 + 33.0 + 20.0 + 31.2 + 9.20 + 25.8 + 14.2 gpt-4o + 29.4 + 33.8 + 14.6 + 28.2 + 10.0 + 39.4 + 20.8 gpt-4o-mini + 14.8 + 20.0 + 12.8 + 23.0 + 4.00 + 16.2 + 8.6 T able 3. Ov erall Score in the ablation study . “w/o” denotes “without”. Ablation Setting GPT-4.1 Gemini-3-Pro DeepSeek-R1 Baseline 59.6 62.0 50.2 SLO W w/o Cognitive V alidation 73.0 78.6 72.8 SLO W w/o Affective Prediction 77.2 77.8 66.0 SLO W (F ull) 83.0 88.6 89.6 ally causes substantial degradation, esp ecially for DeepSeek-R1, while removing Affectiv e Prediction also leads to clear performance losses across mo dels. These results suggest that cognitive v alidation and affective prediction provide com- plemen tary b enefits for effective tutoring. Because these comparisons rely on rubric-based ev aluation, we further exam- ined the reliability of the scoring framework. Sp ecifically , W e computed Cron- bac h’s α across rubric dimensions as an index of in ternal consistency for the scoring framew ork. F or the tw o human exp erts, the ratings show ed go od inter- nal consistency ( α = 0 . 84 ), while the hybrid human–AI ev aluation (including the LLM rater) also main tained goo d in ternal consistency ( α = 0 . 81 ). In ter- rater agreemen t b et ween the t w o h uman experts w as measured b y ICC(2,1), yielding a v alue of 0 . 78 . In addition, the rank-order alignment b et w een human and LLM judgmen ts reached a Sp earman’s ρ = 0 . 72 . These results indicate that the scoring framew ork is sufficiently reliable. 5.2 Computational Efficiency and Cost T o ev aluate the trade-off b et w een p edagogical gain and computational ov er- head, we conducted a cost analysis using GPT-4o-mini as the bac kb one mo del with deterministic deco ding (temp erature = 0 ). Relative to the standard Base- line (1.0 × cost, 65.4 ov erall score) and the EduPlanner-style framework [22] (3.6 × cost, 73.2 ov erall score), SLOW incurs a computational cost of 6.4 × and ac hieves an o verall score of 79.6. 12 Y. W ei et al. Fig. 3. A case of SLOW Reasoning W orkspace demonstrating in terpretability . Notably , SLOW do es not rely on a fixed-length reasoning chain; additional it- erations are triggered only when diagnostic inconsistencies are detected, yielding a median of 6 and an 80th p ercen tile of 7 API calls p er instance. T o test whether these gains are merely due to increased inference-time compute, w e compared SLO W against a compute-matc hed 7-step refinement control (Refine-7). Refine- 7 uses sev en sequential calls (draft, critique, revision, critique, revision, critique, and final revision) with the same bac kb one mo del, deco ding setting, and ev alua- tion rubric, resulting in a similar cost of 6.2 × . Unlik e SLOW, Refine-7 do es not main tain explicit learner-state represen tations or implement explicit cognitiv e– affectiv e decomp osition, but instead relies on unconstrained multi-turn refine- men t. Despite the closely matched budget (within 5% of SLO W’s tok en budget), Refine-7 achiev ed an o verall score of only 71.2, substantially low er than SLOW’s 79.6. These results suggest that SLOW’s gains are not explained by additional compute alone, but b y its structured p edagogical architecture. 5.3 In terpretability Analysis T o demonstrate the interpretabilit y of SLOW b ey ond surface-level explanations, w e developed an interactiv e Reasoning W orkspace (Figure 3) that externalizes the system’s internal delib eration as a h uman-readable System 2 pedagogical trace. Rather than providing p ost-hoc rationales for a generated response, the w orkspace exposes the intermediate diagnostic assumptions, coun terfactual ev al- uations, and strategy trade-offs that shap e instructional decisions. As illustrated in the historical sequencing case, the workspace supp orts in terpretability at three complemen tary levels that are meaningful: SLO W: Strategic Logical-inference Open W orkspace 13 – Diagnostic T raceabilit y . When the learner rep orts difficulty recalling ev ent order, the workspace reveals how the system parses this utterance in to a sp ecific cognitiv e hypothesis—namely , a deficit in chronological structur- ing rather than factual recall. Through iterativ e cognitiv e v alidation, the learner’s mastery profile is refined from an initial Unknown (Un) state to Insufficiently Known (InK), with explicit evidence indicating fragmented kno wledge and missing structural anchors. This allows human observ ers to insp ect the final diagnosis and the eviden tial path leading to it. – Risk-A w are Strategy Selection . The workspace mak es instructional de- lib eration explicit by displa ying candidate strategies that were considered and rejected. In this case, a So cratic questioning approac h is discarded b ecause prosp ectiv e sim ulation predicts heigh tened anxiet y and cognitiv e blo c k age given the learner’s current state. By exp osing these alternativ es, the system clarifies wh y certain p edagogically plausible actions are inten tionally a voided, supporting informed auditing of instructional risk management. – Calibrated F eedback Rationale . Finally , the w orkspace provides a plain- language justification for the selected action. It explains how supplying min- imal chronological anchors reduces immediate retriev al load and enables the learner to reorganize fragmented knowledge indep enden tly . This rationale connects diagnostic conclusions to concrete instructional choices in a man- ner accessible to b oth teac hers and learners. By decomp osing tutoring b eha vior into observ able diagnostic, affectiv e, and strategic lay ers, the SLOW Reasoning W orkspace mo ves b ey ond black-box au- tomation. It functions as an auditable pedagogical in terface that allows stak e- holders to understand, ev aluate, and trust how instructional decisions are formed, thereb y supp orting b oth educational v alidit y and resp onsible deploymen t of LLM-based tutors. 6 Discussion & Conclusion This pap er prop oses the SLO W framew ork, which introduces an op en reasoning w orkspace to enable systematic consideration of learners’ cognitiv e and affective factors prior to instructional response generation. Based on this design, we con- struct a structured reasoning pro cess consisting of evidence parsing, cognitive v alidation, affectiv e reasoning, and strategy integration. Empirical ev aluation demonstrates that SLOW impro ves instructional sp ecificit y , actionability , and p edagogical coherence. Alongside these impro v ements, the prop osed design also in tro duces corre- sp onding costs. By incorp orating explicit reasoning and simulation during in- teraction, SLO W ma y result in longer resp onse latency and could inadverten tly amplify internal mo del biases tow ard diverse learner profiles. Consequently , fu- ture work should explore memory summarization and state-cac hing mechanisms to reduce redundant reasoning costs, alongside dedicated bias detection and miti- gation proto cols to enhance p edagogical fairness. Whether this trade-off b etw een 14 Y. W ei et al. reasoning effort and instructional qualit y can be stably translated in to verifiable teac hing effectiveness in real educational settings remains an op en question for further empirical inv estigation. Beyond directly supp orting learners, the trans- paren t reasoning pro cess exp osed by SLO W may also serve teac hers by illustrat- ing ho w instructional decisions can b e formed through systematic consideration of cognition and affect. In addition, the framework can function as a reasoning- a ware data synthesis mechanism, providing more structured reasoning data for the training of next-generation educational language mo dels. A cknow le dgements. This work w as supp orted by the National Natural Science F oundation of China (Grant No. 62477012), the Natural Science F oundation of Shanghai, China (Grant No. 23ZR1418500), the AI for Science Program of the Shanghai Municipal Commission of Econom y and Informatization, China (Grant No. 2025-GZL-RGZN-BTBX-01014), Ma jor Program of Philosophy and So cial Sciences Researc h of the Ministry of Education (Grant No. 2025JZDZ054). References 1. Blac k, P ., Wiliam, D.: Assessment and classroom learning. Assessment in Educa- tion: Principles, P olicy & Practice 5 (1), 7–74 (1998) 2. F rankish, K.: Dual-pro cess and dual-system theories of reasoning. Philosophy Com- pass 5 (10 ), 914–926 (2010) 3. F u, S., Desmarais, M.C.: Marko v blanket based feature selection: a review of past decade. In: Pro ceedings of the w orld congress on engineering. vol. 1, pp. 321–328. Newsw o o d Ltd. Hong Kong, China (2010) 4. Guo, D., Y ang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zh u, Q., Ma, S., W ang, P ., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcemen t learning. arXiv preprin t arXiv:2501.12948 (2025) 5. Jia, R., W ei, Y., Li, R., Jiang, Y.H., Xie, X., Shen, Y., Zhang, M., Jiang, B.: Diacdm: Cognitive diagnosis in teacher-studen t dialogues using the initiation- resp onse-ev aluation framew ork. arXiv preprint arXiv:2509.24821 (2025) 6. Kabudi, T., P appas, I., Olsen, D.H.: Ai-enabled adaptive learning systems: A sys- tematic mapping of the literature. Computers and education: Artificial intelligence 2 , 100 017 (2021) 7. Liu, A., F eng, B., Xue, B., W ang, B., W u, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical rep ort. arXiv preprint (2024) 8. Liu, J., Huang, Z., Xiao, T., Sha, J., W u, J., Liu, Q., W ang, S., Chen, E.: So crat- iclm: Exploring so cratic p ersonalized teac hing with large language mo dels. Ad- v ances in Neural Information Processing Systems 37 , 85693–85721 (2024) 9. Liu, Z., Yin, S.X., Lin, G., Chen, N.: P ersonality-a w are studen t sim ulation for con versational intelligen t tutoring systems. In: Pro ceedings of the 2024 Conference on Empirical Metho ds in Natural Language Processing. pp. 626–642 (2024) 10. Manh Hung, N., Sebastian, T., Victor-Alexandru, P ., Alkis, G., Adish, S.: Synthe- sizing high-quality programming tasks with llm-based exp ert and student agents (2025) SLO W: Strategic Logical-inference Open W orkspace 15 11. Muennighoff, N., Y ang, Z., Shi, W., Li, X.L., F ei-F ei, L., Ha jishirzi, H., Zettle- mo yer, L., Liang, P ., Candès, E., Hashimoto, T.B.: s1: Simple test-time scaling. In: Pro ceedings of the 2025 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 20286–20332 (2025) 12. Murra y , T.: An ov erview of intelligen t tutoring system authoring to ols: Up dated analysis of the state of the art. Authoring to ols for adv anced technology learning en- vironmen ts: T o ward cost-effective adaptive, in teractive and intelligen t educational soft ware pp. 491–544 (2003) 13. P eng, X., Y uan, P ., Li, D., Cheng, J., F ang, Q., Liu, Z.: Kele: A m ulti-agent frame- w ork for structured socratic teac hing with large language models. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 16342–16362 (2025) 14. Ro oein, D., Cho wdhury , S.P ., Eremeev a, M., Qin, Y., Nozza, D., Sac han, M., Hovy , D.: Pats: P ersonalit y-aw are teaching strategies with large language mo del tutors. arXiv preprin t arXiv:2601.08402 (2026) 15. Ry an, T., F renc h, S., Kennedy , G.: Bey ond the iron triangle: Impro ving the quality of teaching and learning at scale. Studies in Higher Education 46 (7), 1383–1394 (2021) 16. Sadler, D.R.: F ormativ e assessment and the design of instructional systems. In- structional science 18 (2), 11 9–144 (1989) 17. Scarlatos, A., Bak er, R.S., Lan, A.: Exploring knowledge tracing in tutor-studen t dialogues using llms. In: Proceedings of the 15th International Learning Analytics and Kno wledge Conference. pp. 249–259 (2025) 18. Sw eller, J.: Cognitive load theory . In: Psychology of learning and motiv ation, v ol. 55, pp. 37–76. Elsevier, Amsterdam (2011) 19. W ang, F., Liu, Q., Chen, E., Huang, Z., Chen, Y., Yin, Y., Huang, Z., W ang, S.: Neural cognitive diagnosis for in telligent education systems. In: Pro ceedings of the AAAI c onference on artificial in telligence. v ol. 34, pp. 6153–6161 (2020) 20. W ei, J., W ang, X., Sc huurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. A dv ances in neural information pro cessing systems 35 , 24824–24837 (2022) 21. W ei, Y., Jiang, B.: In terpretable cognitiv e state prediction via temp oral fuzzy cog- nitiv e map. IEEE T ransactions on Learning T ec hnologies 17 , 514–526 (2023) 22. Zhang, X., Zhang, C., Sun, J., Xiao, J., Y ang, Y., Luo, Y.: Eduplanner: Llm- based multi-agen t systems for customized and intelligen t instructional design. IEEE T ransactions on Learning T echnologies (2025) 23. Zhang, Z., Zhang-Li, D., Y u, J., Gong, L., Zhou, J., Hao, Z., Jiang, J., Cao, J., Liu, H., Liu, Z., et al.: Simulating classro om education with llm-emp o wered agents. In: Pro ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso ciation for Computational Linguistics: Human Language T echnologies (V olume 1: Long P ap ers). pp. 10364–10379 (2025)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment