SLOW: Strategic Logical-inference Open Workspace for Cognitive Adaptation in AI Tutoring

SLO W: Strategic Logical-inference Op en W orkspace for Cognitiv e A daptation in AI T utoring Y uang W ei 1 [0000 − 0002 − 8187 − 4011] ⋆ , Ruijia Li 1 [0009 − 0001 − 9680 − 7508] ⋆ , and Bo Jiang 1 [0000 − 0002 − 7914 − 1978] ⋆⋆ Shanghai Institute of Artiﬁcial In telligence for Education, Shanghai, China Abstract. While Large Language Mo dels (LLMs) ha ve demonstrated remark able ﬂuency in educational dialogues, most generative tutors pri- marily op erate through in tuitive, single-pass generation. This reliance on fast thinking precludes a dedicated reasoning w orkspace, forcing multiple diagnostic and strategic signals to b e pro cessed in a conﬂated manner. As a result, learner cognitive diagnosis, aﬀective p erception, and pedagogi- cal decision-making become tigh tly en tangled, whic h limits the tutoring system’s capacit y for delib erate instructional adaptation. W e propose SLO W, a theory-informed tutoring framew ork that supp orts delib erate learner-state reasoning within a transparent decision workspace. Inspired b y dual-process accoun ts of h uman tutoring, SLO W explicitly separates learner-state inference from instructional action selection. The frame- w ork integrates causal evidence parsing from learner language, fuzzy cognitiv e diagnosis with counterfactual stabilit y analysis, and prosp ec- tiv e aﬀectiv e reasoning to anticipate how instructional choices may in- ﬂuence learners’ emotional tra jectories. These signals are join tly consid- ered to guide p edagogically and aﬀectively aligned tutoring strategies. Ev aluation using hybrid human-AI judgments demonstrates signiﬁcant impro vemen ts in personalization, emotional sensitivity , and clarity . Abla- tion studies further conﬁrm the necessity of eac h mo dule, sho wcasing ho w SLO W enables in terpretable and reliable intelligen t tutoring through a visualized decision-making process. This work adv ances the in terpretabil- it y and educational v alidit y of LLM-based adaptiv e instruction. Keyw ords: In telligen t T utoring Systems · Multi-Agen t Systems · Learner Mo deling · Explainable AI in Education 1 In tro duction The “iron triangle” among scale, p ersonalization, and qualit y has long b een one of the central c hallenges in education [15]. AI has often b een viewed as a p ossible w ay to ease this trade-oﬀ by enabling more scalable and adaptive forms of in- structional supp ort. More recen tly , the rapid adv ancement of LLMs has further ⋆ Equal con tribution( philrain.cs@gmail.com, rj.yuzu.li@gmail.com ). ⋆⋆ Corresp onding author( bjiang@deit.ecnu.edu.cn ). 2 Y. W ei et al. rekindled such exp ectations, as their ﬂuent natural language interaction capa- bilities seem to oﬀer new possibilities for in telligent tutoring at scale. Ho wev er, linguistic ﬂuency do es not equate to the capacity for educationally meaningful reasoning and instructional decision-making. In educational tec hnology , eﬀorts to address this tension hav e traditionally b een embo died in In telligent T utoring Systems (ITS) [12] and adaptiv e learning systems (ALS) [6]. These systems typically rely on explicit learner mo deling and instructional strategy adjustments to achiev e p ersonalized supp ort. As LLMs are in tro duced into educational con texts, ho w ever, generative mo dels are in- creasingly expected to p erform b oth learner-state diagnosis and instructional resp onse generation within the same conv ersational agen t [9]. This shift blurs the boundary b et ween learner-state diagnosis and p edagogical decision-making, creating a fundamental c hallenge for LLM-based tutoring: supporting explicit learner-state reasoning that can guide delib erate instructional planning [14]. In most existing LLM-driv en tutoring systems, learner-state inference and instructional action decision-making are compressed in to a single generation pro- cess. As a result, multiple signals, including cognitive diagnosis, aﬀect p erception, and p edagogical inten t, are handled in a highly coupled manner, making it diﬃ- cult to explicitly verify diagnostic hypotheses, compare alternative p edagogical strategies, or main tain alignment betw een inferred learner needs and generated instructional responses. Figure 1 illustrates a t ypical failure scenario: when a learner expresses confusion ab out the c hronological order of historical ev ents, generic LLM tutoring tends to pro vide abstract metaphors or v ague encourage- men t, rev ealing a clear p edagogical misalignment betw een the system’s response and the learner’s actual cognitiv e diﬃculty . T o address these issues, existing research has mainly proceeded along t wo paths. One is the theory-driven paradigm, which impro ves the normativity of instructional strategies b y translating educational or cognitive-psyc hological the- ories into mo del constraints, as exempliﬁed by Soc raticLM [8], KELE [13], and P A TS [14]. How ever, such approaches rely hea vily on exp ert-deﬁned rules and often struggle to generalize across domains or long-tail learning b ehaviors. The other is the simulation-based paradigm, which enhances system p erception by constructing virtual learners, such as p ersonalit y-trait-based learner mo deling [9] and dialogue tra jectory simulation [5, 23]. Y et b ecause these sim ulations are largely driv en b y the probabilistic generation of LLMs, their diagnostic logic remains diﬃcult to v erify against educ ational measuremen t and cognitiv e prin- ciples, thereby limiting interpretabilit y and p edagogical credibility . Despite their diﬀeren t implementation paths, b oth paradigms generally retain a tigh tly cou- pled generation pro cess in which learner diagnosis and instructional action are resolv ed simultaneously . This diﬀers from exp ert teac hing practice, where teach- ers t ypically elicit and interpret evidence ab out learner understanding and then adjust instructional actions iteratively through formative assessment [1]. As a result, existing systems often act on insuﬃcien tly veriﬁed learner models and struggle to supp ort timely state correction during interaction [5], which can lead to p ersisten t strategy misalignment and instructional inertia across turns. SLO W: Strategic Logical-inference Open W orkspace 3 I can't remember the order of these historical events . They are all mixed up in my head . (Smiling warmly) It's al right , mi xing up sequences is very common . Think of history as an interesting story book where each event is a chapter . Why not try a 'timeline' tool to string these events together like a p earl necklace ? I understand it's tricky to get the sequence right . It's already great that you've remembered several events ! Let's take it one step at a time : pi ck just the three most critical events and write their keywords on cards . Once you've arranged those three in order, I'l l check them for you . We'll master it gradually! I can't remember the order of these historical events . They are all mixed up in my head . ×"" Abstract Metaphor ×"" Vague Instruction √ Student - Centered Alignment Student Student AI Tutor AI Tutor Identifi ed Student States Cognitive Status Emotional Need Simulation & Optimizatio n Integrat ed Strategy + If action A is taken, how will the student's aff ective state evolve? If the student mastered..., th ey should perform as .. .; however, .. .. shows a g ap, implying.. .. Fig. 1. An illustrative case motiv ating the need for delib erate learner-state reasoning. General LLMs (Left) often pro vides abstract Metaphors b ecause their "fast-thinking" nature leads to the entanglemen t of diagnosis and decision-making. Our SLO W framew ork (Righ t) introduces a transparent op en workspace to decouple these pro cesses. By performing coun terfactual cognitiv e v alidation and prosp ectiv e aﬀectiv e sim ulation, SLOW explicitly weighs cognitiv e gains against aﬀectiv e risks to generate guidance th at is more aligned with the studen t’s needs. T o address this, we propose SLO W, a “slo w thinking” intelligen t tutoring framew ork inspired by dual-system theory [2]. Rather than directly generating instructional resp onses, SLOW explicitly externalizes learner state reasoning in to an open reasoning w orkspace, structurally decoupling the pro cesses of cognitiv e diagnosis and instructional action selection. This w orkspace operates through four collab orativ e reasoning stages: evidenc e p arsing , c o gnitive validation , aﬀe ct pr e diction , and str ate gy inte gr ation , resp onsible for extracting causally relev an t evidence, v alidating the stability of learning states, estimating aﬀectiv e ev olu- tion risks, and balancing cognitiv e gains against aﬀectiv e risks to generate in- structional actions. By explicitly reasoning ab out learner states and simulating instructional consequences before resp onse generation, SLOW pro vides a trans- paren t, traceable, and educationally rational decision path for in telligen t tu- toring, thereb y establishing a more interpretable and trust worth y architectural foundation for LLM-driv en tutoring systems. The main con tributions of this pap er are as follows: 1. W e reveal a structural limitation in current LLM-driven tutoring systems: the lack of explicit reasoning workspaces leads to coupling b et ween cogni- tiv e diagnosis and instructional planning, resulting in inexplicable and error- prone instructional b eha viors; 2. W e prop ose SLOW, a learner-cen tered reasoning architecture that simulates exp ert teachers’ delib erate instructional pro cesses through an op en reasoning w orkspace, ac hieving in terpretable learner diagnosis and adaptive instruction within a uniﬁed framew ork; 3. Through theory-driven empirical ev aluation and human-AI collaborative rat- ing, we sho w that SLO W improv es in instructional p ersonalization, aﬀectiv e sensitivit y , clarit y , and op erabilit y , demonstrating how its traceable reason- ing paths supp ort educational eﬀectiv eness and trustw orth y deploymen t. 4 Y. W ei et al. 2 Related W ork 2.1 Dual-System Theory and T est-Time Scaling The Dual-System Theory serves as a cornerstone of cognitiv e psychology , bifur- cating human cognition into System 1, whic h is intuitiv e and automatic (“fast thinking”), and System 2, which is logical and deliberative (“slow thinking”) [2]. As the research fo cus of LLMs shifts from the “scaling laws” of parameters to test-time sc aling [11], which increases computational steps during inference, this theory has b een revitalized in the ﬁeld of artiﬁcial intelligence. F rontiers suc h as Op enAI o1 hav e demonstrated that b y incorporating Chain-of-Thought (CoT) [20] and searc h algorithms, mo dels can signiﬁcantly enhance their logical deduction capabilities in tasks with ob jective standards, such as mathematics and programming. How ev er, existing research on test-time scaling is predomi- nan tly concentrated on domains with explicit feedback lo ops [15]. In the context of ITS, whic h inv olve high interactiv e complexity , utilizing test-time computation to enhance the p erception of learner states and p edagogical planning remains an under-explored fron tier. 2.2 LLM-based ITS Existing eﬀorts to enhance p edagogical planning in LLM-based ITS can b e broadly group ed in to tw o paradigms. The theory-driven paradigm translates ed- ucational or cognitiv e theories into explicit behavioral constrain ts. F or example, So craticLM [8] and KELE [13] use supervised ﬁne-tuning or m ulti-agen t strate- gies to strengthen guided questioning, while P A TS [14] explicitly maps learner proﬁles to instructional strategies. Although these methods improv e the nor- mativit y of instructional b eha vior, they often dep end heavily on expert-deﬁned rules and struggle to generalize across domains or long-tail learner behaviors. The sim ulation-based paradigm impro ves system p erception by constructing virtual learners. Prior w ork has used personality-based modeling to simulate div erse studen t resp onses [9], simulated misconceptions to improv e error correction, and adopted frameworks such as SimT utor [10] to preview dialogue tra jectories. How- ev er, b ecause these sim ulations are largely driven by LLM generation rather than explicit cognitive diagnosis mo del(CDM), it remains diﬃcult to verify whether they align with psychometric principles, which limits diagnostic explainability in complex tutoring settings. 2.3 Dialogue-based Learner Mo deling Learner modeling underpins adaptive education and has ev olved from static assessmen t based on con trolled testing to dynamic inference from natural in- teraction. Early neural cognitive diagnosis mo dels, suc h as NeuralCDM [19], established a basis for learner-state estimation, but remained limited in mo del- ing complex knowledge relations. More recen t work has sought to improv e ro- bustness and transparency . F or instance, tFCM [21] combines T emp oral F uzzy SLO W: Strategic Logical-inference Open W orkspace 5 Student Te a c h e r, when I revi ew medieval history , I always get confus ed about which event happened first . It fee ls a bit chaotic . AI Great initiative! It is comple tely normal to fe el overwhelmed when startin g medieval history . Let's fo cus on the big picture first . Tr y listing just three major fi gures (like Charlemagne) or events (like the Crusades) wit h their approximate dates . Establishing this timeline is a huge ste p fo rwar d ; we can fill in the specific details later . Stra tegic Logical - inference Open Wo rk sp ac e 1. Evid enc e Parsi ng 2. Cognitive Va li d at i on 4. Strat egy Integration 3. Affectiv e Prediction Dialogue LLM MB Styl e 𝒌 𝑬 𝒌 𝒑, 𝜾 𝑬 " Reali ty Counter Factu al !∆ !"# ,% ∆ &' ! Fuzzy tools Itera te Current State … … Scenario Simulation Itera te Next State 𝑪 𝒇𝒊𝒏𝒂𝒍 ' 𝒔 𝒊 , 𝝁 𝑰 , 𝑬 𝒌 ' Focus 𝑲 * 𝑭 𝒆𝒎𝒐 ' 𝑲 * = argmax " ! P(𝒔 𝒊 , 𝝁 𝑰 , 𝑬 𝒌 ) Action Integrat or Fig. 2. Overview of the Strategic Logical-inference Op en W orkspace (SLOW) frame- w ork. The architecture illustrates the transition from (1) Evidence Parsing, where dia- logue is deconstructed in to cognitive and aﬀectiv e primitiv es, to (2) Cognitiv e V alida- tion and (3) Aﬀective Prediction, which utilize coun terfactual simulation and prospec- tiv e simulation to reﬁne the internal state. Finally , (4) Strategy Integration balances these signals to execute a calibrated tutoring action. Cognitiv e Maps with the Mark ov Blank et principle to s upport causally in ter- pretable kno wledge-state transitions. With the rise of LLMs, researchers hav e further explored learner mo deling from dialogue. Dialogue-KT [17] uses conv ersa- tional con text to track knowledge mastery , but suc h generative approaches often lac k explicit educational-measurement constraints, reducing p edagogical in ter- pretabilit y . DiaCDM [5] further in tro duces diagnostic signal extraction based on the Initiation–Resp onse–Ev aluation framew ork. Ho wev er, its design is ori- en ted more to ward oﬄine p ost ho c analysis than online interaction, making it diﬃcult to capture rapid learner-state shifts during tutoring. As a result, exist- ing approaches still struggle to join tly supp ort transparen t reasoning and online diagnosis in op en-ended dialogue. 3 Metho ds The SLO W (Strategic Logical-inference Op en W orkspace) framew ork (Figure 2) facilitates p edagogical reasoning b y constructing an explicit inference space within a generative architecture. This design is grounded in the principles of formativ e assessmen t, whic h, as noted by Sadler [16], necessitates a clear under- standing of a learner’s current state b efore taking p edagogical action to close learning gaps. T o op erationalize this theory , SLOW simulates the in ternal de- lib erativ e mec hanism of an exp ert teac her who p erforms structured analysis and anticipates p oten tial consequences within a mental workspace rather than resp onding impulsiv ely . A ccordingly , the framework deconstructs this psycholog- ical process in to four synergistic stages: parsing evidence, cognitive v alida- tion, aﬀective prediction, and strategy in tegration . 6 Y. W ei et al. 3.1 P arsing Evidence T o mitigate the inherent linguistic noise and redundancy presen t in open-ended learner discourse, the SLO W framework initiates its reasoning pro cess with a structured evidence extraction stage. Grounded in the Marko v Blanket (MB) [3] principle, w e iden tify a minimal suﬃcien t feature set for eac h Kno wl- edge Comp onen t (KC) k . This ensures that once these diagnostic features are determined, the mastery state of k remains conditionally independent of p eriph- eral dialogue elemen ts. The system decomp oses the raw input u orig in to tw o distinct diagnostic streams: cognitiv e primitives ( k , E k ) and aﬀective triplets ( p, ι, E e ) . Speciﬁcally , eac h identiﬁed K C k and its asso ciated evidence span E k is transformed into a dense diagnostic enco ding z k : z k = Φ MB ( k , E k ; S MB default ) (1) where S MB default represen ts a sp ecialized set of MB-style feature templates designed to capture mastery-relev an t indicators directly from natural language. Unlik e traditional metho ds that rely on explicit b eha vioral logs, the mapping function Φ MB deriv es latent pro xies of mastery by analyzing th e causal relev ance of the learner’s expressed reasoning within the curren t interaction con text. When the input con tains emotional cues, the system extracts the p olarit y p , in tensity ι , and supp orting text E e to form an aﬀectiv e vector e . This vector serv es as the initial aﬀectiv e baseline for the subsequent sim ulation of emotional tra jectories. The comprehensive output of this stage is the lab el set T ( u orig ) , whic h consolidates the collection of cognitive features Z = { z k } and the initial aﬀectiv e state e for downstream deliberative analysis. 3.2 Cognitiv e V alidation Up on obtaining the structural features z k , the system initiates an analysis of the learner’s cognitive state. Recognizing that knowledge acquisition is a non-discrete ev olutionary process, w e utilize a F uzzy Cognitive Discriminator to represen t the mastery level µ ( C k ) [21]. The state is quan tiﬁed as a contin uous membership distribution across four hierarc hical levels: µ ( C k ) = ( µ Un , µ InK , µ K , µ L ) (2) where the states corresp ond to Unknown (Un), Insuﬃciently Known (InK), Kno wn (K), and Learned (L), resp ectively . T o ev aluate the stabilit y of this diagnosis, the system executes a Coun ter- factual Simulation within the delib erativ e workspace. This pro cess assesses the robustness of the observ ed state b y constructing a counterfactual h yp othesis (e.g., “Assume the learner already p ossesses state K”) and deducing the corre- sp onding t ypical features F sim k . The system ev aluates the alignmen t b etw een the empirical reality F orig k and the sim ulated proﬁle F sim k b y calculating tw o contrastiv e signals: ∆ sim = Diﬀ ( F orig k , F sim k ) (3) SLO W: Strategic Logical-inference Open W orkspace 7 where ∆ sim iden tiﬁes the directional mismatc h signal. These signals, along with the counterfactual eﬀort ∆ cf required to shift b et ween states, are then pro cessed b y F uzzy to ols [21] to iteratively reﬁne the diagnostic score. This v alidation lo op ensures that the ﬁnal output reaches a stable cognitive context C ﬁnal , thereby enhancing the transparency and in terpretability of the modeling pro cess. 3.3 Aﬀectiv e Prediction Sim ultaneously , the system activ ates prosp ectiv e aﬀectiv e simulation to antici- pate the impact of potential p edagogical in terven tions. As illustrated in Figure 2 Blo c k 3, the simulator utilizes the emotional triplets extracted during the parsing stage to initialize the Curren t State e before , representing the learner’s baseline p olarit y and intensit y . T o ev aluate the emotional tra jectory , the system conducts a forw ard rollout b y sim ulating a po ol of candidate tutor resp onses { r (1) , . . . , r ( M ) } to predict the learner’s emotional Next State e ( m ) after for each candidate r ( m ) . A transition score ∆ ( m ) is calculated to ev aluate the aﬀectiv e shift b etw een the current state e before and the predicted state e ( m ) after : ∆ ( m ) = Score ( e before , e ( m ) after ) (4) where m denotes the index of the resp onse draft b eing ev aluated. The optimal sim ulation result is then syn thesized in to the ﬁnal Aﬀectiv e Prediction signal, enco ded as a control vector F emo = ( emo cur , int cur , tgt cur ) . In this form ulation, emo cur and int cur sp ecify the target polarity and intensit y , while tgt cur iden tiﬁes a prescriptive control target (e.g., “encourage” or “stabilize”). This prosp ectiv e mec hanism ensures that the instructional path is motiv ationally supp ortiv e by mitigating the risk of triggering learner frustration b efore the resp onse is actually deliv ered. 3.4 Strategy Integration In the ﬁnal stage, the system in tegrates the sim ulated cognitive and aﬀective signals to facilitate a multi-criteria decision-making process. As illustrated by the Figure 2, Blo c k 4, the framework iden tiﬁes the optimal p edagogical fo cus k ∗ b y ranking candidates according to their priority scores: k ∗ = arg max k i Priorit y ( s i , µ i , E k i ) (5) This decision logic p erforms a critical trade-oﬀ among three primary dimensions for each candidate knowledge comp onen t k i . Sp eciﬁcally , the in tegrator ev alu- ates the mastery severit y s i along the p edagogical hierarch y (Un → InK → K → L), the diagnostic conﬁdence µ i deriv ed from the membership stability and coun terfactual analysis, and the richness of supp orting evidence E k i extracted from the dialogue. 8 Y. W ei et al. The associated state s ∗ of the selected focus k ∗ subsequen tly determines the instructional stance, ranging from foundational scaﬀolding for state Un to transfer-orien ted extension for state L. T o generate the ﬁnal tutoring resp onse y resp , the system couples the ﬁnalized cognitiv e diagnosis C ﬁnal with the af- fectiv e con trol vector F emo . By balancing instructional necessity with emotional stabilit y on a conceptual scale, the framework ensures that the resulting tutoring action is b oth cognitiv ely precise and motiv ationally supportive. 4 Exp erimen tal Design This section details the exp erimental conﬁguration, cov ering the construction of the ev aluation dataset, the selection and conﬁguration of baseline mo dels, and the ev aluation mec hanism. 4.1 Dataset The ev aluation dataset is designed to capture representativ e tutoring c hallenges in authentic instructional con texts. It combines authen tic student–teac her inter- action data with mo del-augmen ted samples. The source data were drawn from real in teraction corp ora, reﬂecting t ypical cognitive h urdles and aﬀective expres- sions, suc h as confusion ov er circuit current directions, fragmented knowledge acquisition, and learner doubt with aﬀective undertones. T o increase div ersity , w e used a large language mo del to generate 200 additional samples conditioned on these authen tic inquiries, while preserving the original knowledge topics and diﬃcult y levels and v arying surface formulations and aﬀective tones. All candi- date instances were then manually reviewed b y educational exp erts to remov e am biguous or p edagogically unrealistic cases, yielding a ﬁnal set of 100 exp ert- v alidated high-qualit y instances. The ﬁnal dataset spans K1–K12 and cov ers seven disciplines: Biology (20), Ph ysics (20), Mathematics (20), History (14), Geograph y (12), Chemistry (10), and English (4). It also balances ﬁv e scenario types—Aﬀective Supp ort (32), P ersonalized Support (26), Strategic Scaﬀolding (22), Direct Q&A (12), and Error Correction (8)—as well as three emotion categories: Positiv e (36), Neutral (32), and Negative (32). T ogether, these prop erties ensure p edagogical realism, cross-disciplinary co verage, and con trolled v ariation in learner states. 4.2 Baselines T o verify the generalizabilit y of SLOW across diﬀeren t foundation mo dels, w e selected represen tative mo dels from three ma jor families: GPT-4o and GPT- 4o-mini from Op enAI, Gemini-1.5-Pro and Gemini-1.5-Flash from Go ogle, and DeepSeek-V3 alongside the reasoning-enhanced DeepSeek-R1. T o ensure a rigorous and fair comparison, all baselines are conﬁgured with strong prompting and follow a tw o-stage pipeline where the mo del ﬁrst explicitly diagnoses the learner’s cognitive and aﬀective states b efore generating a ﬁnal SLO W: Strategic Logical-inference Open W orkspace 9 resp onse. Critically , the baseline prompts explicitly incorp orate the full set of ev aluation rubrics. This setup ensures that baselines are fully aw are of the scoring preferences, thereby conﬁrming that any p erformance gains from SLO W stem from its internal op en reasoning workspace and simulation mechanisms rather than mere prompt engineering or information disparit y . Baseline prompt has b een open-sourced at https://gith ub.com/Philrain V/SLOW. 4.3 Metrics T utoring resp onses are ev aluated using a principle-driven framework grounded in cognitive load theory [18] and formative feedbac k principles [16]. Rather than relying on reference-based matc hing, whic h is ill-suited to op en-ended tutoring where eﬀective strategies are inherently non-unique, our framework assesses re- sp onse quality directly with resp ect to p edagogical appropriateness and cognitiv e eﬃciency . Eac h tutoring response is ev aluated along sev en resp onse-lev el dimensions. These dimensions op erationalize core asp ects of eﬀective tutoring, including diag- nostic appropriateness, con trolled cognitive load, and guidance to ward concrete next steps. The framew ork explicitly penalizes excessiv e v erb osit y , redundan t explanations, and m ultiple parallel solution paths that increase cognitiv e and decision-making load, while fav oring concise, fo cused resp onses that align with the learner’s expressed understanding and instructional needs and prop ose a minimal actionable step. T able 1 summarizes the complete ev aluation rubric. The ev aluation protocol follows a single-blind procedure where human ev alu- ators remain unaw are of the mo del identit y behind each resp onse. Each resp onse is indep enden tly rated by tw o exp erts and an automated judge (GPT-5 equiv- alen t mo del) using the 0–100 scale rubrics. The ﬁnal score is derived from an equally w eighted a verage of h uman and LLM ratings. 5 Results Our ev aluation analyzes the p erformance of the SLO W framew ork from four distinct persp ectives: (i) a m ulti-mo del comparison ev aluating pedagogical gains o ver baseline mo dels across diverse dimensions; (ii) an ablation study to determine the con tribution of individual architectural components; (iii) a computational eﬃciency analysis assessing the trade-oﬀ betw een reason- ing o verhead and instructional quality; and (iv) an interpretabilit y analysis demonstrating the transparen t reasoning pro cess within the workspace. 5.1 Mo del Comparison T able 2 summarizes the p erformance gains of SLO W o v er the prompt-based baseline across six tutoring dimensions. Across all mo del families, SLOW con- sisten tly impro v es resp onse quality on nearly all dimensions, with similar im- pro vemen t patterns observ ed for both large and compact models, suggesting 10 Y. W ei et al. T able 1. Ev aluation rubric for personalized tutoring quality . Dimension Deﬁnition Clarit y The response is easy to understand, well-structured, and un- am biguous. Excessiv e verbosity , redundant explanations, or un- necessary en umeration are p enalized due to increased cognitiv e load. Goal Clarity The resp onse mak es its instructional inten t explicit, enabling the learner to clearly understand the immediate learning ob jective for th e current turn. Emotion Sensitivity The resp onse appropriately attends to emotional cues expressed in the learner’s utterance, providing reassurance, encourage- men t, or neutral guidance when appropriate, without exagger- ated o r unnecessary aﬀective language. Self-comparison The resp onse frames feedback in terms of the learner’s own progress and remaining gaps, emphasizing p ersonal improv e- men t rather than peer comparison or competitive ev aluation. P ersonalization The response is tailored to the learner’s expressed diﬃculty or apparent lev el of understanding, av oiding generic, template- based, or broadly applicable explanations. A ctionability The resp onse provides a speciﬁc, minimal, and immediately ex- ecutable next step. Resp onses that presen t multiple parallel op- tions are penalized for increasing decision-making and cognitive load. Ov erall Score A holistic judgmen t of tutoring qualit y , reﬂecting instructional usefulness, emotional appropriateness, and eﬀective manage- men t of cognitive load. that the gains are attributable to the framew ork’s structural design rather than b eing solely driven b y mo del capacit y . T o assess the robustness of these results, w e conducted Wilco xon signed-rank tests for each of the nine bac kb one mo d- els o v er the ev aluation instances. SLOW outp erformed the baseline signiﬁcantly for all nine mo dels, with p < 0 . 001 for seven mo dels and p < 0 . 01 for the re- maining tw o. F urthermore, Cliﬀ ’s δ v alues ranged from 0.42 to 0.59, indicating medium-to-large eﬀect sizes. Notable impro vemen ts are observed in clarity and goal clarit y , particularly for DeepSeek-R1 and GPT-4.1, suggesting that SLOW impro ves instructional fo cus and response clarit y . A small degradation in clarit y is observed for DeepSeek-V3 [7], possibly reﬂecting its tendency tow ard more v erb ose in termediate reasoning. In contrast, large gains in actionability (e.g., +62.4 for DeepSeek-R1[4]) highlight SLOW’s eﬀectiveness in translating diag- nostic insigh ts into concrete next-step guidance. An ablation study compares the full SLOW framework with v arian ts remo v- ing Cognitive V alidation or Aﬀective Prediction. As shown in T able 3, the full system p erforms b est across all mo dels. Removing Cognitiv e V alidation gener- SLO W: Strategic Logical-inference Open W orkspace 11 T able 2. Comparative analysis of p erformance gains: score diﬀerences betw een SLO W and the baseline model across tutoring dimensions. Mo del ∆ Clar. ∆ Goal. ∆ Emo. ∆ SelfComp. ∆ Pers. ∆ Act. ∆ Ov erall deepseek-r1 + 41.4 + 42.2 + 20.4 + 47.2 + 19.4 + 62.4 + 38.0 deepseek-v3 - 6.80 + 32.2 + 5.40 + 25.4 + 7.20 + 17.0 + 10.6 gemini-2.5-ﬂash + 31.4 + 48.6 + 28.2 + 39.2 + 19.0 + 36.8 + 29.8 gemini-2.5-pro + 35.6 + 37.4 + 18.8 + 25.2 + 17.8 + 40.6 + 30.0 gemini-3-pro + 28.8 + 22.0 + 14.2 + 25.4 + 11.0 + 21.8 + 20.4 gpt-4.1 + 42.2 + 36.2 + 15.2 + 32.6 + 12.6 + 55.0 + 20.9 gpt-4.1-mini + 22.2 + 33.0 + 20.0 + 31.2 + 9.20 + 25.8 + 14.2 gpt-4o + 29.4 + 33.8 + 14.6 + 28.2 + 10.0 + 39.4 + 20.8 gpt-4o-mini + 14.8 + 20.0 + 12.8 + 23.0 + 4.00 + 16.2 + 8.6 T able 3. Ov erall Score in the ablation study . “w/o” denotes “without”. Ablation Setting GPT-4.1 Gemini-3-Pro DeepSeek-R1 Baseline 59.6 62.0 50.2 SLO W w/o Cognitive V alidation 73.0 78.6 72.8 SLO W w/o Aﬀective Prediction 77.2 77.8 66.0 SLO W (F ull) 83.0 88.6 89.6 ally causes substantial degradation, esp ecially for DeepSeek-R1, while removing Aﬀectiv e Prediction also leads to clear performance losses across mo dels. These results suggest that cognitive v alidation and aﬀective prediction provide com- plemen tary b eneﬁts for eﬀective tutoring. Because these comparisons rely on rubric-based ev aluation, we further exam- ined the reliability of the scoring framework. Sp eciﬁcally , W e computed Cron- bac h’s α across rubric dimensions as an index of in ternal consistency for the scoring framew ork. F or the tw o human exp erts, the ratings show ed go od inter- nal consistency ( α = 0 . 84 ), while the hybrid human–AI ev aluation (including the LLM rater) also main tained goo d in ternal consistency ( α = 0 . 81 ). In ter- rater agreemen t b et ween the t w o h uman experts w as measured b y ICC(2,1), yielding a v alue of 0 . 78 . In addition, the rank-order alignment b et w een human and LLM judgmen ts reached a Sp earman’s ρ = 0 . 72 . These results indicate that the scoring framew ork is suﬃciently reliable. 5.2 Computational Eﬃciency and Cost T o ev aluate the trade-oﬀ b et w een p edagogical gain and computational ov er- head, we conducted a cost analysis using GPT-4o-mini as the bac kb one mo del with deterministic deco ding (temp erature = 0 ). Relative to the standard Base- line (1.0 × cost, 65.4 ov erall score) and the EduPlanner-style framework [22] (3.6 × cost, 73.2 ov erall score), SLOW incurs a computational cost of 6.4 × and ac hieves an o verall score of 79.6. 12 Y. W ei et al. Fig. 3. A case of SLOW Reasoning W orkspace demonstrating in terpretability . Notably , SLOW do es not rely on a ﬁxed-length reasoning chain; additional it- erations are triggered only when diagnostic inconsistencies are detected, yielding a median of 6 and an 80th p ercen tile of 7 API calls p er instance. T o test whether these gains are merely due to increased inference-time compute, w e compared SLO W against a compute-matc hed 7-step reﬁnement control (Reﬁne-7). Reﬁne- 7 uses sev en sequential calls (draft, critique, revision, critique, revision, critique, and ﬁnal revision) with the same bac kb one mo del, deco ding setting, and ev alua- tion rubric, resulting in a similar cost of 6.2 × . Unlik e SLOW, Reﬁne-7 do es not main tain explicit learner-state represen tations or implement explicit cognitiv e– aﬀectiv e decomp osition, but instead relies on unconstrained multi-turn reﬁne- men t. Despite the closely matched budget (within 5% of SLO W’s tok en budget), Reﬁne-7 achiev ed an o verall score of only 71.2, substantially low er than SLOW’s 79.6. These results suggest that SLOW’s gains are not explained by additional compute alone, but b y its structured p edagogical architecture. 5.3 In terpretability Analysis T o demonstrate the interpretabilit y of SLOW b ey ond surface-level explanations, w e developed an interactiv e Reasoning W orkspace (Figure 3) that externalizes the system’s internal delib eration as a h uman-readable System 2 pedagogical trace. Rather than providing p ost-hoc rationales for a generated response, the w orkspace exposes the intermediate diagnostic assumptions, coun terfactual ev al- uations, and strategy trade-oﬀs that shap e instructional decisions. As illustrated in the historical sequencing case, the workspace supp orts in terpretability at three complemen tary levels that are meaningful: SLO W: Strategic Logical-inference Open W orkspace 13 – Diagnostic T raceabilit y . When the learner rep orts diﬃculty recalling ev ent order, the workspace reveals how the system parses this utterance in to a sp eciﬁc cognitiv e hypothesis—namely , a deﬁcit in chronological structur- ing rather than factual recall. Through iterativ e cognitiv e v alidation, the learner’s mastery proﬁle is reﬁned from an initial Unknown (Un) state to Insuﬃciently Known (InK), with explicit evidence indicating fragmented kno wledge and missing structural anchors. This allows human observ ers to insp ect the ﬁnal diagnosis and the eviden tial path leading to it. – Risk-A w are Strategy Selection . The workspace mak es instructional de- lib eration explicit by displa ying candidate strategies that were considered and rejected. In this case, a So cratic questioning approac h is discarded b ecause prosp ectiv e sim ulation predicts heigh tened anxiet y and cognitiv e blo c k age given the learner’s current state. By exp osing these alternativ es, the system clariﬁes wh y certain p edagogically plausible actions are inten tionally a voided, supporting informed auditing of instructional risk management. – Calibrated F eedback Rationale . Finally , the w orkspace provides a plain- language justiﬁcation for the selected action. It explains how supplying min- imal chronological anchors reduces immediate retriev al load and enables the learner to reorganize fragmented knowledge indep enden tly . This rationale connects diagnostic conclusions to concrete instructional choices in a man- ner accessible to b oth teac hers and learners. By decomp osing tutoring b eha vior into observ able diagnostic, aﬀectiv e, and strategic lay ers, the SLOW Reasoning W orkspace mo ves b ey ond black-box au- tomation. It functions as an auditable pedagogical in terface that allows stak e- holders to understand, ev aluate, and trust how instructional decisions are formed, thereb y supp orting b oth educational v alidit y and resp onsible deploymen t of LLM-based tutors. 6 Discussion & Conclusion This pap er prop oses the SLO W framew ork, which introduces an op en reasoning w orkspace to enable systematic consideration of learners’ cognitiv e and aﬀective factors prior to instructional response generation. Based on this design, we con- struct a structured reasoning pro cess consisting of evidence parsing, cognitive v alidation, aﬀectiv e reasoning, and strategy integration. Empirical ev aluation demonstrates that SLOW impro ves instructional sp eciﬁcit y , actionability , and p edagogical coherence. Alongside these impro v ements, the prop osed design also in tro duces corre- sp onding costs. By incorp orating explicit reasoning and simulation during in- teraction, SLO W ma y result in longer resp onse latency and could inadverten tly amplify internal mo del biases tow ard diverse learner proﬁles. Consequently , fu- ture work should explore memory summarization and state-cac hing mechanisms to reduce redundant reasoning costs, alongside dedicated bias detection and miti- gation proto cols to enhance p edagogical fairness. Whether this trade-oﬀ b etw een 14 Y. W ei et al. reasoning eﬀort and instructional qualit y can be stably translated in to veriﬁable teac hing eﬀectiveness in real educational settings remains an op en question for further empirical inv estigation. Beyond directly supp orting learners, the trans- paren t reasoning pro cess exp osed by SLO W may also serve teac hers by illustrat- ing ho w instructional decisions can b e formed through systematic consideration of cognition and aﬀect. In addition, the framework can function as a reasoning- a ware data synthesis mechanism, providing more structured reasoning data for the training of next-generation educational language mo dels. A cknow le dgements. This work w as supp orted by the National Natural Science F oundation of China (Grant No. 62477012), the Natural Science F oundation of Shanghai, China (Grant No. 23ZR1418500), the AI for Science Program of the Shanghai Municipal Commission of Econom y and Informatization, China (Grant No. 2025-GZL-RGZN-BTBX-01014), Ma jor Program of Philosophy and So cial Sciences Researc h of the Ministry of Education (Grant No. 2025JZDZ054). References 1. Blac k, P ., Wiliam, D.: Assessment and classroom learning. Assessment in Educa- tion: Principles, P olicy & Practice 5 (1), 7–74 (1998) 2. F rankish, K.: Dual-pro cess and dual-system theories of reasoning. Philosophy Com- pass 5 (10 ), 914–926 (2010) 3. F u, S., Desmarais, M.C.: Marko v blanket based feature selection: a review of past decade. In: Pro ceedings of the w orld congress on engineering. vol. 1, pp. 321–328. Newsw o o d Ltd. Hong Kong, China (2010) 4. Guo, D., Y ang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zh u, Q., Ma, S., W ang, P ., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcemen t learning. arXiv preprin t arXiv:2501.12948 (2025) 5. Jia, R., W ei, Y., Li, R., Jiang, Y.H., Xie, X., Shen, Y., Zhang, M., Jiang, B.: Diacdm: Cognitive diagnosis in teacher-studen t dialogues using the initiation- resp onse-ev aluation framew ork. arXiv preprint arXiv:2509.24821 (2025) 6. Kabudi, T., P appas, I., Olsen, D.H.: Ai-enabled adaptive learning systems: A sys- tematic mapping of the literature. Computers and education: Artiﬁcial intelligence 2 , 100 017 (2021) 7. Liu, A., F eng, B., Xue, B., W ang, B., W u, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical rep ort. arXiv preprint (2024) 8. Liu, J., Huang, Z., Xiao, T., Sha, J., W u, J., Liu, Q., W ang, S., Chen, E.: So crat- iclm: Exploring so cratic p ersonalized teac hing with large language mo dels. Ad- v ances in Neural Information Processing Systems 37 , 85693–85721 (2024) 9. Liu, Z., Yin, S.X., Lin, G., Chen, N.: P ersonality-a w are studen t sim ulation for con versational intelligen t tutoring systems. In: Pro ceedings of the 2024 Conference on Empirical Metho ds in Natural Language Processing. pp. 626–642 (2024) 10. Manh Hung, N., Sebastian, T., Victor-Alexandru, P ., Alkis, G., Adish, S.: Synthe- sizing high-quality programming tasks with llm-based exp ert and student agents (2025) SLO W: Strategic Logical-inference Open W orkspace 15 11. Muennighoﬀ, N., Y ang, Z., Shi, W., Li, X.L., F ei-F ei, L., Ha jishirzi, H., Zettle- mo yer, L., Liang, P ., Candès, E., Hashimoto, T.B.: s1: Simple test-time scaling. In: Pro ceedings of the 2025 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 20286–20332 (2025) 12. Murra y , T.: An ov erview of intelligen t tutoring system authoring to ols: Up dated analysis of the state of the art. Authoring to ols for adv anced technology learning en- vironmen ts: T o ward cost-eﬀective adaptive, in teractive and intelligen t educational soft ware pp. 491–544 (2003) 13. P eng, X., Y uan, P ., Li, D., Cheng, J., F ang, Q., Liu, Z.: Kele: A m ulti-agent frame- w ork for structured socratic teac hing with large language models. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 16342–16362 (2025) 14. Ro oein, D., Cho wdhury , S.P ., Eremeev a, M., Qin, Y., Nozza, D., Sac han, M., Hovy , D.: Pats: P ersonalit y-aw are teaching strategies with large language mo del tutors. arXiv preprin t arXiv:2601.08402 (2026) 15. Ry an, T., F renc h, S., Kennedy , G.: Bey ond the iron triangle: Impro ving the quality of teaching and learning at scale. Studies in Higher Education 46 (7), 1383–1394 (2021) 16. Sadler, D.R.: F ormativ e assessment and the design of instructional systems. In- structional science 18 (2), 11 9–144 (1989) 17. Scarlatos, A., Bak er, R.S., Lan, A.: Exploring knowledge tracing in tutor-studen t dialogues using llms. In: Proceedings of the 15th International Learning Analytics and Kno wledge Conference. pp. 249–259 (2025) 18. Sw eller, J.: Cognitive load theory . In: Psychology of learning and motiv ation, v ol. 55, pp. 37–76. Elsevier, Amsterdam (2011) 19. W ang, F., Liu, Q., Chen, E., Huang, Z., Chen, Y., Yin, Y., Huang, Z., W ang, S.: Neural cognitive diagnosis for in telligent education systems. In: Pro ceedings of the AAAI c onference on artiﬁcial in telligence. v ol. 34, pp. 6153–6161 (2020) 20. W ei, J., W ang, X., Sc huurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. A dv ances in neural information pro cessing systems 35 , 24824–24837 (2022) 21. W ei, Y., Jiang, B.: In terpretable cognitiv e state prediction via temp oral fuzzy cog- nitiv e map. IEEE T ransactions on Learning T ec hnologies 17 , 514–526 (2023) 22. Zhang, X., Zhang, C., Sun, J., Xiao, J., Y ang, Y., Luo, Y.: Eduplanner: Llm- based multi-agen t systems for customized and intelligen t instructional design. IEEE T ransactions on Learning T echnologies (2025) 23. Zhang, Z., Zhang-Li, D., Y u, J., Gong, L., Zhou, J., Hao, Z., Jiang, J., Cao, J., Liu, H., Liu, Z., et al.: Simulating classro om education with llm-emp o wered agents. In: Pro ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso ciation for Computational Linguistics: Human Language T echnologies (V olume 1: Long P ap ers). pp. 10364–10379 (2025)

SLOW: Strategic Logical-inference Open Workspace for Cognitive Adaptation in AI Tutoring

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment