A Causal Framework for Estimating Heterogeneous Effects of On-Demand Tutoring

A Causal Frame w ork f or Estimating Heter ogeneous Effects of On-Demand T utoring Kirk V anacore Cornell University kpv27@conrell.edu Danielle R Thomas Carnegie Mellon Univ ersity dchine@andrew .cmu.edu Digor y Smith Eedi digor y .smith@eedi.com Bibi Groot Eedi bibi.groot@eedi.com Justin Reich Massachusetts Institute of T echnology jreich@mit.edu Rene Kizilcec Cornell University kizilcec@cor nell.edu ABSTRA CT This pap er introduces a scalable causal inference framew ork for estimating the immediate, session-lev el eﬀects of on- demand h uman tutoring embedded within adaptiv e learn- ing systems. Because students seek assistance at momen ts of diﬃculty , conv entional ev aluation is confounded b y self- selection and time-v arying knowled ge states. W e address these c hallenges by integrating principled analytic sample construction with Deep Knowledge T racing (DKT) to esti- mate laten t mastery , follow ed b y doubly robust estimation using Ca usal F orests. Applying this framew ork to o ver 5,000 middle-sc ho ol mathematics tutoring sessions, we ﬁnd that requesting human tutoring increases next-problem correct- ness b y approximately 4 p ercen tage points and accuracy on the subsequen t skill encoun tered b y approximately 3 p er- cen tage points, suggesting that the eﬀects of tutoring ha ve pro ximal transfer across knowledge components. This eﬀect is robust to v arious forms of mo del sp eciﬁcation and p o- ten tial unmeasured confounders. Notably , these eﬀects ex- hibit signiﬁcant heterogeneity across sessions and studen ts, with session-level eﬀect estimates ranging from − 20 . 25 pp to +19 . 91 pp . Our follow-up analyses suggest that typical b e- ha vioral indicators, suc h as studen t talk time, do not con- sisten tly correlate with high-impact sessions. F urthermore, treatmen t eﬀects are larger for studen ts with low er prior mastery and sligh tly smaller for lo w-SES students. This framew ork oﬀers a rigorous, practical template for the ev al- uation and contin uous improv ement of on-demand human tutoring, with direct applications for emerging AI tutoring systems. K eywords T utoring, Adaptive Learning Systems, Causal Inference. 1. INTR ODUCTION T utoring is widely recognized as one of the most eﬀective educational interv entions, particularly for studen ts strug- gling academically [11, 16, 24, 22]. How ever, recen t large- scale evidence suggests that the magnitude of K-12 tutor- ing eﬀects v aries widely across contexts, populations, and implemen tations, complicating eﬀorts to generalize ﬁndings and guide p olicy decisions [21]. Muc h prior work has fo- cused on “high-dosage” tutoring, whic h is typically deﬁned as sustained, sc heduled sessions deliv ered multiple times per w eek [33]. Ho wev er, a gro wing class of interv entions pro vides on-demand tutoring: brief supp ort oﬀered that is in tended to deliver instruction precisely at momen ts of student diﬃ- cult y [10, 41]. These in terven tions are increasingly embed- ded within adaptiv e learning platforms and are deliv ered b y h uman tutors or AI-based agents [30, 41]. These em bedded tutoring systems generate rich , ﬁne-grained in teraction data, creating unprecedented opportunities for con tin uous ev aluation and improv ement. How ev er, they also in troduce a fundamen tal metho dological problem: students self-select in to tutoring at momen ts of struggle, creating se- v ere time-v arying confounding that undermines naive obser- v ational comparisons. As a result, despite the rapid gro wth of em b edded tutoring, the ﬁeld lac ks scalable metho ds for rigorously estimating the causal impact of individual tutor- ing intera ctions in real-world learning platforms. This paper addresses this gap b y introducing a scalable causal inference framew ork for estimating the eﬀect of on- demand tutor within an adaptive mathemati cs learning plat- form. Rather than ev aluating en tire programs or long-term in terv en tions, the framework is designed to estimate the im- mediate causal eﬀect of individual tutoring in teractions on subsequen t student p erformance. Our proposed framework in tegrates three c omponents into a uniﬁed pip eline: (1) prin- cipled construction of analytic samples to appro ximate coun - terfactual comparisons, (2) laten t knowled ge state estima- tion using Deep Knowledge T racing (DKT; [29]) to address time-v arying confounding, and (3) estimatio n of conditional a ve rage treatment eﬀects for local, individual tutoring ses- sions using Generalized Random F orests (Causal F orests; [3]). In combin ation, these components provide a practical template for estimating a v erage treatmen t eﬀects and eﬀect heterogeneit y from large-scale learning in teraction logs. W e further demonstrate v arious robustness tests for this framew ork, showi ng that with and without contextual v ari- ables from outside of the platform (e.g., standardized assess- men ts and demographic v ariables), this pipeline produced consisten t estimates of the tutoring eﬀects. Our framework addresses central challenges in observ ational studies of ed- ucational interv entions, including self-selection in to treat- men t, unmeasured baseline abilit y that changes as studen ts learn, and treatmen t eﬀect heterogeneit y . F urthermore, it provides opp ortunities to b etter understand when, for whom, and wh y tutoring is eﬀectiv e. Finally , this metho d could b e used to estimate outcomes for ev aluating AI models for tutoring. Applying our framew ork to data from an embedded tutor- ing system, w e ﬁnd that requesting tutoring has a p ositive impact on immediate performance b y increasing the likeli- hoo d that students answer the next problem correctly b y an a verage of 4 . 01 percentage p oints ( pp ). W e further ﬁnd that this aﬀects near transfer of skills: tutoring increases the probabilit y of answering the ﬁrst problem on the subsequent skill correctly by 2 . 73 pp . Ho w ever, there is substan tial v ari- abilit y in these eﬀects, with lo cal eﬀect estimates ranging from − 20 . 25 pp to +19 . 91 pp for immediate performance cor- rectness and from − 23 . 60 pp to +47 . 70 pp for near transfer. Studen ts with low er estimated prior knowledge tended to beneﬁt more from tutoring , although this relationship was sligh tly attenuated for lo wer-income studen ts. W e conclude b y discuss ing how causal approac hes of this kind can help researc hers and practitioners b etter understand and ev alu- ate the mechanisms that drive eﬀective on-demand tutoring, ultimately informing the design of improv ed human and AI tutoring systems. 2. B A CKGR OUND 2.1 Efﬁcacy and Heterogeneity in T utoring Interventions T utoring is widely recognized as one of the most eﬀective academic in terven tions for impro ving student learning out- comes [11, 16, 24]. Comprehensiv e meta-analyses ha ve found that tutoring consistently yields signiﬁcan t p ositiv e eﬀects on student learning [22, 24]. These eﬀects are impactful across grade levels and sub ject areas, often outp erform- ing other educational in terven tions such as class-size reduc- tion or extended school days [16, 24]. How ever, the eﬃ- cacy of tutoring is rarely uniform; eﬀects are often hetero- geneous across diﬀerent implemen tations [24]. Impact also v aries signiﬁcantly according to c haracteristics such as prior ac hiev emen t, socio economic status, and baseline ac hiev e- men t [11]. Although tutoring often targets low er-p erforming studen ts to close achiev ement gaps [16], other studies on v olun tary , on-demand models suggest that students with higher basel ine engagemen t or fewer structural barriers are sometimes more lik ely to access and b eneﬁt from support [2, 10]. Understanding this heterogeneit y is crucial to determine whether sp eciﬁc tutoring interv entions narrow achiev ement gaps or inadv erten tly widen them due to diﬀeren tial uptake. The magnitu de of tutoring eﬀects is also inextricably link ed to the implementation mo del. Muc h of the literature ad- v o cates for ” high-dosage” tutoring, ty pically deﬁned as sus- tained, scheduled sessions occurring three or more times per w eek, as a primary driver of eﬃcacy [22, 24]. How ever, scal- ing suc h intensiv e mo dels presents logistical and ﬁnancial c hallenges. Consequen tly , many Adaptiv e Learning Systems (ALS) ha ve adopted ” on-demand” or ” just-in-time” models, where brief supp ort is triggered by students making speciﬁc errors rather than a ﬁxed sc hedule [30, 38, 41]. Although scalable, these em b edded interactions complicate the tradi- tional deﬁnition of d osage, as exp osure to treatmen t b ecomes a function of studen t agency and immediate need ra ther than administrativ e assig nmen t. 2.2 Integrating Human T utoring in Adaptive Learning Systems T o address the scalabilit y and cost constraints of high- dosage human tutoring, recen t approac hes ha ve integrated h uman support directly into ALS. “Hybrid” or ” human -in- the-loop” mo dels aim to combin e the personalized, imme- diate feedbac k of AI-driv en softw are with the motiv ational and complex p edagogical supp ort of human tutors [6, 38]. Researc h suggests that h uman tutoring can enhance the beneﬁts of ALS [15]. F or instance, human tutors—guided b y real-time dashboards to in terv ene when students strug- gle—yields statistically signiﬁcan t additional b oosts in time- on-task and skill proﬁciency within an ALS, compared to studen ts w orking in an ALS alone [15]. These h ybrid ap- proac hes seem to b eneﬁt low er-p erforming students more than higher-performing ones, suggesting that human inter- v en tion may most beneﬁt students who most need it [38]. In tegrated tutoring models often take the form of ” on- demand” c hat support embedded within the learning env i- ronmen t. In these settings, the ALS identiﬁes a knowledge gap or a speciﬁc misconception, and a h uman tutor enters the lo op to pro vide targeted scaﬀolding [18]. Ev aluations of such systems ha v e shown that studen ts who engage with this hyb rid supp ort demonstrate higher learning gains and kno wledge transfer than th ose using the ALS in isolation [18, 41]. F urthermore, in terven tions explicitly designed to com- bine human tutoring with ALS p ersonalization hav e been sho wn to nearly double math learning gains compared to con trol groups, pro viding a scalable mechanism to promot e educational equit y for marginalized students [6 ]. By oﬄoad- ing rote instruction to the ALS and reserving h uman capital for high-v alue interactions, these systems oﬀer a promising path w a y to deliver eﬀectiv e tutoring at scale [6, 38, 42]. 2.3 Methodological Challenges in Estimating On-Demand Effects Although the eﬃcacy of tutoring is w ell-do cumen ted, esti- mating the causal impact of on-demand help within adap- tiv e systems presents unique methodological h urdles. Unlik e sc heduled ” high-dosage” tutoring, where attendance is often mandated or ﬁxed, ” on-demand” supp ort is driven by stu- den t agency . This in tro duces a sp eciﬁc form of confounding: studen ts are more likely to ask for help precisely when they are least likely to succeed without it [2, 33]. Consequen tly , naiv e comparisons b et ween tutored and non-tutored prob- lem attempts often yield negativ e correlations, reﬂecting the studen t’s struggle rather than the interv ention’s failure. T o address this, prior EDM research has largely relied on t wo approac hes: experiments/randomized con trolled trials (R CTs) and observ ational matc hing strategies. While RCTs remain the gold stand ard, they are often implemented at the sc hool or studen t level to ev aluate general program access rather than sp eciﬁc in teractions [10, 34, 41]. Randomizing individual help requests, which may require den ying help to a struggling student for the sake of exp erimen tal control, is often ethic ally prohibitiv e and disruptiv e to the learning experience (e.g., wait-list designs where studen ts are ran- domly assig ned to get help right aw ay or later are still dis- ruptiv e). As a result, gran ular, problem-lev el insigh ts must often be deriv ed from observ ational data. In observ ational settings, researchers hav e traditionally employ ed propensity score matching to balance treatmen t and control groups on observ ed co v ariates suc h as prior achiev ement, demograph - ics, and topic diﬃcult y [6, 23]. How ever, standard matching approac hes typically rely on static or slo wly c hanging v ari- ables (e.g., pre-test scores, diagnostic assessmen ts). These measures fail to capture the highly dynamic, time-v arying nature of a student’s kno wledge state during a learning ses- sion. A studen t’s decision to ask for help is often a function of immediate confusion or a sp eciﬁc misconception, which static or slowly-c hanging cov ariates cannot detect. More recent work has attempted to incorp orate dynamic measures of mastery into causal estimation. Approac hes suc h as the ” Rebar” and ” ReLOOP” method hav e demon- strated that using predictions from high-dimensional mo dels as co v ariates can reinforce matching estimators and reduce bias [36, 26]. These metho ds allo w for more precise eﬀect estimation in the con texts of RCTs but are underutilized in observ ational studies where conditions are not randomized. A related signiﬁcan t y et underdev elop ed area in Educational Data Mining (EDM) is the integration of kno wledge trac- ing (KT) into observ ational causal inference frameworks to accoun t for time-v arying confounding rooted in a studen t’s historical p erformance [1, 25, 27]. In particular, DKT ex- tends this by utilizing recurren t neural netw orks to generate high-dimensional laten t states that capture complex tem- poral dependencies in stude nt learning [29]. Ho w ever, even with these ric her represent ations, standard regression or lin- ear adjustmen ts ma y not fully account for the complex, non- linear heterogeneit y in ho w diﬀerent studen ts resp ond to tu- toring. Finally , most causal studies in EDM and related ﬁelds focus on a verage treatment eﬀects [36, 26] or subgroup analyses [28, 30, 19, 20]. A growing bo dy of work also examines heterogeneit y b y ho w students interact with learning envi- ronmen ts [37, 39]. Ho wev er, the ﬁeld would b eneﬁt from methods that estimate lo cal, session-level estimates of ef- fectiv eness. These estimates could be leveraged to ﬂexibly explore the circumstances under whic h ALS elemen ts, lik e on-demand tutoring, are most eﬀectiv e, including elements of heterogeneit y based on how studen ts resp ond. 3. CURRENT STUD Y The curren t study proposes an observ ational approach for ev aluating tutoring sessions embedded within digital learn- ing platforms by estimating their causal impact on stu- den ts’ subsequent p erformance. The approach integrates DKT–based estimates of students’ laten t kno wledge states in to a causal inference framework that comb ines causal forests for count erfactual outcome estimation with aug- men ted in verse probabilit y weigh ting to obtain doubly ro- bust, session-lev el treatmen t eﬀects. T ogether, these meth- ods yield unbiased estimates of individual tutoring session eﬀectiv eness and enable downstream analyses that charac- terize heterogeneit y in tutoring impacts across studen ts, con- texts, and instructional conditions. W e emplo y this frame- w ork on data from an ALS 1 in which studen ts could access on-demand h uman tutoring through brief text chats. 4. CONTEXT 4.1 Data The data used for this analysis came from the treatment arm of a m ulti-year rand omized con trolled trial (RCT) de- signed to estimate the causal impact of Eedi on middle sc hool mathematics achiev ement. Sc ho ols were randomly assigned to treatment or control conditions at the school lev el, with the treatmen t group receiving access to Eedi for t wo ful l academic y ears while con trol schools contin ued with business-as-usual instruction. Our current sample is from the 12 sc ho ols and 2,585 students who had access to and used Eedi during the tw o-year study . 4.2 Adaptive Lear ning System This study w as run in the ALS Eedi, which is a digital math- ematics platform designed to supp ort middle sc ho ol studen ts b y identifying and addressing misconceptions in real time through diagnostic assessmen t. This ALS uses a large bank of carefully engin eered multiple-c hoice questi ons, where eac h incorrect option maps to a sp eciﬁc, well-documented math- ematical misconception. This design allows the Eedi to pin- point gaps in understanding at a ﬁne grain and resp ond with targeted support, including short video explanations, struc- tured practice, and follo w-up diagnostic questions. T eac hers can deplo y the system ﬂexibly in lessons or as homew ork, using its analytics dash b oards to see patterns of misunder- standing acro ss individuals and classes and to adapt instruc- tion accordingly . Studen ts work independen tly on quizzes of ﬁv e questions assigned b y their teac hers. Based on their an- sw ers, Eedi resp onds adaptively with a sequence of hints, videos, and ﬂuency practice to help students o vercome mis- conceptions. Evidence from a large, sch o ol-lev el random- ized con trolled trial suggests that the platform can pro duce modest but meaningful improv ements in mathematics at- tainmen t when implemen ted ov er an academic year [ ? ]. 4.3 On-Demand T utoring in Eedi. Within the Eedi, h uman tutoring was implemented as on- demand, sync hronous, one-to-one c hat-based support in te- grated directly into students’ ongoing problem-solving ac- tivit y . A t any point, studen ts could initiate a liv e session that immediately connected them with an exp ert tutor. T u- tors received structured context at the start of each ses- sion—including the problem text, the studen t’s answer, and the associated misconception, allowing them to provide tar- geted, dialog-based support focused on diagnosing and re- solving misunderstandings before studen ts returned to inde- pendent w ork. The interv ention inv olved sev en teen tutors, eac h with at least three years of teac hing exp erience and sp eciﬁc train- ing in Eedi en vironment. Sessions w ere designed to be brief 1 The name of this system is omitted for review. and task-focused, enabling studen ts to mov e ﬂuidly b etw een independent practice and individualized help. T utors com- m unicated exclusiv ely through text chat an d used a So cratic approac h that emphasized questioning and g uided reasoning rather than direct answ er-giving. Sessions concluded once studen ts demonstrated suﬃcient understanding or chose to resume their lesson, positioning tutoring as a scalable, just- in-time complemen t to the broader Eedi learning environ- men t. T able 1 summarizes the characteristics of the tutoring ses- sions. T utoring sessions w ere relativ ely brief, with a median duration of 4.2 min utes and 14 messages exchanged. T utors con tributed approximately 60% of messages on a verage, con- sisten t with their role in guiding the dialogue. T able 1: T utoring Session Characteristics Metric Mean Median Q25 Q75 T otal Messages 17.7 14 9 23 T utor Messages 10.5 8 5 14 Studen t Messages 7.2 6 3 9 Duration (min utes) 5.4 4.2 2.4 6.9 5. METHODS: CA USAL FRAMEWORK FOR LOCAL EFFECT ESTIMA TION Our causal inference framew ork consists of three in tercon- nected sta ges, illustrated in Figure 1. (1) First, we construct an analytic dataset that links each tutoring interv ention to its immediate outcome while deﬁning an appropriate con- trol group. (2) Second, we train a DKT mo del on a held- out sample of students to generate latent represen tations of studen t kno wledge for b oth the treatmen t and control sam- ples. (3) Third, we apply a Causal F orest estimator that join tly estimates prop ensity scores and treatment eﬀects us- ing the DKT-derived hidden states and predicted proba- bilities as cov ariates. W e v alidate our ﬁndings through a series of robustness chec ks that incorp orate con textual co- v ariates (e.g., sc ho ol assignment, standardized assessmen ts, and demographics), alternativ e sample conﬁgurations to ad- dress p otential participation bias, and a placeb o outcome test. This framew ork addresses three central challenges in observ ational tutoring researc h: confounding from baseline kno wledge di ﬀerences, self-selection into treatment, and het- erogeneit y in treatmen t eﬀects across students. 5.1 Analytic Sample Construction Our method requires three samples: treatmen t, control, and hold-out. T able 2 presen ts the sample breakdown. When a studen t received tutoring on a problem, that instance en- tered the treatment sample. The next problem the stu- den t attempted without help served as our imme diate p erfor- manc e outcome, and the next problem on a new skill serv ed as our ne ar tr ansfer outcome. T o ensure co mparabilit y , only problems for whic h at least one student receiv ed tutoring w ere included in the causal analysis. In additio n, problem attempts from treated studen ts in which they did not receiv e tutoring were excluded from the primary analysis. T reated observ ations w ere also excluded when no subsequent prob- lem attempt was a v ailable to serve as an outcome. T o a void potential Stable Unit T reatmen t V alue Assumption (SUTV A) violations [35], only studen ts who nev er received T able 2: Analytic Sample Construction from Raw Data to Final Analytic Sample Studen ts Problem A ttempts Unique Problems Original Usage Data 2,585 852,274 8,540 T reatment Sample All T reatment Usage 1,234 465,619 6,803 Exclude d (13) (460,456) (5,337) Final Analytic T reated 1,221 5,163 1,466 Con trol Sample All Contro l Usage 1,351 386,655 8,017 DKT T raining Holdout 676 197, 332 7,485 Con trol Analysis 675 189, 323 5,884 Exclude d (10) (97,874) (4,418) Final Analytic Con trol 665 91,449 1,466 T otal Analytic Sample 1 ,886 96,6 12 1,466 tutoring were eligible for the con trol sample, as prior exp o- sure to tutoring could inﬂuence later outcomes. W e relax this restriction in a robustness chec k to assess whether this sample construction meaningfully aﬀects results (see Sec tion 5.4). Studen ts who were nev er treated were randomly di- vided b et ween the con trol and hold-out samples. The hold- out sample was used exclusiv ely for DKT estimation (Sec- tion 5.2), while the control sample was reserv ed for causal eﬀect estimation. The analytic control sample, therefore, consisted of students who (a) nev er receiv ed tutoring and (b) attempted at least one of the same problems as studen ts in the treatment sample. Because not all problems ov er- lapped with tutoring interv entions, con trol-group attempts on non-o verlapping problems w ere excluded. Studen ts who lac k ed an y problem attempts ov erlapping with the treatment sample were remo ved from the ﬁnal analytic con trol sample. 5.2 Knowledge State Estimation W e implemented a DKT mo del using a Long Short-T erm Memory (LSTM) architecture to generate student knowl- edge estimates used as pre-treatm ent cov ariates in the c ausal analysis [29]. The mo del uses a studen t’s sequen tial history of question attempts to predict the probabilit y of correctly answ ering subsequen t questions. Critically , the model main- tains a laten t hidden state that encodes information from prior performance across time, allowin g it to summarize a studen t’s evolving knowledge state ev en when past in terac- tions inv olve diﬀerent constructs than the one b eing pre- dicted. The use of DKT serv es a substan tive causal purpose. Stu- den t knowledge is a k ey confounder in this setting: studen ts are more lik ely to seek or be oﬀered tutoring when they are struggling, and baseline kno wledge strongly predicts future performance. By conditioning on a ric h, multidimensional represen tation of studen ts’ laten t kno wledge states, we aim to reduce bias arising from time-v arying co nfounding in tu- toring assignmen t and outcome generation. F urthermore, DKT can pro vide speciﬁc p oint-in-time estimates of base- line knowledge that accounts for recen t learning, whic h may be more robust than global pretest measures. W e implemen ted the DKT mo del in PyT orch following the Figure 1: Causal F ramework for Estimating Heterogeneous Eﬀects of On-Demand T utoring. The pip eline in tegrates ALS and tutor interaction logs to estimate student knowledge states and causal eﬀects. These estimates dri ve do wnstream insights into eﬀect v ariability . While the framew ork functions ind ep enden tly of con textual data, w e incorp orate it here t o test robustness and iden tify v ariables for heterogeneit y analysis. arc hitecture ﬁrst introduced b y Piec h et al. [29]. The train- ing data included 197,332 problem attempts across 7,485 problems representing 1,598 distinct kno wledge comp onen ts. Item em b eddings were passed to an LSTM with a hidden dimension of 50, which up dated its hidden state h t at each time step. The resulting 50-dimensional hidden state v ector ( H 1 , . . . , H 50 ) w as used as the primary kno wledge represen- tation in our causal analysis. The output la y er applied a sigmoid act iv ation function to produce probabilit y estimates for the next construct. The model ac hieved an acceptable A UC of 0.72 on the analytic con trol sample (which w as not used for training), consistent with typical p erformance re- ported for DKT models [1]. 5.2.1 F eatur e Extraction. F or each observ ation in the treatment and control samples, w e extracted the 50-dimensional LSTM hidden state calcu- lated just prior to the intervention. W e also use the pre- dicted performance on the curren t problem, in which the studen t could receiv e tutoring support, and the subsequ ent problem, which serv es as an outcome. Including these as co v ariates in causal analysis is modeled after the Rebar method, whic h has b een shown to produce precise, un biased eﬀect estimates by applying models generated from data out- side of the treatmen t and control sample to those samples [36]. 5.3 Causal F orest Estimation W e estimated treatmen t eﬀects using Generalized Random F orests (GRF) as implemented in the grf R pack age [4]. W e chose to incorp orate causal forests into the pip eline b e- cause of their nonparametric ﬂexibility and their abilit y to model high-dimensional treatment eﬀect heterogeneity with- out functional form assumptions [3]. Causal forests oﬀer three other adv antages that are partic- ularly relev ant in our setting. First, they hav e b een sho wn to p erform well under targeted selection, when treatment assignmen t dep ends on predictors of the outcome itself [17]. In the context of on-demand tutoring, studen ts are more lik ely to seek or receive tutoring precisely when they antici- pate p o or p erformance absent interv ention, rendering treat- men t assignmen t endogenous and outcome- driv en. Causal forests are explicitly designed to accomm o date this form of selection. Second, a key feature is honesty , in which sep- arate subsamples are used for tree construction and treat- men t eﬀect estimation. This sample-splitting reduces adap- tiv e ov erﬁtting and supp orts v alid asymptotic inference by producing out-of-sample counterfactual predictions analo- gous to leav e-one-out estimation [43]. Third, causal forests directly estimate Conditional Average T reatmen t Eﬀects (CA TEs) as ﬂexible functions of a high-dimensio nal cov ari- ate space. Given the ric hness of our co v ariates—including the 50-dimensional DKT hidden state representation—this approac h enables estimation of ﬁne-grained, unit-sp eciﬁc conditional eﬀects and supp orts do wnstream analyses of treatmen t eﬀect heterogeneity across students and con texts. 5.3.1 Model Speciﬁcation. The co v ariate matrix X included 50 DKT hidden state di- mensions, cumulativ e historical accuracy , and DKT-based probabilit y estimates of correct resp onses. The forest was trained with 500 trees, honest y enabled, and h yp erparame- ters selected via cross-v alidation. W e sp eciﬁed students as clusters in the causal forest, implementing cluster-robust es- timation such that all observ ations from a given studen t are assigned to the same subsample during tree construction. As a result, treatmen t eﬀect estimates account for within- studen t dependence, and predictions for each student are generated without using that studen t’s own data. 5.3.2 Estimation Engine: Robinson Decomposition and Pr opensity Modeling W e estimate heterogeneous treatmen t eﬀects using Causal F orests, which rely on the Robinson decomposition (also kno wn as a residual-on-residual or orthogon alization strat- egy). Let Y i denote the observ ed outcome for unit i , Z i ∈ { 0 , 1 } the treatmen t indicator, and X i the v ector of pre-treatmen t co v ariates. F or eac h outcome deﬁnition (next- problem correctness and next-skill correctness), the pro ce- dure ﬁrst ﬁts tw o functions using separate regression forests: the conditional mean outcome mo del ˆ m ( X ) = E [ Y | X ] and the prop ensit y score model ˆ e ( X ) = P ( Z = 1 | X ) . These estimates are then used to construct orthogonalized (residualized) v ariables: ˜ Y i = Y i − ˆ m ( X i ) , ˜ Z i = Z i − ˆ e ( X i ) , (1) where ˜ Y i represen ts the deviation of the observ ed outcome from its co v ariate-exp ected v alue, and ˜ Z i represen ts the de- viation of the observ ed treatment assi gnmen t from its pre- dicted probabilit y . This transformation remov es outcome v ariation explained b y baseline cov ariates and treatmen t v ariation explained by selection on observ ables, yielding Neyman-orthogonal signals for eﬀect estimation. The Causal F orest is then trained to model the relationsh ip betw een ˜ Y i and ˜ Z i as a function of X i , whic h isolates the causal signal aft er adjusting for baseline outcome diﬀerences ( ˆ m ) and treatmen t selection ( ˆ e ). Propensity scores are therefore estimated internally as part of the Causal F orest procedure via a regression forest for ˆ e ( X ). V alid estimation requires o verlap (p ositivity), mean- ing each unit must hav e a non-negligible probabilit y of both treatmen t and con trol conditional on X . In our analytic sample, estimated propensity scores ranged from 0.03 to 0.89, with most observ ations falling b et ween 0.05 and 0.20. Appendix A.1 presents the full propensity score distribution and associated o verlap and robustness diagnostics. 5.3.3 Effect Estimates. W e rep ort three t yp es of eﬀects. The Conditional Av erage T reatment Eﬀect (CA TE) provides unit-lev el predictions, where τ ( x ) represents the diﬀerence in expected p otenti al outcomes for an individual with co v ariates x . Sp eciﬁcally , it captures the con trast betw een the exp ected outcome if the studen t receives tutoring, Y (1) , and the exp ected outcome if the studen t do es not receiv e tutoring, Y (0) : τ ( x ) = E [ Y (1) − Y (0) | X = x ] (2) The Av erage T reatment Eﬀect (A TE) is the population- lev el eﬀect, computed using Augmen ted In verse Probabilit y W eighting (AIPW) scores: ˆ τ A TE = 1 n n X i =1  ˆ τ ( X i ) + Z i − ˆ e ( X i ) ˆ e ( X i )(1 − ˆ e ( X i ))  Y i − ˆ m Z i ( X i )   (3) The Av erage T reatment Eﬀect on the T reated (A TT) is de- termined by restricting this calculation to treated units and represen ts the av erage eﬀects for those who chose the treat- men t. The use of AIPW mak es the A TE and A TT ’dou- bly robust,’ as they combine the outcome predictions with propensity weigh ting to correct for bias [13]. Th us, if ei- ther the outcome mo del or the prop ensit y score mo del is misspeciﬁed, the estimates remain cons isten t and asymptot- ically un biased. 5.3.4 Heter ogeneity Analysis. T o test for moderators of the treatment eﬀect, we treat the estimated unit-level CA TEs ( ˆ τ i ) as the outcome v ariable in a Linear Mixed Mo del (MLM): ˆ τ i = β 0 + β 1 · Mo derator + µ s + ϵ i (4) where µ s accoun ts for student-lev el clustering. This allo ws us to determine (via signiﬁc ance testing) which moderators explain the v ariability in treatmen t eﬀects. 5.4 Robustness and Sensiti vity Checks T o assess whether the mo del is robust to misspeciﬁcation and p otential unobserv ed confounders, w e conducted a se- ries of robustness chec ks. Our ﬁrst sets of robustness chec ks determine whether adding diﬀerent v ariables or sample con- ﬁgurations inﬂuences the results. 5.4.1 External V ariables Added First, we incorp orated additional cov ariates not derived from platform usage. These included standardi zed pretest scores of mathematical kno wledge (NWEA MAP RIT), which pro- vide an external measure of baseline abilit y independent of platform performance; a gender indicator with three cate- gories (male, female, or neither); and Pupil Premium F und- ing (PPF) eligibilit y , a go v ernmen t designation for students from disa dv anta ged bac kgrounds th at serv es as a pro xy for socio economic status. W e also included school identiﬁers . The goal of this analysis was to test whether the esti- mated Av erage T reatment Eﬀect (A TE) remains stable af- ter adding external cov ariates, which would suggest that the DKT-based features may adequately capture relev ant confounding and that the estimated treatmen t eﬀect is not driv en by unobserv ed demographic or institutional factors. 5.4.2 W ashout Sample Included The second robustness test ev aluates potential sel ection bias arising from deﬁning the con trol group as students who hav e never requested tutoring. These studen ts ma y diﬀer system- atically from treated studen ts in wa ys not fully captured b y DKT features or other observed co v ariates. Although re- stricting the control gro up in th is w ay helps preven t SUTV A violations (see Section 5.1), it ma y introduce confounding if never-treated students are fundamen tally diﬀeren t from those who seek tutoring. T o examine this p ossibility , w e re-estimate the mo del using an expanded con trol group that includes students who previously received tutoring, pro vided that the prior session did not occur within the previous t wo skills (i.e., a w ashout sample). This speciﬁcation allo ws us to account for p oten tial carry ov er eﬀects while reducing se- lection bias by increasing ov erlap b et ween treatment and con trol observ ations. 5.4.3 Pr e-Intervention Placebo T est Next, we conducted a placeb o test using an outcome for whic h no causal eﬀect is p ossible: student p erformance on a problem that occurs b efor e the tutoring int erven tion. Placebo or “negative contro l” outcomes are widely used in causal inference as a diagnostic tool for detecting residual confounding or model misspeciﬁcation. If the identiﬁcation strategy is v alid, estimated treatmen t eﬀects on suc h out- comes should be n ull, since they temporally precede treat- men t and cannot be causally inﬂuenced by it [12]. W e chose a problem three items b efore the interv ention as our placeb o outcome to av oid an y in terference with the interv ention. Finding no signiﬁcan t eﬀect on this pre-treatmen t outco me w ould therefore provide additional evidence that the metho d is not simply capturing baseline di ﬀerences b et ween studen ts who choose tutoring and those who do not, but is instead iden tifying a genuine causal eﬀect of tutoring. 5.4.4 Additional Checks Finally , w e emplo y t wo additional robustness c hec ks that are common in quasi-exp erimental analyses. W e present these in detail in the App endix A. W e tested the sensitivit y of our results to prop ensit y score model speciﬁcation and trimming of the distribution tails to ensure common support [9]. T o quan tify the robustness of our estimates to omitted v ari- able bias, w e calculate the Robustness V alue (R V) using the sensemak r framew ork [8]. The R V represen ts the mini- m um strength of asso ciation that an unobserved confounder w ould need to ha ve with both the treatment and the out- come (measured in terms of partial R 2 ) to reduce the esti- mated treatment eﬀect to zero. This allows us to b enc hmark the required strength of an unmeasured confounder against observ ed cov ariates like prior mastery or studen t SES . 6. RESUL TS 6.1 T r eatment Effect Estimates T able 3 presen ts the doubly robust av erage treatment ef- fect estimates from the Causal F orest a nalysis, with p-v alues adjusted for multiple corrections. W e ﬁnd a statistically signiﬁcan t p ositiv e eﬀect of tutoring on immediate perfor- mance and near transfer (T able 3). The estimated a verage treatmen t eﬀect (A TE) of tutoring on next problem correct- ness was 4 . 01 p ercentage points ( pp ) (CI = [2.51, 5.51]), p < 0 . 001) and accuracy on the ﬁrst attempt at a prob- lem on the next skill w as 2 . 73 pp (CI = [1.12, 4.35]). The eﬀects on the treatmen t group (A TT) were also signiﬁcan t and similar in magnitude. 6.1.1 Robustness and Sensitivity . T able 3 presen ts results from several robustnes s c hec ks de- signed to ev aluate whether sp eciﬁc modeling ch oices, such as co v ariate selection and sample construction, inﬂuence the stabilit y of the eﬀect estimates. The inclusion of exter- nal demographic cov ariates yielded a nearly identical A TE of 3 . 93 pp ( C I = [2 . 73 , 5 . 51]). This stabilit y suggests that our Deep Kno wledge T racing (DKT)-derived features suc- cessfully capture the requisite student heterogeneit y , ren- dering additional demographic data redundant for the pur- poses of confounding control. In the mo del speciﬁcation that included the w ashout sample (the contro l included stu- den ts who ha v e b een treated in the past, but not within t wo skills), studen ts within the control group produced a lo w er but still statistically signiﬁcan t estimate of 2 . 17 pp ( C I = [1 . 17 , 6 . 16]). The conv ergence and ov erlapping con- ﬁdence interv als across these disparate design assumptions pro vide robust evidence for a consisten t causal eﬀect. These c hec ks were replicated for the near-transfer outcome with similarly signiﬁcan t results. The eﬀect estimate for the placebo outcome—a meas ure the- oretically unaﬀected by the treatmen t—w as negative and non-signiﬁcan t; ho wev er, this is only after applying a Bon- ferroni correction to accoun t for family-wise error. This re- sult suggests that, if anything, students in the treatment group may actually b e biased tow ard low er baseline perfor- mance; consequen tly , the observed p ositive treatmen t eﬀects are unlikely to be artifacts of mo del-driv en upw ard bias. Tw o additional robustness tests are detailed in App endix A.1. W e utilized paired matc hing as a diagnostic to ol for propensity score estimation, ﬁnding that a Random F orest propensity model ac hiev ed superior cov ariate balance com- pared to linear sp eciﬁcations. F urthermore, we conﬁrmed that trimming the tails of the propensity distribution to en- sure stricter o verlap d id not signiﬁcan tly alter the treatmen t eﬀect estimates. Finally , to address the p oten tial for omitted v ariable bias (e.g., unobserved student motiv ation), w e conducted a for- mal sensitivit y analysis using the sensemakr framew ork [7]. W e calculated the Robustness V alue ( RV q =1 ) required to reduce the treatment eﬀect to statistic al insigniﬁcance [8]. Our results indicate that an unmeasured confounder would need to b e three times as predictive of b oth treatment as- signmen t and the outc ome as our most inﬂuential observed co v ariates to nullify the curren t ﬁndings. 6.2 T r eatment Effect Heterogeneity While the av erage eﬀects are positive, we observe substan tial heterogeneit y in treatment lo cal eﬀects across interv ention sessions. Figure 2 displays the distribution of local, session- lev el treatment eﬀect estimates on studen ts’ immediate per- formance and near transfer. The CA TE for immediate per- formance has a standard deviation of 3 . 36 pp , and ranges from − 20 . 25 pp to 19 . 91 pp . The CA TE for near transfer has a standard deviation of 3 . 41 pp , and ranges from − 47 . 70 pp to 23 . 60 pp . Notably , there is a strong correlation b etw een these eﬀects ( r = 0.69, p < 0.001), suggesting that the immediate performance CA TE may be a go o d surrogate measure for near transfer; if a tutoring session has a p ositive eﬀect on next problem correctness, it is lik ely to ha ve a positive eﬀect on the studen ts’ ability to solve problems on other similar skills as well. 6.3 Heterogeneity by T utoring Session Char - acteristics T o illustrate ho w this method can b e used to examine wh y some tutoring sessions are more eﬀective than others, we analyzed three p oten tial mo derators: session length, pro- T able 3: Average treatment eﬀect estimates in percen tage points across diﬀeren t mo del speciﬁcations. Model Sp eciﬁcation A TE (95% CI) A TT (95% CI) Primary Spe ciﬁc ation Immediate P erformance (Next Problem) 4.01 *** (2.51, 5.51) 3.98 *** (2.43, 5.54) Near T ransfer (Next Skill) 2.73 ** (1.12, 4.35) 2.65 * (0.99, 4.31) R obustness Che cks External V ariables Added (Next Problem) 3.93 *** (2.49, 5.37) 4.71 *** (2.92, 6.50) W ashout Sample Included (Next Problem) 2.17 *** (1.17, 3.16) 2.76 *** (1.91, 3.61) Placebo T est (Pre-Interv ention) -1.67 (-3.10, -0.24) -0.71 (-2.15, 0.74) Bonferr oni-c orr e cte d signiﬁc anc e: ∗ p < 0 . 05, ∗∗ p < 0 . 01, ∗∗∗ p < 0 . 001 Figure 2: Distribution of Conditional Av erage T reatmen t Eﬀects (CA TE) on both Immediate P erformance and Near T ransfer. The solid blac k line indicates the A TE, while the dashed red lin e indicates zero eﬀect. portion of studen t talk, and the n um ber of prior tutoring sessions. W e chose these mo derators because each of them has b een studied in more traditional tutoring en vironments. Longer and more freque nt sessions are consisten tly poin ted to as hallmarks of high-impact tutoring [33]. Eliciting stu- den t talk is often asso ciated with higher quality tutoring sessions [5]. Notably , we view this as a preliminary anal- ysis into understanding what migh t drive eﬀectiveness in on-demand tutoring, which will lead to future directions for this line of work. Because these v ariables w ere non-normally distributed, each w as divided in to quartiles and compared using the hetero- geneit y mo del (Section 5.3.4) with T uk ey HSD p ost-hoc tests. 2 Figure 3 presents the diﬀerences in eﬀect size on im- mediate performance b y quartile (quart ile descriptiv e statis- tics and results tables are presented in Appendix C). Session length w as measured b y the num b er of messages sen t betw een the tutors and students that sho w ed the clearest and most consisten t pattern. Mo derately long sessions (Q3) produced signiﬁcan tly larger eﬀects than very short sessions (Q1; p < 0 . 001), indicating that brief in teractions are often insuﬃcien t to supp ort learning. Sessions in the longest quar- tile (Q4) also outp erformed the shortest sessions ( p < 0 . 05), 2 W e also ran the mo dels with transformed v ariables and found similar results, but chose to present the results in terms of quartiles to ease interpretation. but did not yield reliably greater beneﬁts than moderately long sessions, suggesting diminishing returns b ey ond a cer- tain p oin t. As a proxy for studen t talk, w e used the p ercenta ge of total w ords in the session that were sen t by the student. This measure of the prop ortion of student talk exhibited min- imal and inconsisten t asso ciations with tutoring eﬀectiv e- ness. Only one comparison reac hed signiﬁcance, indicating sligh tly low er eﬀects in sessions with the highest levels of studen t talk (Q4 vs. Q2; p < 0 . 01), with all other con trasts non-signiﬁcan t. Students who sent betw een 14 pp and 20 pp of the words during the session had higher estimated eﬀects than tho se who sent b et ween 29 . 8 pp and 85 . 7 pp . This con- trasts with m uch of the work on studen t talk in tutoring, whic h often ﬁnds that getting students to engage is an im- portant part of tutoring [5]. Ho wev er, this analysis suggests that student talk ma y be less imp ortant in on-deman d ses- sions that fo cus on speciﬁc problems. The num b er of tutoring sessions that preceded the inter- v en tion session was strongly and positivel y associated with impact. Students with more previous sessions (Q3 and Q4) experienced substan tially larger gains than ﬁrst-time or in- frequen t users (Q1; both p < 0 . 001), with eﬀect sizes increas- ing monotonically across quartiles. T aken together, these results suggest that tutoring is most eﬀective when sessions are of at least moderate length and when students hav e prior Figure 3: Eﬀect Sizes on Immediate Performance by T utoring Session Characteristics. Eac h eﬀect is decomp osed by quartiles of the c haracteristics. Quartile ranges are presen ted in brack ets. experience engaging with on-demand support, whereas sim- ply increasing the amount of studen t talk do es not reliably translate in to greater learning gains. 6.4 Heterogeneity by Student Characteristics The CA TE estimations also provide opportunities to ev alu- ate which studen ts b eneﬁt more from this form of tutoring. T o assess this v ariability , we estimated a series of regres- sion models predicting session-lev el CA TEs (the full table of the mo del output can b e found in App endix B). Stu- den ts with low er baselin e ability exhibited larger treatmen t eﬀects, with DKT mastery showing the strongest asso cia- tion ( β = − 1 . 87, p < 0 . 001), indicating that a one standard deviation decrease in predicted mastery corresponded to a 1.87 p ercen tage point increase in eﬀect size. Once DKT w as included, NWEA RIT scores sho wed only a small positive association ( β = 0 . 08), suggesting that the more proximal DKT-based measure of kno wledge was the primary driv er of heterogeneit y . This pattern is consistent with b oth a ceiling eﬀect—where higher-performing students ha ve less ro om to impro v e—and the hypothesis that just-in-time tutoring is most v aluable for students who are struggling. Lo w-SES students b eneﬁted slightly less on a vera ge ( β = − 0 . 21, p < 0 . 05) and in teraction terms revealed that the re- lationship b et ween prior knowledge and treatment eﬀects w as weak er for these students (DKT × PPF: β = 0 . 25, p < 0 . 001; RIT × PPF: β = 0 . 29, p < 0 . 001). F igure 4 presents a visualization of this interaction. While lo wer- performing students generally b eneﬁted more from tutoring, this gradien t was atten uated among PPF-eligible studen ts. One interpreta tion is that low-SES studen ts may face ad- ditional barriers—suc h as limited access to supplemen tary resources or greater comp eting demands on atten tion—that constrain the p oten tial b eneﬁts of tutoring regardless of baseline abilit y . 7. DISCUSSION & CONCLUSION W e introduced a scalable causal framew ork for estimating problem-lev el eﬀects of on-demand h uman tutoring em b ed- ded in an adaptiv e learning system. Across speciﬁcations, requesting tutoring pro duces mo dest but reliable gains in immediate performance (next-problem correctness; A TE of 4 percentage p oin ts) and near transfer (ﬁrst attempt on the next skill; A TE of 3 p ercen tage p oin ts). These eﬀects are similar in magnitude to other ALS features [14, 32, 31], but the large treatmen t-eﬀect heterogeneit y suggests substan tial room to optimize tutoring qualit y . The framew ork enables session-lev el eﬀect estimation, supp orting feature mining of tutoring interactions asso ciated with higher or lo wer impact. A central contribut ion is methodological. W e show how time-v arying kno wledge representation s deriv ed from plat- form logs can be used as pretreatment cov ariates to address confounding driv en b y evolving mastery and task diﬃculty in observ ational on-demand tutoring studies. By integrating Deep Knowledge T racing featu res [29] in to a doubly robust causal forest estimator [40 , 3, 13], the framework targets the core selection problem that tutoring is typically requested when studen ts an ticipate po or performance. The stabilit y of A TE estimates after adding external cov ariates (test scores, SES p rox y , sc hool iden tiﬁers) suggests that platform-derived learning traces capture muc h of the relev ant confounding structure. A near-zero placeb o eﬀect further supports the iden tiﬁcation strategy . The framework also supports developmen t and b enchma rk- ing of AI tutoring systems. Session-level Conditional Aver- age T reatmen t Eﬀect (CA TE) estimates provide large sets of lab eled examples indicating whic h tutoring interactions are lik ely to produce meaningful gains. These can serv e as supervision signals for training AI tutors to emulate high- impact h uman strategies and a void lo w-impact patterns. CA TE distributions also deﬁne a principled benchmark: AI systems can b e ev aluated not only against a verage human outcomes but relat ive to the upper range of observ ed human session eﬀects. A key substantiv e result is that the A TE alone masks wide heterogeneity , including strongly p ositiv e and negativ e session-lev el eﬀects. Although the av erage tutoring eﬀect is comparable to low er-cost ALS features suc h as on-demand hin ts [31], individual eﬀects range from roughly +20 to − 20 percentage p oin ts. This indicates that some tutoring ses- Figure 4: In teractions b et w een prior kno wledge in terms of predicted mastery (A) or standardized test scores (B) and socio- economic status (SES) based on Pupil Premium F unding (PPF) eligibility . sions are high ly eﬀectiv e while others are n eutral or coun ter- productive. The proposed framew ork enables iden tiﬁcation and analysis of these high- and low-impact sessions. Our analysis of heterogeneity pro vides a preliminary , though not comprehensiv e, attempt to iden tify the factors associ- ated with higher-impact tutoring. Ev aluating session char- acteristics highlights sev eral p oten tial drivers of eﬀective- ness. Session length exhibits the clearest pattern: mo der- ately long sessions (Q3) outperform very short sessions (Q1), while the longest sessions do not reliably impro ve up on mo d- erately long ones. This suggests a diminishing marginal re- turn on tutor time beyond a certain threshold. F urthermore, prior interv ention exp erience is strongly and monotonically associated with larger impacts, indicating that students ma y ’learn ho w to use’ tutoring more eﬀectiv ely o ver time. Alter- nativ ely , students who access tutoring more frequently ma y possess inheren t c haracteristics, suc h as higher baseline en- gagemen t, that render the tutoring more beneﬁcial. In con- trast, the proportion of studen t talk shows minimal asso cia- tions with impact. This ﬁnding is direct ionally inconsisten t with classic tutoring accoun ts that emphasize studen t expla- nation and constructiv e engagement as drivers of learning [5], and suggests that in brief, problem-focused, on-demand c hats, the quantit y of studen t talk ma y be less informativ e than the sp eciﬁcity , timing, and instructional con tent of tu- tor mo ves. How ev er, the crude metric of the prop ortion of w ords used by students vs. teachers do es not capture the qualit y of the con versation; more work can b e conducted in this area using the eﬀect estimates from our framework. T ogether, these patterns point to a practical implication for em b edded systems: eﬀectiv eness ma y dep end less on max- imizing interaction v olume and more on ensuring sessions are long enough to resolve a misconcept ion while support- ing students in dev eloping productive help-seeking routines. Studen t-level heterogeneit y shows larger beneﬁts for learn- ers with low er prior mastery , consisten t with just-in-time support mo dels and prior tutoring evidence [24]. How ever, lo w-SES studen ts b eneﬁt sligh tly less on av erage and sho w a weak er mastery–impact gradien t, aligning with prior w ork suggesting that on-demand tutoring may require comple- men tary design supp orts to ensure equitable beneﬁt [22, 10]. Finally , the robustness c hecks illustrate b oth the promise of observ ational ev aluation in em b edded interv entions. Esti- mates attenuate under alternative con trol-group construc- tions that expand ov erlap while introducing p otentia l carry- o ve r, highligh ting the diﬃculty of deﬁning appropriate coun- terfactuals when treatmen t is episo dic and histories are lon- gitudinal. This reinforces the v alue of framew orks that (i) explicitly enco de time-v arying knowled ge and (ii) p rovid e di- agnostics (e.g., o verlap, placebo outcomes) to probe residual bias [35, 13]. 8. LIMIT A TIONS & FUTURE WORK Sev eral limitations of this study provide pro ductive a v- en ues for future research. First, while our causal identi - ﬁcation strategy relies on the conditional ignorabilit y as- sumption, our sensitivity analysis suggests that unmea- sured factors—suc h as studen t motiv ation or teacher prac- tices—w ould need to be substan tially more inﬂuen tial than our platform-derived measures to inv alidate the primary ﬁndings. Second, mo deling tutoring as a binary exp osure necessarily marginalizes the granular v ariations in instruc- tional quality and conv ersational strategy that likely driv e eﬀect heterogeneity . While this work iden tiﬁes correlates of impact, subsequent inquiries must deconstruct the ” blac k box” of the tutoring exchange b y integrating process-level data, suc h as dialogue transcripts or sp eciﬁc instructional mo v es, to isolate the laten t causal mechanisms mediating eﬀectiv eness. Third, the outcome measures used here are in ten tionally prox imal; while they capture immediate im- pact and near transfer, they may not reﬂect long-term per- sistence or durable achiev ement. F uture research should ex- amine the comp ounding eﬀects of these just-in-time inter- v en tions to determine if session-lev el gains translate in to cu- m ulativ e impro vemen ts in learning tra jectories. Finally , as this study examines a single platform and implemen tation, future researc h should address generalizabilit y by applying this framew ork across diverse domains and utilizing exper- imen tal or quasi-exp erimen tal designs that more directly probe equit y implications and p edagogical mec hanisms. APPENDIX Anon ymized digital appendices: https: //osf.io/vdf8q/files/bez4n?v iew_only= e45c14c4cc1a4f9e958eeb8e8753 5ee3 . A. REFERENCES [1] G. Ab delrahman, Q. W ang, and B. Nunes . Knowledge tracing: A survey . A CM Comp uting Surveys , 55(11):224:1–224:37, 2023. [2] V. Aleven, I. Roll, B. M. McLaren, and K. R. Koedinger. Help helps, but only so m uch: Researc h on help seeking with intelligen t tutoring systems. International Journal of Artiﬁcial Intel ligenc e in Educ ation , 26(1):205–223, 2016. [3] S. Athey , J. Tibshirani, and S. W ager. Generalized random forests. The A nnals of Statistics , 47(2):1148–1178, 2019. [4] S. Athey and S. W ager. Estimating treatment eﬀects with causal forests: An application. Observational studies , 5(2):37–51, 2019. [5] M. T. Chi, S. A. Siler, H. Jeong, T. Y amauch i, and R. G. Hausmann. Learning from human tutoring. Co gnitive scienc e , 25(4):471–533, 2001. [6] D. R. Chine, C. Bren tley , C. Thomas-Browne, J. E. Ric hey , A. Gul, P . F. Carv alho, and K. R. Koedinger. Educational equit y throug h combined h uman-ai personalization: A propensity matc hing ev aluation. In International Confer enc e on Artiﬁcial Intel ligenc e in Educ ation , pages 366–377. Springer, 2022. [7] C. Cinelli and C. Hazlett. Making sense of sensitivit y: Extending omitted v ariable bias. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 82(1):39–67, 2020. [8] C. Cinelli and C. Hazlett. sensemakr: Sensitivit y analysis tools for omitted v ariable bias in r. Jour nal of Statistic al Softwar e , 94(11):1–34, 2020. [9] R. K. Crump, J. V. Hotz, G. W. Im b ens, and O. A. Mitnik. Dealing with limited ov erlap in estimation of a vera ge treatment eﬀects. Biometrika , 96(1):187– 199, 2009. [10] G. Deacon and G. Cho jnac ki. Impacts of upch ieve on-demand tutoring on studen ts’ math knowledg e and perceptions. T ec hnical rep ort, Mathematica, 2023. [11] J. Dietrichson, M. Bøg, T. Filges, and A. M. Klin t Jørgensen. Academic interv entions for elemen tary and middle school students with lo w socio economic status: A systematic review and meta-analysis. R eview of Educ ational R esear ch , 87(2):243–282, 2017. [12] A. C. Eggers, G. T u ˜ n´ on, and A. Dafo e. Placebo tests for causal inference. Americ an Journal of Politic al Scienc e , 68(3):1106–1121, 2024. [13] A. N. Glynn an d K. M. Quinn. An introduction to the augmen ted in verse propensity w eighted estimator. Politic al A nalysis , 18:36–56, 2010. [14] A. Gurung, S. Baral, M. P . Lee, A. C. Sales, A. Haim, K. P . V anacore, A. A. McReynolds, H. Kreisberg, C. Heﬀernan, and N. T. Heﬀernan. How common are common wrong answe rs? crowdsourcing remediation at scale. In Pr o c e e dings of the T enth A CM Confer enc e on L e arning @ Scale , page 70–80, Copenhagen Denmark, 2023. ACM. [15] A. Gurung, J. Lin, J. Gutterman, D. R. Thomas, A. Houk, S. Gupta, and K. Ko edinger. Human tutoring impro v es the impact of ai tutor use on learning outcomes. In International Confere nc e on Ar tiﬁcial Intelligenc e in Educ ation , pages 393–407. Springer, 2025. [16] J. Guryan, J. Ludwig, M. P . Bhatt, P . J. Cook, J. M. Da vis, K. Do dge, and G. Sto ddard. Not too late: Impro ving academic outcomes among adolescen ts. Am eric an Ec onomic R eview , 113(3):738–765, 2023. [17] P . R. Hahn, J. S. Murra y , and C. M. Carv alho. Ba y esian regression tree mo dels for causal inference: Regularization, confounding, and heterogeneous eﬀects (with discussion). Bayesian Analysis , 15(3):965–1056, 2020. [18] D. W. Harrison, D. J. Bro wn, and S. Higgins. A pilot impact study to ev aluate the eﬀectiveness of eedi on raising attainmen t in mathematics at ks3. T ec hnical report, What W orked Education, 2023. [19] R. F. Kizilcec, G. M. Da vis, and G. L. Cohen. T ow ards equal opp ortunities in mo ocs: aﬃrmation reduces gender & so cial-class ac hievemen t gap s in c hina. In Pr o c ee dings of the fourth (2017) A CM c onfer enc e on lear ning@ sc ale , pages 121–130, 2017. [20] R. F. Kizilcec, A. Saltarell i, P . Bonfert-T aylor, M. Goudzw aard, E. Hamonic, and R. Sharro c k. W elcome to the course: Early social cues inﬂuence w omen’s p ersistence in computer science. In Pr o c e e dings of the 2020 CHI c onfer enc e on human factors in c omputing systems , pages 1–13, 2020. [21] M. A. Kraft, D. S. Edw ards, and M. Cannata. The scaling dynamics and causal eﬀects of a district-operated tutoring program. edworkingpaper no. 24-1030. Annenb er g Institute for Scho ol R eform at Br own University , 2024. [22] M. A. Kraft, B. E. Sc hueler, and G. F alk en. What impacts should we expect from tutoring at scale? exploring meta-analytic generalizabilit y , 2024. [23] S. Mo jarad, A. Essa, S. Mo jarad, an d R. S. Baker. Studying adaptiv e learning eﬃcacy using propensity score match ing. In Comp anion pr oc e e dings of the 8th international co nfer enc e on le arning analytics and know le dge (LAK’18) , pages 5–9, 2018. [24] A. Nick ow, P . Oreop oulos, and V. Quan. The promise of tutoring for prek–12 learning: A systematic review and meta-analysis of the exp erimen tal evidence. Am eric an Educ ational R ese ar ch Jour nal , 61(1):74–107, 2024. [25] Z. A. Pardos and N. T. Heﬀernan. Mo deling individualization in a ba y esian netw orks implemen tation of know ledge tracing. In International c onfer enc e on user mo deling, adaptation, and p ersonalization , pages 255–266. Springer, 2010. [26] Y. Pei, A. Sales, and J. Gagnon-Bartsch . Bo osting precision in educational a/b tests using auxiliary information and design-based estimators. In Pr o c e e dings of the 17th International Confer enc e on Educ ational Data Mining , pages 990–993, 2024. [27] R. Pel´ anek. Bay esian knowledge tracing, logistic models, and beyond: an o verview of learner mo deling tec hniques. User modeling and user-adap te d inter action , 27(3):313–350, 2017. [28] D. M. Pham, K. P . V anacore, A. C. Sales, and J. A. Gagnon-Bartsc h. Lo ol: T o wa rds p ersonalization with ﬂexible & robust estimation of heterogeneous treatmen t eﬀects. In Pr o c e e dings of the 17th International Confer enc e on Educ ational Data Mining , pages 376–384. In ternational Educational Data Mining Society , 2024. [29] C. Piech et al. Deep knowledge tracing. In A dvanc es in Neur al Information Pr o cessing Systems , pages 505–513, 2015. [30] E. Prihar, A. Moore, and N. Heﬀernan. Identifying explanations within studen t-tutor chat logs. In Pr o c e e dings of the 15th International Confer enc e on Educ ational Data Mining , page 773–777, 2022. [31] E. Prihar, T. P atikorn, A. Botelho, A. Sales, and N. Heﬀernan. T ow ard p ersonalizing studen ts’ education with cro wdsourced tutoring. page 37–45. Association for Comput ing Machi nery , Inc, 2021. [32] E. Prihar, M. Sy ed, and K. Ostrow. Exploring common trends in online educational exp erimen ts. page 12, 2022. [33] C. D. Robinson and S. Loeb. High-impact tutoring: State of the research and priorities for future learning. National Student Supp ort A c c eler ator , 21(284):1–53, 2021. [34] C. D. Robinson, C. P ollard, S. Novicoﬀ, S. White, and S. Lo eb. The eﬀects of virtual tutoring on young readers: Results from a randomized controlled trial. Educ ational Evaluation and Policy Analysis , 47(4):1245–1265, 2025. [35] D. B. Rubin. Est imating causal eﬀects of treatments in randomized and nonrandomized studies. Journal of Educ ational Psycholo gy , 66:688–701, 1974. [36] A. C. Sales, B. B. Hansen, and B. Ro wan. Rebar: Reinforcing a match ing estimator with predictions from high-dimensional co v ariates. Journal of Educ ational and Behavior al Statistics , 43(1):3–31, 2018. [37] A. C. Sales and J. F. P ane. The eﬀect of teac hers reassigning studen ts to new cognitiv e tutor sections. International Educ ational Data Mining So ciety , 2020. [38] D. R. Thomas, J. Lin, E. Gatz, A. Gurung, S. Gupta, K. Norb erg, and K. R. Ko edinger. Impro ving student learning with hybrid human-ai tutoring: A three-study quasi-experimental in vestigation. In Pr o c e e dings of the 14th L e arning A nalytics and Know le dge Confer enc e , pages 404–415, 2024. [39] K. V anacore, A. Sales, A. Liu, and E. Ottmar. Beneﬁt of gamiﬁcation for p ersisten t learners: Propensity to repla y problems mo derates algebra-game eﬀectiv eness. In T enth A CM Confer enc e on L ear ning @ Sc ale (L@S ’23) , Copenhagen, Denmark, 202 3. ACM. [40] S. W ager and S. Athey . Esti mation and inference of heterogeneous treatmen t eﬀects using random forests. Journal of the A meric an Statistic al Asso ciation , 113(523):1228–1242, 2018. [41] A. W ang, A. Rysb ek, A. Huber, A. Nam biar, A. Kenolty , B. Caulﬁ eld, B. Lilley-Drap er, B. Groot, B. V eprek, C. Burdett, C. Willis, C. Barton, D. Smith, G. Mu, H. W alters, I. Jurenk a, I. Hulls, and V. Braz˜ ao. Ai tutoring can safely and eﬀectively support stud ent s: An exploratory rct in uk classrooms, 2025. arXi v preprint. [42] R. E. W ang, A. T. Rib eiro, C. D. Robinson, S. Lo eb, and D. Demszky . T utor cop ilot: A human-ai approac h for scaling real-time exp ertise. arXiv pr eprint arXiv:2410.03017 , 2024. [43] E. W u and J. A. Gagnon-Bartsch. The loop estimator: Adjusting for cov ariates in randomized exp erimen ts. Evaluation revi ew , 42(4):458–488, 2018.

A Causal Framework for Estimating Heterogeneous Effects of On-Demand Tutoring

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment