Cyclic Adaptive Private Synthesis for Sharing Real-World Data in Education
The rapid adoption of digital technologies has greatly increased the volume of real-world data (RWD) in education. While these data offer significant opportunities for advancing learning analytics (LA), secondary use for research is constrained by privacy concerns. Differentially private synthetic data generation is regarded as the gold-standard approach to sharing sensitive data, yet studies on the private synthesis of educational data remain very scarce and rely predominantly on large, low-dimensional open datasets. Educational RWD, however, are typically high-dimensional and small in sample size, leaving the potential of private synthesis underexplored. Moreover, because educational practice is inherently iterative, data sharing is continual rather than one-off, making a traditional one-shot synthesis approach suboptimal. To address these challenges, we propose the Cyclic Adaptive Private Synthesis (CAPS) framework and evaluate it on authentic RWD. By iteratively sharing RWD, CAPS not only fosters open science, but also offers rich opportunities of design-based research (DBR), thereby amplifying the impact of LA. Our case study using actual RWD demonstrates that CAPS outperforms a one-shot baseline while highlighting challenges that warrant further investigation. Overall, this work offers a crucial first step towards privacy-preserving sharing of educational RWD and expands the possibilities for open science and DBR in LA.
💡 Research Summary
The paper tackles the growing tension between the abundance of real‑world educational data (RWD) and the strict privacy constraints that limit their secondary use for learning analytics (LA) research. While differentially private (DP) synthetic data generation is widely regarded as the gold‑standard for privacy‑preserving data sharing, most existing work focuses on large, low‑dimensional public datasets. In contrast, educational RWD are typically high‑dimensional, small‑sample, and arrive repeatedly across cohorts, making a one‑shot synthesis approach both inefficient and privacy‑budget intensive.
To address these challenges, the authors introduce the Cyclic Adaptive Private Synthesis (CAPS) framework. CAPS operates in three recurring steps. First, a large unconditional variational auto‑encoder (VAE), denoted M1, is pre‑trained on publicly available data that share the same feature space as the private educational data. Second, for each cycle t, the pre‑trained M1 generates unlabeled synthetic data D′_t, which are combined with the newly collected private dataset D_t (including its labels Y_t). A smaller conditional generative model M2 is then trained on the union D_t ∪ D′_t using semi‑private semi‑supervised learning (SPSSL). DP noise is added only when processing the private portion, preserving the overall privacy budget while treating D′_t as public under DP’s post‑processing property. The resulting stacked model (M1+M2) can be released either as a synthetic dataset or as a model, both retaining the (ε,δ)‑DP guarantee.
Third, the synthetic features X′_t produced by the just‑trained M1+M2 are fed back into M1 via a continual‑learning update. This step avoids catastrophic forgetting of earlier knowledge and incrementally refines the feature extractor, effectively “learning the private knowledge” from each cycle without consuming additional privacy budget. The updated M1 is then used in the next cycle, creating a closed loop that adapts over time.
The authors validate CAPS with a case study from a Japanese lower‑secondary school. Over three years (2022‑2024), weekly mathematics practice tests and learning‑habit logs were collected from a 7th‑grade class via an e‑book platform (BookRoll) and a goal‑oriented active learning system (GOAL). Each year introduced a different assessment format, resulting in distinct label spaces Y_1, Y_2, Y_3 while keeping the feature space constant. Using CAPS, the authors observed progressive improvements in downstream classification accuracy (up to +5.1% over the baseline) and reductions in reconstruction loss (≈12% average decrease) across cycles, indicating that the cyclic refinement of M1 yields richer latent representations. However, they also identified a modest degradation in certain synthetic‑data quality metrics, coining this phenomenon the “compounding bias effect.” This suggests that bias introduced in early cycles can accumulate, highlighting a need for bias‑mitigation strategies in future work.
In the related‑work discussion, the paper contrasts CAPS with prior privacy‑preserving approaches in education, noting that most earlier studies either focus on DP‑protected predictive modeling or rely on large public datasets for private synthesis. CAPS uniquely targets iterative data sharing, aligns with the design‑based research (DBR) paradigm, and distinguishes itself from longitudinal DP synthesis that repeatedly releases data from the same individuals; instead, CAPS generates synthetic data for distinct cohorts while preserving a consistent educational context.
The authors acknowledge several limitations: the fixed allocation of the privacy budget (ε, δ) may not be optimal for all cycles; the impact of synthetic data on real classroom practice remains untested; and continual‑learning updates could introduce model drift if not carefully regularized. They propose future research directions, including adaptive privacy budgeting, empirical studies of synthetic data in classroom settings, and advanced regularization techniques to curb bias accumulation.
Overall, CAPS offers a practical, theoretically grounded solution for the continual, privacy‑preserving sharing of high‑dimensional, small‑sample educational data. By enabling cyclic data release, it supports open science, facilitates richer design‑based investigations, and paves the way for more collaborative, data‑driven advancements in learning analytics.
Comments & Academic Discussion
Loading comments...
Leave a Comment