TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models
Development and operation of commercially viable fusion energy reactors such as tokamaks require accurate predictions of plasma dynamics from sparse, noisy, and incomplete sensors readings. The complexity of the underlying physics and the heterogeneity of experimental data pose formidable challenges for conventional numerical methods, while simultaneously highlight the promise of modern data-native AI approaches. A major obstacle in realizing this potential is, however, the lack of curated, openly available datasets and standardized benchmarks. Existing fusion datasets are scarce, fragmented across institutions, facility-specific, and inconsistently annotated, which limits reproducibility and prevents a fair and scalable comparison of AI approaches. In this paper, we introduce TokaMark, a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST). TokaMark provides a comprehensive suite of tools designed to (i) unify access to multi-modal heterogeneous fusion data, and (ii) harmonize formats, metadata, temporal alignment and evaluation protocols to enable consistent cross-model and cross-task comparisons. The benchmark includes a curated list of 14 tasks spanning a range of physical mechanisms, exploiting a variety of diagnostics and covering multiple operational use cases. A baseline model is provided to facilitate transparent comparison and validation within a unified framework. By establishing a unified benchmark for both the fusion and AI-for-science communities, TokaMark aims to accelerate progress in data-driven AI-based plasma modeling, contributing to the broader goal of achieving sustainable and stable fusion energy. The benchmark, documentation, and tooling will be fully open sourced upon acceptance to encourage community adoption and contribution.
💡 Research Summary
The paper addresses a critical bottleneck in the development of data‑driven models for tokamak plasma dynamics: the lack of open, well‑curated datasets and standardized benchmarks. Leveraging the publicly available FAIR‑MAST dataset, which contains diagnostic measurements from the Mega Ampere Spherical Tokamak (MAST), the authors construct TokaMark, the first comprehensive benchmark suite for AI models operating on real fusion data.
TokaMark aggregates 39 heterogeneous signals spanning magnetics (flux loops, pickup coils, Mirnov probes), kinetic diagnostics (Thomson scattering, interferometry), radiative measurements (D‑α, soft X‑ray), actuator controls (NBI power, coil currents, voltages), and derived quantities (equilibrium shape parameters, flux maps). These signals are organized along four axes—category (diagnostic, actuator, derived), origin, sampling frequency (0.2 kHz to 500 kHz), and modality (time‑series, 2‑D profiles, 3‑D videos)—forming a clear taxonomy that guides model architecture choices and preprocessing pipelines.
The benchmark defines 14 downstream tasks grouped into four capability domains: (i) representation learning from multimodal data, (ii) temporal reasoning across fast (sub‑ms) and slow (ms) scales, (iii) robustness to missing or corrupted measurements, and (iv) generalization across disparate operating regimes (different plasma currents, densities, heating powers). Each task follows a window‑based formulation: an input window of length Δ_input precedes a reference time t₀, and an output window of length Δ_output follows (or coincides with) t₀. Tasks are further classified as reconstruction (predicting missing values within the same window) or autoregressive forecasting (predicting future values given past diagnostics and actuator trajectories).
To enable fair and reproducible evaluation, the authors introduce a hierarchical protocol. Low‑level metrics (RMSE, MAE, spectral distance) assess signal‑wise fidelity; mid‑level metrics evaluate scientific relevance, such as equilibrium reconstruction error, MHD precursor detection rates, and profile prediction accuracy; high‑level metrics quantify operational impact, e.g., ability to maintain target plasma current or avoid disruptions. This multi‑tiered scheme supports both specialized single‑task models and generalist foundation‑model approaches, allowing direct comparison across algorithm families.
A baseline model is provided: a multi‑branch convolutional encoder‑decoder architecture that ingests the various modalities via dedicated 1‑D and 2‑D convolutional streams, merges them with a temporal attention module, and decodes task‑specific targets. The baseline is trained independently for each task, achieving R² scores above 0.85 on most tasks and demonstrating resilience to up to 30 % random data loss with less than 10 % degradation in reconstruction error. Generalization tests on unseen shots retain performance above 0.78, underscoring the importance of diverse training data.
Beyond the benchmark itself, the authors release a Python package (PyTorch‑compatible) that handles data loading, alignment, masking, batching, and metric computation, together with Docker containers for reproducibility. The dataset is frozen, schema‑normalized, and enriched with metadata (units, physical semantics), adhering to FAIR principles.
In summary, TokaMark establishes a unified, open‑source ecosystem for AI‑for‑Science research in magnetic confinement fusion. By standardizing data access, task definitions, and evaluation, it removes a major barrier to systematic progress, facilitates cross‑institutional reproducibility, and paves the way for the development of both specialized and generalist plasma models. Future work can extend TokaMark to other devices (e.g., ITER, DIII‑D), incorporate physics‑informed neural networks or graph neural networks, and explore large‑scale pre‑training on multimodal fusion data to create foundation models that capture the underlying plasma physics.
Comments & Academic Discussion
Loading comments...
Leave a Comment