Accurate and cost-effective quantification of the agroecosystem carbon cycle at decision-relevant scales is essential for climate mitigation and sustainable agriculture. However, both transfer learning and the exploitation of spatial variability in this field are challenging, as they involve heterogeneous data and complex cross-scale dependencies. Conventional approaches often rely on location-independent parameterizations and independent training, underutilizing transfer learning and spatial heterogeneity in the inputs, and limiting their applicability in regions with substantial variability. We propose FTBSC-KGML (Fine-Tuning-Based Site Calibration-Knowledge-Guided Machine Learning), a pretraining- and fine-tuning-based, spatial-variability-aware, and knowledge-guided machine learning framework that augments KGML-ag with a pretraining-fine-tuning process and site-specific parameters. Using a pretraining-fine-tuning process with remote-sensing GPP, climate, and soil covariates collected across multiple midwestern sites, FTBSC-KGML estimates land emissions while leveraging transfer learning and spatial heterogeneity. A key component is a spatial-heterogeneity-aware transfer-learning scheme, which is a globally pretrained model that is fine-tuned at each state or site to learn place-aware representations, thereby improving local accuracy under limited data without sacrificing interpretability. Empirically, FTBSC-KGML achieves lower validation error and greater consistency in explanatory power than a purely global model, thereby better capturing spatial variability across states. This work extends the prior SDSA-KGML framework.
Purely data-driven machine learning (ML) models often achieve limited success in scientific domains due to their high data requirements and inability to produce physically consistent results (Willard et al. 2022). Thus, research communities have begun to explore integrating scientific knowledge with ML in a synergistic manner. The burgeoning field of knowledge-guided machine learning (KGML) offers a promising framework that integrates the strengths of processbased (PB) models, machine learning, and multi-source datasets. KGML has proven effective in spatial prediction tasks, such as land emissions estimation. However, current KGML models use location-independent parameters that overlook spatial heterogeneity across their large footprints, leaving little room for effective global-to-local transfer. As a result, performance and interpretability degrade in settings where processes are spatially variable (Moran 1950). To address these issues, this paper proposes a Fine-Tuning-Based Site Calibration-Knowledge-Guided Machine Learning (FTBSC-KGML) framework as a general schema to enhance current KGML methods by incorporating locationbased parameter values and cross-site transfer learning. Building on prior work such as SDSA-KGML (Sharma et al. 2025a), the proposed approach introduces a transfer-learning mechanism in which a globally pretrained, physics-guided model is fine-tuned for each site or state. This design bridges awareness of spatial variability with model calibration efficiency, leveraging knowledge from aggregated multi-state training while retaining site-specific interpretability.
The problem is important for accurately predicting carbon and other emissions from land-use activities (e.g., agriculture, deforestation). Quantifying and controlling these emissions is crucial for climate change mitigation, optimum crop management, and maintaining sustainable agriculture.
Predicting land emissions is challenging due to the heterogeneity of factors that affect them. This spatial variability, as implied by Tobler’s First Law of Geography (Tobler 1970), encompasses variations in soil characteristics, moisture content, and other environmental conditions. Moreover, collecting ground truth data for this task is costly, complicating the training of large deep learning models. These challenges call for methods that both effectively capture spatial variability and are guided by physical knowledge.
Related Work: The most common approach for predicting land emissions is based on process-based models, which use scientific theories that accurately explain the phenomena that occur, obeying principles such as mass and energy conservation. However, these models do not perform well in high spatial heterogeneity and variance, which is common in realworld settings (Gupta et al. 2021). Other approaches considered, especially for small-area estimation, include data-driven machine learning models. However, these models usually require extensive training data, which can be time-consuming and sometimes impossible to achieve.
Knowledge-Guided Machine Learning (KGML) methods have been further explored, incorporating elements of both process-based models and data-driven machine learning approaches. For instance, KGML-ag (Liu et al. 2024) integrates several pretraining steps with knowledge from ecosys, a process-based model for agroecosystems, into a deep learning architecture. KGML-ag effectively addresses challenges such as spatial autocorrelation and scalability to larger datasets. However, as mentioned earlier, the lack of awareness of spatial variability in KGML-ag limits its performance and interpretability.
To address this limitation, (Sharma et al. 2025a) proposed the Spatial Distribution-Shift Aware Knowledge-Guided Machine Learning (SDSA-KGML) framework, which introduced location-dependent parameters to explicitly account for spatial heterogeneity and distribution shifts across regions such as Illinois, Iowa, and Indiana. Their results demonstrated that incorporating region-specific parameters enhanced local accuracy under strong spatial variability. Building upon this direction, our proposed Fine-Tuning-Based Site Calibration (FTBSC-KGML) extends SDSA-KGML by introducing a transfer-learning mechanism in which a globally pretrained model is fine-tuned for each site or state. This approach bridges spatial variability awareness with model calibration efficiency, leveraging knowledge from aggregated multi-state training while retaining site-specific interpretability. Empirically, this mechanism significantly improves local performance in data-limited regions, mitigating overfitting and preserving knowledge-guided physical constraints.
Organization: The paper is organized as follows: Section 2 introduces basic concepts. Section 3 formally defines the problem. Section 4 discusses design decisions. Section 5 presents the proposed approach. Section 6 discusses experimental evaluation. Section 7 concludes the paper and discusse
This content is AI-processed based on open access ArXiv data.