An Empirical Framework for Evaluating Semantic Preservation Using Hugging Face

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: An Empirical Framework for Evaluating Semantic Preservation Using Hugging Face
ArXiv ID: 2512.07983
Date: 2025-12-08
Authors: ** - Nan Jia (CUNY, Graduate Center) – njia@gradcenter.cuny.edu - Anita Raja (CUNY, Hunter College & Graduate Center) – anita.raja@hunter.cuny.edu - Raffi Khatchadourian (CUNY, Hunter College & Graduate Center) – khatchad@hunter.cuny.edu **

📝 Abstract

As machine learning (ML) becomes an integral part of high-autonomy systems, it is critical to ensure the trustworthiness of learning-enabled software systems (LESS). Yet, the nondeterministic and run-time-defined semantics of ML complicate traditional software refactoring. We define semantic preservation in LESS as the property that optimizations of intelligent components do not alter the system's overall functional behavior. This paper introduces an empirical framework to evaluate semantic preservation in LESS by mining model evolution data from HuggingFace. We extract commit histories, Model Cards, and performance metrics from a large number of models. To establish baselines, we conducted case studies in three domains, tracing performance changes across versions. Our analysis demonstrates how semantic drift can be detected via evaluation metrics across commits and reveals common refactoring patterns based on commit message analysis. Although API constraints limited the possibility of estimating a full-scale threshold, our pipeline offers a foundation for defining community-accepted boundaries for semantic preservation. Our contributions include: (1) a large-scale dataset of ML model evolution, curated from 1.7 million Hugging Face entries via a reproducible pipeline using the native HF hub API, (2) a practical pipeline for the evaluation of semantic preservation for a subset of 536 models and 4000+ metrics and (3) empirical case studies illustrating semantic drift in practice. Together, these contributions advance the foundations for more maintainable and trustworthy ML systems.

💡 Deep Analysis

📄 Full Content

An Empirical Framework for Evaluating Semantic Preservation Using Hugging Face Nan Jia CUNY, the Graduate Center njia@gradcenter.cuny.edu Anita Raja CUNY, Hunter College CUNY, the Graduate Center anita.raja@hunter.cuny.edu Raffi Khatchadourian CUNY, Hunter College CUNY, the Graduate Center khatchad@hunter.cuny.edu Abstract As machine learning (ML) becomes an integral part of high-autonomy systems, it is critical to ensure the trustworthiness of learning-enabled software systems (LESS). Yet, the nondeterministic and run-time-defined semantics of ML complicate traditional software refactoring. We define semantic preservation in LESS as the property that optimizations of intelligent components do not alter the system’s overall functional behavior. This paper introduces an empirical framework to evaluate semantic preservation in LESS by mining model evolution data from HuggingFace. We extract commit histories, Model Cards, and performance metrics from a large number of models. To establish baselines, we conducted case studies in three domains, tracing performance changes across versions. Our analysis demonstrates how semantic drift can be detected via evaluation metrics across commits and reveals common refactoring patterns based on commit message analysis. Although API constraints limited the possibility of estimating a full-scale threshold, our pipeline offers a foundation for defining community-accepted boundaries for semantic preservation. Our contributions include: (1) a large-scale dataset of ML model evolution, curated from 1.7 million Hugging Face entries via a reproducible pipeline using the native HF hub API, (2) a practical pipeline for the evaluation of semantic preservation for a subset of 536 models and 4000+ metrics and (3) empirical case studies illustrating semantic drift in practice. Together, these contributions advance the foundations for more maintainable and trustworthy ML systems. Keywords: Refactoring, Semantic Drift, Learning-enabled Software Systems, Hugging Face, Software Evolution 1. Introduction In traditional software engineering, behavior-preserving system transformation is a well-understood concept. As first introduced by Opdyke (1992), refactoring in object-oriented programming involves systematically restructuring code without altering its external behavior. However, a learning-enabled software systems (LESS)—where Machine Learning (ML) models and data drive system behavior—refactoring is far more ambiguous. How can we verify that fine-tuning produces a trustworthy transformation (Ao et al., 2023; Jia et al., 2024) that maintains both the system’s behavioral integrity (Tang et al., 2021) and interpretable decision-making (Molnar et al., 2020)? The uncertainty is especially problematic in high-autonomy domains such as safety-critical infrastructure, finance, and healthcare (Hu et al., 2022; Nahar et al., 2024; Pan & Rajan, 2020; Zhuo et al., 2023), where ML components must remain reliable and explainable under continuous evolution. Yet, despite this need, there is currently no empirical baseline for what counts as a safe or semantics preserving change when updating models, training data, or documentation. This gap raises risks not only in performance regression but also in trust, reproducibility, and downstream system reliability. To mitigate these risks, it is increasingly critical to understand how ML models evolve while retaining their original intent. Unlike traditional software artifacts, ML models are not static—they are rapidly updated through fine-tuning, performance optimization, and documentation updates. Our goal in this paper is to uncover patterns and boundaries of semantic preservation during system transformation in widely arXiv:2512.07983v1 [cs.SE] 8 Dec 2025 used pretrained ML models that are hosted on Hugging Face. Studying this semantic transformation in deployed LESS is difficult due to proprietary constraints (H. V. Nguyen, 2025) and dynamic environments (David, 2020; Hu et al., 2022; Pollano et al., 2023). Fortunately, the Hugging Face platform, often referred to as the “GitHub for ML models (Ait et al., 2023; Pan et al., 2022), offers a uniquely rich and open environment to observe these dynamics at scale. Each model repository hosted on Hugging Face includes not only model weights and configurations but also version-controlled Model Card (Mitchell et al., 2019) and commit histories. These artifacts allow researchers to trace intra-repository evolution—how individual models change over time within a single project—not just at a snapshot, but across multiple versions. In this paper, we present the first empirical study to our knowledge of intra-repository evolution on the Hugging Face (HF) platform by operationally defining semantic preservation via metric stability, which in our work is assessing and visualizing each repository’s commit-specific trajectories over temporal changes in their Model Card. While our current study defines semantic preservation

📄 Read Full PDF on ArXiv