Textual semantics and machine learning methods for data product pricing

November 27, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Textual semantics and machine learning methods for data product pricing
ArXiv ID: 2511.22185
Date: 2025-11-27
Authors: Ruize Gao, Feng Xiao, Jinpu Li, Shaoze Cui

📝 Abstract

Reasonable pricing of data products enables data trading platforms to maximize revenue and foster the growth of the data trading market. The textual semantics of data products are vital for pricing and contain significant value that remains largely underexplored. Therefore, to investigate how textual features influence data product pricing, we employ five prevalent text representation techniques to encode the descriptive text of data products. And then, we employ six machine learning methods to predict data product prices, including linear regression, neural networks, decision trees, support vector machines, random forests, and XGBoost. Our empirical design consists of two tasks: a regression task that predicts the continuous price of data products, and a classification task that discretizes price into ordered categories. Furthermore, we conduct feature importance analysis by the mRMR feature selection method and SHAP-based interpretability techniques. Based on empirical data from the AWA Data Exchange, we find that for predicting continuous prices, Word2Vec text representations capturing semantic similarity yield superior performance. In contrast, for price-tier classification tasks, simpler representations that do not rely on semantic similarity, such as Bag-of-Words and TF-IDF, perform better. SHAP analysis reveals that semantic features related to healthcare and demographics tend to increase prices, whereas those associated with weather and environmental topics are linked to lower prices. This analytical framework significantly enhances the interpretability of pricing models.

💡 Deep Analysis

📄 Full Content

In the era of digital economy, data referred to as the "new oil" has become an invaluable resource (Pei, 2020;Zhu et al., 2024). Governments, enterprises, and individuals worldwide are increasingly leveraging data-driven approaches to unlock substantial economic value.

For example, the U.S. healthcare sector could generate more than USD 300 billion in value annually by fully harnessing big data. 1 In the retail sector, the adoption of dynamic pricing algorithms based on consumer behavioral data has been shown to increase revenue by 11% to 19% (Fisher and Raman, 2018). Similarly, the UK’s HM Revenue & Customs (HMRC) recovered GBP 1.4 billion in unpaid taxes in a single year through its Connect system, which integrates 28 data sources to support anomaly detection and risk modeling (Maciejewski, 2017). The rapid growth in data-driven demand has fueled the rise of Data Marketplaces, which is an online platforms that bring together data providers, consumers, and brokers to facilitate the exchange of data products (Zhang et al., 2023). According to a report by Grand View Research (2025), the global DMs market was valued at approximately USD 1.49 billion in 2024 and is predicted to reach USD 5.73 billion by 2030. 2 In data marketplaces, an effective data pricing model can enhance transaction efficiency, enable platforms to maximize revenue, and foster the sustained growth of the market (Pei, 2020;Zhang et al., 2023). Traditional cost-based pricing models often underestimate the intrinsic value embedded in data (Li, 2024;Wang et al., 2022). Revenue-based pricing models, in contrast, may generate biased prices because different data consumers form heterogeneous expectations about future benefits, making it difficult to establish a fair and consistent price.

In comparison, market-based and data-driven pricing approaches allow data providers to determine more reasonable prices by consider the features of the data itself and existing prices of comparable datasets. Such mechanisms further incentivize data providers to continuously improve data quality and proactively participate in data trade.

Existing data-driven approaches to data product pricing mainly include linear regression models, machine learning techniques, and explainable learning methods. In terms of linear regression approaches, regression models can capture the linear relationships between data product prices and their influencing factors. For example, Han et al. (2022) employed Lasso regression to identify features that are valuable to users and then combined these features with seller-defined unit prices to enable automatic pricing. About machine learning (ML) methods, ML-based techniques help uncover nonlinear relationships between data features and data product prices. For instance, Hao et al. (2024) introduced a heterogeneous ensemble pricing model based on clustering strategies to enhance the accuracy of data asset valuation. Shang et al. (2025) applied a LightGBM-based model for pricing medical data.

For explainable learning, existing studies primarily adopt methods such as SHAP to analyze the influence of different features on data product prices. For example, Zhu et al. (2024) generated semantic embeddings of dataset descriptions using mBERT and then applied SHAP to evaluate the relative importance of pricing factors.

Although existing studies have made initial investigation in applying machine learning methods to data product pricing, most current work still relies primarily on numerical features (e.g., dataset size, update frequency, category) to characterize data products. Few studies extracts value-relevant information from textual semantics. In practice, on data marketplaces such as AWS Data Exchange and Datarade, data products are typically presented through natural language descriptions, including data product titles, summaries, detailed descriptions, and application scenarios. The semantic content embedded in this text not only reflects the product’s thematic and domain attributes but also conveys implicit signals about its market positioning and perceived value (Gan et al., 2025). Although Zhu et al. (2024) made an early attempt to incorporate textual features into data product pricing, their approach relied solely on a single mBERT-based textual representation. It did not consider other widely used textual representations such as Bag-of-Words (BOW), TF-IDF, or topicbased models. Moreover, data product pricing commonly involves two forms, i.e., precise (continuous) pricing and tiered (categorical) pricing, yet their study did not investigate how different textual representation methods may influence these two types of pricing tasks.

To investigate how different textual feature representation methods influence various types of data product pricing tasks, we conduct a systematic empirical analysis. First, we extract textual information from data products and represent it using five prevalent textual feature engineering methods: Bag-

📄 Read Full PDF on ArXiv