Developing synthetic microdata through machine learning for firm-level business surveys

Developing synthetic microdata through machine learning for firm-level business surveys
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business Survey (ABS) and discusses various quality metrics. Although the ABS PUMS is currently being refined and results are confidential, we present two synthetic PUMS developed for the 2007 Survey of Business Owners, similar to the ABS business data. Econometric replication of a high impact analysis published in Small Business Economics demonstrates the verisimilitude of the synthetic data to the true data and motivates discussion of possible ABS use cases.


💡 Research Summary

This paper addresses the critical challenge of providing public access to sensitive firm-level business survey data collected by U.S. statistical agencies, such as the Annual Business Survey (ABS). While rich microdata is essential for research and evidence-based policymaking, it is currently restricted to secure Federal Statistical Research Data Centers (FSRDCs) due to confidentiality pledges. Traditional de-identification methods (e.g., suppression, noise infusion) often fail for business data because the combination of detailed industry codes (NAICS) and geographic information, coupled with abundant public data on firms, creates high re-identification risks. This results in a trade-off where increasing privacy protection severely degrades data utility.

As a solution, the authors propose and evaluate the development of a synthetic Public-Use Microdata Sample (PUMS) using machine learning. Synthetic data consists of artificially generated records that replicate the statistical properties and multivariate relationships of the original confidential data without containing any actual information from real respondents, thereby inherently protecting privacy.

The paper outlines the unique difficulties of creating a business PUMS compared to demographic data, highlighting the skewness in firm size distributions and the acute re-identification risks. It argues that common de-identification techniques are insufficient for this context. The core methodological contribution is the application of Classification and Regression Tree (CART)-based generative models, specifically the CenSyn and synthpop synthesizers. These models learn the complex structure of the source data by building decision trees and then sample new synthetic records from the learned distribution. This approach is favored for its effectiveness with tabular survey data, interpretability, and proven use in other official statistics projects.

Since the ABS PUMS itself is under development and confidential, the authors demonstrate their framework using similar data from the 2007 Survey of Business Owners (SBO). They create two synthetic versions of the 2007 SBO PUMS. To validate the quality and “verisimilitude” of the synthetic data, they perform an econometric replication of a high-impact study originally published in Small Business Economics using the synthetic datasets. The successful replication shows that the synthetic data preserves the key statistical relationships found in the true, restricted data.

In conclusion, the paper demonstrates that machine learning-generated synthetic microdata presents a viable path forward for privacy-preserving public data dissemination. It balances the competing demands of analytical utility and strong confidentiality protection for business surveys. The successful replication experiment builds confidence that a future synthetic ABS PUMS could support a wide range of use cases, including academic research, policy analysis, pre-testing of analysis code, and increasing public transparency, all while adhering to modern privacy standards and supporting the goals of legislation like the Foundations for Evidence-Based Policymaking Act.


Comments & Academic Discussion

Loading comments...

Leave a Comment