A Decade of Public Procurement in Spain: A Longitudinal Open Dataset from the BOE (2014-2024)

A Decade of Public Procurement in Spain: A Longitudinal Open Dataset from the BOE (2014-2024)
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents a longitudinal open dataset of Spanish public procurement extracted from the Official State Gazette (BOE) covering the period 2014-2024. The dataset integrates structured information on contracts, contracting authorities, suppliers, amounts, and procedures, enabling large-scale quantitative analysis of public procurement dynamics in Spain. We describe the data extraction and normalization pipeline, provide descriptive statistical analyses of temporal and sectoral trends, and discuss potential applications in transparency research, public policy evaluation, and computational social science. The dataset is released to facilitate reproducible research on public procurement and government contracting.


💡 Research Summary

This paper presents the creation, analysis, and public release of a comprehensive longitudinal dataset detailing central government public procurement in Spain from 2014 to 2024. Sourced from the Official State Gazette (BOE), the primary but semi-structured legal publication for contract announcements, the dataset addresses a significant gap in machine-readable, structured procurement data for Spain.

The core technical contribution is a fully reproducible ETL (Extract, Transform, Load) pipeline built in Python. This pipeline systematically scrapes HTML announcements from Section V-A of the BOE, extracts key information fields, and transforms the heterogeneous data into a structured format aligned with the Open Contracting Data Standard (OCDS). The final dataset comprises approximately 97,000 raw observations across 15 variables, including contracting institution, contract category, description, geographic scope, CPV codes, estimated value, awarded value, and awardee name. After rigorous cleaning—involving type conversions, normalization of categorical variables, handling of missing values inherent to the administrative lifecycle (e.g., awarded value missing in tender notices), and filtering of negligible records—the analytical subset consists of 60,456 awarded contract notices.

Exploratory analysis reveals fundamental characteristics of Spanish public procurement over the decade. The distribution of awarded contract values exhibits extreme positive skewness, with a mean of approximately €1.24 million but a median of only €110,436, indicating that a small number of very large contracts dominate the total expenditure. Geographically, procurement activity is highly concentrated, with the Community of Madrid accounting for the vast majority of contracts, reflecting the centralization of state administration. Sectorally, “Services” and “Supplies” categories consistently constitute over 85% of all awarded contracts. A clear year-on-year increase in contract count is observed, likely linked to digitalization efforts and extraordinary spending cycles such as COVID-19 emergency procurement (2020) and EU Next Generation funds (2021-2023).

The authors employ three analytical techniques to derive deeper insights. First, a multiple linear regression model was built to predict awarded contract value. While the model’s explanatory power (R² = 0.014) is low—expected due to the absence of crucial predictors like contract duration or number of bidders—it identifies geographic scope and the “Public Service Management” contract category as the most influential structural factors. Second, K-Means clustering was applied to the population of 16,502 unique awardees, using the log-transformed total number of contracts and total awarded value as features. The analysis revealed a three-tier market structure: a small cluster of “High-Value Operators” who win few but extremely high-value contracts; a large intermediate cluster of “Standard Operators”; and a broad base of “Micro Operators” with occasional, low-value awards. Third, a non-parametric hypothesis test (Wilcoxon-Mann-Whitney) confirmed a statistically significant difference in the awarded value distributions between “Works” and “Services” contracts, with “Works” contracts having a substantially higher median value, consistent with the capital-intensive nature of public infrastructure projects.

The paper acknowledges limitations, including reporting biases from the BOE source, potential classification errors from complex HTML, and the inherent constraints of modeling without key commercial variables. Despite this, the released dataset (under a CC0 license at a persistent DOI) and the open-source pipeline constitute a significant open resource. They enable reproducible research on Spanish public procurement, facilitate transparency and accountability studies, and provide a foundation for evidence-based policy analysis in public spending and market competition. The work stands as a practical application of administrative data science, transforming legal publications into a resource for computational social science and open government research.


Comments & Academic Discussion

Loading comments...

Leave a Comment