Big enterprise registration data imputation: Supporting spatiotemporal analysis of industries in China

Reading time: 5 minute
...

📝 Original Info

  • Title: Big enterprise registration data imputation: Supporting spatiotemporal analysis of industries in China
  • ArXiv ID: 1804.03562
  • Date: 2023-06-15
  • Authors: : Li, Zhang, Chen, Yu, Watkins, Zhu, Parr, Roongpiboonsopit, Karimi, Luengo, García, Herrera, Yang, Huang, Liu, Hu

📝 Abstract

Big, fine-grained enterprise registration data that includes time and location information enables us to quantitatively analyze, visualize, and understand the patterns of industries at multiple scales across time and space. However, data quality issues like incompleteness and ambiguity, hinder such analysis and application. These issues become more challenging when the volume of data is immense and constantly growing. High Performance Computing (HPC) frameworks can tackle big data computational issues, but few studies have systematically investigated imputation methods for enterprise registration data in this type of computing environment. In this paper, we propose a big data imputation workflow based on Apache Spark as well as a bare-metal computing cluster, to impute enterprise registration data. We integrated external data sources, employed Natural Language Processing (NLP), and compared several machine-learning methods to address incompleteness and ambiguity problems found in enterprise registration data. Experimental results illustrate the feasibility, efficiency, and scalability of the proposed HPC-based imputation framework, which also provides a reference for other big georeferenced text data processing. Using these imputation results, we visualize and briefly discuss the spatiotemporal distribution of industries in China, demonstrating the potential applications of such data when quality issues are resolved.

💡 Deep Analysis

Figure 1

📄 Full Content

Big data with fine-grained street-level location and coordinates, as well as operating period and industrial category information can deepen and extend analysis of industrial spatial distributions, thereby promoting a deeper understanding of urban processes. The spatial distribution of various economic activities lies at the very heart of theories of urban spatial structure and is essential for rational urban and regional economic planning and policymaking (Li, Zhang, Chen, & Yu, 2015;Parr, 2014). However, due to the lack of complete, finegrained micro-level enterprise or firm data, few studies have fully analyzed the spatial distribution of industries in China at multiple scales, from a temporally sensitive perspective, incorporating all kinds of enterprises (Watkins, 2014;Zhu & Chen, 2007). The local bureaus of Administration for Industry and Commerce (AIC) of China, are responsible for enterprise registration, supervision and administration, and protection of consumers' rights and interests (AIC, 2016). These regional bureaus record detailed operating information for each enterprise. Big enterprise registration data, collected from multiple regional bureaus of AIC of China, can enable and support spatial-temporal analysis of industries, if the data quality issues are resolved.

Incompleteness and address ambiguity are prominent quality problems of Chinese enterprise registration data. A typical registration record contains information of an individual enterprise, including enterprise name, address, registration date, industrial category, business scope, postcode, legal representative, and registered capital. Usually, these records are manually recorded and inputted into the system at local AIC offices. In this process, critical information is either overlooked or neglected, and therefore frequently missing from the database. For example, in our study, 43.64% of the data has no industrial category values. This information however, is imperative when executing a spatial distribution analysis of industrial categories and industries. Approximately 30% of the records only have a street-level address but do not include the province or city to which it belongs. This address ambiguity problem is defined as the missing Administrative Division (AD) information problem, seriously impeding effective geocoding (Roongpiboonsopit & Karimi, 2010). To obtain the complete and accurate industrial category values, and the multi-scale text address and coordinates for each enterprise, imputation is required when filling missing values and information (Luengo, García, & Herrera, 2012).

Imputation however, introduces troublesome computing challenges when data volume is big. Enterprise registration data is in a short text format and text-based data imputation involves Natural Language Processing (NLP) techniques, such as short text classification and matching. This process is computing intensive and may result in the Out Of Memory and Intolerable Calculation-Time problems on a stand-alone computer when data volume is big. High Performance Computing (HPC) frameworks are often used to handle the computational issues of big data (Yang, Huang, Li, Liu, & Hu, 2016). Previous research explored big text data processing based on HPC frameworks. However, few studies have systematically investigated HPC-based imputation for big georeferenced text data that involves short text classification, location imputation and geocoding. Moreover, the discussion and applications of such technologies in regional and social science is insufficient in literature.

To fill this gap in the research and solve the big data quality problems endemic to this enterprise registration data, we propose an imputation framework and develop parallel imputation methods based on cutting-edge HPC technologies, to make this data more applicable. An effective solution to these kinds of data quality problems is relevant in many other domains where the use of big data is impeded by incompleteness and the ambiguity issues, especially for big georeferenced text data classification and location geocoding. We compare several widely used text classification methods employing NLP based on Apache Spark to fill missing industrial category values, in terms of accuracy, execution time, memory consumption, and scalability. We also introduce a location imputation method to fill the missing location information and obtain coordinates of each enterprise. Using these imputation results, we briefly analyze the spatiotemporal distribution of all industries in China at multiple spatial scales to illustrate potential applications of this data for analysis of urban spatial structures, urban agglomerations, industrial aggregations, and socioeconomic activities.

This article is organized as follows. Section 2 reviews relevant research. Section 3 introduces the data and HPC-based imputation framework. Section 4 describes industrial category and location imputation. Section 5 details the data im

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut