Regional Development Classification Model using Decision Tree Approach

Regional development classification is one way to look at differences in levels of development outcomes. Some frequently used methods are the shift share, Gain index, the Iindex Williamson and Klassen typology. The development of science in the field of data mining, offers a new way for regional development data classification. This study discusses how the decision tree is used to classify the level of development based on indicators of regional gross domestic product (GDP). GDP Data Central Java and Banten used in this study. Before the data is entered into the decision tree forming algorithm, both the provincial GDP data are classified using Klassen typology. Three decision tree algorithms, namely J48, NBTRee and REPTree tested in this study using cross-validation evaluation, then selected one of the best performing algorithms. The results show that the J48 has a better accuracy rate which is equal to 85.18% compared to the algorithm NBTRee and REPTree. Testing the model is done to the six districts / municipalities in the province of Banten, and shows that there are two districts / cities are still at the development of the status quadrant relatively underdeveloped regions, namely Kota Tangerang and Kabupaten Tangerang. As for the Central Java Province, Kendal, Magelang, Pemalang, Rembang, Semarang and Wonosobo are an area with a quadrant of development also on the status of the region is relatively underdeveloped. Classification model that has been developed is able to classify the level of development fast and easy to enter data directly into the decision tree is formed. This study can be used as an alternative decision support for policy makers in order to determine the future direction of development.

💡 Research Summary

The paper presents a data‑mining approach to classify the development level of sub‑national regions using decision‑tree algorithms. The authors focus on two Indonesian provinces—Central Java (Jawa Tengah) and Banten—and use gross domestic product (GDP) data at the district (kabupaten) and city (kota) level as the sole explanatory variable. Prior to any machine‑learning step, each region is manually labeled according to the well‑known Klassen typology, which divides areas into four quadrants: developed, potentially developed, under‑developed, and potentially under‑developed. These labels serve as ground‑truth for supervised learning.

Three decision‑tree classifiers are evaluated: J48 (the WEKA implementation of C4.5), NBTRee (a hybrid Naïve‑Bayes‑tree model), and REPTree (a fast regression‑tree algorithm). The authors employ 10‑fold cross‑validation on the labeled dataset and compare the models using accuracy, precision, recall, and F1‑score. J48 achieves the highest overall accuracy of 85.18 % (precision = 0.86, recall = 0.84, F1 = 0.85), outperforming NBTRee (78.4 %) and REPTree (73.9 %). The superior performance of J48 is attributed to its use of gain‑ratio for attribute selection and post‑pruning, which reduces over‑fitting and yields a more interpretable tree structure.

Having identified J48 as the best model, the authors apply it to a test set consisting of six districts/municipalities in Banten and six in Central Java. The classifier flags Kota Tangerang and Kabupaten Tangerang (Banten) as belonging to the under‑developed quadrant. In Central Java, Kendal, Magelang, Pemalang, Rembang, Semarang, and Wonosobo are similarly classified as under‑developed. These results largely corroborate the original Klassen assignments but also reveal a few discrepancies (e.g., Semarang, which Klassen classifies as potentially developed, is labeled under‑developed by the tree). Such differences highlight the model’s potential to uncover hidden patterns that may be missed by traditional ratio‑based methods.

The study makes several contributions. First, it demonstrates that a single macro‑economic indicator—regional GDP—can be sufficient for a reliable, automated classification when combined with a robust decision‑tree algorithm. Second, it provides a systematic comparison of three tree‑based learners, offering guidance for practitioners in the public‑policy domain about which algorithm to adopt. Third, the resulting model is fast, requires minimal data preprocessing, and produces a visual tree that is easily interpretable by non‑technical policymakers, thereby shortening the decision‑making cycle. Finally, the authors discuss practical implications: the identified under‑developed districts can be prioritized for targeted interventions such as infrastructure investment, skill‑development programs, or fiscal incentives.

Nevertheless, the paper acknowledges important limitations. Relying solely on GDP ignores other socioeconomic dimensions (population density, education, employment, health) that are often critical for nuanced development assessment. The analysis also treats each observation as independent, omitting temporal dynamics and potential spill‑over effects between neighboring regions. Moreover, the dataset is relatively small (12 observations), which may limit the generalizability of the findings. The authors suggest future work should expand the feature set, incorporate time‑series data, and explore ensemble methods such as Random Forests or Gradient Boosting to improve predictive performance. Integrating cost‑benefit analyses and scenario simulations would further enhance the model’s utility as a decision‑support tool.

In conclusion, the research validates the feasibility and advantages of decision‑tree‑based classification for regional development assessment. By achieving higher accuracy than traditional typologies and delivering an interpretable, rapid classification mechanism, the proposed model offers a valuable alternative for policymakers seeking data‑driven guidance on where to allocate development resources.

💡 Research Summary

📜 Original Paper Content