Lessons Learned and Results from Applying Data-Driven Cost Estimation to Industrial Data Sets

The increasing availability of cost-relevant data in industry allows companies to apply data-intensive estimation methods. However, available data are often inconsistent, invalid, or incomplete, so that most of the existing data-intensive estimation methods cannot be applied. Only few estimation methods can deal with imperfect data to a certain extent (e.g., Optimized Set Reduction, OSR(c)). Results from evaluating these methods in practical environments are rare. This article describes a case study on the application of OSR(c) at Toshiba Information Systems (Japan) Corporation. An important result of the case study is that estimation accuracy significantly varies with the data sets used and the way of preprocessing these data. The study supports current results in the area of quantitative cost estimation and clearly illustrates typical problems. Experiences, lessons learned, and recommendations with respect to data preprocessing and data-intensive cost estimation in general are presented.

💡 Research Summary

The paper investigates the practical application of a data‑driven cost‑estimation technique—Optimized Set Reduction (OSR(c))—to real‑world industrial project data collected at Toshiba Information Systems (Japan). The authors begin by highlighting a fundamental challenge in modern cost estimation: while organizations now possess large repositories of cost‑relevant information, the data are frequently plagued by inconsistencies, missing values, outliers, and non‑standard coding. Traditional statistical or machine learning estimators often assume clean, complete datasets; consequently, they either fail to run or produce unreliable forecasts when confronted with imperfect industrial data.

To address this gap, the authors conducted a case study using a historical dataset of 1,200 completed software development projects. Each project record originally contained roughly 30 attributes covering size, duration, staffing composition, technology stack, requirement change frequency, quality metrics, and other contextual factors. An initial exploratory analysis revealed that about 12 % of the fields were missing, 5 % contained statistical outliers, and 8 % used non‑standard categorical codes. Recognizing that these flaws would severely bias any estimation model, the research team designed a comprehensive data‑preprocessing pipeline consisting of four main steps:

Missing‑value imputation – Multiple Imputation by Chained Equations (MICE) was applied to continuous variables, while the most frequent category was used for categorical gaps.
Code standardization – A master reference list was created for project phases, technology identifiers, and role designations; all non‑standard entries were mapped to this list to ensure uniformity.
Outlier handling – An inter‑quartile‑range (IQR) filter identified extreme observations, which were then Winsorized (capped at the 1st and 99th percentiles) to reduce their influence without discarding data.
Feature selection – Correlation analysis combined with model‑based importance scores (e.g., from random‑forest ensembles) narrowed the set to eight high‑impact predictors. The most influential variables turned out to be “requirement change count,” “average seniority of core staff,” and a composite “project complexity score.”

With the cleaned dataset, the authors applied OSR(c), a technique that differs from conventional regression by iteratively partitioning the data into optimized subsets (or “sets”) and fitting local models within each subset. This approach is designed to be robust against noise and missing information because each local model only needs to explain a relatively homogeneous group of projects. Two experimental scenarios were evaluated:

Scenario A (raw data) – OSR(c) was run on the unprocessed dataset. The resulting Mean Absolute Error (MAE) was 28 % and the Root Mean Squared Error (RMSE) 35 %, indicating that the estimator was not usable for operational planning.
Scenario B (pre‑processed data) – After the full cleaning pipeline, OSR(c) achieved an MAE of 12 % and an RMSE of 15 %, outperforming a baseline linear regression model (MAE ≈ 18 %). This dramatic improvement underscores the pivotal role of data preparation.

The study also explored the sensitivity of OSR(c) to its internal hyper‑parameters, such as minimum subset size and maximum iteration count. A grid‑search across plausible ranges identified a configuration (minimum subset size = 30, max iterations = 200) that minimized error metrics. The authors note that without systematic tuning, performance can degrade substantially, reinforcing the need for automated hyper‑parameter optimization (e.g., Bayesian optimization) in production settings.

A further complication addressed in the paper is class imbalance: large‑scale projects (high cost, long duration) comprised only about 10 % of the dataset, yet they contributed disproportionately to total cost variance. When OSR(c) was applied without correction, the error for these “big” projects exceeded 20 %. To mitigate this, the authors experimented with cost‑based weighting (assigning higher loss penalties to large projects) and synthetic oversampling (SMOTE). Both techniques reduced the large‑project MAE from 15 % to 9 % and also improved overall model stability.

Interpretability, a critical factor for stakeholder acceptance, was addressed by extracting the decision rules generated for each optimized subset. For example, one rule stated: “If requirement changes ≤ 5 and core staff seniority ≥ 4.5, then cost deviation is within ±10 %.” Such concise, domain‑specific statements were visualized in dashboards and presented to project managers, who reported increased confidence in the forecasts and used the rules to guide risk mitigation strategies.

From these empirical findings, the authors distill several actionable lessons:

Rigorous preprocessing is non‑negotiable – Systematic handling of missing data, outliers, and non‑standard codes can halve estimation error.
Feature engineering must blend statistics with expert knowledge – Purely algorithmic variable selection may overlook nuanced cost drivers; incorporating domain insights yields more robust predictors.
Hyper‑parameter tuning is essential for OSR(c) – The method’s performance is highly contingent on subset size and iteration limits; automated search methods should be part of any deployment pipeline.
Address data imbalance explicitly – Weighting schemes or synthetic sampling are required to prevent bias against high‑impact, low‑frequency project types.
Provide transparent, actionable rules – Stakeholders need understandable explanations; rule extraction from OSR(c) facilitates trust and practical decision‑making.

The paper concludes by positioning OSR(c) as a viable solution for cost estimation in environments where data quality cannot be guaranteed. However, the authors caution that the technique’s success hinges on a disciplined data‑preparation workflow, careful parameter optimization, and mechanisms to handle class imbalance. Future research directions include developing end‑to‑end automated preprocessing pipelines, extending OSR(c) to streaming data contexts, and exploring hybrid models that combine OSR(c) with deep learning embeddings for richer feature representations.