Lung cancer (LC) is a leading cause of cancer-related mortality in the United States. Accurate prediction of LC mortality rates is crucial for guiding targeted interventions and addressing health disparities. Although traditional regression-based models have been commonly used, explainable machine learning models may offer enhanced predictive accuracy and deeper insights into the factors influencing LC mortality. This study applied three models: random forest (RF), gradient boosting regression (GBR), and linear regression (LR) to predict county-level LC mortality rates across the United States. Model performance was evaluated using R-squared and root mean squared error (RMSE). Shapley Additive Explanations (SHAP) values were used to determine variable importance and their directional impact. Geographic disparities in LC mortality were analyzed through Getis-Ord (Gi*) hotspot analysis. The RF model outperformed both GBR and LR, achieving an R2 value of 41.9% and an RMSE of 12.8. SHAP analysis identified smoking rate as the most important predictor, followed by median home value and the percentage of the Hispanic ethnic population. Spatial analysis revealed significant clusters of elevated LC mortality in the mid-eastern counties of the United States. The RF model demonstrated superior predictive performance for LC mortality rates, emphasizing the critical roles of smoking prevalence, housing values, and the percentage of Hispanic ethnic population. These findings offer valuable actionable insights for designing targeted interventions, promoting screening, and addressing health disparities in regions most affected by LC in the United States.
Cancer, with its multifaceted impact on public health and society, remains one of the most significant challenges in contemporary medicine. Among its various forms, lung cancer (LC) stands out as a major contributor to cancerrelated morbidity and mortality worldwide. 1 In the United States, LC has been a leading cause of cancer-related mortality in both men and women over several decades. 2 With a mortality rate exceeding that of breast, prostate, and pancreatic cancers combined, and more than 2.5 times higher than colorectal cancer (the second leading cause of cancer mortality in the country), LC demands urgent attention. 3 Although advances in technologies and screening have led to a decline in new LC diagnoses since the early 2010s, the disease remains a critical threat to public health, with a 5-year relative survival rate of just 26.6%, as reported by the American Lung Association. 4 Given these statistics, there is a clear and pressing need for effective predictive models to guide interventions and optimize resource allocation.
Numerous studies have used statistical and geospatial techniques to model the association between risk factors and county-level LC mortality. [5][6][7][8] At the population level, a variety of socioeconomic, environmental, and demographic determinants influence the incidence and mortality rates of LC. [9][10][11] For instance, Li et al 5 applied simultaneous autoregressive models, which account for spatial interdependencies, to assess the associations between the Environmental Quality Index (EQI) and US county-level LC mortality. The EQI incorporates five domain-specific indices (air, water, land, built environment, and sociodemographic factors) into a single composite index. Their findings highlighted a significant correlation between lower EQI and increased LC mortality, with a stronger effect size observed in females compared with males. Similarly, an investigation within the United States 6 found an increased risk of county-level LC mortality associated with higher levels of environmental carcinogen releases, demonstrated through linear regression (LR). Notably, although these associations were observed across gender and racial groups (including Whites and African Americans), the strength of these relationships was more evident among African American cohorts. These studies underscore the importance of considering gender and environmental factors when developing strategies to reduce LC mortality. Moreover, a notable geospatial variation in LC mortality rates is evident, both interstate and intrastate, extending down to the county level. 7,12 However, it is important to note that these studies primarily rely on conventional linear models, which may not adequately capture the complex interactions and nonlinear dynamics between LC mortality and its risk factors. 13,14 Although traditional regression-based models have historically played a fundamental role in cancer outcome prognostication, 15,16 the emergence of explainable machine learning (ML) techniques has introduced new opportunities to improve predictive accuracy and enhance our understanding of complex disease dynamics. 17,18 These advanced techniques have revealed higher-order nonlinear interactions among variables, leading to more robust and reliable predictions. 13 For example, Lee et al 19 conducted a comprehensive analysis using both regression and ML techniques, to examine the impacts of background radiation levels, PM2.5 exposure, and various sociobehavioral factors on county-level LC incidence rates in the United States. Their findings showed the superiority of the random forest (RF) algorithm over traditional regression methods. Although cancer research has made significant progress in assessing the impacts of social determinants of health on disease risk, 20,21 ML techniques can further advance our understanding. Given the increasing availability of data and advances in ML techniques, there is a significant opportunity to more accurately predict and identify the factors influencing LC mortality. These insights can inform targeted treatment strategies and interventions tailored to specific geographic areas.
Ensemble learning (EL), a powerful ML technique, enhances predictive accuracy by leveraging the collective intelligence of multiple models. 13,22 Although some studies suggest that EL models may not outperform logistic regression models in predicting mortality in emergency departments, 18,23 EL proves to be a compelling alternative to traditional LR for forecasting county-level LC mortality rates, for several reasons: (1) The complex interplay of socioeconomic, environmental, and health care factors influencing cancer outcomes requires a nuanced understanding that EL’s capacity to capture complex relationships can provide. (2) County-level health data sets often suffer from sparse, heterogeneous, or inconsistent data, posing challenges for conventional models. EL’s resilience to such data irregularities ena
This content is AI-processed based on open access ArXiv data.