Robust Joint Modeling for Data with Continuous and Binary Responses
In many supervised learning applications, the response consists of both continuous and binary outcomes. Studies have shown that jointly modeling such mixed-type responses can substantially improve predictive performance compared to separate analyses. But outliers pose a new challenge to the existing likelihood-based modeling approaches. In this paper, we propose a new robust joint modeling framework for data with both continuous and binary responses. It is based on the density power divergence (DPD) loss function with the $\ell_1$ regularization. The proposed framework leads to a sparse estimator that simultaneously predicts continuous and binary responses in high-dimensional input settings while down-weighting influential outliers and mislabeled samples. We also develop an efficient proximal gradient algorithm with Barzilai-Borwein spectral step size and a robust information criterion (RIC) for data-driven selection of the penalty parameters. Extensive simulation studies under a variety of contamination schemes demonstrate that the proposed method achieves lower prediction error and more accurate parameter estimation than several competing approaches. A real case study on wafer lapping in semiconductor manufacturing further illustrates the practical gains in predictive accuracy, robustness, and interpretability of the proposed framework.
💡 Research Summary
The paper addresses a common yet challenging problem in modern supervised learning: simultaneously predicting a continuous outcome and a binary outcome from a high‑dimensional set of covariates, while being robust to outliers and mislabeled observations. Existing joint‑modeling approaches (e.g., conditional regression, mixed‑effects, Bayesian hierarchical models) improve predictive performance over separate analyses but rely on maximum‑likelihood estimation, which is highly sensitive to contamination. Conversely, robust regression methods based on M‑estimators, S‑estimators, or recent divergence‑based techniques (such as the Density Power Divergence, DPD) handle outliers well for a single response type but have not been extended to mixed‑type joint modeling.
To fill this gap, the authors propose a unified framework that combines the DPD loss with ℓ₁ regularization. The binary response z is modeled by a logistic regression p(x)=exp(xᵀη)/(1+exp(xᵀη)), while the continuous response y conditioned on z and x follows a normal linear model with possibly different regression coefficients for the two binary states: y|z=1,x∼N(xᵀβ,σ²) and y|z=0,x∼N(xᵀω,σ²). The joint density f(y,z|x) is the product of the normal density and the Bernoulli probability.
Instead of the usual log‑likelihood, the authors minimize the DPD between the empirical distribution g (and thus the observed data) and the model density f. For a tuning parameter α>0, the DPD loss down‑weights each observation by f(y,z|x)^α, automatically reducing the influence of extreme residuals or mis‑classified binary labels. The loss expands into three components: (1) a constant term involving the normalizing constant of the normal density, (2) a sum over observations with z=1 that contains the exponential of the squared residual scaled by α and multiplied by the predicted logistic probability p(x)^α, and (3) an analogous term for z=0 with factor (1‑p(x))^α.
To achieve sparsity in high‑dimensional settings, ℓ₁ penalties λ₁‖β‖₁ + λ₂‖ω‖₁ + λ₃‖η‖₁ are added, yielding the final objective
h(β,ω,η;σ²)=Q_α(β,ω,η;σ²)+λ₁‖β‖₁+λ₂‖ω‖₁+λ₃‖η‖₁.
The authors assume that the true coefficient vectors are sparse, which justifies the use of Lasso‑type regularization.
Theoretical properties are established by verifying that the DPD estimator satisfies the five regularity conditions of Basu et al. (1998) (identifiability, smoothness, bounded moments, etc.). Under these conditions, the estimator θ̂=(β̂,ω̂,η̂) is consistent and asymptotically normal: √n(θ̂‑θ₀) → MVN(0,J⁻¹KJ⁻¹), where J and K are the expected Hessian and variance of the DPD score, respectively. Closed‑form expressions for J and K are provided in the supplementary material.
Computationally, the problem is a non‑convex, non‑smooth optimization due to the exponential terms in the DPD loss. The authors adopt a proximal‑gradient algorithm with Barzilai‑Borwein (BB) step‑size selection, which adaptively chooses the learning rate based on successive gradient differences, leading to faster convergence than fixed step sizes. The algorithm updates β, ω, and η in a block‑coordinate fashion, applying the soft‑thresholding operator after each gradient step to enforce sparsity.
The variance parameter σ² is treated as a nuisance parameter. Direct joint estimation proved unstable under heavy contamination, so the authors use a robust pilot estimator based on the Pseudo Standard Error (PSE) method: an initial Lasso fit to the continuous response provides residuals, which are trimmed using a multiplier of the median absolute deviation, yielding a robust scale estimate that is then fixed during the main optimization. After convergence, σ² is recomputed from the final residuals.
Model selection (choice of α and the three λ values) is guided by a newly proposed Robust Information Criterion (RIC). RIC augments the DPD loss with a log‑penalty on the effective degrees of freedom (the number of non‑zero coefficients) and is designed to be insensitive to outliers, unlike traditional AIC/BIC. The authors demonstrate that RIC reliably identifies the optimal tuning parameters in simulation studies, avoiding the computational burden of cross‑validation.
Extensive simulations cover three contamination schemes: (i) mixed normal–Bernoulli outliers, (ii) additive heavy‑tailed noise on the continuous outcome, and (iii) random flipping of binary labels. The proposed DPD‑Lasso method consistently achieves lower mean‑squared error for the continuous response and higher area‑under‑the‑ROC curve for the binary response compared with Lasso, SparseLTS, Bayesian hierarchical models (BHQQ), and recent robust GLM approaches. Moreover, the method recovers the true sparse support with higher precision and recall, even when the proportion of contaminated observations reaches 20 %.
A real‑world case study involves wafer‑lapping data from semiconductor manufacturing, where 150 process variables are used to predict total thickness variation (TTV, continuous) and a binary site‑total‑indicator‑reading (STIR). The dataset contains sensor failures and occasional mis‑recorded binary labels. Conventional methods produce biased fits and large deviations from the 45‑degree line in a TTV‑vs‑prediction plot. In contrast, the robust joint model yields predictions that align closely with the ideal line, and the selected variables correspond to known physical factors (e.g., polishing pressure, slurry composition), enhancing interpretability for process engineers.
In summary, the paper makes five key contributions: (1) introduces DPD‑based loss to jointly model continuous and binary outcomes, (2) integrates ℓ₁ regularization for high‑dimensional sparse estimation, (3) develops an efficient proximal‑gradient algorithm with BB step‑size, (4) proposes a robust information criterion for data‑driven tuning, and (5) provides rigorous consistency and asymptotic normality results. The methodology bridges a gap between robust statistics and joint mixed‑type modeling, offering a practical tool for industries where data contamination is common. Future work may extend the framework to multinomial or count outcomes, incorporate non‑linear predictors via kernels or neural networks, and adapt the algorithm for streaming data environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment