Robust Estimation of Polychoric Correlation
Polychoric correlation is often an important building block in the analysis of rating data, particularly for structural equation models. However, the commonly employed maximum likelihood (ML) estimator is highly susceptible to misspecification of the polychoric correlation model, for instance through violations of latent normality assumptions. We propose a novel estimator that is designed to be robust against partial misspecification of the polychoric model, that is, when the model is misspecified for an unknown fraction of observations, such as careless respondents. To this end, the estimator minimizes a robust loss function based on the divergence between observed frequencies and theoretical frequencies implied by the polychoric model. In contrast to existing literature, our estimator makes no assumption on the type or degree of model misspecification. It furthermore generalizes ML estimation, is consistent as well as asymptotically normally distributed, and comes at no additional computational cost. We demonstrate the robustness and practical usefulness of our estimator in simulation studies and an empirical application on a Big Five administration. In the latter, the polychoric correlation estimates of our estimator and ML differ substantially, which, after further inspection, is likely due to the presence of careless respondents that the estimator helps identify.
💡 Research Summary
The paper addresses a critical weakness of polychoric correlation estimation: its sensitivity to violations of the latent normality assumption and to contaminated observations such as careless respondents. While the standard maximum‑likelihood (ML) estimator treats every cell of the observed contingency table equally, even cells that are poorly explained by the model can exert a strong influence, leading to severely biased correlation estimates when a modest proportion of the data are misspecified.
To overcome this problem the authors introduce a “partial misspecification” framework. Instead of assuming that the entire sample comes from a non‑normal latent distribution (the usual distributional misspecification setting), they allow that only an unknown fraction of observations may be uninformative for the parameter of interest. Typical examples are respondents who answer randomly, misinterpret items, or otherwise provide data that do not reflect the underlying latent variables.
Within this framework they propose a robust estimator that generalizes ML by minimizing a divergence‑based loss function between the observed cell frequencies (n_{ij}) and the model‑implied expected frequencies (m_{ij}(\theta)). The loss (\rho) is chosen from the family of M‑estimation functions (Huber‑type, Tukey’s biweight, etc.), which automatically down‑weights cells whose observed frequencies are far from the model’s expectations. Formally, the estimator solves
\
Comments & Academic Discussion
Loading comments...
Leave a Comment