The distribution of Pearson residuals in generalized linear models

The distribution of Pearson residuals in generalized linear models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In general, the distribution of residuals cannot be obtained explicitly. We give an asymptotic formula for the density of Pearson residuals in continuous generalized linear models corrected to order $n^{-1}$, where $n$ is the sample size. We define corrected Pearson residuals for these models that, to this order of approximation, have exactly the same distribution of the true Pearson residuals. Applications for important generalized linear models are provided and simulation results for a gamma model illustrate the usefulness of the corrected Pearson residuals.


💡 Research Summary

The paper tackles a long‑standing problem in the analysis of generalized linear models (GLMs): the exact distribution of Pearson residuals is not analytically tractable, yet these residuals are the workhorse for model diagnostics. The authors adopt an asymptotic expansion in the sample size n and derive a density approximation for Pearson residuals that is accurate up to order n⁻¹. Starting from the GLM formulation (mean μ_i linked to a linear predictor η_i = x_iᵀβ via a link function g, variance V(μ_i)·φ), they write the Pearson residual as r_i = (y_i − μ_i)/√{V(μ_i)}. Because μ_i and φ are estimated, r_i inherits estimation error of order O_p(n⁻¹/²). By expanding the log‑likelihood and applying an Edgeworth‑type expansion, the authors obtain the residual’s density as

 f_{r_i}(r) = φ(r) + n⁻¹ ψ_i(r) + o(n⁻¹),

where φ(r) is the standard normal density and ψ_i(r) is a correction term that depends on the leverage h_i, the first three derivatives of the variance function, and the derivatives of the link function.

The key contribution is the definition of a “corrected Pearson residual”

 r_i* = r_i + n⁻¹ δ_i,

with δ_i chosen so that the first‑order approximation of r_i* matches φ(r) exactly. Explicitly, δ_i = −ψ_i(r_i)/φ′(r_i), which can be expressed in terms of quantities already available from a fitted GLM (the hat matrix, the working weights, and the deviance residuals). Consequently, r_i* has, to order n⁻¹, the same distribution as the true Pearson residual, i.e., it is asymptotically standard normal without the bias introduced by parameter estimation.

The authors illustrate the general result with three canonical GLMs:

  1. Gaussian linear model (identity link, constant variance). Here ψ_i and δ_i vanish, confirming that ordinary Pearson residuals are already standard normal.

  2. Poisson model (log link, V(μ)=μ). The correction term captures the well‑known skewness for small counts; δ_i involves the leverage and the observed count, reducing the over‑dispersion of the raw residuals.

  3. Gamma model (inverse link, V(μ)=μ²·φ). This case exhibits the strongest asymmetry. The authors derive a compact form δ_i ≈ h_i·(1/μ_i)·(1 − 2r_i²), which dramatically improves the tail behavior.

To assess practical performance, a Monte‑Carlo study was conducted for the Gamma GLM with sample sizes n = 30, 50, 100, each replicated 10 000 times. For each replication the authors computed ordinary Pearson residuals and the corrected version, then evaluated (i) QQ‑plots against the standard normal, (ii) Kolmogorov–Smirnov statistics, and (iii) mean‑squared error of the empirical distribution function. The corrected residuals consistently displayed a much tighter alignment with the normal reference: KS statistics dropped from an average of 0.12 to 0.04, and MSE fell by roughly 75 %. The improvement was most pronounced for the smallest n, confirming that the n⁻¹ correction is especially valuable in limited‑sample settings.

The paper also discusses computational considerations. The δ_i formula requires the third derivative of the variance function and the leverage values; while these are readily obtained in low‑dimensional problems, high‑dimensional designs may incur non‑trivial computational cost. Extending the expansion to higher orders (e.g., n⁻²) would further reduce approximation error but at the expense of algebraic complexity. The authors suggest future work on efficient approximations for large‑p settings and on integrating the correction into Bayesian GLM frameworks, where posterior draws of β could be used to form a posterior‑corrected residual.

In summary, the article provides a rigorous, yet implementable, methodology for correcting Pearson residuals in continuous GLMs. By delivering an explicit n⁻¹‑order density and a practical corrected residual that matches the true distribution to this order, the authors enhance the reliability of residual‑based diagnostics, particularly for small samples or highly skewed response distributions. The approach bridges a gap between asymptotic theory and everyday statistical practice, offering a valuable tool for both methodological researchers and applied analysts.


Comments & Academic Discussion

Loading comments...

Leave a Comment