Generalized least squares can overcome the critical threshold in respondent-driven sampling
In order to sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like $O(n^{-1})$, where $n$ is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is $O(n^{-1})$. We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from the sampled observations. Simulations on empirical social networks show that the feasible GLS (fGLS) estimators can have drastically smaller error and rarely increase the error. A diagnostic plot helps to identify where fGLS will aid estimation. The fGLS estimators continue to outperform standard estimators even when they are built from a misspecified model and when there is preferential recruitment.
💡 Research Summary
Respondent‑Driven Sampling (RDS) is a widely used network‑based technique for reaching hidden or hard‑to‑reach populations, such as people at risk for HIV. Because participants recruit peers, the resulting sample exhibits strong dependence: adjacent observations are correlated through the underlying social network. Prior work has shown that when the average participant refers many contacts, the variance of standard RDS estimators (e.g., the sample mean or the Volz‑Heckathorn estimator) decays slower than the usual O(1/n) rate, where n is the number of sampled individuals. Consequently, confidence intervals become excessively wide compared to those from simple random sampling.
This paper proposes to use Generalized Least Squares (GLS) to mitigate the inflated variance. The authors model RDS as a (T,P)‑walk: a Markov chain with transition matrix P on the population graph G, observed along a sampling tree T (typically a complete binary tree). For each node σ in T, the observed value is Yσ = y(Xσ), where y(·) is the attribute of interest (e.g., HIV status). The covariance matrix Σ of the vector Y = (Yσ)σ∈T depends only on the tree distance d(σ,τ) through an autocovariance function γ(d) = Σσ,τ = Cov(Yσ,Yτ). The GLS estimator is the linear combination g* that minimizes variance under the constraint Σσ gσ = 1, yielding the closed‑form solution
μ̂_GLS = (1ᵀ Σ⁻¹ 1)⁻¹ 1ᵀ Σ⁻¹ Y.
The central theoretical contribution is Theorem 3.1, which proves that, for any reversible, irreducible transition matrix P and a complete binary sampling tree, the variance of μ̂_GLS decays as O(1/n) as n → ∞. This matches the optimal rate for independent sampling, thereby “overcoming the critical threshold” identified in earlier work. When the autocovariance simplifies to a rank‑two form γ(d)=β²λ^d (i.e., the feature y lies in the span of the leading eigenfunction of P), Theorem 3.2 gives an explicit asymptotic constant: n·Var(μ̂_GLS) → (1+λ)/(1−λ)·β².
Because Σ is unknown in practice, the authors develop feasible GLS (fGLS) estimators that replace Σ with an estimate Σ̂. Two families of estimators are introduced:
-
Degree‑Corrected Stochastic Blockmodel (DC‑SBM) based fGLS – The population graph is assumed to follow a DC‑SBM with K blocks, block memberships z(i) observed for sampled nodes, and degree‑heterogeneity parameters θi. From the sampled referrals the authors construct an empirical block‑transition matrix Q̂, whose expectation equals B/m (B is the block connectivity matrix, m its average row sum). By normalizing B to B_L = D_B^{-½} B D_B^{-½} and extracting its eigenvalues/eigenvectors, they obtain consistent estimates of the non‑zero eigenvalues of P and the associated eigenfunctions. Plugging these spectral estimates into the formula for γ(d) yields Σ̂, and consequently μ̂_fGLS.
-
Rank‑two model based fGLS – When the autocovariance is assumed to follow γ(d)=β²λ^d, only the first‑lag autocorrelation (or equivalently the first and second differences) need to be estimated. The authors propose two estimators: μ̂_auto uses a plug‑in estimate of λ derived from the sample autocorrelation at lag 1; μ̂_Δ uses estimates of E
Comments & Academic Discussion
Loading comments...
Leave a Comment