Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-throughput shotgun sequence data makes it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual’s genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual is limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual’s genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step calling genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first by its performance on simulated sequence data, and secondly on real sequence data where we obtain estimates using low coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse world-wide populations, and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show filters can correct for the confounding by sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates, and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher coverage data.

💡 Research Summary

The paper addresses a fundamental challenge in population genetics: estimating an individual’s genome‑wide heterozygosity from low‑coverage shotgun sequencing data. Traditional approaches rely on high‑coverage data to call genotypes confidently, but many studies—especially those involving ancient DNA, non‑model organisms, or large population screens—are limited to shallow sequencing depths where genotype calling is unreliable. The authors propose a statistical framework that bypasses explicit genotype calling by jointly modeling the allele frequency distribution at each site, the sequencing error profile, and reference‑bias, using both the low‑coverage target individual and a panel of other individuals sequenced at the same genomic positions.

Model Overview
At each genomic locus the observed read counts for the four nucleotides are treated as a multinomial draw from an underlying true allele composition. The true composition is parameterised by a shared allele frequency vector (the proportion of the two true alleles at that site) and a set of error probabilities that describe the chance of observing each incorrect base. A reference‑bias term captures the systematic over‑representation of the reference allele that is common in short‑read mapping pipelines. The key innovation is that these parameters are not estimated separately for the target individual; instead, they are learned jointly across the entire panel using an Expectation–Maximisation (EM) algorithm. In the E‑step, the current parameter estimates are used to compute the posterior probability of each possible genotype at every site for every sample. In the M‑step, the allele‑frequency distribution, error rates, and bias term are updated by maximising the expected complete‑data log‑likelihood. This iterative process converges to a set of parameters that best explain the observed read counts across all samples.

Simulation Experiments
The authors validate the method on synthetic data spanning a range of coverages (1×–10×) and error rates (0.1%–1%). They compare three metrics: (i) absolute heterozygosity error, (ii) bias in heterozygosity estimates, and (iii) variance across replicates. Their approach consistently outperforms a naïve genotype‑calling pipeline (e.g., GATK HaplotypeCaller at low depth) with mean absolute errors below 0.001 even at 2× coverage. The inclusion of the reference‑bias term dramatically reduces systematic over‑estimation of heterozygosity that plagues standard methods when the reference allele is preferentially aligned.

Application to Real Human Data
To demonstrate real‑world utility, the method is applied to 11 individuals from the 1000 Genomes Project representing diverse continental groups. Each individual’s high‑coverage (≈30×) BAM file is down‑sampled to 5×, mimicking a low‑coverage scenario. Heterozygosity estimates derived from the 5× data using the proposed model correlate with the high‑coverage “gold‑standard” estimates at r = 0.98, and the absolute differences are <0.0005. Importantly, the authors observe a non‑linear relationship between local sequencing depth and the observed heterozygosity: regions with low depth tend to under‑report heterozygous sites even after error correction. To mitigate this, they introduce a depth filter (minimum 3×) and a regression‑based correction that normalises heterozygosity estimates across depth bins. After correction, the heterozygosity ratios between populations (e.g., African vs. European) match previously published ratios from high‑coverage studies, confirming that relative comparisons are robust to residual depth effects.

Key Insights and Practical Recommendations

Joint Learning is Essential – By leveraging information from a panel of individuals, the model compensates for the paucity of reads in the target sample, effectively borrowing strength across genomes.
Explicit Error and Bias Modeling – Incorporating per‑base error rates and a reference‑bias parameter eliminates two major sources of systematic error that dominate low‑coverage analyses.
Depth‑Dependent Confounding – The authors highlight that heterozygosity is not a simple function of true genetic variation; local coverage modulates the probability of observing both alleles. Filters and depth‑aware corrections are therefore mandatory.
Ratios Over Absolute Values – While absolute heterozygosity estimates are sensitive to residual biases, ratios between populations remain stable and are more biologically interpretable in comparative studies.

Conclusions and Future Directions
The study demonstrates that accurate genome‑wide heterozygosity can be recovered from shallow sequencing without explicit genotype calls, provided that a suitable reference panel and a well‑specified statistical model are used. This opens the door to cost‑effective population‑scale surveys, ancient DNA projects, and studies of non‑model organisms where deep sequencing is prohibitive. Future work could extend the framework to incorporate structural variants, polyploid genomes, or integrate long‑read data to further improve allele frequency inference. Additionally, adapting the model to handle heterogeneous panels (different sequencing technologies or capture kits) would broaden its applicability. Overall, the paper offers a rigorous, empirically validated solution to a longstanding bottleneck in population genomics.

Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites

💡 Research Summary

Comments & Academic Discussion

Leave a Comment