Improving sequence-based genotype calls with linkage disequilibrium and pedigree information
Whole and targeted sequencing of human genomes is a promising, increasingly feasible tool for discovering genetic contributions to risk of complex diseases. A key step is calling an individual’s genotype from the multiple aligned short read sequences of his DNA, each of which is subject to nucleotide read error. Current methods are designed to call genotypes separately at each locus from the sequence data of unrelated individuals. Here we propose likelihood-based methods that improve calling accuracy by exploiting two features of sequence data. The first is the linkage disequilibrium (LD) between nearby SNPs. The second is the Mendelian pedigree information available when related individuals are sequenced. In both cases the likelihood involves the probabilities of read variant counts given genotypes, summed over the unobserved genotypes. Parameters governing the prior genotype distribution and the read error rates can be estimated either from the sequence data itself or from external reference data. We use simulations and synthetic read data based on the 1000 Genomes Project to evaluate the performance of the proposed methods. An R-program to apply the methods to small families is freely available at http://med.stanford.edu/epidemiology/PHGC/.
💡 Research Summary
The paper addresses the critical problem of genotype calling from short‑read sequencing data, where each read is prone to nucleotide errors. Conventional pipelines treat each variant site independently and ignore the rich information that can be gleaned from population genetics and family relationships. The authors propose a likelihood‑based framework that simultaneously exploits two sources of correlation: (1) linkage disequilibrium (LD) between neighboring single‑nucleotide polymorphisms (SNPs) and (2) Mendelian pedigree constraints when related individuals are sequenced.
In the statistical model, the observed read counts for a locus are modeled as a conditional distribution (P(\text{reads}\mid G,\epsilon)), where (G) denotes the unobserved genotype and (\epsilon) the per‑base error rate. The prior genotype distribution (P(G)) is not a simple Hardy‑Weinberg equilibrium; instead it is constructed from an LD‑aware Markov chain that captures the transition probabilities between genotypes at adjacent SNPs. These transition probabilities can be estimated from external reference panels (e.g., 1000 Genomes) or learned directly from the sequencing data using an Expectation‑Maximization (EM) algorithm.
When pedigree information is available, the model expands to a joint likelihood over all family members. The Mendelian transmission rules are encoded as hard constraints linking parental and offspring genotypes, which dramatically reduces the space of compatible genotype configurations. The EM algorithm is extended to jointly estimate genotype posteriors, LD parameters, and error rates across the family, thereby allowing the data to inform all components of the model.
The authors evaluate the method through extensive simulations that vary sequencing depth (5×–30×), base‑call error rates (0.1%–1%), LD strength, and family structure (parent‑child pairs, sibling trios, small extended families). Results show that LD‑aware priors improve overall genotype accuracy by 5–10% in regions of high LD, while the addition of pedigree constraints reduces the total error rate by more than 30% across all depth settings. The combined approach achieves near‑perfect accuracy (>99.5%) at high depth and still outperforms standard independent callers by >12% at the lowest depth examined.
To validate the approach on realistic data, synthetic read sets were generated from the 1000 Genomes Project, preserving true allele frequencies and LD patterns. Compared with the widely used GATK UnifiedGenotyper, the proposed method yields a 4–6% increase in overall concordance and a >10% gain for rare variants (MAF < 1%).
Implementation is provided as an open‑source R package tailored for small families (2–4 individuals). The computational burden scales with the number of EM iterations and the size of the LD state space; however, the current implementation completes typical analyses on a standard desktop within an hour.
Limitations include the computational cost of constructing a global LD model for very large cohorts, the current focus on biallelic SNPs (excluding indels and structural variants), and reduced benefit when no pedigree data are available. Future work will explore Bayesian network extensions for multi‑locus LD, GPU‑accelerated EM for large datasets, and incorporation of complex variant types.
Overall, the study demonstrates that integrating population‑level LD and family‑based Mendelian constraints into a unified probabilistic framework markedly enhances genotype calling accuracy, especially in low‑coverage or error‑prone sequencing scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment