Improved haplotyping of rare variants using next-generation sequence data

Accurate identification of haplotypes in sequenced human genomes can provide invaluable information about population demography and fine-scale correlations along the genome, thus empowering both population genomic and medical association studies. Yet phasing unrelated individuals remains a challenging problem. Incorporating available data from high throughput sequencing into traditional statistical phasing approaches is a promising avenue to alleviate these issues. We present a novel statistical method that expands on an existing graphical haplotype reconstruction method (shapeIT) to incorporate phasing information from paired-end read data. The algorithm harnesses the haplotype graph information estimated by shapeIT from genotypes across the population and refines haplotype likelihoods for a given individual to be compatible with the sequencing data. Applying the method to HapMap individuals genotyped on the Affymetrix Axiom chip at 7,745,081 SNPs and on a trio sequenced by Complete Genomics, we found that the inclusion of paired end read data significantly improved phasing, with reductions in switch error on the order of 4-15% against shapeIT across all panels. As expected, the improvements were found to be most significant at sites harboring rare variants; furthermore, we found that longer read sizes and higher throughput translated to greater decreases in switching error, as did higher variance in the size of the insert separating the two reads–suggesting that multi-platform next generation sequencing may be exploited to yield particularly accurate haplotypes. Overall, the phasing improvements afforded by this new method highlight the power of integrating sequencing read information and population genotype data for reconstructing haplotypes in unrelated individuals.

💡 Research Summary

Accurate haplotype reconstruction is a cornerstone for both population genetics and medical association studies, yet phasing unrelated individuals remains difficult, especially for rare variants. This paper introduces a novel extension of the widely used statistical phasing tool shapeIT that directly incorporates paired‑end next‑generation sequencing (NGS) read information. The method retains shapeIT’s population‑wide haplotype graph, which encodes the linkage structure inferred from genotype data, and refines the individual‑specific haplotype likelihoods by enforcing consistency with observed read‑pair allele combinations.

The algorithm proceeds in two stages. First, shapeIT builds a hidden‑Markov model (HMM) graph from the genotypes of all samples, representing possible allele configurations at each SNP. Second, for each target individual, the algorithm scans the paired‑end reads. Whenever a read pair spans two heterozygous sites, the observed allele pair imposes a constraint on the HMM path: paths that are compatible receive an increased posterior weight, while incompatible paths are penalized. This Bayesian updating is iterated across the genome, allowing the read‑derived constraints to gradually reshape the posterior distribution of haplotypes. Crucially, the approach leverages two key read characteristics: (1) read length, which determines how many SNPs can be linked in a single fragment, and (2) the distribution of insert sizes, which governs the physical distance over which alleles can be jointly observed. Longer reads and broader insert‑size variance provide more extensive long‑range linkage information, thereby strengthening the constraints.

The authors evaluated the method on two complementary data sets. The first comprised HapMap individuals genotyped on the Affymetrix Axiom platform at 7,745,081 SNPs, providing a dense population reference. The second consisted of a trio sequenced by Complete Genomics, offering high‑coverage whole‑genome paired‑end data with known pedigree phase for validation. Compared with standard shapeIT, the read‑integrated version reduced switch error rates by 4–15 % across all panels. The improvement was most pronounced for variants with minor allele frequency (MAF) < 1 %, where error reductions frequently exceeded 10 %. Moreover, simulations varying read length and insert‑size distribution demonstrated a monotonic relationship: longer reads (≥100 bp) and larger average inserts (≈500 bp) yielded greater error reductions, and higher variance in insert size further amplified the benefit. These trends suggest that multi‑platform sequencing—combining libraries with different fragment sizes—could be strategically employed to maximize phasing accuracy.

In summary, the study provides compelling evidence that integrating NGS read‑pair information into population‑based statistical phasing substantially enhances haplotype inference, especially for rare alleles that are poorly resolved by genotype data alone. The method is computationally efficient, requiring only modest modifications to the existing shapeIT pipeline, and is readily applicable to ongoing large‑scale sequencing projects. By bridging population‑level linkage disequilibrium patterns with direct physical linkage observed in sequencing reads, the approach offers a powerful, scalable solution for generating high‑quality haplotypes in unrelated individuals.