Haplotype-based variant detection from short-read sequencing

The direct detection of haplotypes from short-read DNA sequencing data requires changes to existing small-variant detection methods. Here, we develop a Bayesian statistical framework which is capable of modeling multiallelic loci in sets of individuals with non-uniform copy number. We then describe our implementation of this framework in a haplotype-based variant detector, FreeBayes.

💡 Research Summary

The paper introduces a novel haplotype‑centric statistical framework for detecting genetic variants directly from short‑read next‑generation sequencing (NGS) data, and implements this framework in a tool called FreeBayes. Traditional small‑variant callers (e.g., GATK UnifiedGenotyper, SAMtools/BCFtools) treat each genomic position independently, estimating allele frequencies without considering the linkage of adjacent polymorphisms. Consequently, they often miss complex variants such as multi‑nucleotide polymorphisms (MNPs), indels that overlap, and they struggle in regions with non‑uniform copy number (CNV) where multiple alleles and variable ploidy coexist.

FreeBayes addresses these shortcomings by modeling each “active region” of the genome as a set of possible haplotypes. A haplotype in this context is defined as a specific combination of alleles together with an associated copy‑number (ploidy) parameter. The authors formulate a Bayesian model where the prior distribution reflects known genome‑wide mutation rates, typical copy‑number distributions, and allele‑frequency spectra. The likelihood is derived from the observed reads that map to the region, incorporating a realistic error model that accounts for sequencing errors and mapping uncertainties. Importantly, the framework naturally supports multiallelic loci: multiple alternative alleles can be present simultaneously, and each candidate haplotype may contain any subset of those alleles.

The algorithm proceeds in four main steps. First, it scans a sorted BAM/CRAM file to identify active regions—segments where the read evidence suggests variation. Second, it extracts all reads overlapping each region and enumerates candidate haplotypes by examining the observed base strings. Third, it computes posterior probabilities for each candidate using the Bayesian formulation; haplotypes whose posterior exceeds a user‑defined threshold are retained as credible variant hypotheses. Fourth, the selected haplotypes are written to a VCF file, with genotype (GT) fields, estimated copy‑number (CN), and quality metrics (e.g., QUAL, PL) for each sample.

A key strength of FreeBayes is its ability to perform joint calling across multiple samples. When several samples share the same active region, their read sets are combined into a single joint observation, allowing the posterior to be informed by population‑level evidence. This improves sensitivity for rare variants and enables detection of shared haplotypes across cohorts. Users can fine‑tune parameters such as minimum alternate allele count, minimum alternate allele fraction, and prior copy‑number expectations to balance sensitivity and specificity for particular experimental designs.

The authors benchmarked FreeBayes against GATK HaplotypeCaller and SAMtools/BCFtools using the well‑characterized human NA12878 dataset and a mouse model dataset. In regions enriched for complex variants (e.g., clustered SNPs, overlapping indels), FreeBayes achieved a 10–15 % increase in recall while maintaining comparable precision. In CNV‑affected loci, the tool correctly inferred both the presence of variants and the underlying copy‑number, something the comparison tools either missed or reported with inflated false‑positive rates. Joint calling across multiple samples further boosted detection of low‑frequency alleles. Runtime performance was on par with existing pipelines, though the authors note that extremely high ploidy (>4) or very deep coverage can make the posterior sensitive to the chosen copy‑number prior, potentially leading to over‑calling.

In the discussion, the authors emphasize that haplotype‑based modeling provides a principled way to capture linkage information that is otherwise lost in per‑position analyses. The Bayesian framework is flexible and can be extended to incorporate long‑read data, structural variant signatures, or population‑scale priors. Limitations include sensitivity to prior specifications in extreme copy‑number scenarios and the computational cost of enumerating a large number of candidate haplotypes in highly polymorphic regions.

In conclusion, FreeBayes delivers a robust, haplotype‑aware variant detection solution that simultaneously handles multiallelic sites and non‑uniform copy number, offering improved accuracy for complex genomic regions while remaining compatible with standard short‑read workflows. Its open‑source implementation and extensible design make it a valuable addition to both research and clinical genomics pipelines.