Haplotype-based variant detection from short-read sequencing
The direct detection of haplotypes from short-read DNA sequencing data requires changes to existing small-variant detection methods. Here, we develop a Bayesian statistical framework which is capable of modeling multiallelic loci in sets of individuals with non-uniform copy number. We then describe our implementation of this framework in a haplotype-based variant detector, FreeBayes.
đĄ Research Summary
The paper introduces a novel haplotypeâcentric statistical framework for detecting genetic variants directly from shortâread nextâgeneration sequencing (NGS) data, and implements this framework in a tool called FreeBayes. Traditional smallâvariant callers (e.g., GATK UnifiedGenotyper, SAMtools/BCFtools) treat each genomic position independently, estimating allele frequencies without considering the linkage of adjacent polymorphisms. Consequently, they often miss complex variants such as multiânucleotide polymorphisms (MNPs), indels that overlap, and they struggle in regions with nonâuniform copy number (CNV) where multiple alleles and variable ploidy coexist.
FreeBayes addresses these shortcomings by modeling each âactive regionâ of the genome as a set of possible haplotypes. A haplotype in this context is defined as a specific combination of alleles together with an associated copyânumber (ploidy) parameter. The authors formulate a Bayesian model where the prior distribution reflects known genomeâwide mutation rates, typical copyânumber distributions, and alleleâfrequency spectra. The likelihood is derived from the observed reads that map to the region, incorporating a realistic error model that accounts for sequencing errors and mapping uncertainties. Importantly, the framework naturally supports multiallelic loci: multiple alternative alleles can be present simultaneously, and each candidate haplotype may contain any subset of those alleles.
The algorithm proceeds in four main steps. First, it scans a sorted BAM/CRAM file to identify active regionsâsegments where the read evidence suggests variation. Second, it extracts all reads overlapping each region and enumerates candidate haplotypes by examining the observed base strings. Third, it computes posterior probabilities for each candidate using the Bayesian formulation; haplotypes whose posterior exceeds a userâdefined threshold are retained as credible variant hypotheses. Fourth, the selected haplotypes are written to a VCF file, with genotype (GT) fields, estimated copyânumber (CN), and quality metrics (e.g., QUAL, PL) for each sample.
A key strength of FreeBayes is its ability to perform joint calling across multiple samples. When several samples share the same active region, their read sets are combined into a single joint observation, allowing the posterior to be informed by populationâlevel evidence. This improves sensitivity for rare variants and enables detection of shared haplotypes across cohorts. Users can fineâtune parameters such as minimum alternate allele count, minimum alternate allele fraction, and prior copyânumber expectations to balance sensitivity and specificity for particular experimental designs.
The authors benchmarked FreeBayes against GATK HaplotypeCaller and SAMtools/BCFtools using the wellâcharacterized human NA12878 dataset and a mouse model dataset. In regions enriched for complex variants (e.g., clustered SNPs, overlapping indels), FreeBayes achieved a 10â15âŻ% increase in recall while maintaining comparable precision. In CNVâaffected loci, the tool correctly inferred both the presence of variants and the underlying copyânumber, something the comparison tools either missed or reported with inflated falseâpositive rates. Joint calling across multiple samples further boosted detection of lowâfrequency alleles. Runtime performance was on par with existing pipelines, though the authors note that extremely high ploidy (>4) or very deep coverage can make the posterior sensitive to the chosen copyânumber prior, potentially leading to overâcalling.
In the discussion, the authors emphasize that haplotypeâbased modeling provides a principled way to capture linkage information that is otherwise lost in perâposition analyses. The Bayesian framework is flexible and can be extended to incorporate longâread data, structural variant signatures, or populationâscale priors. Limitations include sensitivity to prior specifications in extreme copyânumber scenarios and the computational cost of enumerating a large number of candidate haplotypes in highly polymorphic regions.
In conclusion, FreeBayes delivers a robust, haplotypeâaware variant detection solution that simultaneously handles multiallelic sites and nonâuniform copy number, offering improved accuracy for complex genomic regions while remaining compatible with standard shortâread workflows. Its openâsource implementation and extensible design make it a valuable addition to both research and clinical genomics pipelines.