pHapCompass: Probabilistic Assembly and Uncertainty Quantification of Polyploid Haplotype Phase
Computing haplotypes from sequencing data, i.e. haplotype assembly, is an important component of molecular and population genetics problems, including interpreting the effects of genetic variation on complex traits and reconstructing genealogical relationships. Assembling the haplotypes of polyploid genomes remains a significant challenge due to the exponential search space of haplotype phasings and read assignment ambiguity; the latter challenge is particularly difficult for haplotype assemblers since the information contained within the observed sequence reads is often insufficient for unambiguous haplotype assignment in polyploid genomes. We present pHapCompass, probabilistic haplotype assembly algorithms for diploid and polyploid genomes that explicitly model and propagate read assignment ambiguity to compute a distribution over polyploid haplotype phasings. We develop graph theoretic algorithms to enable statistical inference and uncertainty quantification despite an exponential space of possible phasings. Since prior work evaluates polyploid haplotype assembly on synthetic genomes that do not reflect the realistic genomic complexity of polyploidy organisms, we develop a computational workflow for simulating genomes and DNA-seq for auto- and allopolyploids. Additionally, we generalize the vector error rate and minimum error correction evaluation criteria for partially phased haplotypes. Benchmarking of pHapCompass and several existing polyploid haplotype assemblers shows that pHapCompass yields competitive performance across varying genomic complexities and polyploid structures while retaining an accurate quantification of phase uncertainty. The source code for pHapCompass, simulation scripts, and datasets are freely available at https://github.com/bayesomicslab/pHapCompass.
💡 Research Summary
The paper introduces pHapCompass, a probabilistic framework for assembling haplotypes in diploid and polyploid genomes while explicitly quantifying phase uncertainty. The authors identify two fundamental challenges in polyploid haplotype assembly: (1) the exponential number of possible haplotype configurations as ploidy increases, and (2) read‑assignment ambiguity caused by highly similar or duplicated haplotype segments. Existing assemblers either ignore this ambiguity or provide only a single deterministic solution, which can be misleading when the data do not contain enough information to resolve the true phase.
pHapCompass tackles these issues with two complementary statistical models tailored to short‑read and long‑read sequencing technologies.
pHapCompass‑short adopts a SNP‑centric formulation. A SNP graph G connects heterozygous positions that are co‑covered by at least one read. Its line graph Q treats each edge of G (i.e., a SNP pair) as a node, and two nodes are adjacent if they share exactly one SNP. Over Q the authors build a factor graph (the pCompass graph) in which node potentials encode the likelihood of each admissible phasing of a SNP pair, while edge potentials capture consistency across three‑way overlaps. The space of valid phasings for a pair is parameterized by the count of haplotypes carrying each allele pattern, reducing the combinatorial explosion from O(2^K) to O(K). Likelihoods are computed using a simple error model with per‑base error rate ε and Hamming distance between the observed read alleles and a candidate phasing. Inference is performed either with the Viterbi algorithm for a maximum‑a‑posteriori (MAP) haplotype set or with forward‑filtering backward‑sampling (FFBS) to obtain posterior samples and per‑SNP uncertainty estimates.
pHapCompass‑long is read‑centric, designed for long, low‑coverage reads that span many variants. Here each read is a node, and explicit latent variables encode the mapping of reads to haplotypes. The resulting hybrid graphical model combines directed edges (read → haplotype) with undirected edges (mutual constraints among overlapping reads). This structure allows the same inference machinery (Viterbi for MAP, FFBS for posterior sampling) to exploit the long‑range information while still handling the ambiguity of read‑to‑haplotype assignment.
Both models share a common “pCompass graph” construction that limits the factorization to 1‑ and 2‑cliques, keeping the factor graph size O(L²) where L is the number of SNPs. The authors also provide efficient enumeration of feasible phasings, making the algorithms scalable to realistic genome sizes.
A major contribution is the development of a realistic simulation pipeline for both autopolyploid and allopolyploid genomes. The pipeline generates synthetic chromosomes with duplicated segments, varying mutation rates, and realistic coverage profiles for Illumina‑style short reads and PacBio/Nanopore long reads. This addresses a gap in prior benchmarking, which often relied on overly simplistic synthetic data.
Evaluation metrics are extended from the classic vector error rate (VER) and minimum error correction (MEC) to handle partially phased haplotypes, and a new uncertainty score derived from posterior probabilities is introduced. Benchmarking against state‑of‑the‑art polyploid assemblers (HapTree, Poly‑Harsh, SDhap, nPhase, etc.) shows that pHapCompass‑short achieves the lowest VER and MEC on high‑coverage short‑read data, while pHapCompass‑long produces more contiguous blocks and comparable error rates on long‑read data. Importantly, the uncertainty quantification reliably flags regions where the data are insufficient, allowing downstream analyses (e.g., GWAS, phylogenetics) to down‑weight or exclude ambiguous loci.
The authors also demonstrate a real‑world application by assembling an allo‑octoploid strawberry chromosome, achieving higher continuity and lower error than existing tools. All source code, simulation scripts, and benchmark datasets are released publicly on GitHub, facilitating reproducibility.
In summary, pHapCompass provides a unified probabilistic framework that explicitly models read‑assignment ambiguity, scales to both short and long reads, and delivers rigorous uncertainty quantification. This represents a significant advance for polyploid genomics, with immediate relevance to plant breeding, evolutionary studies, and any field where accurate haplotype reconstruction in complex genomes is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment