LinkedNN: a neural model of linkage disequilibrium decay for recent effective population size inference
Summary: A bioinformatics tool is presented for estimating recent effective population size by using a neural network to automatically compute linkage disequilibrium-related features as a function of genomic distance between polymorphisms. The new method outperforms existing deep learning and summary statistic-based approaches using relatively few sequenced individuals and variant sites, making it particularly valuable for molecular ecology applications with sparse, unphased data. Availability and implementation: The program is available as an easily installable Python package with documentation here: https://pypi.org/project/linkedNN/. The open source code is available from: https://github.com/the-smith-lab/LinkedNN.
💡 Research Summary
LinkedNN introduces a novel neural network layer specifically designed to capture linkage‑disequilibrium (LD) decay as a continuous function of inter‑SNP distance, enabling accurate inference of recent effective population size (Nₑ) from sparse, unphased genotype data. The authors first address the combinatorial explosion of SNP‑pair combinations by sampling pairs in a log‑uniform fashion, yielding roughly 10 × M pairs for M = 5 000 SNPs, which keeps computational cost manageable while still covering a wide distance spectrum. Each SNP pair is encoded as the minor‑allele count per individual and passed through a shared position‑wise dense layer (64 output features, ReLU activation). These genotype‑derived features are then combined across the two loci of a pair to produce preliminary genetic features gₚ.
Crucially, the inter‑SNP distances dₚ are transformed using radial basis functions (RBFs) applied in log space. The RBF centers µₖ are log‑uniformly spaced over the chromosome length L, and the number of centers K is set to ⌈log L⌉. This creates a soft, overlapping binning of distances, allowing the network to learn distance‑specific weighting coefficients sₚ for each of the 64 genotype features. The coefficients are generated by a small two‑layer “distance‑mapping” sub‑network that processes the RBF outputs. The final distance‑conditioned genotype features g′ₚ = gₚ ⊙ sₚ are then averaged across all sampled pairs, mirroring the classic Hill (1981) approach of aggregating LD information over the genome. The averaged vector feeds into a regression head consisting of five dense layers that output either Nₑ or any user‑defined target.
Training is performed on simulated datasets that mimic the empirical study design (10 individuals, 5 000 SNPs). The model simultaneously estimates a two‑epoch demographic history (recent and ancient Nₑ). Evaluation on 1 000 held‑out simulations shows that the LD layer achieves a mean relative absolute error (MRAE) of 0.380 for recent Nₑ, outperforming a pairwise‑CNN (MRAE = 0.422), a summary‑statistic neural network (0.429), a random‑forest regression on summary statistics (0.456), and a basic CNN (0.511). Visual inspection of the learned distance coefficients reveals that many of them peak between 5 × 10⁵ and 5 × 10⁶ bp, a range that coincides with the inflection point of LD decay in low‑Nₑ simulations, confirming that the network has indeed captured biologically meaningful LD patterns. Coefficients peaking at shorter distances likely capture fine‑scale LD or non‑LD features such as heterozygosity, while coefficients near zero indicate features that are irrelevant for the inference task.
The method is applied to a real dataset of harbor porpoises (Phocoena phocoena) comprising 10 individuals and 5 000 SNPs from the longest contig (max distance ≈ 6.7 × 10⁷ bp). Repeated subsampling (100 replicates) yields a recent Nₑ estimate of ~1 400 (range 1 119–1 659) and an older Nₑ of ~5 900, with the population size change inferred to have occurred ~42 generations ago (≈ 501 years assuming an 11.9‑year generation time). These values are biologically plausible given the sampled individuals occupy an intermediate geographic zone between the critically endangered Baltic Sea sub‑population and larger Atlantic populations, and they are lower than estimates from a sequentially Markovian coalescent approach that struggles with very recent events.
In the discussion, the authors emphasize that the LD layer eliminates the need for manual distance binning and can operate effectively with as few as 10 individuals and 5 000 variants, making it attractive for molecular‑ecology studies that rely on reduced‑representation sequencing. They acknowledge that the utility of LD information depends on recombination rates and demographic complexity; thus, future work may combine the LD layer with other architectures (e.g., geographic distance layers, graph‑convolutional networks) to capture multi‑scale signals and more intricate histories such as bottlenecks, admixture, or migration. They also note that providing a genetic map (cM positions) instead of raw base‑pair distances could further improve performance.
Overall, LinkedNN delivers a practical, open‑source tool that automatically learns distance‑dependent LD features, offering superior accuracy over existing CNN and summary‑statistic methods for recent Nₑ inference from sparse, unphased genomic data. This advancement broadens the applicability of LD‑based demographic inference to a wider range of non‑model organisms and conservation genetics scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment