Summary: A bioinformatics tool is presented for estimating recent effective population size by using a neural network to automatically compute linkage disequilibrium-related features as a function of genomic distance between polymorphisms. The new method outperforms existing deep learning and summary statistic-based approaches using relatively few sequenced individuals and variant sites, making it particularly valuable for molecular ecology applications with sparse, unphased data. Availability and implementation: The program is available as an easily installable Python package with documentation here: https://pypi.org/project/linkedNN/. The open source code is available from: https://github.com/the-smith-lab/LinkedNN.
Linkage disequilibrium (LD) is the non-random association of alleles at different loci and is shaped by evolutionary forces including recombination and genetic drift. In particular, its magnitude reflects demographic history, where populations with larger effective size, N e , show lower levels of LD (Hill and Robertson, 1968).
This relationship has been used to develop powerful N e estimators that leverage correlations between genotypes at different loci as a function of genomic distance (Laurie-Ahlberg and Weir, 1979;Hill, 1981;Waples, 2006;Santiago et al., 2020). LD-based approaches are especially useful for estimating recent N e , because crossover events occur more frequently than mutations and can therefore shape LD patterns over shorter timescales. For example, GONE (Santiago et al., 2020) models the decay of LD with genomic distance to infer N e through time, producing reasonable N e estimates in the past 100 generations. Such LD-based methods are especially valuable in molecular ecology and conservation where unphased and relatively sparse genomic markers can nonetheless provide signal about recent N e . Despite these advantages, summarizing genomewide LD remains burdensome because existing approaches require analysis-specific decisions for binning SNP pairs into arbitrary, discrete distance classes. Consequently, there is a methodological need for tools that automatically extract features related to LD decay directly from polymorphism data.
Many of the deep learning architectures used to automatically extract features from SNPs are convolutional neural networks (CNNs). While CNNs have been effective for demographic inference and other tasks (Flagel et al., 2019;Torada et al., 2019;Gower et al., 2021;Sanchez et al., 2021;Wang et al., 2021;Smith et al., 2023), they have limited capacity to see LD. This issue arises in part because CNNs perform spatial operations designed for gridded, regularly spaced inputs like images, but SNPs are not uniformly distributed along chromosomes. Furthermore, the genotype information from individual sites is progressively eroded in neural network layers beyond the first convolution, making typical architectures not ideal for seeing correlations beyond adjacent SNPs in the input array. Capturing precise LD signal requires models that can represent genomic features across a continuum of genomic distances.
Here, a neural network is developed that is capable of learning LD decay-related features from SNPs as a function of genomic distance, rather than relying on grid operations to convey positional information.
The method is evaluated on simulated data for estimating recent N e , benchmarked against current CNN and summary statistic-based regression tools, and demonstrated on empirical data. The new LD layer is implemented in a bioinformatics tool called LinkedNN that can be used for N e estimation or other inference tasks in diverse species.
The LD layer was implemented using PyTorch (Ansel et al., 2024) and consists of the following steps. The initial stage involves sub-sampling pairs of SNPs, as the number of possible pairs scales quadratically with the input size. Beginning with ordered polymorphisms i = 1, . . . , M , discretized log-uniform index jumps ∆ i = floor(log-U(1, M )) are drawn to form pairs (i, i + ∆ i ), resulting in physical distances d i within pairs that are roughly log-uniform. The number of sampled pairs is increased by repeating this process ten times, discarding pairs with i + ∆ i > M , to yield a total of P = 10M -n skip pairs indexed by p = 1, . . . , P . For example, with M = 5, 000 and ten proposed pairs per SNP, the total retained pairs is around P ≈ 44, 000.
This strategy surveys a range of genomic distances without enumerating all pairs. Features are initially extracted from the genotype inputs x 1 , …, x P irrespective of genomic position (Figure 1, left). Specifically, unphased genotypes are encoded as the count of the minor allele in each individual and given to a shared-weights, position-wise layer. All trainable layers have a number of output features f = 64 and rectified linear unit activation, except the final layer. The position-wise features in each pair are combined and given to two additional layers to compute preliminary genetic features g p for each pair. By letting the network automatically extract features instead of manually calculating genotype correlations, it has the potential to extract non-LD features as well.
Next, the inter-SNP distances undergo further processing before being passed to any trainable layers.
The d p are transformed using radial basis functions, following the approach from Schütt et al. (2017) but applied in log space:
where the K centers, µ k , are log-uniformly spaced over (1, L), with L being the length of the chromosome, and K is set to log(L), rounded up. The parameter γ controls the width of the radial basis functions and, in this study, is implicitly set as the spacing between adjacent
This content is AI-processed based on open access ArXiv data.