Population-scale Ancestral Recombination Graphs with tskit 1.0

Reading time: 4 minute
...

📝 Original Info

  • Title: Population-scale Ancestral Recombination Graphs with tskit 1.0
  • ArXiv ID: 2602.09649
  • Date: 2026-02-10
  • Authors: 논문에 명시된 저자 정보가 제공되지 않았습니다. (원문에 저자 리스트가 포함되지 않음)

📝 Abstract

Ancestral recombination graphs (ARGs) are an increasingly important component of population and statistical genetics. The tskit library has become key infrastructure for the field, providing an expressive and general representation of ARGs together with a suite of efficient fundamental operations. In this note, we announce tskit version 1.0, describe its underlying rationale, and document its stability guarantees. These guarantees provide a foundation for durable computational artefacts and support long-term reproducibility of code and analyses.

💡 Deep Analysis

📄 Full Content

Ancestral recombination graphs (ARGs) capture the full genetic history of samples from a recombining species. Although ARGs have been a central theoretical object in population genetics for decades, their practical use was constrained by the lack of scalable inference methods, standard interchange formats, and software infrastructure. Recent breakthroughs in simulation and inference have substantially changed this landscape, leading to renewed interest in ARG-based analyses across population and statistical genetics. [1][2][3] The tskit library has played a key enabling role in this shift and has become foundational infrastructure for working with ARGs. This paper marks the release of tskit 1.0, which formalises long-term stability guarantees for its data formats and APIs. At the core of tskit is the succinct tree sequence data model which defines a set of nodes (genomes at particular times) and edges (inheritance relationships between nodes spanning genomic intervals) in a simple tabular form. 4 This encoding provides a lossless representation of a general class of ARGs, establishing a precise and machine-readable definition suitable for largescale computation. 5 The data model also incorporates site, mutation, population, and pedigree information and supports arbitrary metadata associated with each of these components of the ARG. Provenance information is recorded natively, enhancing reproducibility and transparency. Together, these features make the tskit data model a semantically complete and interoperable representation of ARGs that serves as a common foundation across diverse analytical workflows (Figure 1).

Simulation is a fundamental tool in population genomics, and was the first domain in which the tskit data model demonstrated its impact. Introduced initially as part of the msprime simulator, the tskit data model enabled performance improvements of several orders of magnitude over previous coalescent simulation approaches. 6 The same representation later enabled efficient forward-time simulation of ARGs and yielded substantial speedups by avoiding explicit simulation of neutral mutations. 4 Because these forward-time and coalescent simulators share a common underlying representation, their complementary strengths can be combined within a single workflow. This has made it possible to simulate ARGs under complex demographic scenarios involving geography and selection that were previously infeasible, providing essential ground truth for method evaluation. Simulation capabilities have continued to expand, including whole-autosome ARG simulations for nearly 1.5 million individuals based on a large human pedigree. 7 A growing ecosystem of forward-time, coalescent, and hybrid simulation tools builds directly on tskit (Table S1).

The lack of scalable inference methods has been a major obstacle to practical application of ARGs. Although there are many inference methods (see Wong et al. 5 for a review), tsinfer was the first to scale to hundreds of thousands of samples, directly leveraging the tskit data model. 8 Many recent ARG inference methods have chosen to support tskit as an output format in addition to their own native representations (Table S1). This shared output layer enables inferred ARGs to interoperate directly with simulators, facilitating systematic evaluation and benchmarking against known ground truth. It also shifts the burden of format conversion away from downstream users, who can instead rely on inference tools to emit results in a common, well-defined representation. The scalability and flexibility of this approach are illustrated by the recent inference of an ARG for 2.48 million SARS-CoV-2 whole genomes, which occupies 32 MiB of storage and can be loaded into memory in under a second. 9 Efficient storage and analysis of large genetic datasets is a central design goal of tskit, and the data model has enabled substantial performance gains in downstream analyses. For example, single-site population genetic statistics can be computed orders of magnitude faster than from genotype matrices while using far less memory by operating on the underlying ARG structure. 10 Tskit exposes a large API with a performance-critical core implemented in C and bindings available for Python, Rust, and R. Its vectorised, table-first design allows zero-copy access to underlying arrays, supporting high-performance analysis pipelines. As a result, downstream tools inherit performance and correctness properties from a shared, well-tested core.

The goal of tskit is to provide a shared technical foundation, centred on efficient, welltested, and thoroughly documented primitive operations on ARGs, rather than to directly implement end-user workflows. This design principle has enabled a broad ecosystem of downstream software-spanning simulation, ARG inference, population and statistical genetic inference, analysis, and visualisation-with 62 published tools now using tskit as a core dependency (Table S1). With the

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut