Barcoding-free BAC Pooling Enables Combinatorial Selective Sequencing of the Barley Gene Space
We propose a new sequencing protocol that combines recent advances in combinatorial pooling design and second-generation sequencing technology to efficiently approach de novo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when dealing with hundreds or thousands of DNA samples, such as genome-tiling gene-rich BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundreds of million of short reads and assign them to the correct BAC clones so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is extremely accurate (99.57% of the deconvoluted reads are assigned to the correct BAC), and the resulting BAC assemblies have very high quality (BACs are covered by contigs over about 77% of their length, on average). Experimental results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate (almost 70% of left/right pairs in paired-end reads are assigned to the same BAC, despite being processed independently) and the BAC assemblies have good quality (the average sum of all assembled contigs is about 88% of the estimated BAC length).
💡 Research Summary
The paper introduces a novel sequencing workflow that eliminates the need for DNA barcodes when handling thousands of bacterial artificial chromosome (BAC) clones, by leveraging combinatorial pooling designs together with next‑generation sequencing (NGS). Traditional multiplexing relies on attaching a unique barcode to each sample, a process that becomes prohibitively expensive, labor‑intensive, and error‑prone at the scale of hundreds to thousands of clones. Moreover, barcode bias can cause highly uneven read distribution across samples. To overcome these limitations, the authors adopt a mathematically grounded pooling scheme in which each BAC is present in a specific combination of pools, and the pattern of pool membership itself encodes a unique “signature” for that BAC.
The workflow begins with the construction of a minimum tiling path (MTP) from a physical map of the target genome (rice and barley are used as test cases). BACs from the MTP are allocated to pools according to a shifted transversal design that is 3‑decodable. In the experiments, the parameters P = 13 (pools per layer), L = 7 (number of pools per BAC), and Γ = 2 (maximum overlap between any two pools) were chosen, yielding 91 pools that together contain 2,197 BACs. Each BAC appears in exactly seven pools, and each pool holds 169 BACs. This design guarantees that the set of pools containing a given BAC is unique and that any two pools share at most two BACs, providing robustness against errors.
After pooling, the DNA from each pool is sequenced on an Illumina platform. The resulting short reads (100–150 bp) are processed by extracting all 26‑mers and recording in which pools each k‑mer occurs. A hash‑based data structure stores these k‑mer signatures. For each read, the collection of pools that contain its constituent k‑mers forms the read’s “pool signature”. By matching this signature against the pre‑computed BAC signatures, the read is assigned to the most likely BAC (or to overlapping BACs when ambiguity exists). This deconvolution step is computationally intensive; the rice dataset required ~120 GB of RAM and 164 minutes on a single core to build the hash table, followed by 33 minutes on 10 cores for the assignment phase.
Simulation on the rice genome (where the true origin of each read is known) demonstrated remarkable performance: 99.57 % of deconvoluted reads were assigned correctly, and only 18.5 % of reads were discarded due to ambiguous signatures (typically highly repetitive k‑mers appearing in many pools). The effective coverage per BAC rose to ~87×, higher than the nominal 56×, because many reads were shared among overlapping BACs. Subsequent assembly of each BAC’s read set using VELVET (with k‑mer size tuned for maximal N50) yielded high‑quality assemblies: average N50 of 47 kb (≈31 % of the average BAC length), average largest contig of 57 kb, and total assembled bases covering 90.7 % of each BAC on average. BLAST validation against the reference rice genome confirmed that the assembled contigs largely recapitulated the original BAC sequences, with an average BAC coverage of 76.8 %.
Real data from barley, a larger and more repetitive genome, were processed similarly. From 2,197 BACs distributed across 91 pools, 71.3 % of reads could be assigned to 1–3 BACs, covering roughly 87 % of the total bases. Approximately 70 % of paired‑end reads had both mates assigned to the same BAC, despite independent processing, indicating strong deconvolution accuracy. BAC assemblies showed a lower N50 than rice (reflecting the higher repeat content) but still achieved an average total contig length equal to 88 % of the estimated BAC size. To assess biological relevance, the authors compared assembled contigs to known EST/unigene sequences; only 10 % of BAC assemblies missed the expected unigene, and for the remaining 90 % the unigene coverage averaged 90 % of its length.
The authors also benchmarked three alternative strategies: (1) assembling each pool before deconvolution (169 BACs per pool), (2) assembling the entire set of 2,197 BACs together, and (3) a conventional whole‑genome shotgun assembly of barley at 31× coverage. The BAC‑by‑BAC approach consistently produced the longest N50 and most complete BAC reconstructions, demonstrating its superiority for targeted, high‑resolution assembly.
Key insights from the study include: (i) combinatorial pooling can replace barcoding for large‑scale clone sequencing, eliminating barcode synthesis and associated biases; (ii) the pooling design provides inherent error tolerance and filters out highly repetitive reads that would otherwise confound assembly; (iii) the method scales to complex plant genomes, though memory requirements for k‑mer hashing are substantial and may necessitate high‑performance computing resources. Future work could focus on optimizing memory usage (e.g., using succinct data structures or distributed computing) and automating the selection of pooling parameters for different genome sizes and complexities.
In summary, the paper presents a practical, cost‑effective pipeline that combines combinatorial pooling with NGS to achieve accurate deconvolution and high‑quality BAC assemblies without barcodes, offering a valuable tool for targeted sequencing projects in large and repetitive plant genomes.
Comments & Academic Discussion
Loading comments...
Leave a Comment