Syotti: Scalable Bait Design for DNA Enrichment
Jarno Alanko, University of Helsinki, Finland
Ilya Slizovskiy, Department of Veterinary Population Medicine, College of Veterinary Medicine, University of Minnesota, United States
Daniel Lokshtanov, University of California, Santa Barbara, United States
Travis Gagie, Diego Portales University, Finland
Noelle Noyes, University of Minnesota, United States
Christina Boucher, University of Florida, United States
Motivation: Bait enrichment is a relatively new protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes (“baits”) are designed, manufactured, and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Most recently, Metsky et al. (Nature Biotech 2019) demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples.
Results: We formalize the problem of designing baits by defining the Minimum Bait Cover problem, and show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time of Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the recent method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that Syotti requires only 25 minutes to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 8% of the data in 24 hours.
CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
Shaopeng Liu, Pennsylvania State University, United States
David Koslicki, Pennsylvania State University, United States
K-mer based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where data sets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k = k_max value, we can simultaneously obtain k-mer based estimates for all k values up to k_max. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. For example, we show that when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time is close to 10x faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles.