Identifies sequences

Therefore, k-mers with length ≥20 bp have been utilized to identify biomarkers, such as sequences ( Drouin et al., 2016 Wang et al., 2018), genetic variants ( Jaillard et al., 2018 Rahman et al., 2018 Standage et al., 2019), and genes ( Han et al., 2017) specific to categorical phenotypes. Long k-mers contain richer biological information and are able to depict specific signatures in nucleotide sequences ( Wang et al., 2016).

However, these measures only return dissimilarity between two data sets, but do not capture specific biomarkers associated with different phenotypes. Short k-mer ( k 21), Mash ( Ondov et al., 2016), Skmer ( Sarmashghi et al., 2019), and Kmer-db ( Deorowicz et al., 2018) use MinHash to approximate Jaccard distance between pairwise sequences based on randomly sampled small set of k-mers. The comparisons of high-throughput sequencing data under various phenotypes are critical to understand the mechanism behind their differences. įast developments of high-throughput sequencing technologies spout large volume of shotgun genomic/metagenomic data. The output group-specific k-mers or sequences from KmerGO could be the inputs of other tools for the downstream discovery of biomarkers, such as genetic variants, species, or genes. Through multi-process parallel computing, KmerGO is implemented with both graphic user interface and command line on Linux and Windows free from any pre-installed supporting environments, packages, and complex configurations. Furthermore, KmerGO can also be applied to capture trait-associated sequences for continuous-trait. For a 1.05 TB dataset (.fasta), it takes KmerGO about 21.5 h (including k-mer counting) to return assembled group-specific sequences on a regular stand-alone workstation with no more than 1 GB memory. Compared with other tools, KmerGO captures group-specific k-mers ( k up to 40 bps) with much lower requirements for computing resources in much shorter running time. We developed a user-friendly tool, KmerGO, to identify group-specific sequences between two groups of genomic/metagenomic long sequences or high-throughput sequencing datasets. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered a “group-specific” sequence in our study.

3Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United StatesĬapturing group-specific sequences between two groups of genomic/metagenomic sequences is critical for the follow-up identifications of singular nucleotide variants (SNVs), gene families, microbial species or other elements associated with each group.

2Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision-Making, Xiamen, China.

1Department of Automation, Xiamen University, Xiamen, China.

Ying Wang 1,2*, Qi Chen 1, Chao Deng 1, Yiluan Zheng 1 and Fengzhu Sun 3*