Therefore, k-mers with length ≥20 bp have been utilized to identify biomarkers, such as sequences ( Drouin et al., 2016 Wang et al., 2018), genetic variants ( Jaillard et al., 2018 Rahman et al., 2018 Standage et al., 2019), and genes ( Han et al., 2017) specific to categorical phenotypes. Long k-mers contain richer biological information and are able to depict specific signatures in nucleotide sequences ( Wang et al., 2016).
![identifies sequences identifies sequences](https://i.ytimg.com/vi/i8TQRt6mZAE/hqdefault.jpg)
However, these measures only return dissimilarity between two data sets, but do not capture specific biomarkers associated with different phenotypes. Short k-mer ( k 21), Mash ( Ondov et al., 2016), Skmer ( Sarmashghi et al., 2019), and Kmer-db ( Deorowicz et al., 2018) use MinHash to approximate Jaccard distance between pairwise sequences based on randomly sampled small set of k-mers. The comparisons of high-throughput sequencing data under various phenotypes are critical to understand the mechanism behind their differences. įast developments of high-throughput sequencing technologies spout large volume of shotgun genomic/metagenomic data. The output group-specific k-mers or sequences from KmerGO could be the inputs of other tools for the downstream discovery of biomarkers, such as genetic variants, species, or genes. Through multi-process parallel computing, KmerGO is implemented with both graphic user interface and command line on Linux and Windows free from any pre-installed supporting environments, packages, and complex configurations. Furthermore, KmerGO can also be applied to capture trait-associated sequences for continuous-trait. For a 1.05 TB dataset (.fasta), it takes KmerGO about 21.5 h (including k-mer counting) to return assembled group-specific sequences on a regular stand-alone workstation with no more than 1 GB memory. Compared with other tools, KmerGO captures group-specific k-mers ( k up to 40 bps) with much lower requirements for computing resources in much shorter running time. We developed a user-friendly tool, KmerGO, to identify group-specific sequences between two groups of genomic/metagenomic long sequences or high-throughput sequencing datasets. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered a “group-specific” sequence in our study.
![identifies sequences identifies sequences](https://img.wonderhowto.com/img/73/85/63475325066678/0/identify-geometric-sequences-and-find-nth-term.1280x600.jpg)