MersV1, Pmc, Corpus, bibRecord, 000945

Disk-based k-mer counting on a PC

Identifieur interne : 000945 ( Pmc/Corpus ); précédent : 000944; suivant : 000946

Disk-based k-mer counting on a PC

Auteurs : Sebastian Deorowicz ; Agnieszka Debudaj-Grabysz ; Szymon Grabowski

Source :

BMC Bioinformatics [ 1471-2105 ] ; 2013.

RBID : PMC:3680041

Abstract

Background

The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.

Results

We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.

Conclusions

By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3680041

DOI: 10.1186/1471-2105-14-160
PubMed: 23679007
PubMed Central: 3680041

Links to Exploration step

PMC:3680041

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Disk-based <italic>k</italic>
-mer counting on a PC</title>
<author><name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation><nlm:aff id="I1">, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Debudaj Grabysz, Agnieszka" sort="Debudaj Grabysz, Agnieszka" uniqKey="Debudaj Grabysz A" first="Agnieszka" last="Debudaj-Grabysz">Agnieszka Debudaj-Grabysz</name>
<affiliation><nlm:aff id="I1">, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
<affiliation><nlm:aff id="I2">, Institute of Computer Science, Lodz University of Technology, Al. Politechniki 11, Łódź, 90-924, Poland</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">23679007</idno>
<idno type="pmc">3680041</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3680041</idno>
<idno type="RBID">PMC:3680041</idno>
<idno type="doi">10.1186/1471-2105-14-160</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000945</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000945</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Disk-based <italic>k</italic>
-mer counting on a PC</title>
<author><name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation><nlm:aff id="I1">, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Debudaj Grabysz, Agnieszka" sort="Debudaj Grabysz, Agnieszka" uniqKey="Debudaj Grabysz A" first="Agnieszka" last="Debudaj-Grabysz">Agnieszka Debudaj-Grabysz</name>
<affiliation><nlm:aff id="I1">, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
<affiliation><nlm:aff id="I2">, Institute of Computer Science, Lodz University of Technology, Al. Politechniki 11, Łódź, 90-924, Poland</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint><date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>The <italic>k</italic>
-mer counting problem, which is to build the histogram of occurrences of every <italic>k</italic>
-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.</p>
</sec>
<sec><title>Results</title>
<p>We propose a simple, yet efficient, parallel disk-based algorithm for counting <italic>k</italic>
-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.</p>
</sec>
<sec><title>Conclusions</title>
<p>By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive <italic>k</italic>
-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Compeau, Pe" uniqKey="Compeau P">PE Compeau</name>
</author>
<author><name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author><name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author><name sortKey="Narechania, A" uniqKey="Narechania A">A Narechania</name>
</author>
<author><name sortKey="Stein, J" uniqKey="Stein J">J Stein</name>
</author>
<author><name sortKey="Ware, D" uniqKey="Ware D">D Ware</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author><name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Miller, Jr" uniqKey="Miller J">JR Miller</name>
</author>
<author><name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author><name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author><name sortKey="Venter, E" uniqKey="Venter E">E Venter</name>
</author>
<author><name sortKey="Walenz, B" uniqKey="Walenz B">B Walenz</name>
</author>
<author><name sortKey="Brownley, A" uniqKey="Brownley A">A Brownley</name>
</author>
<author><name sortKey="Johnson, J" uniqKey="Johnson J">J Johnson</name>
</author>
<author><name sortKey="Li, K" uniqKey="Li K">K Li</name>
</author>
<author><name sortKey="Mobarry, Cm" uniqKey="Mobarry C">CM Mobarry</name>
</author>
<author><name sortKey="Sutton, Gg" uniqKey="Sutton G">GG Sutton</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Marcais, G" uniqKey="Marcais G">G Marçais</name>
</author>
<author><name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Melsted, P" uniqKey="Melsted P">P Melsted</name>
</author>
<author><name sortKey="Pritchard, Jk" uniqKey="Pritchard J">JK Pritchard</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pell, J" uniqKey="Pell J">J Pell</name>
</author>
<author><name sortKey="Hintze, A" uniqKey="Hintze A">A Hintze</name>
</author>
<author><name sortKey="Canino Koning, R" uniqKey="Canino Koning R">R Canino-Koning</name>
</author>
<author><name sortKey="Howe, A" uniqKey="Howe A">A Howe</name>
</author>
<author><name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
<author><name sortKey="Brown, Ct" uniqKey="Brown C">CT Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rizk, G" uniqKey="Rizk G">G Rizk</name>
</author>
<author><name sortKey="Lavenier, D" uniqKey="Lavenier D">D Lavenier</name>
</author>
<author><name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Satish, N" uniqKey="Satish N">N Satish</name>
</author>
<author><name sortKey="Kim, C" uniqKey="Kim C">C Kim</name>
</author>
<author><name sortKey="Chhugani, J" uniqKey="Chhugani J">J Chhugani</name>
</author>
<author><name sortKey="Nguyen, Ad" uniqKey="Nguyen A">AD Nguyen</name>
</author>
<author><name sortKey="Lee, Vw" uniqKey="Lee V">VW Lee</name>
</author>
<author><name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author><name sortKey="Dubey, P" uniqKey="Dubey P">P Dubey</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
<author><name sortKey="Ghemawat, S" uniqKey="Ghemawat S">S Ghemawat</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group><journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">23679007</article-id>
<article-id pub-id-type="pmc">3680041</article-id>
<article-id pub-id-type="publisher-id">1471-2105-14-160</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-14-160</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>Disk-based <italic>k</italic>
-mer counting on a PC</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" corresp="yes" id="A1"><name><surname>Deorowicz</surname>
<given-names>Sebastian</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>sebastian.deorowicz@polsl.pl</email>
</contrib>
<contrib contrib-type="author" id="A2"><name><surname>Debudaj-Grabysz</surname>
<given-names>Agnieszka</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>agnieszka.debudaj-grabysz@polsl.pl</email>
</contrib>
<contrib contrib-type="author" id="A3"><name><surname>Grabowski</surname>
<given-names>Szymon</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>sgrabow@kis.p.lodz.pl</email>
</contrib>
</contrib-group>
<aff id="I1"><label>1</label>
, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland</aff>
<aff id="I2"><label>2</label>
, Institute of Computer Science, Lodz University of Technology, Al. Politechniki 11, Łódź, 90-924, Poland</aff>
<pub-date pub-type="collection"><year>2013</year>
</pub-date>
<pub-date pub-type="epub"><day>16</day>
<month>5</month>
<year>2013</year>
</pub-date>
<volume>14</volume>
<fpage>160</fpage>
<lpage>160</lpage>
<history><date date-type="received"><day>29</day>
<month>11</month>
<year>2012</year>
</date>
<date date-type="accepted"><day>3</day>
<month>5</month>
<year>2013</year>
</date>
</history>
<permissions><copyright-statement>Copyright © 2013 Deorowicz et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2013</copyright-year>
<copyright-holder>Deorowicz et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0"><license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License(<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/14/160"></self-uri>
<abstract><sec><title>Background</title>
<p>The <italic>k</italic>
-mer counting problem, which is to build the histogram of occurrences of every <italic>k</italic>
-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.</p>
</sec>
<sec><title>Results</title>
<p>We propose a simple, yet efficient, parallel disk-based algorithm for counting <italic>k</italic>
-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.</p>
</sec>
<sec><title>Conclusions</title>
<p>By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive <italic>k</italic>
-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.</p>
</sec>
</abstract>
<kwd-group><kwd><italic>k</italic>
-mer counting</kwd>
<kwd>de Bruijn graph genome assemblers</kwd>
<kwd>Multiple sequence alignment</kwd>
<kwd>Repeat detection</kwd>
</kwd-group>
</article-meta>
</front>
<body><sec><title>Background</title>
<p>Counting the number of occurrences of every substring of length <italic>k</italic>
 (so-called <italic>k</italic>
-mer) in a given string <italic>S</italic>
 is an important procedure in bioinformatics. One prominent application is <italic>de novo</italic> assembly from very large number (typically, a few billions) of short reads, produced by second-generation sequencing instruments. The most popular assembly approach for such data is based on building the de Bruijn graph
[<xref ref-type="bibr" rid="B1">1</xref>
], in which an edge between any pair of <italic>k</italic>
-mers, represented as nodes in the graph, exists if and only if the (<italic>k</italic>
−1)-symbol long suffix of one <italic>k</italic>
-mer is a prefix of another. The current sequencing technology cannot, however, get rid of a relatively large number of errors (mis-detected nucleotides) in sequence reads. These errors can be detected on a statistical basis. The whole genome obtains high coverage (30-fold to 60-fold is typical) on modern sequencing platforms, which means that any “genuine” substring of length, e.g., 25 or 30 is very likely to appear multiple times in the reads collection. If a given <italic>k</italic>
-mer occurs only once, it almost certainly contains at least one (or more probably a few) false symbol. The unique <italic>k</italic>
-mers should be discarded before (or in the process of) building the de Bruijn graph, since they significantly increase the memory and time requirements for graph generation. Other applications of <italic>k</italic>-mer counting include fast multiple sequence alignment
[<xref ref-type="bibr" rid="B2">2</xref>] and repeat detection
[<xref ref-type="bibr" rid="B3">3</xref>
]. Usually we should not distinguish between a <italic>k</italic>
-mer and its reversed complement, and by the “canonical <italic>k</italic>
-mer” we will mean the lexicographically smaller of the two.</p>
<p>Counting <italic>k</italic>
-mers is challenging, because it should be both fast and performed using as little memory as possible. A naïve approach is to use a standard hash table, with <italic>k</italic>
-mers as keys and their counts as values. This solution is both memory consuming and hard to parallelize effectively. Moreover, some of the <italic>k</italic>-mer counting tools cooperate with Quake
[<xref ref-type="bibr" rid="B4">4</xref>
], a widely used package to correct substitution sequencing errors in experiments with deep coverage, which takes <italic>k</italic>
-mer statistics as an important component for its job. Supporting Quake makes <italic>k</italic>
-mer counting even more demanding, both in time and used memory, since instead of plain <italic>k</italic>
-mer counts Quake takes into account also the quality of base calls, i.e., high-quality reads have higher impact. In the following paragraphs we will briefly present several respected <italic>k</italic>
-mer counting solutions.</p>
<p>Tallymer
[<xref ref-type="bibr" rid="B3">3</xref>
] is a <italic>k</italic>
-mer counter based on the suffix array structure. Alas, suffix array construction is a relatively expensive operation, which is worsened for the second-generation sequencing data, where short reads must go together with high coverage.</p>
<p>Meryl, from the <italic>k</italic>-mer package of the Celera assembler
[<xref ref-type="bibr" rid="B5">5</xref>
], sorts the <italic>k</italic>
-mers in memory. More precisely, it distributes all <italic>k</italic>
-mers into (by default) 2<sup>24</sup> bins and then sorts each bin. Finally it removes the unique ones. This approach requires a huge amount of memory to work. For example, for the human NA19238 dataset that we use in the experimental section (cf. the statistics in Table
<xref ref-type="table" rid="T1">1</xref>
) over 350 GB of RAM would be needed, provided 8 bytes per <italic>k</italic>
-mer.</p>
<table-wrap position="float" id="T1"><label>Table 1</label>
<caption><p><bold>Statistics of the datasets used in the experiments for</bold>
<bold><italic>k=25</italic>
</bold>
</p>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top"><tr><th align="left" valign="bottom"> <hr></hr>
</th>
<th align="right" valign="bottom"><bold><italic>D. ananassae</italic>
</bold>
<hr></hr>
</th>
<th align="right" valign="bottom"><bold><italic>C. elegans</italic>
</bold>
<hr></hr>
</th>
<th align="right" valign="bottom"><bold><italic>Z. mays</italic>
</bold>
<hr></hr>
</th>
<th align="right" valign="bottom"><bold><italic>H. sapiens</italic>
</bold>
<hr></hr>
</th>
<th align="right" valign="bottom"><bold><italic>H. sapiens</italic>
</bold>
<hr></hr>
</th>
</tr>
<tr><th align="left"> </th>
<th align="left"> </th>
<th align="left"> </th>
<th align="left"> </th>
<th align="right"><bold>NA19238</bold>
</th>
<th align="right"><bold>HG02057</bold>
</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="left" valign="bottom">FASTQ file size [GB]<hr></hr>
</td>
<td align="right" valign="bottom">8.7<hr></hr>
</td>
<td align="right" valign="bottom">16.4<hr></hr>
</td>
<td align="right" valign="bottom">45.9<hr></hr>
</td>
<td align="right" valign="bottom">353<hr></hr>
</td>
<td align="right" valign="bottom">208<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Total gzipped size [GB]<hr></hr>
</td>
<td align="right" valign="bottom">1.8<hr></hr>
</td>
<td align="right" valign="bottom">4.6<hr></hr>
</td>
<td align="right" valign="bottom">16.3<hr></hr>
</td>
<td align="right" valign="bottom">116.6<hr></hr>
</td>
<td align="right" valign="bottom">65.9<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">No. of gzipped files<hr></hr>
</td>
<td align="right" valign="bottom">6<hr></hr>
</td>
<td align="right" valign="bottom">2<hr></hr>
</td>
<td align="right" valign="bottom">108<hr></hr>
</td>
<td align="right" valign="bottom">463<hr></hr>
</td>
<td align="right" valign="bottom">6<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">No. of reads [ ×10<sup>6</sup>
]<hr></hr>
</td>
<td align="right" valign="bottom">35<hr></hr>
</td>
<td align="right" valign="bottom">68<hr></hr>
</td>
<td align="right" valign="bottom">62<hr></hr>
</td>
<td align="right" valign="bottom">2,662<hr></hr>
</td>
<td align="right" valign="bottom">860<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Read lengths<hr></hr>
</td>
<td align="right" valign="bottom">75<hr></hr>
</td>
<td align="right" valign="bottom">100<hr></hr>
</td>
<td align="right" valign="bottom">25–2043<hr></hr>
</td>
<td align="right" valign="bottom">36(most)–75<hr></hr>
</td>
<td align="right" valign="bottom">100<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom"><bold>Statistics of </bold>
<bold><italic>k </italic>
</bold>
<bold>-mers</bold>
<hr></hr>
</td>
<td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom"> <hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">No. of singletons<hr></hr>
</td>
<td align="right" valign="bottom">43<hr></hr>
</td>
<td align="right" valign="bottom">347<hr></hr>
</td>
<td align="right" valign="bottom">1,010<hr></hr>
</td>
<td align="right" valign="bottom">11,823<hr></hr>
</td>
<td align="right" valign="bottom">3,367<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">No. of distinct<hr></hr>
</td>
<td align="right" valign="bottom">63<hr></hr>
</td>
<td align="right" valign="bottom">459<hr></hr>
</td>
<td align="right" valign="bottom">1,916<hr></hr>
</td>
<td align="right" valign="bottom">14,599<hr></hr>
</td>
<td align="right" valign="bottom">6,023<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">No. of distinct non-singletons<hr></hr>
</td>
<td align="right" valign="bottom">20<hr></hr>
</td>
<td align="right" valign="bottom">112<hr></hr>
</td>
<td align="right" valign="bottom">906<hr></hr>
</td>
<td align="right" valign="bottom">2,776<hr></hr>
</td>
<td align="right" valign="bottom">2,657<hr></hr>
</td>
</tr>
<tr><td align="left">Total no.</td>
<td align="right">1,803</td>
<td align="right">5,127</td>
<td align="right">20,214</td>
<td align="right">44,687</td>
<td align="right">65,325</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Totals in millions.</p>
</table-wrap-foot>
</table-wrap>
<p>Jellyfish
[<xref ref-type="bibr" rid="B6">6</xref>
] is an algorithm designed for shared memory parallel computers with multiple CPUs / cores. It uses several lock-free data structures. Efficient shared access to these structures implies high utilization of concurrent processing units. More precisely, Jellyfish is based on a hash table with quadratic probing collision resolution, where concurrent update operations are possible thanks to its lock-free mechanism, exploiting the ‘compare-and-swap’ (CAS) assembly instruction present in all modern multi-core CPUs. A CAS instruction performs up to three ‘elementary’ operations in an atomic fashion: it reads a memory cell, compares the read value to the second parameter of the instruction, and if the two values are equal it then writes the third CAS parameter to the memory cell. In the considered application this mechanism is much more efficient than traditional serialization of the access to a shared resource with locking. Another interesting feature of Jellyfish is to store only a part (prefix) of the <italic>k</italic>
-mer in the hash table, since its suffix can be deduced from the hash position.</p>
<p>BFCounter
[<xref ref-type="bibr" rid="B7">7</xref>
] ingeniously involves the Bloom filter structure to discard most of the unique <italic>k</italic>
-mers before their statistics are calculated using a standard hash table. Bloom filter (BF) is a classic compact probabilistic data structure for dynamic set membership queries, which allows a low rate of false positives. To count non-unique <italic>k</italic>
-mers, BFCounter uses both a BF and a plain hash table. Initially, the <italic>k</italic>
-mers are inserted into the BF structure. Then, the hash table is populated with all <italic>k</italic>
-mers which have at least two occurrences plus a relatively small number of unique <italic>k</italic>
-mers, those which appeared false positives in the BF. This algorithm is relatively frugal in memory utilization, but only a serial implementation exists.</p>
<p>It should be noted that both Jellyfish and BFCounter require estimation on the number of distinct <italic>k</italic>
-mers. If the user-specified value is far from the real one, these algorithms may work much slower and using much more memory than in the case of appropriate values given. (More precisely, using Jellyfish with bounded memory is possible, but with limitations. This aspect is discussed in more detail in Sect. Results and discussion.)</p>
<p>We mention also khmer
[<xref ref-type="bibr" rid="B8">8</xref>
], a toolkit for <italic>k</italic>
-mer-based dataset analysis, which (among others) can count <italic>k</italic>
-mers and remove reads with low- or high-abundance <italic>k</italic>
-mers. It is reasonably fast and memory frugal, but these benefits are achieved thanks to its probabilistic nature (again, due to the underlying Bloom filter): with a low probability, khmer may report a false <italic>k</italic>
-mer as being “present”. Also the reported counts for genuine <italic>k</italic>
-mers are only approximate. While these features are acceptable in some applications, we can name its drawbacks: no <italic>k</italic>
-mer listing possibility (testing <italic>every</italic>
 possible <italic>k</italic>
-mer is of course prohibitive, even for relatively small <italic>k</italic>
) and no quality score support (e.g., with Quake-compatible counters). For these reasons, we do not compare khmer with our solution in the experimental section, as in our opinion they “play in different leagues”.</p>
<p>Very recently, a disk-based <italic>k</italic>-mer counting algorithm, named DSK, was presented
[<xref ref-type="bibr" rid="B9">9</xref>
]. On the high level, DSK is similar to the solution presented in this paper, but both algorithms were designed independently at about the same time. In DSK, the (multi)set of <italic>k</italic>
-mers from the input reads is partitioned and partitions are sent to disk. Then, each partition is loaded and processed separately in the main memory, using a hash table. The tool provides strict control of the allocated memory and disk space (lower memory usage results in increased processing time due to more iterations and thus increased I/O), which for example allows to process human genome data in 4 GB of RAM only, in reasonable time.</p>
</sec>
<sec sec-type="methods"><title>Methods</title>
<p>In the following subsections we first present our basic idea, on a high level and in a sequential manner, and then the ‘real’ parallel algorithm, involving multiple CPUs / cores and multiple disks (see Figure
<xref ref-type="fig" rid="F1">1</xref>
). The algorithm description is valid for any parameters <italic>k</italic>
 and read length <italic>r</italic>
. In fact, in the current implementation <italic>k</italic>
 can be as large as 256, and <italic>r</italic>
 as large as 10240, provided that 10<<italic>k</italic>
≤<italic>r</italic>
. (These values can be easily increased when needed, as they are compile-time constants.)</p>
<fig id="F1" position="float"><label>Figure 1</label>
<caption><p>A scheme of the parallel KMC algorithm.</p>
</caption>
<graphic xlink:href="1471-2105-14-160-1"></graphic>
</fig>
<p>We assume here that the purpose is to count the <italic>k</italic>
-mers, but our implementation is more flexible: the associated quality scores can be taken into consideration, <italic>k</italic>
-mers with a count below a threshold may be removed, etc. At least from the algorithmic point, however, the core functionality is the most important, hence it is discussed below.</p>
<sec><title>Basic idea</title>
<p>Our goal is to obtain a compact on-disk dictionary structure with <italic>k</italic>
-mers as keys and their counts as values. The structure can then be read sequentially, or individual <italic>k</italic>
-mers (with their associated counts) can be found using the standard binary search technique.<sup>a</sup>
 The algorithm follows the disk-based distribution sort paradigm. In the first, distribution, phase, we scan the reads one by one, extract all the <italic>k</italic>
-mers from each, replace them with their canonical versions when necessary, and send each to one of multiple (several hundred) disk files based on the <italic>k</italic>
-mer prefix of length <italic>p</italic>
<sub>1</sub>
. Actually, the first phase starts with storing the data in buffers in the main memory where another prefix part, of length <italic>p</italic>
<sub>2</sub>
, is removed from each <italic>k</italic>
-mer, and the prefix counts are maintained for further recovery. Once the buffer reaches the predefined capacity, its content is sent to a file.</p>
<p>The latter, sorting phase, is to collect the data from disk in the order of lexicographically sorted prefixes of length <italic>p</italic>
<sub>1</sub>
, recover the <italic>p</italic>
<sub>2</sub>
-symbol long prefixes, then radix-sort the <italic>k</italic>
-mers, count their frequencies (after sorting repeating <italic>k</italic>
-mers are at adjacent positions), and (optionally) remove, e.g., unique <italic>k</italic>
-mers.</p>
</sec>
<sec><title>Parallel algorithm</title>
<p>The detailed description of our algorithm is relatively complex and the reader is advised to consult Figure
<xref ref-type="fig" rid="F1">1</xref>
. The number of components (splitters, bin readers etc.) at each stage is chosen depending on the number of CPUs / cores of the target machine, but optionally these parameters may be user-specified. The distribution and the sort phase are described in the two following subsections.</p>
<sec><title>Distribution phase</title>
<p>The first phase begins with reading blocks of several megabytes (the exact size depends on the available memory), rounded down to a record boundary, from a (possibly compressed) FASTQ dataset.<sup>b</sup>
 One or more threads are used, depending on the number of input data files; the default number of such threads can be overridden with a command-line switch. The blocks are added to a queue.</p>
<p>In the next step, a number of splitter threads remove the blocks from the queue and extract <italic>k</italic>
-mers from their reads, converting the <italic>k</italic>
-mers to canonical form. Every splitter has its multiple (typically hundreds) bins to fulfill, each with capacity of, e.g., 2<sup>15</sup>
 entries. Each <italic>k</italic>
-mer is directed to the bin specified by the <italic>k</italic>
-mer’s prefix of length <italic>p</italic>
<sub>1</sub>
.</p>
<p>The lengths <italic>p</italic>
<sub>1</sub>
 are variable-sized and their minimum size is user-specified. For example, if the program is run with switch <italic>-p3</italic>
, it means that 3, 4, or 5 symbols belong to the prefix, depending on its content. The rationale is that, for example, about 7/16 of all <italic>canonical</italic>
<italic>k</italic>
-mers are those starting with A<sup>c</sup>
, and different prefix lengths allow to have the bin counts more balanced. Due to some technical difficulties we resigned from even more granularity, but it should be stressed that this issue is practically irrelevant for the overall processing time. A more important goal is to limit the largest bin count (which is beyond our full control, since the reads content is not random). The number of resulting bins for parameter <italic>-p3</italic>
 is about 125, for <italic>-p4</italic>
 about 500, and for <italic>-p5</italic>
 about 2000. As a rule of thumb, it is better to use <italic>-p5</italic>
 for large collections (e.g., mammalian genomes with high coverage), <italic>-p3</italic>
 for small collections (e.g., bacterial genomes), and <italic>-p4</italic>
 for middle-sized ones. Once a bin is full, its content is transformed and then flushed to a common queue of bins. The transformation means here: partial sorting and compaction. The former uses counting sort, according to <italic>k</italic>
-mer’s prefix, this time of the length <italic>p</italic>
<sub>2</sub>
. The latter operation, compaction, removes those prefixes and stores their frequencies, to enable to recover the <italic>k</italic>
-mers later. The parameter <italic>p</italic>
<sub>2</sub>
 is chosen dynamically to fit other (possibly user-selected) settings, like <italic>p</italic>
<sub>1</sub>
, hardware configuration, and the values of <italic>k</italic>
 and <italic>r</italic>
, with the idea of minimizing the amount of temporary data on disk (and thus also total I/O). For convenience (byte-aligned data layout) we always have <italic>k</italic>
−<italic>p</italic>
<sub>1</sub>
−<italic>p</italic>
<sub>2</sub> a multiple of 4. The reader is advised to look at Figure
<xref ref-type="fig" rid="F2">2</xref>
 with a 2-stage prefix removal example presenting two cases: one for a <italic>k</italic>
-mer starting with A and one for a <italic>k</italic>-mer starting with another symbol. The whole splitter operation is presented in Figure
<xref ref-type="fig" rid="F3">3</xref>
.</p>
<fig id="F2" position="float"><label>Figure 2</label>
<caption><p><bold>An example of 2-stage </bold>
<bold><italic>k</italic>
</bold>
<bold>-mer prefix removal, for one </bold>
<bold><italic>k</italic>
</bold>
<bold>-mer starting with symbol A and one starting with non-A.</bold>
 In total, 9 starting characters are effectively removed before storing the <italic>k</italic>
-mer on a temporary disk. The part of length <italic>p</italic>
<sub>1</sub>
 stands for the ID of the bin the given <italic>k</italic>
-mer is inserted into, and the part of length <italic>p</italic>
<sub>2</sub>
 is discarded to reduce the temporary storage (a way to recover later the removed part of length <italic>p</italic>
<sub>2</sub>
 is not shown here, for clarity; see text).</p>
</caption>
<graphic xlink:href="1471-2105-14-160-2"></graphic>
</fig>
<fig id="F3" position="float"><label>Figure 3</label>
<caption><p><bold>Algorithm of splitting </bold>
<bold><italic>k</italic>
</bold>
<bold>-mers in reads to bins according to their prefix.</bold>
 A full bin is compacted by sorting on <italic>p</italic>
<sub>2</sub>
-symbol long prefix and removing this prefix.</p>
</caption>
<graphic xlink:href="1471-2105-14-160-3"></graphic>
</fig>
<p>Now a single package maker thread comes into action. It prepares data to be sent to disk. More precisely, it moves the content from the queue of bins to another queue of “bin part packages” (Figure
<xref ref-type="fig" rid="F4">4</xref>
), which is handled by multiple threads. A compactor checks if it pays to strip extra 4 symbols from the prefix of each item in its package; the compaction criterion is satisfied if the prefix counters (statistics) together with the stripped data use less space than before the stripping. Now, the resulting data (possibly more compacted) are formed into one of many compact packages. Once the compact packages reach in total the specified maximum capacity, the largest of them is dumped to a file. Compacting the packages speeds up file handling and reduces file fragmentation. The <italic>k</italic>
-mers scattered over (usually) hundreds of files are the outcome of the first, i.e., distribution phase. Each file corresponds to a unique prefix of length <italic>p</italic>
<sub>1</sub>
. In each file, the <italic>k</italic>
-mers are also grouped by their successive <italic>p</italic>
<sub>2</sub>
 or <italic>p</italic>
<sub>2</sub>
+4 symbols. Assume for presentation clarity that the extra 4 symbols are not removed, and thus what is sent to files are (<italic>k</italic>
−<italic>p</italic>
<sub>1</sub>
−<italic>p</italic>
<sub>2</sub>
)-symbol long suffixes of the <italic>k</italic>-mers, packed into bytes. Additionally, each file contains
<inline-formula><mml:math id="M1" name="1471-2105-14-160-i1" overflow="scroll"><mml:msup><mml:mrow><mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>
 prefix counts (each stored in 2 bytes) which enable to recover the <italic>p</italic>
<sub>2</sub>
-symbol long parts of the prefixes. In total, the used disk space, in the worst case and with classic counters, is approximately </p>
<p><disp-formula><mml:math id="M2" name="1471-2105-14-160-i2" overflow="scroll"><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:msub><mml:mrow><mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>−</mml:mo>
<mml:msub><mml:mrow><mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>·</mml:mo>
<mml:msup><mml:mrow><mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
<mml:mo>·</mml:mo>
<mml:msub><mml:mrow><mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:msup><mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow><mml:mn>15</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mtext>bytes,</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p> where <italic>n</italic>
<sub><italic>k</italic>
</sub>
=<italic>n</italic>
(<italic>r</italic>
−<italic>k</italic>
+1) is the number of <italic>k</italic>
-mers in the input data. Switching to Quake-compatible counters increases this worst-case estimate to </p>
<p><disp-formula><mml:math id="M3" name="1471-2105-14-160-i3" overflow="scroll"><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:msub><mml:mrow><mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>−</mml:mo>
<mml:msub><mml:mrow><mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mn>16</mml:mn>
<mml:mo>)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>·</mml:mo>
<mml:msup><mml:mrow><mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
<mml:mo>·</mml:mo>
<mml:msub><mml:mrow><mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:msup><mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow><mml:mn>15</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mtext>bytes.</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<fig id="F4" position="float"><label>Figure 4</label>
<caption><p>Algorithm of collecting compacted bin chunks from all splitting threads.</p>
</caption>
<graphic xlink:href="1471-2105-14-160-4"></graphic>
</fig>
<p>In practice, removing the prefixes reduces the disk usage by at least a few, and sometimes over 20 percent, depending on the value of <italic>k</italic>
. This has an analogous effect of reducing the I/O, which translates to similar overall performance improvement.</p>
</sec>
<sec><title>Sorting phase</title>
<p>The second phase starts with bin reader threads; there are as many of them as disks for temporary data. The bin readers read the files from disk to a queue.</p>
<p>Now, several sorter threads collect the data from the queue, uncompact the <italic>k</italic>
-mers (their <italic>p</italic>
<sub>2</sub>-symbol long prefix parts are brought back), sort them using multithreaded least significant digit (LSD) radix sort (the sort implementation is partially inspired by
[<xref ref-type="bibr" rid="B10">10</xref>
]), count their frequencies and discard <italic>k</italic>
-mers with out of thresholds counter values (based on default or user-selected settings). The input data to be sorted are divided evenly among threads.</p>
<p>The (remaining) <italic>k</italic>-mers, with their counts, are ready to be sent to disk (cf. Figure
<xref ref-type="fig" rid="F5">5</xref>
), but it is up to the next stage, the completer thread. The sorter threads submit the <italic>k</italic>
-mers with their statistics into a priority queue, with bin ID as the priority, which is then handled by the single completer thread. This thread reorganizes the sorted bins in the order of ascending bin ID. The priority queue is needed to send the statistics in the proper order. As this structure is relatively small, implementing it as a plain unsorted array with several hundred slots and linear scan for finding the minimum was enough for the application.</p>
<fig id="F5" position="float"><label>Figure 5</label>
<caption><p><bold>Algorithm of sorting </bold>
<bold><italic>k</italic>
</bold>
<bold>-mers within bins.</bold>
 Fast radix sort procedure is used here.</p>
</caption>
<graphic xlink:href="1471-2105-14-160-5"></graphic>
</fig>
<p>The scheme presented above depends on a number of parameters. By default, KMC works in an automatic mode, where the parameters are found with respect to the machine it is executed at; the number of CPU cores and the available amount of RAM are all taken into account. More details on the parameter setting are given in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Table S6). (Manual setting is possible as well, though.)</p>
</sec>
</sec>
</sec>
<sec><title>Results and discussion</title>
<p>Our algorithm, called K-mer Counter (KMC), was implemented in C++11, using gcc compiler (version 4.7.1) for the linux build and Microsoft Visual Studio 2012 for the Windows build, and is freely available at <ext-link ext-link-type="uri" xlink:href="http://sun.aei.polsl.pl/kmc">http://sun.aei.polsl.pl/kmc</ext-link> (as well as is given as Additional files
<xref ref-type="supplementary-material" rid="S2">2</xref> and
<xref ref-type="supplementary-material" rid="S3">3</xref>
). The following well-known libraries were used: OpenMP, Boost for managing the threads and filesystem operations, zlib and libbzip for reading compressed FASTQ files, and asmlib (<ext-link ext-link-type="uri" xlink:href="http://www.agner.org/optimize/asmlib-instructions.pdf">http://www.agner.org/optimize/asmlib-instructions.pdf</ext-link>
) for a fast low-level memcpy<sup>d</sup>
 implementation.</p>
<p>Out of the five well-known <italic>k</italic>-mer counting tools, three were taken for the comparative tests, Jellyfish
[<xref ref-type="bibr" rid="B6">6</xref>], BFCounter
[<xref ref-type="bibr" rid="B7">7</xref>], and DSK
[<xref ref-type="bibr" rid="B9">9</xref>]. The other two, Tallymer
[<xref ref-type="bibr" rid="B3">3</xref>] and Meryl from the Celera assembler
[<xref ref-type="bibr" rid="B5">5</xref>], were tested in
[<xref ref-type="bibr" rid="B6">6</xref>
], on a 1 GB turkey genome, and we can find the following statement in the cited work: <italic>Jellyfish is also able to count 22-mers at coverage > 10× where the other programs fail or take over 5 h.</italic>
 Clearly, this makes them hard to use on human genome data, with 30-fold coverage.</p>
<p>Two test machines were used. One was a 4 AMD Opteron™6136 2.4 GHz CPUs (32 cores) server with 128 GB RAM, and fast RAID-0 disk matrix of total size 2.0 TB. The other was a “home” PC, with 6-core AMD Phenom II 1090 3.2 GHz CPU, 16 GB RAM and 3 SATA HDD of sizes 2 TB each. The hard disks at the PC machine were: two Seagate Barracuda Green ST2000DL003 with 5,900 rpm and one Hitachi Deskstar 7K3000 with 7,200 rpm.</p>
<p>The comparison includes total computation time and maximum RAM use. Moreover, the maximum disk use of the disk-based algorithms is recorded. Although the other tested programs for <italic>k</italic>
-mer counting, except for DSK, work only in RAM, we believe that using even several hundreds of gigabytes of disk space during the execution of KMC is a mild price for the achieved high efficiency, as disk space is almost two orders of magnitude cheaper than the RAM memory. KMC was run twelve times for each dataset and each tested <italic>k</italic>
: with 32 GB and 16 GB RAM limitation on the server machine and with 11 GB RAM limitation on the PC, with classic and Quake-compatible counters, and with gzipped and non-compressed input data in each configuration. The classic counters are just integers telling how many times a <italic>k</italic>
-mer occurs in the dataset. The Quake-compatible counters take into account the quality scores and are thus decimal-point numbers. The other programs used in the comparison, except DSK, optionally produce output results in Quake-compatible form.</p>
<p>We performed experiments on five datasets, three of which are presented below and an extra two in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
.</p>
<p>These three datasets discussed here comprise: </p>
<p>•<italic>Homo sapiens</italic>
 NA19238 from 1000GP (<ext-link ext-link-type="uri" xlink:href="http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA19238/sequence_read/">http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA19238/sequence_read/</ext-link>), used also in
[<xref ref-type="bibr" rid="B7">7</xref>
],</p>
<p>•<italic>Homo sapiens</italic>
 HG02057 from 1000GP (<ext-link ext-link-type="uri" xlink:href="http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG02057/sequence_read/">http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG02057/sequence_read/</ext-link>
),</p>
<p>•<italic>Caenorhabditis elegans</italic>
 (<ext-link ext-link-type="uri" xlink:href="http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065390/">http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065390/</ext-link>
).</p>
<p>Their basic statistics, for <italic>k</italic>=25, are presented in Table
<xref ref-type="table" rid="T1">1</xref>
.</p>
<p>There were several problems to compare exactly the proposed algorithm against its competitors. For example, some datasets contain raw reads with occasional N symbols in the DNA stream. Jellyfish processes such reads but refrains from counting the <italic>k</italic>
-mers containing Ns (in KMC we handle this issue in the same way). To our knowledge, BFCounter treats Ns as As and DSK treats them as Gs. This is of course rather strange from biological point of view but since there are only very few Ns, the misinterpreted <italic>k</italic>
-mers do not affect noticeably the RAM occupancy and computation time. Another problem with BFCounter is that it fails when executed with <italic>k</italic>
>25 in the classic counters mode (the authors claim it should work for <italic>k</italic>
 up to 31, but we were not able to reproduce it). With the Quake-compatible counters it sometimes handles larger <italic>k</italic>
, sometimes fails (details in the tables).</p>
<p>KMC is clearly the most flexible software: it can work over wide ranges of <italic>k</italic>
 and also for both counter modes. Like DSK, but contrary to other solutions, KMC can also read gzipped FASTQ datasets (typically used in genomic repositories) which tends to improve overall processing time (due to reduced I/O and the possibility to read multiple gzip files, even located at multiple disks). It is important to note that KMC space resources are bounded: the RAM usage is user-selected and the upper bound on the amount of disk space can be calculated with the (approximate) formula given in Section Distribution phase, which depends only on standard input parameters (the number of reads, the read length, the value of <italic>k</italic>
). DSK is even better in this aspect: it allows to set the RAM and disk usage quite precisely, but of course choosing tight parameters comes at a price of increased I/O and thus overall processing time. On the other hand, BFCounter and Jellyfish require guessing the number of unique <italic>k</italic>
-mers in the dataset, and a significant deviation from the true value is likely to cost significantly in increased RAM usage and processing time. In fact, it is possible to bound (not very precisely) the RAM requirements for Jellyfish, which, for large enough data, results in two-stage processing. In the first stage the tool produces several hash tables, and then, in the second stage, it merges them. The price is, however, deterioration in speed. Moreover, this regime of work is viable only for classic counters, as for the Quake-compatible counters the amount of space Jellyfish needs in the second stage is huge and we were not able to test it on human datasets.</p>
<p>Based on the results, several observations can be made. We start from the two human datasets. The first, NA19238<sup>e</sup> (Table
<xref ref-type="table" rid="T2">2</xref>
), has variable-length reads, but most of them are short, of length 36 only. This fact poses a restriction on the maximum used value of <italic>k</italic>
 (31). BFCounter was many times slower than its competitors and the amount of RAM it used was not impressive either: up to 46 GB, i.e., less than Jellyfish (run with speed in mind, i.e., with possibly the smallest amount of allocated memory for which no hash table merging is needed) but more than KMC. KMC was consistently faster than Jellyfish (both with classic and Quake-compatible counters) and several times faster than DSK. On the other hand, DSK (run in this and all other tests with default settings) was clearly the most frugal in memory use (6 GB) but required about twice or more disk space than KMC. When the KMC’s input FASTQ data are gzipped (note that the input data consist of 463 gzipped files and several of them can be read in parallel), the gap in speed between KMC and Jellyfish sometimes exceeds factor 2. Although the speedup thanks to compressed input varies, it is often of the order of 20 percent or more. The amount of disk space needed by KMC is up to 141 GB in the classic counters mode and up to 321 GB with the Quake-compatible counters. While certainly not small, this is less than the size of the input (uncompressed) FASTQ file. Reducing the amount of available RAM from 32 GB to 16 GB for KMC results in about 10 percent slow-down.</p>
<table-wrap position="float" id="T2"><label>Table 2</label>
<caption><p><bold><italic>k</italic>
</bold>
<bold>-mers counting results for</bold>
<bold><italic>Homo sapiens</italic>
</bold>
<bold> NA19238 individual (353 GB FASTQ file or 463 gzipped FASTQ files of total size 116.6 GB)</bold>
</p>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top"><tr><th align="left" valign="bottom"> <hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=22</italic>
</bold>
<hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=25</italic>
</bold>
<hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=28</italic>
</bold>
<hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=31</italic>
</bold>
<hr></hr>
</th>
</tr>
<tr><th align="left"><bold>Algorithm</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom"><bold>Classic counters</bold>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">32-core server<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">BFCounter<hr></hr>
</td>
<td align="center" valign="bottom">46/ 0<hr></hr>
</td>
<td align="center" valign="bottom">114,083<hr></hr>
</td>
<td align="center" valign="bottom">41/ 0<hr></hr>
</td>
<td align="center" valign="bottom">99,468<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">Failed<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">Failed<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Jellyfish<hr></hr>
</td>
<td align="center" valign="bottom">50/ 0<hr></hr>
</td>
<td align="center" valign="bottom">2,303<hr></hr>
</td>
<td align="center" valign="bottom">64/ 0<hr></hr>
</td>
<td align="center" valign="bottom">2,258<hr></hr>
</td>
<td align="center" valign="bottom">75/ 0<hr></hr>
</td>
<td align="center" valign="bottom">2,208<hr></hr>
</td>
<td align="center" valign="bottom">87/ 0<hr></hr>
</td>
<td align="center" valign="bottom">2,107<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Jellyfish<sup>1</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">27/ 39<hr></hr>
</td>
<td align="center" valign="bottom">2,964<hr></hr>
</td>
<td align="center" valign="bottom">33/ 36<hr></hr>
</td>
<td align="center" valign="bottom">2,769<hr></hr>
</td>
<td align="center" valign="bottom">21/ 27<hr></hr>
</td>
<td align="center" valign="bottom">2,673<hr></hr>
</td>
<td align="center" valign="bottom">24/ 22<hr></hr>
</td>
<td align="center" valign="bottom">2,511<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK<hr></hr>
</td>
<td align="center" valign="bottom">6/200<hr></hr>
</td>
<td align="center" valign="bottom">6,490<hr></hr>
</td>
<td align="center" valign="bottom">6/340<hr></hr>
</td>
<td align="center" valign="bottom">6,020<hr></hr>
</td>
<td align="center" valign="bottom">6/280<hr></hr>
</td>
<td align="center" valign="bottom">5,115<hr></hr>
</td>
<td align="center" valign="bottom">6/221<hr></hr>
</td>
<td align="center" valign="bottom">4,215<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">6/200<hr></hr>
</td>
<td align="center" valign="bottom">9,076<hr></hr>
</td>
<td align="center" valign="bottom">6/340<hr></hr>
</td>
<td align="center" valign="bottom">8,029<hr></hr>
</td>
<td align="center" valign="bottom">6/280<hr></hr>
</td>
<td align="center" valign="bottom">7,157<hr></hr>
</td>
<td align="center" valign="bottom">6/221<hr></hr>
</td>
<td align="center" valign="bottom">6,424<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">32/104<hr></hr>
</td>
<td align="center" valign="bottom">1,405<hr></hr>
</td>
<td align="center" valign="bottom">32/130<hr></hr>
</td>
<td align="center" valign="bottom">1,488<hr></hr>
</td>
<td align="center" valign="bottom">32/133<hr></hr>
</td>
<td align="center" valign="bottom">1,522<hr></hr>
</td>
<td align="center" valign="bottom">32/121<hr></hr>
</td>
<td align="center" valign="bottom">1,471<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">16/107<hr></hr>
</td>
<td align="center" valign="bottom">1,548<hr></hr>
</td>
<td align="center" valign="bottom">16/131<hr></hr>
</td>
<td align="center" valign="bottom">1,657<hr></hr>
</td>
<td align="center" valign="bottom">16/141<hr></hr>
</td>
<td align="center" valign="bottom">1,684<hr></hr>
</td>
<td align="center" valign="bottom">16/128<hr></hr>
</td>
<td align="center" valign="bottom">1,568<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">32/104<hr></hr>
</td>
<td align="center" valign="bottom">1,040<hr></hr>
</td>
<td align="center" valign="bottom">32/129<hr></hr>
</td>
<td align="center" valign="bottom">1,066<hr></hr>
</td>
<td align="center" valign="bottom">32/132<hr></hr>
</td>
<td align="center" valign="bottom">1,055<hr></hr>
</td>
<td align="center" valign="bottom">32/120<hr></hr>
</td>
<td align="center" valign="bottom">989<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">16/107<hr></hr>
</td>
<td align="center" valign="bottom">1,278<hr></hr>
</td>
<td align="center" valign="bottom">16/130<hr></hr>
</td>
<td align="center" valign="bottom">1,631<hr></hr>
</td>
<td align="center" valign="bottom">16/141<hr></hr>
</td>
<td align="center" valign="bottom">1,662<hr></hr>
</td>
<td align="center" valign="bottom">16/125<hr></hr>
</td>
<td align="center" valign="bottom">1,307<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">6-core PC<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK<hr></hr>
</td>
<td align="center" valign="bottom">6/200<hr></hr>
</td>
<td align="center" valign="bottom">21,468<hr></hr>
</td>
<td align="center" valign="bottom">6/340<hr></hr>
</td>
<td align="center" valign="bottom">18,774<hr></hr>
</td>
<td align="center" valign="bottom">6/280<hr></hr>
</td>
<td align="center" valign="bottom">15,384<hr></hr>
</td>
<td align="center" valign="bottom">6/221<hr></hr>
</td>
<td align="center" valign="bottom">11,857<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">6/200<hr></hr>
</td>
<td align="center" valign="bottom">18,939<hr></hr>
</td>
<td align="center" valign="bottom">6/340<hr></hr>
</td>
<td align="center" valign="bottom">16,818<hr></hr>
</td>
<td align="center" valign="bottom">6/280<hr></hr>
</td>
<td align="center" valign="bottom">14,694<hr></hr>
</td>
<td align="center" valign="bottom">6/221<hr></hr>
</td>
<td align="center" valign="bottom">12,070<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">11/107<hr></hr>
</td>
<td align="center" valign="bottom">3,482<hr></hr>
</td>
<td align="center" valign="bottom">11/128<hr></hr>
</td>
<td align="center" valign="bottom">3,442<hr></hr>
</td>
<td align="center" valign="bottom">11/138<hr></hr>
</td>
<td align="center" valign="bottom">3,584<hr></hr>
</td>
<td align="center" valign="bottom">11/127<hr></hr>
</td>
<td align="center" valign="bottom">3,515<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">11/107<hr></hr>
</td>
<td align="center" valign="bottom">2,198<hr></hr>
</td>
<td align="center" valign="bottom">11/128<hr></hr>
</td>
<td align="center" valign="bottom">2,206<hr></hr>
</td>
<td align="center" valign="bottom">11/138<hr></hr>
</td>
<td align="center" valign="bottom">2,303<hr></hr>
</td>
<td align="center" valign="bottom">11/127<hr></hr>
</td>
<td align="center" valign="bottom">2,365<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom"><bold>Quake-compatible counters</bold>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">32-core server<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">BFCounter<hr></hr>
</td>
<td align="center" valign="bottom">70/ 0<hr></hr>
</td>
<td align="center" valign="bottom">171,888<hr></hr>
</td>
<td align="center" valign="bottom">72/ 0<hr></hr>
</td>
<td align="center" valign="bottom">180,861<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">Failed<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">Failed<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Jellyfish<hr></hr>
</td>
<td align="center" valign="bottom">100/ 0<hr></hr>
</td>
<td align="center" valign="bottom">4,339<hr></hr>
</td>
<td align="center" valign="bottom">57/230<hr></hr>
</td>
<td align="center" valign="bottom">2,891 <sup>∗</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">64/192<hr></hr>
</td>
<td align="center" valign="bottom">3,246 <sup>∗</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">70/175<hr></hr>
</td>
<td align="center" valign="bottom">3,161 <sup>∗</sup>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">32/311<hr></hr>
</td>
<td align="center" valign="bottom">2,585<hr></hr>
</td>
<td align="center" valign="bottom">32/302<hr></hr>
</td>
<td align="center" valign="bottom">2,467<hr></hr>
</td>
<td align="center" valign="bottom">32/282<hr></hr>
</td>
<td align="center" valign="bottom">2,347<hr></hr>
</td>
<td align="center" valign="bottom">32/237<hr></hr>
</td>
<td align="center" valign="bottom">2,106<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">16/315<hr></hr>
</td>
<td align="center" valign="bottom">2,615<hr></hr>
</td>
<td align="center" valign="bottom">16/305<hr></hr>
</td>
<td align="center" valign="bottom">2,730<hr></hr>
</td>
<td align="center" valign="bottom">16/283<hr></hr>
</td>
<td align="center" valign="bottom">2,592<hr></hr>
</td>
<td align="center" valign="bottom">16/245<hr></hr>
</td>
<td align="center" valign="bottom">2,284<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">32/310<hr></hr>
</td>
<td align="center" valign="bottom">2,071<hr></hr>
</td>
<td align="center" valign="bottom">32/302<hr></hr>
</td>
<td align="center" valign="bottom">1,995<hr></hr>
</td>
<td align="center" valign="bottom">32/282<hr></hr>
</td>
<td align="center" valign="bottom">1,880<hr></hr>
</td>
<td align="center" valign="bottom">32/237<hr></hr>
</td>
<td align="center" valign="bottom">1,690<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">16/318<hr></hr>
</td>
<td align="center" valign="bottom">2,273<hr></hr>
</td>
<td align="center" valign="bottom">16/304<hr></hr>
</td>
<td align="center" valign="bottom">2,611<hr></hr>
</td>
<td align="center" valign="bottom">16/283<hr></hr>
</td>
<td align="center" valign="bottom">2,015<hr></hr>
</td>
<td align="center" valign="bottom">16/244<hr></hr>
</td>
<td align="center" valign="bottom">2,707<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">6-core PC<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">11/316<hr></hr>
</td>
<td align="center" valign="bottom">5,538<hr></hr>
</td>
<td align="center" valign="bottom">11/298<hr></hr>
</td>
<td align="center" valign="bottom">5,533<hr></hr>
</td>
<td align="center" valign="bottom">11/277<hr></hr>
</td>
<td align="center" valign="bottom">5,184<hr></hr>
</td>
<td align="center" valign="bottom">11/242<hr></hr>
</td>
<td align="center" valign="bottom">5,016<hr></hr>
</td>
</tr>
<tr><td align="left">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
</td>
<td align="center">11/316</td>
<td align="center">5,370</td>
<td align="center">11/298</td>
<td align="center">5,060</td>
<td align="center">11/277</td>
<td align="center">4,708</td>
<td align="center">11/243</td>
<td align="center">4,643</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>RAM and disk spaces (the first and the second value in the column “Space”, respectively) are in GB (1GB= 2<sup>30</sup>
B). Time is in seconds. The test machines: 32-core server, 6-core PC (see more details in the text). Superscripts denote: <sup>1</sup>
—RAM limited to 36GB, <sup><italic>g</italic>
<italic>z</italic>
</sup>
—input data in gzipped files. The programs were used for the number of threads adjusted to the number of cores to achieve maximum speed. The asterisk signs (for Jellyfish) denote that two separate databases were constructed by Jellyfish due to the memory limit of the machine (128GB RAM) and Jellyfish reported that to merge these databases it needs more RAM, so these times are underestimated.</p>
</table-wrap-foot>
</table-wrap>
<p>The HG02057 data (Table
<xref ref-type="table" rid="T3">3</xref>
) are quite challenging, concerning their sheer volume. BFCounter was not tested here, since the experiments would take literally weeks. Jellyfish suffers from rapidly growing RAM usage for larger <italic>k</italic>
, and for <italic>k</italic>
=31 it needs 86 GB of memory. On the other hand, KMC can handle the dataset in 32 GB or even 16 GB of RAM, no matter the <italic>k</italic>
. In the runs with halved memory, KMC is (again) only by about 10 percent slower than with 32 GB, and requires a few percent more disk space. We admit that Jellyfish is usually faster on this dataset than KMC-32 GB, by about 10–20 percent. The penalty in RAM usage is severe though; in the Quake-compatible mode Jellyfish couldn’t properly finish some runs on the server machine with 128 GB of RAM, i.e., produced two output files and was unable to merge them in the available memory (the corresponding results are marked with an asterisk). The analogous weakness of KMC, using up to 553 GB of temporary disk space, is less painful (not only disk space is much cheaper than internal memory, but also adding a disk to the system is usually easier and not so limited as expanding RAM). We note that Jellyfish’s problems with memory are partly related to its dependence on the estimate of the number of unique <italic>k</italic>
-mers. Jellyfish works fastest when the whole hash table it needs fits the RAM memory. Alas, it requires knowing (approximately) the number of unique <italic>k</italic>
-mers. If the specified parameter is too small, Jellyfish creates several temporary files on disk, which are at the end merged; an operation not only time-consuming, but also demanding in memory, as our experiment showed. Again, for this dataset DSK may be a tool of choice on a less powerful (e.g., laptop) computer, since it uses only 6 GB of RAM and usually less disk space than KMC (which is at least 4 times faster though).</p>
<table-wrap position="float" id="T3"><label>Table 3</label>
<caption><p><bold><italic>k</italic>
</bold>
<bold>-mers counting results for</bold>
<bold><italic>Homo sapiens</italic>
</bold>
<bold> HG02057 individual (208 GB FASTQ file or 6 gzipped FASTQ files of total size 65.9 GB)</bold>
</p>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top"><tr><th align="left" valign="bottom"> <hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=22</italic>
</bold>
<hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=28</italic>
</bold>
<hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=40</italic>
</bold>
<hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=55</italic>
</bold>
<hr></hr>
</th>
</tr>
<tr><th align="left"><bold>Algorithm</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom"><bold>Classic counters</bold>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">32-core server<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Jellyfish<hr></hr>
</td>
<td align="center" valign="bottom">27/ 0<hr></hr>
</td>
<td align="center" valign="bottom">1,375<hr></hr>
</td>
<td align="center" valign="bottom">75/ 0<hr></hr>
</td>
<td align="center" valign="bottom">1,433<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Jellyfish<sup>1</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">27/ 0<hr></hr>
</td>
<td align="center" valign="bottom">1,375<hr></hr>
</td>
<td align="center" valign="bottom">21/ 53<hr></hr>
</td>
<td align="center" valign="bottom">2,404<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK<hr></hr>
</td>
<td align="center" valign="bottom">6/168<hr></hr>
</td>
<td align="center" valign="bottom">8,683<hr></hr>
</td>
<td align="center" valign="bottom">6/156<hr></hr>
</td>
<td align="center" valign="bottom">9,073<hr></hr>
</td>
<td align="center" valign="bottom">6/195<hr></hr>
</td>
<td align="center" valign="bottom">13,172<hr></hr>
</td>
<td align="center" valign="bottom">6/197<hr></hr>
</td>
<td align="center" valign="bottom">9,409<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">6/168<hr></hr>
</td>
<td align="center" valign="bottom">10,125<hr></hr>
</td>
<td align="center" valign="bottom">6/156<hr></hr>
</td>
<td align="center" valign="bottom">10,579<hr></hr>
</td>
<td align="center" valign="bottom">6/195<hr></hr>
</td>
<td align="center" valign="bottom">14,569<hr></hr>
</td>
<td align="center" valign="bottom">6/197<hr></hr>
</td>
<td align="center" valign="bottom">10,987<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">32/130<hr></hr>
</td>
<td align="center" valign="bottom">1,221<hr></hr>
</td>
<td align="center" valign="bottom">32/220<hr></hr>
</td>
<td align="center" valign="bottom">1,706<hr></hr>
</td>
<td align="center" valign="bottom">32/341<hr></hr>
</td>
<td align="center" valign="bottom">2,486<hr></hr>
</td>
<td align="center" valign="bottom">32/391<hr></hr>
</td>
<td align="center" valign="bottom">2,722<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">16/134<hr></hr>
</td>
<td align="center" valign="bottom">1,376<hr></hr>
</td>
<td align="center" valign="bottom">16/234<hr></hr>
</td>
<td align="center" valign="bottom">1,872<hr></hr>
</td>
<td align="center" valign="bottom">16/343<hr></hr>
</td>
<td align="center" valign="bottom">2,664<hr></hr>
</td>
<td align="center" valign="bottom">16/405<hr></hr>
</td>
<td align="center" valign="bottom">2,967<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">32/130<hr></hr>
</td>
<td align="center" valign="bottom">1,249<hr></hr>
</td>
<td align="center" valign="bottom">32/219<hr></hr>
</td>
<td align="center" valign="bottom">1,505<hr></hr>
</td>
<td align="center" valign="bottom">32/342<hr></hr>
</td>
<td align="center" valign="bottom">2,304<hr></hr>
</td>
<td align="center" valign="bottom">32/391<hr></hr>
</td>
<td align="center" valign="bottom">2,597<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">16/134<hr></hr>
</td>
<td align="center" valign="bottom">1,195<hr></hr>
</td>
<td align="center" valign="bottom">16/234<hr></hr>
</td>
<td align="center" valign="bottom">1,732<hr></hr>
</td>
<td align="center" valign="bottom">16/343<hr></hr>
</td>
<td align="center" valign="bottom">2,479<hr></hr>
</td>
<td align="center" valign="bottom">16/403<hr></hr>
</td>
<td align="center" valign="bottom">2,909<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">6-core PC<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK<hr></hr>
</td>
<td align="center" valign="bottom">6/168<hr></hr>
</td>
<td align="center" valign="bottom">22,963<hr></hr>
</td>
<td align="center" valign="bottom">6/156<hr></hr>
</td>
<td align="center" valign="bottom">23,512<hr></hr>
</td>
<td align="center" valign="bottom">6/195<hr></hr>
</td>
<td align="center" valign="bottom">37,958<hr></hr>
</td>
<td align="center" valign="bottom">6/197<hr></hr>
</td>
<td align="center" valign="bottom">28,681<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">6/168<hr></hr>
</td>
<td align="center" valign="bottom">21,688<hr></hr>
</td>
<td align="center" valign="bottom">6/156<hr></hr>
</td>
<td align="center" valign="bottom">22,061<hr></hr>
</td>
<td align="center" valign="bottom">6/195<hr></hr>
</td>
<td align="center" valign="bottom">36,328<hr></hr>
</td>
<td align="center" valign="bottom">6/197<hr></hr>
</td>
<td align="center" valign="bottom">26,584<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">11/137<hr></hr>
</td>
<td align="center" valign="bottom">2,939<hr></hr>
</td>
<td align="center" valign="bottom">11/234<hr></hr>
</td>
<td align="center" valign="bottom">3,782<hr></hr>
</td>
<td align="center" valign="bottom">11/343<hr></hr>
</td>
<td align="center" valign="bottom">6,133<hr></hr>
</td>
<td align="center" valign="bottom">11/405<hr></hr>
</td>
<td align="center" valign="bottom">7,770<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">11/136<hr></hr>
</td>
<td align="center" valign="bottom">2,623<hr></hr>
</td>
<td align="center" valign="bottom">11/235<hr></hr>
</td>
<td align="center" valign="bottom">4,041<hr></hr>
</td>
<td align="center" valign="bottom">11/343<hr></hr>
</td>
<td align="center" valign="bottom">6,306<hr></hr>
</td>
<td align="center" valign="bottom">11/405<hr></hr>
</td>
<td align="center" valign="bottom">7,020<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom"><bold>Quake-compatible counters</bold>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">32-core server<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Jellyfish<hr></hr>
</td>
<td align="center" valign="bottom">51/ 0<hr></hr>
</td>
<td align="center" valign="bottom">2,426<hr></hr>
</td>
<td align="center" valign="bottom">59/126<hr></hr>
</td>
<td align="center" valign="bottom">2,503 <sup>∗</sup>
<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">32/388<hr></hr>
</td>
<td align="center" valign="bottom">2,612<hr></hr>
</td>
<td align="center" valign="bottom">32/468<hr></hr>
</td>
<td align="center" valign="bottom">3,011<hr></hr>
</td>
<td align="center" valign="bottom">32/537<hr></hr>
</td>
<td align="center" valign="bottom">3,541<hr></hr>
</td>
<td align="center" valign="bottom">32/542<hr></hr>
</td>
<td align="center" valign="bottom">3,546<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">16/402<hr></hr>
</td>
<td align="center" valign="bottom">2,990<hr></hr>
</td>
<td align="center" valign="bottom">16/468<hr></hr>
</td>
<td align="center" valign="bottom">3,405<hr></hr>
</td>
<td align="center" valign="bottom">16/539<hr></hr>
</td>
<td align="center" valign="bottom">4,300<hr></hr>
</td>
<td align="center" valign="bottom">16/552<hr></hr>
</td>
<td align="center" valign="bottom">4,175<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">32/387<hr></hr>
</td>
<td align="center" valign="bottom">2,409<hr></hr>
</td>
<td align="center" valign="bottom">32/468<hr></hr>
</td>
<td align="center" valign="bottom">2,860<hr></hr>
</td>
<td align="center" valign="bottom">32/537<hr></hr>
</td>
<td align="center" valign="bottom">3,370<hr></hr>
</td>
<td align="center" valign="bottom">32/536<hr></hr>
</td>
<td align="center" valign="bottom">3,357<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">16/400<hr></hr>
</td>
<td align="center" valign="bottom">2,760<hr></hr>
</td>
<td align="center" valign="bottom">16/468<hr></hr>
</td>
<td align="center" valign="bottom">3,285<hr></hr>
</td>
<td align="center" valign="bottom">16/498<hr></hr>
</td>
<td align="center" valign="bottom">4,083<hr></hr>
</td>
<td align="center" valign="bottom">16/552<hr></hr>
</td>
<td align="center" valign="bottom">4,038<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">6-core PC<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">11/404<hr></hr>
</td>
<td align="center" valign="bottom">6,625<hr></hr>
</td>
<td align="center" valign="bottom">11/469<hr></hr>
</td>
<td align="center" valign="bottom">7,741<hr></hr>
</td>
<td align="center" valign="bottom">11/539<hr></hr>
</td>
<td align="center" valign="bottom">9,673<hr></hr>
</td>
<td align="center" valign="bottom">11/552<hr></hr>
</td>
<td align="center" valign="bottom">11,135<hr></hr>
</td>
</tr>
<tr><td align="left">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
</td>
<td align="center">11/403</td>
<td align="center">6,783</td>
<td align="center">11/468</td>
<td align="center">8,034</td>
<td align="center">11/539</td>
<td align="center">9,764</td>
<td align="center">11/553</td>
<td align="center">9,775</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Test methodology and column description are just as for Table
<xref ref-type="table" rid="T2">2</xref>
. The asterisk sign (for Jellyfish) denotes that two separate databases were constructed by Jellyfish due to the memory limit of the machine (128GB RAM) and Jellyfish reported that to merge these databases it needs more RAM, so these times are underestimated.</p>
</table-wrap-foot>
</table-wrap>
<p>On the smallest of the tested datasets, <italic>C. elegans</italic> (Table
<xref ref-type="table" rid="T4">4</xref>
), our tool, KMC, cannot fully spread its wings and achieve better results than Jellyfish. In the amount of used RAM memory they are comparable, with 21–26 GB used by Jellyfish, slightly growing with chosen <italic>k</italic>, and (selectable) 16 GB or 32 GB spent by KMC. In speed, Jellyfish was in most cases faster by 10–30 percent than KMC (but the speed varied somewhat; for more results see Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Table S2). BFCounter used here the least amount of RAM memory (4 GB) but was about 40 times slower than KMC and Jellyfish. DSK used 5 GB of RAM and less disk space than KMC, but was a few times slower than KMC and Jellyfish.</p>
<table-wrap position="float" id="T4"><label>Table 4</label>
<caption><p><bold><italic>k</italic>
</bold>
<bold>-mers counting results for</bold>
<bold><italic>Caenorhabditis elegans</italic>
</bold>
<bold> (16.4 GB FASTQ file or 2 gzipped FASTQ files of total size 4.6 GB)</bold>
</p>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top"><tr><th align="left" valign="bottom"> <hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=22</italic>
</bold>
<hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=28</italic>
</bold>
<hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=40</italic>
</bold>
<hr></hr>
</th>
<th colspan="2" align="center" valign="bottom"><bold><italic>k=55</italic>
</bold>
<hr></hr>
</th>
</tr>
<tr><th align="left"><bold>Algorithm</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
<th align="center"><bold>Space</bold>
</th>
<th align="center"><bold>Time</bold>
</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom"><bold>Classic counters</bold>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">32-core server<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">BFCounter<hr></hr>
</td>
<td align="center" valign="bottom">4/ 0<hr></hr>
</td>
<td align="center" valign="bottom">10,407<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">Failed<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">Failed<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">Failed<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Jellyfish<hr></hr>
</td>
<td align="center" valign="bottom">21/ 0<hr></hr>
</td>
<td align="center" valign="bottom">88<hr></hr>
</td>
<td align="center" valign="bottom">22/ 0<hr></hr>
</td>
<td align="center" valign="bottom">89<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK<hr></hr>
</td>
<td align="center" valign="bottom">5/13<hr></hr>
</td>
<td align="center" valign="bottom">493<hr></hr>
</td>
<td align="center" valign="bottom">5/13<hr></hr>
</td>
<td align="center" valign="bottom">458<hr></hr>
</td>
<td align="center" valign="bottom">5/16<hr></hr>
</td>
<td align="center" valign="bottom">647<hr></hr>
</td>
<td align="center" valign="bottom">5/16<hr></hr>
</td>
<td align="center" valign="bottom">481<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">5/13<hr></hr>
</td>
<td align="center" valign="bottom">587<hr></hr>
</td>
<td align="center" valign="bottom">5/13<hr></hr>
</td>
<td align="center" valign="bottom">564<hr></hr>
</td>
<td align="center" valign="bottom">5/16<hr></hr>
</td>
<td align="center" valign="bottom">761<hr></hr>
</td>
<td align="center" valign="bottom">5/16<hr></hr>
</td>
<td align="center" valign="bottom">592<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">32/10<hr></hr>
</td>
<td align="center" valign="bottom">105<hr></hr>
</td>
<td align="center" valign="bottom">32/18<hr></hr>
</td>
<td align="center" valign="bottom">115<hr></hr>
</td>
<td align="center" valign="bottom">32/27<hr></hr>
</td>
<td align="center" valign="bottom">185<hr></hr>
</td>
<td align="center" valign="bottom">32/31<hr></hr>
</td>
<td align="center" valign="bottom">191<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">16/11<hr></hr>
</td>
<td align="center" valign="bottom">93<hr></hr>
</td>
<td align="center" valign="bottom">16/18<hr></hr>
</td>
<td align="center" valign="bottom">93<hr></hr>
</td>
<td align="center" valign="bottom">16/27<hr></hr>
</td>
<td align="center" valign="bottom">163<hr></hr>
</td>
<td align="center" valign="bottom">16/32<hr></hr>
</td>
<td align="center" valign="bottom">160<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">32/10<hr></hr>
</td>
<td align="center" valign="bottom">134<hr></hr>
</td>
<td align="center" valign="bottom">32/18<hr></hr>
</td>
<td align="center" valign="bottom">138<hr></hr>
</td>
<td align="center" valign="bottom">32/27<hr></hr>
</td>
<td align="center" valign="bottom">199<hr></hr>
</td>
<td align="center" valign="bottom">32/31<hr></hr>
</td>
<td align="center" valign="bottom">203<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">16/11<hr></hr>
</td>
<td align="center" valign="bottom">119<hr></hr>
</td>
<td align="center" valign="bottom">16/18<hr></hr>
</td>
<td align="center" valign="bottom">121<hr></hr>
</td>
<td align="center" valign="bottom">16/27<hr></hr>
</td>
<td align="center" valign="bottom">195<hr></hr>
</td>
<td align="center" valign="bottom">16/32<hr></hr>
</td>
<td align="center" valign="bottom">182<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">6-core PC<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK<hr></hr>
</td>
<td align="center" valign="bottom">5/13<hr></hr>
</td>
<td align="center" valign="bottom">1,307<hr></hr>
</td>
<td align="center" valign="bottom">5/13<hr></hr>
</td>
<td align="center" valign="bottom">1,215<hr></hr>
</td>
<td align="center" valign="bottom">5/16<hr></hr>
</td>
<td align="center" valign="bottom">2,073<hr></hr>
</td>
<td align="center" valign="bottom">5/16<hr></hr>
</td>
<td align="center" valign="bottom">1,597<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">DSK <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">5/13<hr></hr>
</td>
<td align="center" valign="bottom">1,268<hr></hr>
</td>
<td align="center" valign="bottom">5/13<hr></hr>
</td>
<td align="center" valign="bottom">1,151<hr></hr>
</td>
<td align="center" valign="bottom">5/16<hr></hr>
</td>
<td align="center" valign="bottom">2,017<hr></hr>
</td>
<td align="center" valign="bottom">5/16<hr></hr>
</td>
<td align="center" valign="bottom">1,518<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">11/11<hr></hr>
</td>
<td align="center" valign="bottom">274<hr></hr>
</td>
<td align="center" valign="bottom">11/18<hr></hr>
</td>
<td align="center" valign="bottom">343<hr></hr>
</td>
<td align="center" valign="bottom">11/27<hr></hr>
</td>
<td align="center" valign="bottom">507<hr></hr>
</td>
<td align="center" valign="bottom">11/32<hr></hr>
</td>
<td align="center" valign="bottom">553<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">11/11<hr></hr>
</td>
<td align="center" valign="bottom">233<hr></hr>
</td>
<td align="center" valign="bottom">11/18<hr></hr>
</td>
<td align="center" valign="bottom">333<hr></hr>
</td>
<td align="center" valign="bottom">11/27<hr></hr>
</td>
<td align="center" valign="bottom">514<hr></hr>
</td>
<td align="center" valign="bottom">11/32<hr></hr>
</td>
<td align="center" valign="bottom">542<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom"><bold>Quake-compatible counters</bold>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">32-core server<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">BFCounter<hr></hr>
</td>
<td align="center" valign="bottom">4/ 0<hr></hr>
</td>
<td align="center" valign="bottom">10,349<hr></hr>
</td>
<td align="center" valign="bottom">4/ 0<hr></hr>
</td>
<td align="center" valign="bottom">9,527<hr></hr>
</td>
<td align="center" valign="bottom">4/ 0<hr></hr>
</td>
<td align="center" valign="bottom">8,213<hr></hr>
</td>
<td align="center" valign="bottom">4/ 0<hr></hr>
</td>
<td align="center" valign="bottom">6,689<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Jellyfish<hr></hr>
</td>
<td align="center" valign="bottom">24/ 0<hr></hr>
</td>
<td align="center" valign="bottom">143<hr></hr>
</td>
<td align="center" valign="bottom">25/ 0<hr></hr>
</td>
<td align="center" valign="bottom">132<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
<td colspan="2" align="center" valign="bottom">—<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">32/31<hr></hr>
</td>
<td align="center" valign="bottom">173<hr></hr>
</td>
<td align="center" valign="bottom">32/37<hr></hr>
</td>
<td align="center" valign="bottom">179<hr></hr>
</td>
<td align="center" valign="bottom">32/42<hr></hr>
</td>
<td align="center" valign="bottom">245<hr></hr>
</td>
<td align="center" valign="bottom">32/43<hr></hr>
</td>
<td align="center" valign="bottom">243<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">16/32<hr></hr>
</td>
<td align="center" valign="bottom">166<hr></hr>
</td>
<td align="center" valign="bottom">16/37<hr></hr>
</td>
<td align="center" valign="bottom">154<hr></hr>
</td>
<td align="center" valign="bottom">16/43<hr></hr>
</td>
<td align="center" valign="bottom">212<hr></hr>
</td>
<td align="center" valign="bottom">16/44<hr></hr>
</td>
<td align="center" valign="bottom">214<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">32/31<hr></hr>
</td>
<td align="center" valign="bottom">184<hr></hr>
</td>
<td align="center" valign="bottom">32/37<hr></hr>
</td>
<td align="center" valign="bottom">188<hr></hr>
</td>
<td align="center" valign="bottom">32/43<hr></hr>
</td>
<td align="center" valign="bottom">247<hr></hr>
</td>
<td align="center" valign="bottom">32/43<hr></hr>
</td>
<td align="center" valign="bottom">249<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">16/32<hr></hr>
</td>
<td align="center" valign="bottom">168<hr></hr>
</td>
<td align="center" valign="bottom">16/37<hr></hr>
</td>
<td align="center" valign="bottom">165<hr></hr>
</td>
<td align="center" valign="bottom">16/43<hr></hr>
</td>
<td align="center" valign="bottom">218<hr></hr>
</td>
<td align="center" valign="bottom">16/44<hr></hr>
</td>
<td align="center" valign="bottom">230<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td colspan="8" align="center" valign="bottom">6-core PC<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">KMC<hr></hr>
</td>
<td align="center" valign="bottom">11/32<hr></hr>
</td>
<td align="center" valign="bottom">562<hr></hr>
</td>
<td align="center" valign="bottom">11/37<hr></hr>
</td>
<td align="center" valign="bottom">635<hr></hr>
</td>
<td align="center" valign="bottom">11/43<hr></hr>
</td>
<td align="center" valign="bottom">750<hr></hr>
</td>
<td align="center" valign="bottom">11/44<hr></hr>
</td>
<td align="center" valign="bottom">784<hr></hr>
</td>
</tr>
<tr><td align="left">KMC <sup><italic>g</italic>
<italic>z</italic>
</sup>
</td>
<td align="center">11/32</td>
<td align="center">555</td>
<td align="center">11/37</td>
<td align="center">627</td>
<td align="center">11/43</td>
<td align="center">749</td>
<td align="center">11/44</td>
<td align="center">772</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As expected, KMC uses more time and disk space in the Quake-compatible counter mode, but in most cases these differences are by a factor smaller than 2. With growing <italic>k</italic>
, the computational resources increase (until <italic>k</italic> becomes quite close to the read length), in accordance to the pattern demonstrated in Figure
<xref ref-type="fig" rid="F6">6</xref>
.</p>
<fig id="F6" position="float"><label>Figure 6</label>
<caption><p><bold>The KMC processing time (red lines) and used disk space (blue lines) as a function of k for (a) C. elegans dataset, (b) HG02057 dataset.</bold>
 The solid lines are for classic counters, the dashed lines for Quake-compatible counters.</p>
</caption>
<graphic xlink:href="1471-2105-14-160-6"></graphic>
</fig>
<p>The comparisons with other tools were run on our stronger machine, but KMC is also tested on a PC, where 11 GB of RAM was always used. It is about 3 times slower and uses a comparable amount of disk space.</p>
<p>Figure
<xref ref-type="fig" rid="F6">6</xref>
 presents the computation time (red lines) and the used disk space (blue lines) for <italic>C. elegans</italic>
 and HG02057, for varying <italic>k</italic>
. The read length <italic>r</italic>
 for both datasets is 100. On the charts, the solid lines are for the occurrence count mode and the dashed lines for the Quake-compatible mode. The statistics were gathered for counts at least 2 in the former and at least 2.0 in the latter mode. We focus on the case of <italic>C. elegans</italic>
 (a), although similar observations pertain to the other dataset. Not surprisingly, the Quake-related mode is more demanding in computation time and disk space, but the relative gap diminishes with growing <italic>k</italic>
. The time grows suddenly when <italic>k</italic>
 exceeds thresholds 32 and 64, as more machine words are then needed to store a whole <italic>k</italic>
-mer. As <italic>k</italic>
 was approaching <italic>r</italic>
, KMC worked faster and in less space, because the number of <italic>k</italic>
-mers per read was diminishing relatively fast. Another clear observation is that the processing time and used disk space are closely related; more precisely, the overall time and the amount of I/O are closely related in our algorithm.</p>
<p>It may also be an interesting question how the processing time and the amount of used disk space in KMC vary with different specified amounts of RAM (i.e., if it pays to give it more RAM when it is available). The results, presented in Figure
<xref ref-type="fig" rid="F7">7</xref>
, concern the HG02057 dataset in gzipped input representation, for <italic>k</italic>
=40 and in the classic counters mode. Using more than 32 GB of RAM is even detrimental for the KMC speed, although the loss is slight up to 80 GB (becomes somewhat more significant with 96 GB). What is perhaps more important practically, using 24 GB of RAM is almost as good as 32 GB. Another observation is that more RAM translates into less disk space needed, but this effect is mild (about 6 percent difference between the extreme values).</p>
<fig id="F7" position="float"><label>Figure 7</label>
<caption><p><bold>HG02057 (gzipped) dataset, </bold>
<bold><italic>k=40 </italic>
</bold>
<bold>, classic counters.</bold>
 The KMC processing time (red line) and used disk space (blue line) as a function of allowed RAM memory.</p>
</caption>
<graphic xlink:href="1471-2105-14-160-7"></graphic>
</fig>
<p>A related experiment, on the same dataset, concerned the impact of the number of hard disks available in the PC test machine (Figure
<xref ref-type="fig" rid="F8">8</xref>
). While finding a precise formula would be difficult and dependent on many factors (gzipped or non-gzipped input, standard or Quake-compatible counters, etc.), this experiment clearly shows that using more than one disk is beneficial, and 3 disks reduce the overall processing time sometimes by even more than 50% compared to a single disk. This observation confirms that the overall KMC performance strongly depends on the I/O (sub)system and supplying the platform with SSD disk(s), with approximately 3 times faster transfer rates and 2 orders of magnitude shorter access times, should give an extra boost.</p>
<fig id="F8" position="float"><label>Figure 8</label>
<caption><p><bold>HG02057 (gzipped) dataset, classic counters.</bold>
 Influence of the no. of HDDs on the processing time: 3 HDDs (green line), 2 HDDs (blue line), and 1 HDD (red line).</p>
</caption>
<graphic xlink:href="1471-2105-14-160-8"></graphic>
</fig>
<p>The architecture of KMC fits quite well the MapReduce (MR) paradigm
[<xref ref-type="bibr" rid="B11">11</xref>
], widely used in (but not limited to) cloud computing. Using this framework directly would be inefficient for the <italic>k</italic>
-mer counting problem, due to enormous I/O and communication costs. In KMC the <italic>k</italic>
-mer counting threads make use of the available RAM memory, and once their buffers are filled up, they sent statistics onto disk. Hence, these threads resemble the Combiner function in MR, which typically digests (“reduces”) the data produced by the Map function and outputs them to intermediate file(s). The <italic>k</italic>
-mer statistics from disk are read and processed by other, merging, threads, making an analogy to the Reduce function in MR.</p>
</sec>
<sec sec-type="conclusions"><title>Conclusion</title>
<p>High utilization of available resources is the key to obtaining competitive algorithms. Even home computers, equipped with multi-core CPUs, several gigabytes of RAM and a few fast TB-scale hard disks get powerful enough to be applied for real bioinformatics tasks. Our <italic>k</italic>
-mer counter, KMC, being an external and parallel algorithm, is capable to find exact <italic>k</italic>
-mer statistics on short-read human genomic collection with about 30-fold coverage in less than 40 minutes on a standard 6-core PC with 16 GB of RAM and 3 hard disks, and on long-read human genomic collection with a similar coverage in less than 70 minutes, for <italic>k</italic>
=28 in both tests. Using a more powerful machine reduces the times more than twice. Even in a demanding scenario (Quake-compatible counters, <italic>k</italic>
=70) our software works in less than 3 hours on the PC.</p>
</sec>
<sec><title>Endnotes</title>
<p><sup>a</sup>These functionalities of our tool are available via an API, whose detailed description is contained in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
.</p>
<p><sup>b</sup>
Our tool handles both FASTQ and FASTA input files. For brevity, we however refer throughout the paper only to the (more complicated of the two) FASTQ format.</p>
<p><sup>c</sup>
About 1/4 of all <italic>k</italic>
-mers start with A and also about 1/4 of all <italic>k</italic>
-mers end with T, and it is easy to note that for these (and only these) <italic>k</italic>
-mers their canonical forms start with A. These two groups are not disjoint; their intersection, with exactly the <italic>k</italic>
-mers having A as their first and T as their last symbol, contains about 1/16 of all <italic>k</italic>
-mers. Taking all this into account we immediately obtain the figure 7/16. Similarly, we can show that the distribution of <italic>k</italic>
-mers starting with base C, G and T is 5/16, 3/16 and 1/16. These numbers are different, since (roughly speaking) the lexicographically greater the first <italic>k</italic>
-mer’s symbol, the lesser chance its canonical form will also start with the same symbol.</p>
<p><sup>d</sup>
memcpy is a popular function from the C language standard library, which copies a number of bytes from one memory location to another.</p>
<p><sup>e</sup>Used in the experiments in
[<xref ref-type="bibr" rid="B7">7</xref>
], under a mistaken name NA19240.</p>
</sec>
<sec><title>Competing interests</title>
<p>The authors declare no competing interests whatsoever.</p>
</sec>
<sec><title>Authors’ contributions</title>
<p>SD developed the main concept of the algorithm, was the main architect of the KMC software, wrote most of the implementation code and ran the experiments; he also participated in drafting the manuscript. ADG adapted the parallel radix sort algorithm for the <italic>k</italic>
-mer sorting, implemented the parallel scheme of the main algorithm and the API; she also assisted in drafting the manuscript. SG was a co-author of the main idea of sorting the <italic>k</italic>
-mers and proposed the idea of compacting <italic>k</italic>
-mers stored in temporary files; he was the main author of the text. All authors read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material"><title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1"><caption><title>Additional file 1</title>
<p>1) KMC counter usage. 2) API. 3) Example of API usage. 4) Database format. 5) Experimental results. 6) Automatic setting of parameters in KMC. 7) Selected components of the KMC algorithm (codes not shown in the main part of the paper).</p>
</caption>
<media xlink:href="1471-2105-14-160-S1.pdf"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2"><caption><title>Additional file 2</title>
<p>Source codes for 64-bit Windows platform (Microsoft Visual C++ 2012 project).</p>
</caption>
<media xlink:href="1471-2105-14-160-S2.zip"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S3"><caption><title>Additional file 3</title>
<p>Source codes for 64-bit Linux platform (gcc project).</p>
</caption>
<media xlink:href="1471-2105-14-160-S3.gz"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><sec><title>Acknowledgements</title>
<p>The work was supported by the Polish National Science Center upon decision DEC-2011/01/B/ST6/06868 (first author), and by Silesian University of Technology under the project BK-220/RAu2/2012 (second author). The authors thank the anonymous reviewers for remarks helping to improve the preliminary version of the manuscript.</p>
</sec>
<ref-list><ref id="B1"><mixed-citation publication-type="journal"><name><surname>Compeau</surname>
<given-names>PE</given-names>
</name>
<name><surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name><surname>Tesler</surname>
<given-names>G</given-names>
</name>
<article-title>How to apply de Bruijn graphs to genome assembly</article-title>
<source>Nature Biotechnol</source>
<year>2011</year>
<volume>29</volume>
<issue>11</issue>
<fpage>987</fpage>
<lpage>991</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2023</pub-id>
<pub-id pub-id-type="pmid">22068540</pub-id>
</mixed-citation>
</ref>
<ref id="B2"><mixed-citation publication-type="journal"><name><surname>Edgar</surname>
<given-names>RC</given-names>
</name>
<article-title>MUSCLE: multiple sequence alignment with high accuracy and high throughput</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
<volume>32</volume>
<issue>5</issue>
<fpage>1792</fpage>
<lpage>1797</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh340</pub-id>
<pub-id pub-id-type="pmid">15034147</pub-id>
</mixed-citation>
</ref>
<ref id="B3"><mixed-citation publication-type="journal"><name><surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name><surname>Narechania</surname>
<given-names>A</given-names>
</name>
<name><surname>Stein</surname>
<given-names>J</given-names>
</name>
<name><surname>Ware</surname>
<given-names>D</given-names>
</name>
<article-title>A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes</article-title>
<source>BMC Genomics</source>
<year>2008</year>
<volume>9</volume>
<issue>1</issue>
<fpage>517</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-9-517</pub-id>
<pub-id pub-id-type="pmid">18976482</pub-id>
</mixed-citation>
</ref>
<ref id="B4"><mixed-citation publication-type="journal"><name><surname>Kelley</surname>
<given-names>DR</given-names>
</name>
<name><surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
<source>Genome Biol</source>
<year>2010</year>
<volume>11</volume>
<issue>11</issue>
<fpage>R116</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2010-11-11-r116</pub-id>
<pub-id pub-id-type="pmid">21114842</pub-id>
</mixed-citation>
</ref>
<ref id="B5"><mixed-citation publication-type="journal"><name><surname>Miller</surname>
<given-names>JR</given-names>
</name>
<name><surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name><surname>Koren</surname>
<given-names>S</given-names>
</name>
<name><surname>Venter</surname>
<given-names>E</given-names>
</name>
<name><surname>Walenz</surname>
<given-names>B</given-names>
</name>
<name><surname>Brownley</surname>
<given-names>A</given-names>
</name>
<name><surname>Johnson</surname>
<given-names>J</given-names>
</name>
<name><surname>Li</surname>
<given-names>K</given-names>
</name>
<name><surname>Mobarry</surname>
<given-names>CM</given-names>
</name>
<name><surname>Sutton</surname>
<given-names>GG</given-names>
</name>
<article-title>Aggressive assembly of pyrosequencing reads with mates</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<issue>24</issue>
<fpage>2818</fpage>
<lpage>2824</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn548</pub-id>
<pub-id pub-id-type="pmid">18952627</pub-id>
</mixed-citation>
</ref>
<ref id="B6"><mixed-citation publication-type="journal"><name><surname>Marçais</surname>
<given-names>G</given-names>
</name>
<name><surname>Kingsford</surname>
<given-names>C</given-names>
</name>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>764</fpage>
<lpage>770</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</mixed-citation>
</ref>
<ref id="B7"><mixed-citation publication-type="journal"><name><surname>Melsted</surname>
<given-names>P</given-names>
</name>
<name><surname>Pritchard</surname>
<given-names>JK</given-names>
</name>
<article-title>Efficient counting of k-mers in DNA sequences using a Bloom Filter</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>333</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-12-333</pub-id>
<pub-id pub-id-type="pmid">21831268</pub-id>
</mixed-citation>
</ref>
<ref id="B8"><mixed-citation publication-type="journal"><name><surname>Pell</surname>
<given-names>J</given-names>
</name>
<name><surname>Hintze</surname>
<given-names>A</given-names>
</name>
<name><surname>Canino-Koning</surname>
<given-names>R</given-names>
</name>
<name><surname>Howe</surname>
<given-names>A</given-names>
</name>
<name><surname>Tiedje</surname>
<given-names>JM</given-names>
</name>
<name><surname>Brown</surname>
<given-names>CT</given-names>
</name>
<article-title>Scaling metagenome sequence assembly with probabilistic de Bruijn graphs</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>2012</year>
<volume>109</volume>
<issue>33</issue>
<fpage>13272</fpage>
<lpage>13277</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1121464109</pub-id>
<pub-id pub-id-type="pmid">22847406</pub-id>
</mixed-citation>
</ref>
<ref id="B9"><mixed-citation publication-type="journal"><name><surname>Rizk</surname>
<given-names>G</given-names>
</name>
<name><surname>Lavenier</surname>
<given-names>D</given-names>
</name>
<name><surname>Chikhi</surname>
<given-names>R</given-names>
</name>
<article-title>DSK: k-mer counting with very low memory usage</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>5</issue>
<fpage>652</fpage>
<lpage>653</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt020</pub-id>
<pub-id pub-id-type="pmid">23325618</pub-id>
</mixed-citation>
</ref>
<ref id="B10"><mixed-citation publication-type="book"><name><surname>Satish</surname>
<given-names>N</given-names>
</name>
<name><surname>Kim</surname>
<given-names>C</given-names>
</name>
<name><surname>Chhugani</surname>
<given-names>J</given-names>
</name>
<name><surname>Nguyen</surname>
<given-names>AD</given-names>
</name>
<name><surname>Lee</surname>
<given-names>VW</given-names>
</name>
<name><surname>Kim</surname>
<given-names>D</given-names>
</name>
<name><surname>Dubey</surname>
<given-names>P</given-names>
</name>
<article-title>Fast Sort on CPUs and GPUs. A Case for Bandwidth Oblivious SIMD Sort</article-title>
<source>Proceedings of the 2010 International Conference on Management of Data: 6–11 June 2010</source>
<year>2010</year>
<publisher-name>Indianapolis: ACM</publisher-name>
<fpage>351</fpage>
<lpage>362</lpage>
</mixed-citation>
</ref>
<ref id="B11"><mixed-citation publication-type="journal"><name><surname>Dean</surname>
<given-names>J</given-names>
</name>
<name><surname>Ghemawat</surname>
<given-names>S</given-names>
</name>
<article-title>MapReduce: simplified data processing on large clusters</article-title>
<source>Commun ACM</source>
<year>2008</year>
<volume>51</volume>
<issue>1</issue>
<fpage>107</fpage>
<lpage>113</lpage>
<pub-id pub-id-type="doi">10.1145/1327452.1327492</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000945 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000945 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3680041
   |texte=   Disk-based k-mer counting on a PC
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:23679007" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Disk-based k-mer counting on a PC

Disk-based k-mer counting on a PC

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki