Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0010889 ( Pmc/Corpus ); précédent : 0010888; suivant : 0010890 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure</title>
<author>
<name sortKey="Zhang, Qingpeng" sort="Zhang, Qingpeng" uniqKey="Zhang Q" first="Qingpeng" last="Zhang">Qingpeng Zhang</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pell, Jason" sort="Pell, Jason" uniqKey="Pell J" first="Jason" last="Pell">Jason Pell</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Canino Koning, Rosangela" sort="Canino Koning, Rosangela" uniqKey="Canino Koning R" first="Rosangela" last="Canino-Koning">Rosangela Canino-Koning</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Howe, Adina Chuang" sort="Howe, Adina Chuang" uniqKey="Howe A" first="Adina Chuang" last="Howe">Adina Chuang Howe</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Department of Plant, Soil, and Microbial Sciences, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Brown, C Titus" sort="Brown, C Titus" uniqKey="Brown C" first="C. Titus" last="Brown">C. Titus Brown</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25062443</idno>
<idno type="pmc">4111482</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111482</idno>
<idno type="RBID">PMC:4111482</idno>
<idno type="doi">10.1371/journal.pone.0101271</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">001088</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001088</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure</title>
<author>
<name sortKey="Zhang, Qingpeng" sort="Zhang, Qingpeng" uniqKey="Zhang Q" first="Qingpeng" last="Zhang">Qingpeng Zhang</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pell, Jason" sort="Pell, Jason" uniqKey="Pell J" first="Jason" last="Pell">Jason Pell</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Canino Koning, Rosangela" sort="Canino Koning, Rosangela" uniqKey="Canino Koning R" first="Rosangela" last="Canino-Koning">Rosangela Canino-Koning</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Howe, Adina Chuang" sort="Howe, Adina Chuang" uniqKey="Howe A" first="Adina Chuang" last="Howe">Adina Chuang Howe</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Department of Plant, Soil, and Microbial Sciences, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Brown, C Titus" sort="Brown, C Titus" uniqKey="Brown C" first="C. Titus" last="Brown">C. Titus Brown</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient
<italic>online</italic>
counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marçais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Narechania, A" uniqKey="Narechania A">A Narechania</name>
</author>
<author>
<name sortKey="Stein, Jc" uniqKey="Stein J">JC Stein</name>
</author>
<author>
<name sortKey="Ware, D" uniqKey="Ware D">D Ware</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Metzker, M" uniqKey="Metzker M">M Metzker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Conway, Tc" uniqKey="Conway T">TC Conway</name>
</author>
<author>
<name sortKey="Bromage, Aj" uniqKey="Bromage A">AJ Bromage</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Minoche, Ae" uniqKey="Minoche A">AE Minoche</name>
</author>
<author>
<name sortKey="Dohm, Jc" uniqKey="Dohm J">JC Dohm</name>
</author>
<author>
<name sortKey="Himmelbauer, H" uniqKey="Himmelbauer H">H Himmelbauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Melsted, P" uniqKey="Melsted P">P Melsted</name>
</author>
<author>
<name sortKey="Pritchard, Jk" uniqKey="Pritchard J">JK Pritchard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rizk, G" uniqKey="Rizk G">G Rizk</name>
</author>
<author>
<name sortKey="Lavenier, D" uniqKey="Lavenier D">D Lavenier</name>
</author>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roy, Rs" uniqKey="Roy R">RS Roy</name>
</author>
<author>
<name sortKey="Bhattacharya, D" uniqKey="Bhattacharya D">D Bhattacharya</name>
</author>
<author>
<name sortKey="Schliep, A" uniqKey="Schliep A">A Schliep</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Audano, P" uniqKey="Audano P">P Audano</name>
</author>
<author>
<name sortKey="Vannberg, F" uniqKey="Vannberg F">F Vannberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Howe, Ac" uniqKey="Howe A">AC Howe</name>
</author>
<author>
<name sortKey="Jansson, Jk" uniqKey="Jansson J">JK Jansson</name>
</author>
<author>
<name sortKey="Malfatti, Sa" uniqKey="Malfatti S">SA Malfatti</name>
</author>
<author>
<name sortKey="Tringe, Sg" uniqKey="Tringe S">SG Tringe</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cormode, G" uniqKey="Cormode G">G Cormode</name>
</author>
<author>
<name sortKey="Muthukrishnan, S" uniqKey="Muthukrishnan S">S Muthukrishnan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bloom, Bh" uniqKey="Bloom B">BH Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pell, J" uniqKey="Pell J">J Pell</name>
</author>
<author>
<name sortKey="Hintze, A" uniqKey="Hintze A">A Hintze</name>
</author>
<author>
<name sortKey="Canino Koning, R" uniqKey="Canino Koning R">R Canino-Koning</name>
</author>
<author>
<name sortKey="Howe, A" uniqKey="Howe A">A Howe</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jones, Dc" uniqKey="Jones D">DC Jones</name>
</author>
<author>
<name sortKey="Ruzzo, Wl" uniqKey="Ruzzo W">WL Ruzzo</name>
</author>
<author>
<name sortKey="Peng, X" uniqKey="Peng X">X Peng</name>
</author>
<author>
<name sortKey="Katze, Mg" uniqKey="Katze M">MG Katze</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Broder, Az" uniqKey="Broder A">AZ Broder</name>
</author>
<author>
<name sortKey="Mitzenmacher, M" uniqKey="Mitzenmacher M">M Mitzenmacher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fan, L" uniqKey="Fan L">L Fan</name>
</author>
<author>
<name sortKey="Cao, P" uniqKey="Cao P">P Cao</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
<author>
<name sortKey="Broder, Az" uniqKey="Broder A">AZ Broder</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Broder, A" uniqKey="Broder A">A Broder</name>
</author>
<author>
<name sortKey="Mitzenmacher, M" uniqKey="Mitzenmacher M">M Mitzenmacher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Flajolet, P" uniqKey="Flajolet P">P Flajolet</name>
</author>
<author>
<name sortKey="Fusy, E" uniqKey="Fusy E">É Fusy</name>
</author>
<author>
<name sortKey="Gandouet, O" uniqKey="Gandouet O">O Gandouet</name>
</author>
<author>
<name sortKey="Meunier, F" uniqKey="Meunier F">F Meunier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
<author>
<name sortKey="Scott, E" uniqKey="Scott E">E Scott</name>
</author>
<author>
<name sortKey="Kakaradov, B" uniqKey="Kakaradov B">B Kakaradov</name>
</author>
<author>
<name sortKey="Pevzner, P" uniqKey="Pevzner P">P Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, X" uniqKey="Li X">X Li</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chitsaz, H" uniqKey="Chitsaz H">H Chitsaz</name>
</author>
<author>
<name sortKey="Yee Greenbaum, J" uniqKey="Yee Greenbaum J">J Yee-Greenbaum</name>
</author>
<author>
<name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
<author>
<name sortKey="Lombardo, M" uniqKey="Lombardo M">M Lombardo</name>
</author>
<author>
<name sortKey="Dupont, C" uniqKey="Dupont C">C Dupont</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haas, Bj" uniqKey="Haas B">BJ Haas</name>
</author>
<author>
<name sortKey="Papanicolaou, A" uniqKey="Papanicolaou A">A Papanicolaou</name>
</author>
<author>
<name sortKey="Yassour, M" uniqKey="Yassour M">M Yassour</name>
</author>
<author>
<name sortKey="Grabherr, M" uniqKey="Grabherr M">M Grabherr</name>
</author>
<author>
<name sortKey="Blood, Pd" uniqKey="Blood P">PD Blood</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rusu, F" uniqKey="Rusu F">F Rusu</name>
</author>
<author>
<name sortKey="Dobra, A" uniqKey="Dobra A">A Dobra</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Luo, W" uniqKey="Luo W">W Luo</name>
</author>
<author>
<name sortKey="Friedman, Ms" uniqKey="Friedman M">MS Friedman</name>
</author>
<author>
<name sortKey="Shedden, K" uniqKey="Shedden K">K Shedden</name>
</author>
<author>
<name sortKey="Hankenson, Kd" uniqKey="Hankenson K">KD Hankenson</name>
</author>
<author>
<name sortKey="Woolf, Pj" uniqKey="Woolf P">PJ Woolf</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Perez, F" uniqKey="Perez F">F Pérez</name>
</author>
<author>
<name sortKey="Granger, B" uniqKey="Granger B">B Granger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qin, J" uniqKey="Qin J">J Qin</name>
</author>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Raes, J" uniqKey="Raes J">J Raes</name>
</author>
<author>
<name sortKey="Arumugam, M" uniqKey="Arumugam M">M Arumugam</name>
</author>
<author>
<name sortKey="Burgdorf, Ks" uniqKey="Burgdorf K">KS Burgdorf</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25062443</article-id>
<article-id pub-id-type="pmc">4111482</article-id>
<article-id pub-id-type="publisher-id">PONE-D-13-45384</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0101271</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Computational Biology</subject>
<subj-group>
<subject>Genome Analysis</subject>
<subj-group>
<subject>Genomic Databases</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genomics</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Molecular Biology</subject>
<subj-group>
<subject>Molecular Biology Techniques</subject>
<subj-group>
<subject>Sequencing Techniques</subject>
<subj-group>
<subject>Genome Sequencing</subject>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Computer and Information Sciences</subject>
<subj-group>
<subject>Computing Methods</subject>
<subj-group>
<subject>Cloud Computing</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Software Engineering</subject>
<subj-group>
<subject>Software Tools</subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure</article-title>
<alt-title alt-title-type="running-head">Efficient Probabilistic Online K-mer Counting</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Zhang</surname>
<given-names>Qingpeng</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Pell</surname>
<given-names>Jason</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Canino-Koning</surname>
<given-names>Rosangela</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Howe</surname>
<given-names>Adina Chuang</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Brown</surname>
<given-names>C. Titus</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>1</label>
<addr-line>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</aff>
<aff id="aff3">
<label>3</label>
<addr-line>Department of Plant, Soil, and Microbial Sciences, Michigan State University, East Lansing, Michigan, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Zhu</surname>
<given-names>Dongxiao</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>Wayne State University, United States of America</addr-line>
</aff>
<author-notes>
<corresp id="cor1">* E-mail:
<email>ctb@msu.edu</email>
</corresp>
<fn fn-type="COI-statement">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con">
<p>Conceived and designed the experiments: CTB. Performed the experiments: QZ CTB. Analyzed the data: QZ CTB. Contributed reagents/materials/analysis tools: QZ JP RCK ACH CTB. Wrote the paper: QZ CTB.</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>25</day>
<month>7</month>
<year>2014</year>
</pub-date>
<volume>9</volume>
<issue>7</issue>
<elocation-id>e101271</elocation-id>
<history>
<date date-type="received">
<day>14</day>
<month>10</month>
<year>2013</year>
</date>
<date date-type="accepted">
<day>4</day>
<month>6</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>© 2014 Zhang et al</copyright-statement>
<copyright-year>2014</copyright-year>
<copyright-holder>Zhang et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.</license-p>
</license>
</permissions>
<abstract>
<p>K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient
<italic>online</italic>
counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.</p>
</abstract>
<funding-group>
<funding-statement>This work was supported in part by the United States Department of Agriculture under grant 2009-03296 from the National Institute of Food and Agriculture, to CTB (
<ext-link ext-link-type="uri" xlink:href="http://www.usda.gov">www.usda.gov</ext-link>
); National Science Foundation grant 09-23812 to CTB (
<ext-link ext-link-type="uri" xlink:href="http://www.nsf.gov">www.nsf.gov</ext-link>
); and the National Institutes of Health grant 1R01HG007513-01 to CTB. AH was supported by NSF Postdoctoral Fellowship Award #0905961. (
<ext-link ext-link-type="uri" xlink:href="http://www.nih.gov">www.nih.gov</ext-link>
). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts>
<page-count count="13"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>The goal of k-mer counting is to determine the number of occurrences for each fixed-length word of length k in a DNA data set
<xref rid="pone.0101271-Marais1" ref-type="bibr">[1]</xref>
. Efficient k-mer counting plays an important role in many bioinformatics approaches, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation
<xref rid="pone.0101271-Kurtz1" ref-type="bibr">[2]</xref>
.</p>
<p>Short-read shotgun sequencing data is both relatively sparse in k-mers and contains many erroneous k-mers. For typical values of k such as 32 these data sets are sparse, as only a small fraction of the total possible number of k-mers (
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e001.jpg"></inline-graphic>
</inline-formula>
) are actually present in any genome or read data sets derived from the genome. The high error rate (e.g. Illumina has a 0.1–1% per-base error rate
<xref rid="pone.0101271-Metzker1" ref-type="bibr">[3]</xref>
) generates many unique k-mers. As the total number of generated reads increases, the total number of errors grows with it linearly. This leads to data sets where the erroneous k-mers vastly outnumber the true k-mers
<xref rid="pone.0101271-Conway1" ref-type="bibr">[4]</xref>
. Tracking and counting the resulting large number of k-mers, most of which are erroneous, has become an unavoidable and challenging task in sequence analysis
<xref rid="pone.0101271-Minoche1" ref-type="bibr">[5]</xref>
.</p>
<p>A variety of k-mer counting approaches, and standalone software packages implementing them, have emerged in recent years; this includes Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze
<xref rid="pone.0101271-Marais1" ref-type="bibr">[1]</xref>
,
<xref rid="pone.0101271-Kurtz1" ref-type="bibr">[2]</xref>
,
<xref rid="pone.0101271-Melsted1" ref-type="bibr">[6]</xref>
<xref rid="pone.0101271-Audano1" ref-type="bibr">[10]</xref>
.</p>
<p>These approaches and implementations each offer different algorithmic trade-offs and enable a non-overlapping set of functionality. Tallymer uses a suffix tree to store k-mer counts in memory and on disk
<xref rid="pone.0101271-Kurtz1" ref-type="bibr">[2]</xref>
. Jellyfish stores k-mer counts in in-memory hash tables, and makes use of disk storage to scale to larger data sets
<xref rid="pone.0101271-Marais1" ref-type="bibr">[1]</xref>
. BFCounter uses a Bloom filter as a pre-filter to avoid counting unique k-mers, and is the first published probabilistic approach to k-mer counting
<xref rid="pone.0101271-Melsted1" ref-type="bibr">[6]</xref>
. DSK adopts an approach to k-mer counting that enables time- and memory-efficient k-mer counting with an explicit trade-off between disk and memory usage
<xref rid="pone.0101271-Rizk1" ref-type="bibr">[7]</xref>
. KMC and KAnalyze rely primarily on fast and inexpensive disk access to count k-mers in low memory
<xref rid="pone.0101271-Deorowicz1" ref-type="bibr">[8]</xref>
,
<xref rid="pone.0101271-Audano1" ref-type="bibr">[10]</xref>
. Turtle provides several different containers that offer different false positive and false negative tradeoffs when counting k-mers
<xref rid="pone.0101271-Roy1" ref-type="bibr">[9]</xref>
.</p>
<p>Our motivation for exploring efficient k-mer counting comes from our work with metagenomic data, where we routinely encounter data sets that contain
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e002.jpg"></inline-graphic>
</inline-formula>
bases of DNA and over 50 billion distinct k-mers
<xref rid="pone.0101271-Howe1" ref-type="bibr">[11]</xref>
. To efficiently filter, partition, and assemble these data, we need to store counts for each of these k-mers in main memory, and query and update them in realtime — a set of functionality not readily offered by current packages. Moreover, we wish to enable the use of cloud and desktop computers, which may have poor I/O performance or limited memory. These needs have dictated our exploration of efficient in-memory k-mer counting techniques.</p>
<p>Below, we describe an implementation of a simple probabilistic data structure for k-mer counting. This data structure is based on a Count-Min Sketch
<xref rid="pone.0101271-Cormode1" ref-type="bibr">[12]</xref>
, a generalized probabilistic data structure for storing the frequency distributions of distinct elements. Our implementation extends an earlier implementation of a Bloom filter
<xref rid="pone.0101271-Bloom1" ref-type="bibr">[13]</xref>
, which has been previously used in bioinformatics applications, such as sequence matching
<xref rid="pone.0101271-Malde1" ref-type="bibr">[14]</xref>
, k-mer counting
<xref rid="pone.0101271-Melsted1" ref-type="bibr">[6]</xref>
, and de Bruijn graph storage and traversal
<xref rid="pone.0101271-Pell1" ref-type="bibr">[15]</xref>
,
<xref rid="pone.0101271-Jones1" ref-type="bibr">[16]</xref>
. Many other variations of Bloom filters have been proposed
<xref rid="pone.0101271-Broder1" ref-type="bibr">[17]</xref>
, including counting Bloom filters
<xref rid="pone.0101271-Fan1" ref-type="bibr">[18]</xref>
, multistage filters
<xref rid="pone.0101271-Estan1" ref-type="bibr">[19]</xref>
, and spectral Bloom filters
<xref rid="pone.0101271-Cohen1" ref-type="bibr">[20]</xref>
, which are related to the Count-Min Sketch and our khmer implementation.</p>
<p>Probabilistic approaches can be particularly memory efficient for certain problems, with memory usage significantly lower than any exact data structure
<xref rid="pone.0101271-Pell1" ref-type="bibr">[15]</xref>
. However, their use introduces set membership or counting false positives, which have effects that must be analyzed in the context of specific problems. Moreover, unlike existing techniques, the Count-Min Sketch stores only counts; k-mers must be retrieved from the original data set. In exchange, the low memory footprint enabled by this probabilistic approach enables online updating and retrieval of k-mer counts entirely in memory, which in turn supports streaming applications such as digital normalization
<xref rid="pone.0101271-Brown1" ref-type="bibr">[21]</xref>
.</p>
<p>We use the Amazon cloud to compare time, memory, and disk usage of our k-mer counting implementation with that of other k-mer counting software packages, for two problems. First, we generate a k-mer abundance distribution for large data sets; and second, we query many individual k-mer counts at random from a previously constructed k-mer count database. We show that khmer is competitive in speed, memory, and disk usage for these problems. We also analyze the effects of counting error on calculations of the k-mer count in sequencing data sets, and in particular on metagenomic data sets. Finally, we discuss khmer's miscount performance in the context of two specific applications: low-abundance k-mer trimming of reads, and digital normalization.</p>
<p>The khmer software
<xref rid="pone.0101271-Crusoe1" ref-type="bibr">[22]</xref>
is implemented in C++ in a Python wrapper, enabling flexible use and reuse by users with a wide range of computational expertise. The software package is freely available for academic and commercial use and redistribution under the BSD license at github.com/ged-lab/khmer/. khmer comes with substantial documentation and many tutorials, and contains extensive unit tests. Moreover, we have built several applications on top of khmer, including memory-efficient de Bruijn graph partitioning
<xref rid="pone.0101271-Pell1" ref-type="bibr">[15]</xref>
and lossy compression of short-read data sets for assembly
<xref rid="pone.0101271-Brown1" ref-type="bibr">[21]</xref>
.</p>
</sec>
<sec id="s2">
<title>Results</title>
<sec id="s2a">
<title>Implementing a Count-Min Sketch for k-mers</title>
<p>The two basic operations supported by khmer are increment_count(kmer) and c =  get_count(kmer). Both operate on the data structure in memory, such that neither incrementing a count nor retrieving a count involves disk access.</p>
<p>The implementation details are similar to those of the Bloom filter in
<xref rid="pone.0101271-Pell1" ref-type="bibr">[15]</xref>
, but with the use of 8 bit counters instead of 1 bit counters. Briefly, Z hash tables are allocated, each with a different size of approximately H bytes (
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e003.jpg"></inline-graphic>
</inline-formula>
); the sum of these hash table sizes must fit within available main memory. To increment the count for a particular k-mer, a single hash is computed for the k-mer, and the modulus of that hash with each hash table's size H gives the location for each hash table; the associated count in each hash table is then incremented by 1. We use different sizes for each hash table so as to vary the hash function. Even if two k-mers have the same modulus in one hash table (a collision), they are unlikely to collide in the other hash tables. To retrieve the count for a k-mer, the same hash is computed and the minimum count across all hash tables is computed. While different in implementation detail from the standard Bloom filter, which uses a single hash table with many hash functions, the performance details are identical
<xref rid="pone.0101271-Pell1" ref-type="bibr">[15]</xref>
. One particularly important feature of the Count-Min Sketch is that the counting error is
<italic>one-sided</italic>
<xref rid="pone.0101271-Cormode1" ref-type="bibr">[12]</xref>
. Because counts are only incremented, collisions result in inflated miscounts; if there is no collision for a particular k-mer, the count is correct.</p>
<p>An additional benefit of the Count-Min Sketch is that it is extremely easy to implement correctly, needing only about 3 dozen lines of C++ code for a simple threadsafe implementation. (We have described how khmer scales with multiple threads in
<xref rid="pone.0101271-McDonald1" ref-type="bibr">[23]</xref>
.)</p>
<p>To determine the expected false positive rate — the average frequency with which a given k-mer count will be incorrect when retrieved — we can look at the hash table load. Suppose N distinct k-mers have been counted using Z hash tables, each with size H. The probability that no collisions happened in a specific entry in one hash table is
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e004.jpg"></inline-graphic>
</inline-formula>
, or approximately
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e005.jpg"></inline-graphic>
</inline-formula>
. The individual collision rate in one hash table is then
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e006.jpg"></inline-graphic>
</inline-formula>
. The total collision rate, which is the probability that a collision occurred in each entry where a k-mer maps across all Z hash tables, is
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e007.jpg"></inline-graphic>
</inline-formula>
, which is also the expected false positive rate.</p>
<p>While the false positive rate can easily be calculated from the hash table load, the average
<italic>miscount</italic>
— the degree to which the measured count differs from the true count — depends on the k-mer frequency distribution, which must be determined empirically. We analyze the effects of this below.</p>
</sec>
<sec id="s2b">
<title>Choosing number and size of hash tables used for k-mer counting</title>
<p>The false positive rate depends on the number of distinct k-mers
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e008.jpg"></inline-graphic>
</inline-formula>
, the number of hash tables
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e009.jpg"></inline-graphic>
</inline-formula>
, and the size of the hash tables
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e010.jpg"></inline-graphic>
</inline-formula>
:
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e011.jpg"></inline-graphic>
</inline-formula>
, with an associated memory usage of
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e012.jpg"></inline-graphic>
</inline-formula>
. We face two common scenarios: one in which we have a fixed number of k-mers
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e013.jpg"></inline-graphic>
</inline-formula>
and fixed memory
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e014.jpg"></inline-graphic>
</inline-formula>
and we want to calculate the optimal number of hash tables
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e015.jpg"></inline-graphic>
</inline-formula>
; and one in which we have a desired maximum false positive rate
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e016.jpg"></inline-graphic>
</inline-formula>
and a fixed number of k-mers
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e017.jpg"></inline-graphic>
</inline-formula>
, and we want to calculate the minimum memory usage required to achieve
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e018.jpg"></inline-graphic>
</inline-formula>
.</p>
<p>For fixed memory
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e019.jpg"></inline-graphic>
</inline-formula>
and number of distinct k-mers
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e020.jpg"></inline-graphic>
</inline-formula>
, the optimal number of hash tables can be found by minimizing
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e021.jpg"></inline-graphic>
</inline-formula>
; taking the derivative,
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e022.jpg"></inline-graphic>
</inline-formula>
, with
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e023.jpg"></inline-graphic>
</inline-formula>
and solving for 0, we find that
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e024.jpg"></inline-graphic>
</inline-formula>
is minimized when
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e025.jpg"></inline-graphic>
</inline-formula>
(see
<xref rid="pone.0101271-Broder2" ref-type="bibr">[24]</xref>
for details).</p>
<p>Given a desired false positive rate
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e026.jpg"></inline-graphic>
</inline-formula>
and a fixed number of k-mers
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e027.jpg"></inline-graphic>
</inline-formula>
, the optimal memory usage can be calculated as follows. First, the optimal number of hash tables is determined by the expected false positive rate alone:
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e028.jpg"></inline-graphic>
</inline-formula>
. Using this
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e029.jpg"></inline-graphic>
</inline-formula>
, the minimum average hash table size
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e030.jpg"></inline-graphic>
</inline-formula>
necessary to achieve
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e031.jpg"></inline-graphic>
</inline-formula>
can be calculated as
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e032.jpg"></inline-graphic>
</inline-formula>
(see
<xref rid="pone.0101271-Broder2" ref-type="bibr">[24]</xref>
for details).</p>
<p>A remaining problem is that the number of distinct k-mers
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e033.jpg"></inline-graphic>
</inline-formula>
is typically not known. However, memory- and time-efficient algorithms for calculating
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e034.jpg"></inline-graphic>
</inline-formula>
do exist and we plan to implement this in khmer in the future
<xref rid="pone.0101271-Flajolet1" ref-type="bibr">[25]</xref>
.</p>
</sec>
<sec id="s2c">
<title>khmer efficiently calculates k-mer abundance histograms</title>
<p>We measured time and memory required to calculate k-mer abundance histograms in five soil metagenomic read data sets using khmer, Tallymer, Jellyfish, DSK, KMC, Turtle, and KAnalyze (
<xref rid="pone-0101271-t001" ref-type="table">Table 1</xref>
;
<xref ref-type="fig" rid="pone-0101271-g001">Figures 1</xref>
and
<xref ref-type="fig" rid="pone-0101271-g002">2</xref>
). We chose to benchmark abundance histograms because this functionality is common to all the software packages, and is a common analysis approach for determining assembly parameters
<xref rid="pone.0101271-Chikhi1" ref-type="bibr">[26]</xref>
. We applied each package to increasingly large subsets of a 50 m read soil metagenome data set
<xref rid="pone.0101271-Howe1" ref-type="bibr">[11]</xref>
. For the BFCounter, KMC, Turtle and KAnalyze packages, which do not generate k-mer abundance distribution directly, we output the frequency of each k-mer to a file but do no further analysis.</p>
<fig id="pone-0101271-g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.g001</object-id>
<label>Figure 1</label>
<caption>
<title>Comparison of the time it takes for k-mer counting tools to calculate k-mer abundance histograms, with time (y axis, in seconds) against data set size (in number of reads, x axis).</title>
<p>All programs executed in time approximately linear with the number of input reads.</p>
</caption>
<graphic xlink:href="pone.0101271.g001"></graphic>
</fig>
<fig id="pone-0101271-g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.g002</object-id>
<label>Figure 2</label>
<caption>
<title>Memory usage of k-mer counting tools when calculating k-mer abundance histograms, with maximum resident program size (y axis, in GB) plotted against the total number of distinct k-mers in the data set (x axis, billions of k-mers).</title>
</caption>
<graphic xlink:href="pone.0101271.g002"></graphic>
</fig>
<table-wrap id="pone-0101271-t001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.t001</object-id>
<label>Table 1</label>
<caption>
<title>Benchmark soil metagenome data sets for k-mer counting performance, taken from
<xref rid="pone.0101271-Howe1" ref-type="bibr">[11]</xref>
.</title>
</caption>
<alternatives>
<graphic id="pone-0101271-t001-1" xlink:href="pone.0101271.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Data set</td>
<td align="left" rowspan="1" colspan="1">size of file (GB)</td>
<td align="left" rowspan="1" colspan="1">number of reads</td>
<td align="left" rowspan="1" colspan="1">number of distinct k-mers</td>
<td align="left" rowspan="1" colspan="1">total number of k-mers</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">subset 1</td>
<td align="left" rowspan="1" colspan="1">1.90</td>
<td align="left" rowspan="1" colspan="1">9,744,399</td>
<td align="left" rowspan="1" colspan="1">561,178,082</td>
<td align="left" rowspan="1" colspan="1">630,207,985</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">subset 2</td>
<td align="left" rowspan="1" colspan="1">2.17</td>
<td align="left" rowspan="1" colspan="1">19,488,798</td>
<td align="left" rowspan="1" colspan="1">1,060,354,144</td>
<td align="left" rowspan="1" colspan="1">1,259,079,821</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">subset 3</td>
<td align="left" rowspan="1" colspan="1">3.14</td>
<td align="left" rowspan="1" colspan="1">29,233,197</td>
<td align="left" rowspan="1" colspan="1">1,445,923,389</td>
<td align="left" rowspan="1" colspan="1">1,771,614,378</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">subset 4</td>
<td align="left" rowspan="1" colspan="1">4.05</td>
<td align="left" rowspan="1" colspan="1">38,977,596</td>
<td align="left" rowspan="1" colspan="1">1,770,589,216</td>
<td align="left" rowspan="1" colspan="1">2,227,756,662</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">entire data set</td>
<td align="left" rowspan="1" colspan="1">5.00</td>
<td align="left" rowspan="1" colspan="1">48,721,995</td>
<td align="left" rowspan="1" colspan="1">2,121,474,237</td>
<td align="left" rowspan="1" colspan="1">2,743,130,683</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>
<xref ref-type="fig" rid="pone-0101271-g001">Figure 1</xref>
shows that the time usage of the khmer approach is comparable to DSK and BFCounter, and, as expected, increases linearly with data set size. Tallymer is the slowest of the four tools in this testing, while KMC, Turtle, and Jellyfish are the fastest.</p>
<p>From
<xref ref-type="fig" rid="pone-0101271-g002">Figure 2</xref>
, we see that the memory usage of Jellyfish, Tallymer, BFCounter, and Turtle increases linearly with data set size. Tallymer uses more memory than Jellyfish generally, while BFCounter and Turtle have considerably lower memory usage. DSK, KMC, and KAnalyze use constant memory across the data sets, but at the cost of more limited functionality (discussed below).</p>
<p>The memory usage of khmer also increases linearly with data set size as long as we hold the false positive rate constant. However, the memory usage of khmer varies substantially with the desired false positive rate: we can decrease the memory usage by increasing the false positive rate as shown in
<xref ref-type="fig" rid="pone-0101271-g002">Figure 2</xref>
. We also see that with a low false positive of 1%, the memory usage is competitive with Tallymer and Jellyfish; with a higher 5% false positive rate, the memory usage is lower than all but the disk-based DSK; with an false positive rate as high as 20%, the memory usage is further lower, close to DSK, KAnalyze, and KMC.</p>
<p>We also measured disk usage during counting.
<xref ref-type="fig" rid="pone-0101271-g003">Figure 3</xref>
shows that the disk usage also increases linearly with the number of k-mers in the data set. For a high-diversity metagenomic data set of 5 GB, the disk usage of both Jellyfish and Tallymer is around 30 GB. khmer counts k-mers entirely in working memory and does not rely on any on-disk storage to store or retrieve k-mer counts, although for practicality the hash tables can be saved for later reuse; the uncompressed disk usage for khmer in
<xref ref-type="fig" rid="pone-0101271-g003">Figure 3</xref>
is the same as its memory. At the expense of more time, khmer supports saving and loading gzip-compressed hash tables, which are competitive in size to DSK's on-disk database (
<xref ref-type="fig" rid="pone-0101271-g003">Figure 3</xref>
, dashed line).</p>
<fig id="pone-0101271-g003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.g003</object-id>
<label>Figure 3</label>
<caption>
<title>Disk storage usage of different k-mer counting tools to calculate k-mer abundance histograms in GB (y axis), plotted against the number of distinct k-mers in the data set (x axis).</title>
<p>
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e035.jpg"></inline-graphic>
</inline-formula>
Note that khmer does not use the disk during counting or retrieval, although its hash tables can be saved for reuse.</p>
</caption>
<graphic xlink:href="pone.0101271.g003"></graphic>
</fig>
</sec>
<sec id="s2d">
<title>khmer accesses k-mer counts efficiently</title>
<p>We measured the time it took to access 9.7 m 22-mers across five different data sets after the initial databases had been built (
<xref ref-type="fig" rid="pone-0101271-g004">Figure 4</xref>
). Note that Tallymer, Jellyfish, and khmer all support random access to k-mer counts, while BFCounter, DSK, KMC, Turtle and KAnalyze do not. Here, khmer performed well, dramatically outperforming Jellyfish and Tallymer. In all three cases, system time dominated the overall time required to retrieve k-mers, suggesting that the primary reason for the increase in retrieval time was due to the increased size of the database on the disk (data not shown). In particular, khmer is independent of the size of the database in retrieval time once the hash tables are loaded into memory.</p>
<fig id="pone-0101271-g004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.g004</object-id>
<label>Figure 4</label>
<caption>
<title>Time for several k-mer counting tools to retrieve the counts of 9.7 m randomly chosen k-mers (y axis), plotted against the number of distinct k-mers in the data set being queried (x axis).</title>
<p>BFCounter, DSK, Turtle, KAnalyze, and KMC do not support this functionality.</p>
</caption>
<graphic xlink:href="pone.0101271.g004"></graphic>
</fig>
</sec>
<sec id="s2e">
<title>The measured counting error is low on short-read data</title>
<p>Due to the use of Count-Min Sketch and its lack of collision tracking, khmer will report some incorrect counts for k-mers; these counts are always higher than the true counts, up to the bound of 255 (a limit imposed by our use of 8-bit counters). The frequency with which incorrect counts are reported can be estimated from the hash table load. However, the expected
<italic>miscount</italic>
— the difference between the true k-mer frequency and the reported k-mer frequency — cannot be calculated without knowing the distribution of k-mer abundances; in general, the average miscount will be small if the data is left-skewed. As noted by Melsted and Pritchard, a large number of k-mers in short-read data are low-abundance, leading to precisely the skew that would yield low miscounts
<xref rid="pone.0101271-Melsted1" ref-type="bibr">[6]</xref>
. Here we use both real and simulated data sets to evaluate the counting performance in practice.</p>
<p>
<xref ref-type="fig" rid="pone-0101271-g005">Figure 5</xref>
shows the relationship between average miscount and counting false positive rate for five different test data sets with similar numbers of distinct k-mers: one metagenome data set; a simulated set of random k-mers; a simulated set of reads, chosen with 3x coverage and 1% error; a simulated set of reads (3x) with no error; and a set of
<italic>E. coli</italic>
reads (
<xref rid="pone-0101271-t002" ref-type="table">Table 2</xref>
). Even when the counting false positive rate is as high as 0.9 — where 90% of k-mers have an incorrect count — the average miscount is still below 4.</p>
<fig id="pone-0101271-g005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.g005</object-id>
<label>Figure 5</label>
<caption>
<title>Relation between average miscount — amount by which the count for k-mers is incorrect — on the y axis, plotted against false positive rate (x axis), for five data sets.</title>
<p>The five data sets were chosen to have the same total number of distinct k-mers: one metagenome data set; a set of randomly generated k-mers; a set of reads, chosen with 3x coverage and 1% error, from a randomly generated genome; a simulated set of error-free reads (3x) chosen from a randomly generated genome and a set of
<italic>E. coli</italic>
reads.</p>
</caption>
<graphic xlink:href="pone.0101271.g005"></graphic>
</fig>
<table-wrap id="pone-0101271-t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.t002</object-id>
<label>Table 2</label>
<caption>
<title>Data sets used for analyzing miscounts.</title>
</caption>
<alternatives>
<graphic id="pone-0101271-t002-2" xlink:href="pone.0101271.t002"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Data set</td>
<td align="left" rowspan="1" colspan="1">Size of data set file</td>
<td align="left" rowspan="1" colspan="1">Number of total k-mers</td>
<td align="left" rowspan="1" colspan="1">Number of distinct k-mers</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Real metagenomics reads</td>
<td align="left" rowspan="1" colspan="1">7.01 M</td>
<td align="left" rowspan="1" colspan="1">2,917,200</td>
<td align="left" rowspan="1" colspan="1">1,944,996</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Totally random reads with randomly generated k-mers</td>
<td align="left" rowspan="1" colspan="1">3.53 M</td>
<td align="left" rowspan="1" colspan="1">2,250,006</td>
<td align="left" rowspan="1" colspan="1">1,973,059</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Simulated reads from simulated genome with error</td>
<td align="left" rowspan="1" colspan="1">5.92 M</td>
<td align="left" rowspan="1" colspan="1">3,757,479</td>
<td align="left" rowspan="1" colspan="1">2,133,592</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Simulated reads from simulated genome without error</td>
<td align="left" rowspan="1" colspan="1">9.07 M</td>
<td align="left" rowspan="1" colspan="1">5,714,973</td>
<td align="left" rowspan="1" colspan="1">1,989,644</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Real
<italic>E. coli</italic>
reads</td>
<td align="left" rowspan="1" colspan="1">4.85 M</td>
<td align="left" rowspan="1" colspan="1">4,004,911</td>
<td align="left" rowspan="1" colspan="1">2,079,302</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>We separately analyzed the average
<italic>percentage</italic>
miscount between true and false k-mers; e.g. an miscount of 4 for a k-mer whose true count is 1 would be 400%.
<xref ref-type="fig" rid="pone-0101271-g006">Figure 6</xref>
shows the relationship between average miscount and counting false positive rate for the same five data sets as in
<xref ref-type="fig" rid="pone-0101271-g005">Figure 5</xref>
. For a false positive rate of 0.1 (10% of k-mer counts are incorrect), the average percentage miscount is less than 10% for all five data sets; this will of course generally be true, because the average miscount is bounded by the product of the false positive rate with k-mer abundance.</p>
<fig id="pone-0101271-g006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.g006</object-id>
<label>Figure 6</label>
<caption>
<title>Relation between percent miscount — amount by which the count for k-mers is incorrect relative to its true count — on the y axis, plotted against false positive rate (x axis), for five data sets.</title>
<p>The five data sets are the same as in
<xref ref-type="fig" rid="pone-0101271-g005">Figure 5</xref>
.</p>
</caption>
<graphic xlink:href="pone.0101271.g006"></graphic>
</fig>
<p>We see here that for a fixed false positive rate, the simulated reads without error have the highest average miscount, and the randomly generated k-mers have the lowest average miscount. This is because these two abundance distributions have the least and most left-skew, respectively: the simulated reads without error have no abundance-1 k-mers, while the randomly generated k-mers are entirely low abundance.</p>
</sec>
<sec id="s2f">
<title>Sequencing error profiles can be measured with k-mer abundance profiles</title>
<p>One specific use for khmer is detecting random sequencing errors by looking at the k-mer abundance distribution within reads
<xref rid="pone.0101271-Medvedev1" ref-type="bibr">[27]</xref>
. This approach, known also as “k-mer spectral analysis”, was first proposed in by
<xref rid="pone.0101271-Pevzner1" ref-type="bibr">[28]</xref>
and further developed in
<xref rid="pone.0101271-Li1" ref-type="bibr">[29]</xref>
. The essential idea is that low-abundance k-mers contained in a high-coverage data set typically represent random sequencing errors.</p>
<p>A variety of read trimming and error correcting tools use k-mer counting to reduce the error content of the read data set, independent of quality scores or reference genomes
<xref rid="pone.0101271-Kelley1" ref-type="bibr">[30]</xref>
. This is an application where the counting error of the Count-Min Sketch approach used by khmer may be particularly tolerable: it will never falsely call a high-abundance k-mer as low-abundance because khmer never underestimates counts.</p>
<p>In
<xref ref-type="fig" rid="pone-0101271-g007">Figure 7</xref>
, we use khmer to examine the sequencing error pattern of a 5m-read subset of an Illumina reads data set from single-colony sequencing of
<italic>E. coli</italic>
<xref rid="pone.0101271-Chitsaz1" ref-type="bibr">[31]</xref>
. The high rate of occurrence of unique k-mers close to the 3′ end of reads is due to the increased sequencing error rate at the 3′ end of reads.</p>
<fig id="pone-0101271-g007" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.g007</object-id>
<label>Figure 7</label>
<caption>
<title>Number of unique k-mers (y axis) by starting position within read (x axis) in an untrimmed
<italic>E. coli</italic>
100-bp Illumina shotgun data set, for k = 17 and k = 32.</title>
<p>The increasing numbers of unique k-mers are a sign of the increasing sequencing error towards the 3′ end of reads. Note that there are only 69 starting positions for 32-mers in a 100 base read.</p>
</caption>
<graphic xlink:href="pone.0101271.g007"></graphic>
</fig>
</sec>
<sec id="s2g">
<title>khmer can be applied iteratively to read trimming</title>
<p>We next evaluated the effect of false-positive induced miscounts on read trimming, in which reads are truncated at the first low-abundance k-mer. Because the Count-Min Sketch never undercounts k-mers, reads will never be erroneously trimmed at truly high-abundance k-mers; however, reads may not be trimmed correctly when miscounts inflate the count of low-abundance k-mers. In cases where many errors remain, read trimming can potentially be applied multiple times, with each round reducing the total number of k-mers and hence resulting in lower false positive rates for the same memory usage.</p>
<p>We performed six iterations of unique k-mer trimming on 5 million Illumina reads from sequencing of
<italic>E. coli</italic>
, with memory usage less than 30 MB. For each iteration we measured empirical false positive rate compared with number of bases trimmed as well as the total number of k-mers (
<xref rid="pone-0101271-t003" ref-type="table">Table 3</xref>
). In the first round, the estimated false positive rate was 80.0%, and 13.5% of the total bases were removed by trimming reads at low-abundance k-mers; the second iteration had a false positive rate of 37.7%, and removed only 1.5% additional data; and by the fourth iteration the false positive rate was down to 23.2% with 0.0% of the data removed.</p>
<table-wrap id="pone-0101271-t003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.t003</object-id>
<label>Table 3</label>
<caption>
<title>Iterative low-memory k-mer trimming.</title>
</caption>
<alternatives>
<graphic id="pone-0101271-t003-3" xlink:href="pone.0101271.t003"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">FP rate</td>
<td align="left" rowspan="1" colspan="1">bases trimmed</td>
<td align="left" rowspan="1" colspan="1">distinct k-mers</td>
<td align="left" rowspan="1" colspan="1">unique k-mers</td>
<td align="left" rowspan="1" colspan="1">unique k-mers at 3′ end</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">untrimmed</td>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">41.6 m</td>
<td align="left" rowspan="1" colspan="1">34.1 m</td>
<td align="left" rowspan="1" colspan="1">30.4%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">khmer iteration 1</td>
<td align="left" rowspan="1" colspan="1">80.0%</td>
<td align="left" rowspan="1" colspan="1">13.5%</td>
<td align="left" rowspan="1" colspan="1">13.3 m</td>
<td align="left" rowspan="1" colspan="1">6.5 m</td>
<td align="left" rowspan="1" colspan="1">29.8%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">khmer iteration 2</td>
<td align="left" rowspan="1" colspan="1">40.2%</td>
<td align="left" rowspan="1" colspan="1">1.7%</td>
<td align="left" rowspan="1" colspan="1">7.6 m</td>
<td align="left" rowspan="1" colspan="1">909.9k</td>
<td align="left" rowspan="1" colspan="1">12.3%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">khmer iteration 3</td>
<td align="left" rowspan="1" colspan="1">25.4%</td>
<td align="left" rowspan="1" colspan="1">0.3%</td>
<td align="left" rowspan="1" colspan="1">6.8 m</td>
<td align="left" rowspan="1" colspan="1">168.1k</td>
<td align="left" rowspan="1" colspan="1">3.1%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">khmer iteration 4</td>
<td align="left" rowspan="1" colspan="1">23.2%</td>
<td align="left" rowspan="1" colspan="1">0.1%</td>
<td align="left" rowspan="1" colspan="1">6.7 m</td>
<td align="left" rowspan="1" colspan="1">35.8k</td>
<td align="left" rowspan="1" colspan="1">0.7%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">khmer iteration 5</td>
<td align="left" rowspan="1" colspan="1">22.8%</td>
<td align="left" rowspan="1" colspan="1">0.0%</td>
<td align="left" rowspan="1" colspan="1">6.6 m</td>
<td align="left" rowspan="1" colspan="1">7.9k</td>
<td align="left" rowspan="1" colspan="1">0.2%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">khmer iteration 6</td>
<td align="left" rowspan="1" colspan="1">22.7%</td>
<td align="left" rowspan="1" colspan="1">0.0%</td>
<td align="left" rowspan="1" colspan="1">6.6 m</td>
<td align="left" rowspan="1" colspan="1">1.9k</td>
<td align="left" rowspan="1" colspan="1">0.0%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">filter by FASTX</td>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">9.1%</td>
<td align="left" rowspan="1" colspan="1">26.6 m</td>
<td align="left" rowspan="1" colspan="1">20.3 m</td>
<td align="left" rowspan="1" colspan="1">26.3%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">filter by seqtk(default)</td>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">8.9%</td>
<td align="left" rowspan="1" colspan="1">17.7 m</td>
<td align="left" rowspan="1" colspan="1">12.1 m</td>
<td align="left" rowspan="1" colspan="1">12.3%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">filter by seqtk(-q 0.01)</td>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">15.4%</td>
<td align="left" rowspan="1" colspan="1">9.9 m</td>
<td align="left" rowspan="1" colspan="1">5.1 m</td>
<td align="left" rowspan="1" colspan="1">5.2%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">filter by seqtk(-b 3 -e 5)</td>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">8.0%</td>
<td align="left" rowspan="1" colspan="1">34.5 m</td>
<td align="left" rowspan="1" colspan="1">27.7 m</td>
<td align="left" rowspan="1" colspan="1">25.3%</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt101">
<label></label>
<p>
<bold>The results of trimming reads at unique (erroneous) k-mers from a 5 m read </bold>
<bold>
<italic>E. coli</italic>
</bold>
<bold> data set (1.4 GB) in under 30 MB of RAM. After each iteration, we measured the total number of distinct k-mers in the data set, the total number of unique (and likely erroneous) k-mers remaining, and the number of unique k-mers present at the 3' end of reads.</bold>
</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The elimination of so many unique k-mers (column 5) in the first pass was unexpected: the high false positive rate should have resulted in fewer k-mers being identified as unique, were the erroneous k-mers independent of each other. Upon examination, we realized that in Illumina data erroneous k-mers typically come from substitution errors that yield runs of up to k erroneous k-mers in a row
<xref rid="pone.0101271-Kelley1" ref-type="bibr">[30]</xref>
. When trimming reads with high false positive rates, these runs are typically trimmed after the first few unique k-mers, leaving unique k-mers at the 3′ end. Because of this we hypothesized that high-FP rate trimming would result in the retention of many unique k-mers at the 3′ end of the read, and this was confirmed upon measurement (
<xref rid="pone-0101271-t003" ref-type="table">Table 3</xref>
, column 6, pass 1 vs pass 2).</p>
<p>In comparison to quality-based trimming software such as seqtk and FASTX, trimming at unique k-mers performed very well: in this data set, all unique k-mers represent errors, and even with an initial false positive rate of 80%, khmer outperformed all but the most stringent seqtk run (
<xref rid="pone-0101271-t003" ref-type="table">Table 3</xref>
). With a lower false positive rate or multiple passes, khmer eliminates more erroneous k-mers than seqtk or FASTX. The tradeoff here is in memory usage: for larger data sets, seqtk and FASTX will consume the same amount of memory as on smaller data sets, while khmer's memory usage will need to grow with the data set size.</p>
</sec>
<sec id="s2h">
<title>Using khmer for digital normalization, a streaming algorithm</title>
<p>Digital normalization is a lossy compression algorithm that discards short reads based on saturating coverage of a de Bruijn graph
<xref rid="pone.0101271-Brown1" ref-type="bibr">[21]</xref>
. While several non-streaming implementations exist, including Trinity's
<italic>in silico</italic>
normalization
<xref rid="pone.0101271-Haas1" ref-type="bibr">[32]</xref>
,
<xref rid="pone.0101271-Brown2" ref-type="bibr">[33]</xref>
, digital normalization can be efficiently implemented as a
<italic>streaming</italic>
algorithm. In the streaming implementation, if a read is not kept, it is not loaded into the Count-Min Sketch structure, and the false positive rate does not increase. For high coverage data sets, the digital normalization algorithm is sublinear in memory because it does not collect the majority of k-mers in those data sets
<xref rid="pone.0101271-Brown1" ref-type="bibr">[21]</xref>
. This has the advantage of enabling low-memory preprocessing of both high-coverage genomic data sets, as well as mRNAseq or metagenomic data sets with high-coverage components
<xref rid="pone.0101271-Howe1" ref-type="bibr">[11]</xref>
,
<xref rid="pone.0101271-Brown1" ref-type="bibr">[21]</xref>
.</p>
<p>While digital normalization is already implemented inside khmer, previous work did not explore the lower bound on memory usage for effective digital normalization. In particular, the effects of high false positive rates have not been examined in any prior work.</p>
<p>We applied digital normalization to the
<italic>E. coli</italic>
data set used above, and chose seven different Count-Min Sketch sizes to yield seven different false positive rates 4. The data set was normalized to a k-mer coverage of 20 and the resulting data were evaluated for retention of true and erroneous k-mers, as in
<xref rid="pone.0101271-Brown1" ref-type="bibr">[21]</xref>
(
<xref rid="pone-0101271-t004" ref-type="table">Table 4</xref>
). The results show that digital normalization retains the same set of underlying “true” k-mers until the highest false positive rate of 100% (
<xref rid="pone-0101271-t004" ref-type="table">Table 4</xref>
, column 5), while discarding only about 2% additional reads (
<xref rid="pone-0101271-t004" ref-type="table">Table 4</xref>
, column 6).</p>
<table-wrap id="pone-0101271-t004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.t004</object-id>
<label>Table 4</label>
<caption>
<title>Low-memory digital normalization.</title>
</caption>
<alternatives>
<graphic id="pone-0101271-t004-4" xlink:href="pone.0101271.t004"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">memory</td>
<td align="left" rowspan="1" colspan="1">FP rate</td>
<td align="left" rowspan="1" colspan="1">retained reads</td>
<td align="left" rowspan="1" colspan="1">retained reads %</td>
<td align="left" rowspan="1" colspan="1">true k-mers missing</td>
<td align="left" rowspan="1" colspan="1">total k-mers</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">before diginorm</td>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">5,000,000</td>
<td align="left" rowspan="1" colspan="1">100.0%</td>
<td align="left" rowspan="1" colspan="1">170</td>
<td align="left" rowspan="1" colspan="1">41.6 m</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">2400 MB</td>
<td align="left" rowspan="1" colspan="1">0.0%</td>
<td align="left" rowspan="1" colspan="1">1,656,518</td>
<td align="left" rowspan="1" colspan="1">33.0%</td>
<td align="left" rowspan="1" colspan="1">172</td>
<td align="left" rowspan="1" colspan="1">28.1 m</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">240 MB</td>
<td align="left" rowspan="1" colspan="1">2.8%</td>
<td align="left" rowspan="1" colspan="1">1,655,988</td>
<td align="left" rowspan="1" colspan="1">33.0%</td>
<td align="left" rowspan="1" colspan="1">172</td>
<td align="left" rowspan="1" colspan="1">28.1 m</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">120 MB</td>
<td align="left" rowspan="1" colspan="1">18.0%</td>
<td align="left" rowspan="1" colspan="1">1,652,273</td>
<td align="left" rowspan="1" colspan="1">33.0%</td>
<td align="left" rowspan="1" colspan="1">172</td>
<td align="left" rowspan="1" colspan="1">28.1 m</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">60 MB</td>
<td align="left" rowspan="1" colspan="1">59.1%</td>
<td align="left" rowspan="1" colspan="1">1,633,182</td>
<td align="left" rowspan="1" colspan="1">32.0%</td>
<td align="left" rowspan="1" colspan="1">172</td>
<td align="left" rowspan="1" colspan="1">27.9 m</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">40 MB</td>
<td align="left" rowspan="1" colspan="1">83.2%</td>
<td align="left" rowspan="1" colspan="1">1,602,437</td>
<td align="left" rowspan="1" colspan="1">32.0%</td>
<td align="left" rowspan="1" colspan="1">172</td>
<td align="left" rowspan="1" colspan="1">27.6 m</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">20 MB</td>
<td align="left" rowspan="1" colspan="1">98.8%</td>
<td align="left" rowspan="1" colspan="1">1,460,936</td>
<td align="left" rowspan="1" colspan="1">29.0%</td>
<td align="left" rowspan="1" colspan="1">172</td>
<td align="left" rowspan="1" colspan="1">25.7 m</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">10 MB</td>
<td align="left" rowspan="1" colspan="1">100.0%</td>
<td align="left" rowspan="1" colspan="1">1,076,958</td>
<td align="left" rowspan="1" colspan="1">21.0%</td>
<td align="left" rowspan="1" colspan="1">185</td>
<td align="left" rowspan="1" colspan="1">20.9 m</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt102">
<label></label>
<p>
<bold>The results of digitally normalizing a 5 m read </bold>
<bold>
<italic>E. coli</italic>
</bold>
<bold> data set (1.4 GB) to C = 20 with k = 20 under several memory usage/false positive rates. The false positive rate (column 1) is empirically determined. We measured reads remaining, number of “true” k-mers missing from the data at each step, and the number of total k-mers remaining. Note: at high false positive rates, reads are erroneously removed due to inflation of k-mer counts.</bold>
</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>To evaluate the effect of digital normalization with high false positive rates on actual genome assembly, we next performed normalization to a coverage of 20 with the same range of false positive rates as above. We then assembled this data with Velvet
<xref rid="pone.0101271-Zerbino1" ref-type="bibr">[34]</xref>
and compared the resulting assemblies to the known
<italic>E. coli</italic>
MG1655 genome using QUAST (
<xref rid="pone-0101271-t005" ref-type="table">Table 5</xref>
). To our surprise, we found that even after executing digital normalization with a false positive rate of 83.2%, a nearly complete assembly was generated. No progressive increase in misassemblies (measured against the real genome with QUAST) was seen across the different false positive rates (data not shown). This suggests that below 83.2% FP rate, the false positive rate of digital normalization has little to no effect on assembly quality with Velvet. (Note that the Velvet assembler itself used considerably more memory than digital normalization.)</p>
<table-wrap id="pone-0101271-t005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0101271.t005</object-id>
<label>Table 5</label>
<caption>
<title>
<italic>E. coli</italic>
genome assembly after low-memory digital normalization.</title>
</caption>
<alternatives>
<graphic id="pone-0101271-t005-5" xlink:href="pone.0101271.t005"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">memory</td>
<td align="left" rowspan="1" colspan="1">FP rate</td>
<td align="left" rowspan="1" colspan="1">N contigs</td>
<td align="left" rowspan="1" colspan="1">total length(bases)</td>
<td align="left" rowspan="1" colspan="1">% of true genome covered</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">before diginorm</td>
<td align="left" rowspan="1" colspan="1">-</td>
<td align="left" rowspan="1" colspan="1">106</td>
<td align="left" rowspan="1" colspan="1">4,546,051</td>
<td align="left" rowspan="1" colspan="1">97.84%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">2400 MB</td>
<td align="left" rowspan="1" colspan="1">0.0%</td>
<td align="left" rowspan="1" colspan="1">617</td>
<td align="left" rowspan="1" colspan="1">4,549,235</td>
<td align="left" rowspan="1" colspan="1">98.05%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">240 MB</td>
<td align="left" rowspan="1" colspan="1">2.8%</td>
<td align="left" rowspan="1" colspan="1">87</td>
<td align="left" rowspan="1" colspan="1">4,549,253</td>
<td align="left" rowspan="1" colspan="1">98.04%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">120 MB</td>
<td align="left" rowspan="1" colspan="1">18.0%</td>
<td align="left" rowspan="1" colspan="1">86</td>
<td align="left" rowspan="1" colspan="1">4,549,335</td>
<td align="left" rowspan="1" colspan="1">98.04%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">60 MB</td>
<td align="left" rowspan="1" colspan="1">59.1%</td>
<td align="left" rowspan="1" colspan="1">90</td>
<td align="left" rowspan="1" colspan="1">4,548,619</td>
<td align="left" rowspan="1" colspan="1">98.03%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">40 MB</td>
<td align="left" rowspan="1" colspan="1">83.2%</td>
<td align="left" rowspan="1" colspan="1">89</td>
<td align="left" rowspan="1" colspan="1">4,550,599</td>
<td align="left" rowspan="1" colspan="1">98.11%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">20 MB</td>
<td align="left" rowspan="1" colspan="1">98.8%</td>
<td align="left" rowspan="1" colspan="1">85</td>
<td align="left" rowspan="1" colspan="1">4,550,014</td>
<td align="left" rowspan="1" colspan="1">98.04%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">10 MB</td>
<td align="left" rowspan="1" colspan="1">100.0%</td>
<td align="left" rowspan="1" colspan="1">97</td>
<td align="left" rowspan="1" colspan="1">4,545,871</td>
<td align="left" rowspan="1" colspan="1">97.97%</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt103">
<label></label>
<p>
<bold>A comparison of assembling reads digitally normalized with low memory/high false positive rates. The reads were digitally normalized to C = 20 (see
<xref rid="pone.0101271-Brown1" ref-type="bibr">[21]</xref>
for more information) and were assembled using Velvet. We measured total length of assembly, as well as percent of true MG1655 genome covered by the assembly using QUAST.</bold>
</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>While these results are specific to Velvet and the coverage parameters used in digital normalization, they do suggest that no significant information loss occurs due to false positive rates below 80%. Further evaluation of assembly quality in response to different normalization parameters and assemblers is beyond the scope of of this paper.</p>
</sec>
</sec>
<sec id="s3">
<title>Discussion</title>
<sec id="s3a">
<title>khmer enables fast, memory-efficient online counting</title>
<p>khmer enables memory- and time-efficient online counting (
<xref ref-type="fig" rid="pone-0101271-g001">Figures 1</xref>
,
<xref ref-type="fig" rid="pone-0101271-g002">2</xref>
, and
<xref ref-type="fig" rid="pone-0101271-g004">4</xref>
). This is particularly important for the streaming approaches to data analysis needed as data set sizes increase. Because query and updating of k-mer counts can be done directly as data is being loaded, with no need for disk access or an indexing step, khmer can also perform well in situations with poor disk I/O performance. (Note that BFCounter also supports online k-mer counting
<xref rid="pone.0101271-Melsted1" ref-type="bibr">[6]</xref>
.)</p>
</sec>
<sec id="s3b">
<title>khmer is a generally useful k-mer counting approach</title>
<p>In addition to online counting, khmer offers a general range of useful performance tradeoffs for disk I/O, time and memory. From the performance comparison between khmer and other k-mer counting packages in calculating k-mer abundance distributions, khmer is comparable with existing packages. In time, khmer performs competitively with DSK and BFCounter (
<xref ref-type="fig" rid="pone-0101271-g001">Figure 1</xref>
); khmer also provides a way to systematically trade memory for miscounts across a wide range of parameters (
<xref ref-type="fig" rid="pone-0101271-g002">Figure 2</xref>
). khmer's uncompressed disk storage is competitive with Jellyfish, and, in situations where disk space is at a premium, khmer can take advantage of gzip compression to provide storage similar to that of DSK (
<xref ref-type="fig" rid="pone-0101271-g003">Figure 3</xref>
, purple line with boxes).</p>
<p>KMC, DSK, and KAnalyze perform especially well in memory usage for calculating the abundance distribution of k-mers. However, in exchange for this efficiency, retrieving specific k-mer counts at random is likely to be quite slow, as DSK is optimized for iterating across partition sets of k-mers rather than randomly accessing k-mer counts.</p>
<p>For retrieving the counts of individual k-mers, khmer is significantly faster than both Tallymer and Jellyfish. This is not surprising, since this was a primary motivation for the development of khmer.</p>
</sec>
<sec id="s3c">
<title>khmer memory usage is fixed and low</title>
<p>The memory usage of the basic Count-Min Sketch approach is fixed: khmer's memory usage does not increase as data is loaded. While this means that khmer will never crash due to memory limitations, and all operations can be performed in main memory without recourse to disk storage, the false positive rate may grow too high. Therefore the memory size must be chosen in light of the false positive rate and miscount acceptable for a given application. In practice, we recommend choosing the maximum available memory, because the false positive rate decreases with increasing memory and there are no negative effects to minimizing the false positive rate.</p>
<p>For any given data set, the size and number of hash tables will determine the accuracy of k-mer counting with khmer. Thus, the user can control the memory usage based on the desired level of accuracy (
<xref ref-type="fig" rid="pone-0101271-g002">Figure 2</xref>
). The time usage for the first step of k-mer counting, consuming the reads, depends on the total amount of data, since we must traverse every k-mer in every read. The second step, k-mer retrieval, is algorithmically constant for fixed k; however, for practicality, the hash tables are usually saved to and loaded from disk, meaning that k-mer retrieval time depends directly on the size of the database being queried.</p>
<p>The memory usage of khmer is particularly low for sparse data sets, especially since only main memory is used and no disk space is necessary beyond that required for the read data sets. This is no surprise: the information theoretic comparison in
<xref rid="pone.0101271-Pell1" ref-type="bibr">[15]</xref>
shows that, for sparse sequencing data sets, Bloom filters require considerably less memory than any possible exact information storage for a wide range of false positive rates and data set sparseness.</p>
<p>In our implementation we use 1 byte to store the count of each k-mer in the data structure. Thus the maximum count for a k-mer will be 255. In cases where tracking bigger counts is required, khmer also provides an option to use an STL map data structure to store counts above 255, with the trade-off of significantly higher memory usage. In the future, we may extend khmer to counters of arbitrary bit sizes.</p>
</sec>
<sec id="s3d">
<title>False positive rates in k-mer counting are low and predictable</title>
<p>The Count-Min Sketch is a probabilistic data structure with a one-sided error that results in random overestimates of k-mer frequency, but does not generate underestimates.</p>
<p>In the Count-Min Sketch, the total memory usage is fixed; the memory usage, the hash functions, and the total number of distinct objects counted all influence the accuracy of the count. While the probability of an inaccurate count can easily be estimated based on the hash table load, the miscount size is dependent on details of the frequency distribution of k-mers
<xref rid="pone.0101271-Cormode1" ref-type="bibr">[12]</xref>
.</p>
<p>More specifically, in the analysis of the Count-Min Sketch, the difference between the incorrect count and actual count is related to the total number of k-mers in a data set and the size of each hash table
<xref rid="pone.0101271-Cormode1" ref-type="bibr">[12]</xref>
. Further study has shown that the behavior of Count-Min Sketch depends on specific characteristics of the data set under consideration, especially left-skewness
<xref rid="pone.0101271-Rusu1" ref-type="bibr">[35]</xref>
,
<xref rid="pone.0101271-Cormode2" ref-type="bibr">[36]</xref>
. These probabilistic properties suit short reads from next generation sequencing data sets: the miscounts are low because of the highly left-skewed abundance distribution of k-mers in these data sets.</p>
<p>
<xref ref-type="fig" rid="pone-0101271-g005">Figures 5</xref>
and
<xref ref-type="fig" rid="pone-0101271-g006">6</xref>
demonstrate these properties well. We see more correct counting for error-prone reads from a genome than for error-free reads from a genome, with a normal distribution of k-mer abundance. Thus, this counting approach is especially suitable for high diversity data sets, such as metagenomic data, in which a larger proportion of k-mers are low abundance or unique due to sequencing errors.</p>
</sec>
<sec id="s3e">
<title>Real-world applications for khmer</title>
<p>For many applications, an approximate k-mer count is sufficient. For example, when eliminating reads with low abundance k-mers, we can tolerate a certain number of low-frequency k-mers remaining in the resulting data set falsely. If RAM-limited we can do the filtering iteratively so that at each step we are making more effective use of the available memory.</p>
<p>In practice, we have found that a false positive rate of between 1% and 10% offers acceptable miscount performance for a wide range of tasks, including error profiling, digital normalization and low-abundance read-trimming. Somewhat surprisingly, false positive rates of up to 80% can still be used for both read trimming and digital normalization in memory-limited circumstances, although multiple passes across the data may be needed.</p>
<p>For many applications, the fact that khmer does not break an imposed memory bound is extremely useful, since for many data sets — especially metagenomic data sets — high memory demands constrain analysis
<xref rid="pone.0101271-Howe1" ref-type="bibr">[11]</xref>
,
<xref rid="pone.0101271-Luo1" ref-type="bibr">[37]</xref>
. Moreover, because the false positive rate is straightforward to measure, the user can be warned that the results should be invalidated when too little memory is used. When combined with the graceful degradation of performance for both error trimming and digital normalization, khmer readily enables analysis of extremely large and diverse data sets
<xref rid="pone.0101271-Howe2" ref-type="bibr">[38]</xref>
. In an experiment to assemble the reads of a soil metagenomic sample collected from Iowa prairie, the number of reads to assemble drops from 3.3 million to 2.2 million and the size of the data set drops from 245GB to 145GB accordingly after digital normalization
<xref rid="pone.0101271-Howe1" ref-type="bibr">[11]</xref>
. 240GB memory was used in the process. This also shows that khmer works well to analyze large, real-world metagenomic data sets.</p>
</sec>
<sec id="s3f">
<title>Conclusion</title>
<p>K-mer counting is widely used in bioinformatics, and as sequencing data set sizes increase, graceful degradation of data structures in the face of large amounts of data has become important. This is especially true when the theoretical and practical effects of the degradation can be predicted (see e.g.
<xref rid="pone.0101271-Melsted1" ref-type="bibr">[6]</xref>
,
<xref rid="pone.0101271-Roy1" ref-type="bibr">[9]</xref>
,
<xref rid="pone.0101271-Pell1" ref-type="bibr">[15]</xref>
). This is a key property of the Count-Min Sketch approach, and its implementation in khmer.</p>
<p>The khmer software implementation offers good performance, a robust and well-tested Python API, and a number of useful and well-documented scripts. While Jellyfish, DSK, KMC, and Turtle also offer good performance, khmer is competitive, and, because it provides a Python API for online counting, is flexible. In memory-limited situations with poor I/O performance, khmer is particularly useful, because it will not break an imposed memory bound and does not require disk access to store or retrieve k-mer counts. However, in exchange for this memory guarantee, counting becomes increasingly incorrect as less memory is used or as the data set size grows large; in many situations this may be an acceptable tradeoff.</p>
</sec>
<sec id="s3g">
<title>Future considerations</title>
<p>Applying khmer to extremely large data sets with many distinct k-mers requires a large amount of memory: approximately 446 GB of memory is required to achieve an false positive rate of 1% for
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e036.jpg"></inline-graphic>
</inline-formula>
k-mers. It is possible to reduce the required memory by dividing k-mer space into multiple partitions and counting k-mers separately for each partition. Partitioning k-mer space into
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e037.jpg"></inline-graphic>
</inline-formula>
partitions results in a linear decrease in the number of k-mers under consideration, thus reducing the occupancy by a constant factor
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e038.jpg"></inline-graphic>
</inline-formula>
and correspondingly reducing the collision rate. Partitioning k-mer space is a generalization of the systematic prefix filtering approach, where one might first count all k-mers starting with AA, then AC, then AG, AT, CA, etc., which is equivalent to partitioning k-mer space into 16 equal-sized partitions. These partitions can be calculated independently, either across multiple machines or iteratively on a single machine, and the results stored for later comparison or analysis. This is similar to the approach taken by DSK
<xref rid="pone.0101271-Rizk1" ref-type="bibr">[7]</xref>
, and could easily be implemented in khmer.</p>
<p>Further optimization of khmer on single machines, e.g. for multi-core architectures, is unlikely to achieve significantly greater speed. Past a certain point k-mer counting is fundamentally I/O bound
<xref rid="pone.0101271-McDonald1" ref-type="bibr">[23]</xref>
.</p>
<p>Perhaps the most interesting future direction for probabilistic k-mer counting is that taken by Turtle
<xref rid="pone.0101271-Roy1" ref-type="bibr">[9]</xref>
, in which several data structures are provided, each with different tradeoffs, but with a common API. We hope to pursue this direction in the future by integrating such approaches into khmer.</p>
</sec>
</sec>
<sec sec-type="methods" id="s4">
<title>Methods</title>
<sec id="s4a">
<title>Code and data set availability</title>
<p>The version of khmer used to generate the results below is available at
<ext-link ext-link-type="uri" xlink:href="http://github.com/ged-lab/khmer.git">http://github.com/ged-lab/khmer.git</ext-link>
, tag ‘2013-khmer-counting’. Scripts specific to this paper are available in the paper repository at
<ext-link ext-link-type="uri" xlink:href="https://github.com/ged-lab/2013-khmer-counting">https://github.com/ged-lab/2013-khmer-counting</ext-link>
. The IPython
<xref rid="pone.0101271-Prez1" ref-type="bibr">[39]</xref>
notebook file and data analysis to generate the figures are also available in that github repository. Complete instructions to reproduce all of the results in this paper are available in the khmer-counting repository; see README.rst.</p>
</sec>
<sec id="s4b">
<title>Sequence Data</title>
<p>One human gut metagenome reads data set (MH0001) from the MetaHIT (Metagenomics of the Human Intestinal Tract) project
<xref rid="pone.0101271-Qin1" ref-type="bibr">[40]</xref>
was used. It contains approximately 59 million reads, each 44 bp long; it was trimmed to remove low quality sequences.</p>
<p>Five soil metagenomics reads data sets with different size were taken from the GPGC project for benchmark purpose (see
<xref rid="pone-0101271-t001" ref-type="table">Table 1</xref>
). These reads are from soil in Iowa region and they are filtered to make sure there are less than 30% Ns in the read and each read is longer than 30 bp. The exact data sets used for the paper are available on Amazon S3 and the instructions to acquire these data sets are available in the paper repository on github.com.</p>
<p>We also generated four short-read data sets to assess the false positive rate and miscount distribution. One is a subset of a real metagenomics data set from the MH0001 data set, above. The second consists of randomly generated reads. The third and fourth contain reads simulated from a random, 1 Mbp long genome. The third has a substitution error rate of 3%, and the fourth contains no errors. The four data sets were chosen to contain identical numbers of distinct 22-mers. The scripts necessary to regenerate these data are available in the paper repository on github.com.</p>
</sec>
<sec id="s4c">
<title>Count-Min Sketch implementation</title>
<p>We implemented the Count-Min Sketch data structure, a simple probabilistic data structure for counting distinct elements
<xref rid="pone.0101271-Cormode1" ref-type="bibr">[12]</xref>
. Our implementation uses
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e039.jpg"></inline-graphic>
</inline-formula>
independent hash tables, each containing a prime number of counters
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e040.jpg"></inline-graphic>
</inline-formula>
. The hashing function for each hash table is fixed, and reversibly converts each DNA k-mer (for
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e041.jpg"></inline-graphic>
</inline-formula>
) into a 64-bit number to which the modulus of the hash table size is applied. This provides
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e042.jpg"></inline-graphic>
</inline-formula>
distinct hash functions.</p>
<p>To increment the count associated with a k-mer, the counter associated with the hashed k-mer in each of the
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e043.jpg"></inline-graphic>
</inline-formula>
hash tables is incremented. To retrieve the count associated with a k-mer, the minimum count across all
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e044.jpg"></inline-graphic>
</inline-formula>
hash tables is chosen.</p>
<p>In this scheme, collisions are explicitly not handled, so the count associated with a k-mer may not be accurate. Because collisions only falsely
<italic>increment</italic>
counts, however, the retrieved count for any given k-mer is guaranteed to be no less than the correct count. Thus the counting error is one-sided.</p>
</sec>
<sec id="s4d">
<title>Hash function and khmer implementation</title>
<p>The current khmer hash function works only for
<inline-formula>
<inline-graphic xlink:href="pone.0101271.e045.jpg"></inline-graphic>
</inline-formula>
and converts DNA strings exactly into 64-bit numbers. However, any hash function would work. For example, a cyclic hash would enable khmer to count k-mers larger in size than 32; this would not change the scaling behavior of khmer at all, and is a planned extension.</p>
<p>By default khmer counts k-mers in DNA, i.e. strandedness is disregarded by having the hash function choose the lower numerical value for the exact hash of both a k-mer and its reverse complement. This behavior is configurable via a compile-time option.</p>
</sec>
<sec id="s4e">
<title>Comparing with other k-mer counting programs</title>
<p>We generated k-mer abundance distribution from five soil metagenomic reads data sets of different sizes using khmer, Tallymer, Jellyfish, DSK, BFCounter, KMC, Turtle and KAnalyze. If the software does not include function to generate k-mer abundance distribution directly, we output the frequency of each k-mer in an output file. We fixed k at 22 unless otherwise noted.</p>
<sec id="s4e1">
<title>khmer</title>
<p>For khmer, we set hash table sizes to fix the false positive rate at either 1%, 5% or 20%, and used 8 threads in loading the data.</p>
<p>We did the khmer random-access k-mer counting benchmark with a custom-written Python script khmer-count-kmers which loaded the database file and then used the Python API to query each k-mer individually.</p>
</sec>
<sec id="s4e2">
<title>Tallymer</title>
<p>Tallymer is from the genometools package version 1.3.4. For the suffixerator subroutine we used: -dna -pl -tis -suf -lcp.</p>
<p>We use the mkindex subroutine to generate k-mer abundance distribution, we used: -mersize 22.</p>
<p>The Tallymer random access k-mer counting benchmark was done using the ‘tallymer search’ routine against both strands; see the script tallymer-search.sh.</p>
</sec>
<sec id="s4e3">
<title>Jellyfish</title>
<p>The Jellyfish version used was 1.1.10 and the multithreading option is set to 8 threads.</p>
<p>Jellyfish uses a hash table to store the k-mers and the size of the hash table can be modified by the user. When the specified hash table fills up, Jellyfish writes it to the hard disk and initializes a new hash table. Here we use a similar strategy as in
<xref rid="pone.0101271-Melsted1" ref-type="bibr">[6]</xref>
and chose the minimum size of the hash tables for Jellyfish so that all k-mers were stored in memory.</p>
<p>We ran Jellyfish with the options as below:</p>
<p>jellyfish count -m 22 -c 2 -C for k = 22.</p>
<p>The Jellyfish random access k-mer counting benchmark was performed using the ‘query’ routine and querying against both strands; see the script jelly-search.sh.</p>
</sec>
<sec id="s4e4">
<title>DSK</title>
<p>We ran DSK with default parameters with -histo option to generate k-mer abundance distribution. The DSK version used was 1.5031.</p>
</sec>
<sec id="s4e5">
<title>BFCounter</title>
<p>The BFcounter version used was 1.0 and the multithreading option is set to 8 threads.</p>
<p>We ran BFCounter count subroutine with the options as below:</p>
<p>BFCounter count -k 22 -t 8 -c 100000. -n option representing the estimated number of k-mers is adjusted to the different test data sets.</p>
<p>This subroutine produces the actual count of k-mers in input files.</p>
<p>We ran BFCounter dump subroutine with the options as below: BFCounter dump -k 22.</p>
<p>This subroutine can write k-mer occurrences into a tab-separated text file.</p>
</sec>
<sec id="s4e6">
<title>KMC</title>
<p>The KMC version used was 0.3. We ran both kmc and kmc_dump subroutines with default parameters.</p>
</sec>
<sec id="s4e7">
<title>Turtle</title>
<p>The Turtle version used was 0.3. We ran scTurtle32 with the multithreading option set to 8 threads and -n option representing expected number of frequent k-mers is adjusted to different test data sets.</p>
</sec>
<sec id="s4e8">
<title>KAnalyze</title>
<p>The KAnalyze version used was 0.9.3. We ran count subroutine with default parameters.</p>
</sec>
</sec>
</sec>
</body>
<back>
<ack>
<p>We thank Eric McDonald for technical assistance with optimizing the khmer codebase.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pone.0101271-Marais1">
<label>1</label>
<mixed-citation publication-type="journal">
<name>
<surname>Marçais</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
(
<year>2011</year>
)
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
.
<source>Bioinformatics</source>
<volume>27</volume>
:
<fpage>764</fpage>
<lpage>770</lpage>
.
<pub-id pub-id-type="pmid">21217122</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Kurtz1">
<label>2</label>
<mixed-citation publication-type="journal">
<name>
<surname>Kurtz</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Narechania</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Stein</surname>
<given-names>JC</given-names>
</name>
,
<name>
<surname>Ware</surname>
<given-names>D</given-names>
</name>
(
<year>2008</year>
)
<article-title>A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes</article-title>
.
<source>BMC Genomics</source>
<volume>9</volume>
:
<fpage>517</fpage>
.
<pub-id pub-id-type="pmid">18976482</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Metzker1">
<label>3</label>
<mixed-citation publication-type="journal">
<name>
<surname>Metzker</surname>
<given-names>M</given-names>
</name>
(
<year>2010</year>
)
<article-title>Sequencing technologies - the next generation</article-title>
.
<source>Nat Rev Genet</source>
<volume>11</volume>
:
<fpage>31</fpage>
<lpage>46</lpage>
.
<pub-id pub-id-type="pmid">19997069</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Conway1">
<label>4</label>
<mixed-citation publication-type="journal">
<name>
<surname>Conway</surname>
<given-names>TC</given-names>
</name>
,
<name>
<surname>Bromage</surname>
<given-names>AJ</given-names>
</name>
(
<year>2011</year>
)
<article-title>Succinct data structures for assembling large genomes</article-title>
.
<source>Bioinfor-matics</source>
<volume>27</volume>
:
<fpage>479</fpage>
<lpage>86</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Minoche1">
<label>5</label>
<mixed-citation publication-type="journal">
<name>
<surname>Minoche</surname>
<given-names>AE</given-names>
</name>
,
<name>
<surname>Dohm</surname>
<given-names>JC</given-names>
</name>
,
<name>
<surname>Himmelbauer</surname>
<given-names>H</given-names>
</name>
(
<year>2011</year>
)
<article-title>Evaluation of genomic high-throughput sequencing data generated on illumina hiseq and genome analyzer systems</article-title>
.
<source>Genome Biol</source>
<volume>12</volume>
:
<fpage>R112</fpage>
.
<pub-id pub-id-type="pmid">22067484</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Melsted1">
<label>6</label>
<mixed-citation publication-type="journal">
<name>
<surname>Melsted</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Pritchard</surname>
<given-names>JK</given-names>
</name>
(
<year>2011</year>
)
<article-title>Efficient counting of k-mers in DNA sequences using a bloom filter</article-title>
.
<source>BMC bioinformatics</source>
<volume>12</volume>
:
<fpage>333</fpage>
.
<pub-id pub-id-type="pmid">21831268</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Rizk1">
<label>7</label>
<mixed-citation publication-type="journal">
<name>
<surname>Rizk</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Lavenier</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
(
<year>2013</year>
)
<article-title>Dsk: k-mer counting with very low memory usage</article-title>
.
<source>Bioinfor- matics</source>
<volume>29</volume>
:
<fpage>652</fpage>
<lpage>3</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Deorowicz1">
<label>8</label>
<mixed-citation publication-type="journal">
<name>
<surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Debudaj-Grabysz</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Grabowski</surname>
<given-names>S</given-names>
</name>
(
<year>2013</year>
)
<article-title>Disk-based k-mer counting on a pc</article-title>
.
<source>BMC Bioinformatics</source>
<volume>14</volume>
:
<fpage>160</fpage>
.
<pub-id pub-id-type="pmid">23679007</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Roy1">
<label>9</label>
<mixed-citation publication-type="journal">
<name>
<surname>Roy</surname>
<given-names>RS</given-names>
</name>
,
<name>
<surname>Bhattacharya</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Schliep</surname>
<given-names>A</given-names>
</name>
(
<year>2014</year>
)
<article-title>Turtle: Identifying frequent k-mers with cache-efficient algorithms</article-title>
.
<source>Bioinformatics: Advance Access published March 10, 2014</source>
:
<pub-id pub-id-type="doi">10.1093/bioinformat-ics/btu132</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Audano1">
<label>10</label>
<mixed-citation publication-type="journal">
<name>
<surname>Audano</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Vannberg</surname>
<given-names>F</given-names>
</name>
(
<year>2014</year>
)
<article-title>Kanalyze: A fast versatile pipelined k-mer toolkit</article-title>
.
<source>Bioinformatics: Advance Access published March 18, 2014</source>
:
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu152</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Howe1">
<label>11</label>
<mixed-citation publication-type="journal">
<name>
<surname>Howe</surname>
<given-names>AC</given-names>
</name>
,
<name>
<surname>Jansson</surname>
<given-names>JK</given-names>
</name>
,
<name>
<surname>Malfatti</surname>
<given-names>SA</given-names>
</name>
,
<name>
<surname>Tringe</surname>
<given-names>SG</given-names>
</name>
,
<name>
<surname>Tiedje</surname>
<given-names>JM</given-names>
</name>
,
<etal>et al</etal>
(
<year>2014</year>
)
<article-title>Tackling soil diversity with the assembly of large, complex metagenomes</article-title>
.
<source>Proc Natl Acad Sci U S A</source>
<volume>111</volume>
:
<fpage>4904</fpage>
<lpage>9</lpage>
.
<pub-id pub-id-type="pmid">24632729</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Cormode1">
<label>12</label>
<mixed-citation publication-type="journal">
<name>
<surname>Cormode</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Muthukrishnan</surname>
<given-names>S</given-names>
</name>
(
<year>2005</year>
)
<article-title>An improved data stream summary: the count-min sketch and its applications</article-title>
.
<source>Journal of Algorithms</source>
<volume>55</volume>
:
<fpage>58</fpage>
<lpage>75</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Bloom1">
<label>13</label>
<mixed-citation publication-type="journal">
<name>
<surname>Bloom</surname>
<given-names>BH</given-names>
</name>
(
<year>1970</year>
)
<article-title>Space/time trade-offs in hash coding with allowable errors</article-title>
.
<source>Commun ACM</source>
<volume>13</volume>
:
<fpage>422</fpage>
<lpage>426</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Malde1">
<label>14</label>
<mixed-citation publication-type="book">Malde K, O'Sullivan B (2009) Using bloom filters for large scale gene sequence analysis in haskell. In: Gill A, Swift T, editors, PADL. Springer, volume 5418 of
<italic>Lecture Notes in Computer Science</italic>
, pp. 183–194.</mixed-citation>
</ref>
<ref id="pone.0101271-Pell1">
<label>15</label>
<mixed-citation publication-type="journal">
<name>
<surname>Pell</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Hintze</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Canino-Koning</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Howe</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Tiedje</surname>
<given-names>JM</given-names>
</name>
,
<etal>et al</etal>
(
<year>2012</year>
)
<article-title>Scaling metagenome sequence assembly with probabilistic de bruijn graphs</article-title>
.
<source>Proc Natl Acad Sci U S A</source>
<volume>109</volume>
:
<fpage>13272</fpage>
<lpage>7</lpage>
.
<pub-id pub-id-type="pmid">22847406</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Jones1">
<label>16</label>
<mixed-citation publication-type="journal">
<name>
<surname>Jones</surname>
<given-names>DC</given-names>
</name>
,
<name>
<surname>Ruzzo</surname>
<given-names>WL</given-names>
</name>
,
<name>
<surname>Peng</surname>
<given-names>X</given-names>
</name>
,
<name>
<surname>Katze</surname>
<given-names>MG</given-names>
</name>
(
<year>2012</year>
)
<article-title>Compression of next-generation sequencing reads aided by highly efficient de novo assembly</article-title>
.
<source>Nucleic Acids Res</source>
<volume>40</volume>
:
<fpage>e171</fpage>
.
<pub-id pub-id-type="pmid">22904078</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Broder1">
<label>17</label>
<mixed-citation publication-type="journal">
<name>
<surname>Broder</surname>
<given-names>AZ</given-names>
</name>
,
<name>
<surname>Mitzenmacher</surname>
<given-names>M</given-names>
</name>
(
<year>2003</year>
)
<article-title>Survey: Network applications of bloom filters: A survey</article-title>
.
<source>Internet Mathematics</source>
<volume>1</volume>
:
<fpage>485</fpage>
<lpage>509</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Fan1">
<label>18</label>
<mixed-citation publication-type="journal">
<name>
<surname>Fan</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Cao</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Broder</surname>
<given-names>AZ</given-names>
</name>
(
<year>2000</year>
)
<article-title>Summary cache: A scalable wide-area web cache sharing protocol</article-title>
.
<source>IEEE/ACM Trans Netw</source>
<volume>8</volume>
:
<fpage>281</fpage>
<lpage>293</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Estan1">
<label>19</label>
<mixed-citation publication-type="book">Estan C, Varghese G (2002) New directions in traffic measurement and accounting. In: SIGCOMM. ACM, pp. 323–336.</mixed-citation>
</ref>
<ref id="pone.0101271-Cohen1">
<label>20</label>
<mixed-citation publication-type="book">Cohen S, Matias Y (2003) Spectral bloom filters. In: Halevy AY, Ives ZG, Doan A, editors, SIGMOD Conference. ACM, pp. 241–252.</mixed-citation>
</ref>
<ref id="pone.0101271-Brown1">
<label>21</label>
<mixed-citation publication-type="other">Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom TH (2012) A reference-free algorithm for com- putational normalization of shotgun sequencing data. arXiv: 1203.4802.</mixed-citation>
</ref>
<ref id="pone.0101271-Crusoe1">
<label>22</label>
<mixed-citation publication-type="other">Crusoe M, Edvenson G, Fish J, Howe A, McDonald E,
<etal>et al</etal>
(2014) The khmer software package: enabling efficient sequence analysis. URL
<pub-id pub-id-type="doi">10.6084/m9.figshare.979190</pub-id>
.</mixed-citation>
</ref>
<ref id="pone.0101271-McDonald1">
<label>23</label>
<mixed-citation publication-type="book">McDonald E, Brown CT (2013) Working with big data in bioinformatics. In: Armstrong T, editor, The Performance of Open Source Applications, lulu.com, chapter 12. p. 151.</mixed-citation>
</ref>
<ref id="pone.0101271-Broder2">
<label>24</label>
<mixed-citation publication-type="journal">
<name>
<surname>Broder</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Mitzenmacher</surname>
<given-names>M</given-names>
</name>
(
<year>2004</year>
)
<article-title>Network applications of bloom filters: A survey</article-title>
.
<source>Internet mathematics</source>
<volume>1</volume>
:
<fpage>485</fpage>
<lpage>509</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Flajolet1">
<label>25</label>
<mixed-citation publication-type="journal">
<name>
<surname>Flajolet</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Fusy</surname>
<given-names>É</given-names>
</name>
,
<name>
<surname>Gandouet</surname>
<given-names>O</given-names>
</name>
,
<name>
<surname>Meunier</surname>
<given-names>F</given-names>
</name>
(
<year>2008</year>
)
<article-title>Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm</article-title>
.
<source>DMTCS Proceedings</source>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Chikhi1">
<label>26</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
(
<year>2014</year>
)
<article-title>Informed and automated k-mer size selection for genome assembly</article-title>
.
<source>Bioinformatics</source>
<volume>30</volume>
:
<fpage>31</fpage>
<lpage>7</lpage>
.
<pub-id pub-id-type="pmid">23732276</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Medvedev1">
<label>27</label>
<mixed-citation publication-type="journal">
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Scott</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>Kakaradov</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>Pevzner</surname>
<given-names>P</given-names>
</name>
(
<year>2011</year>
)
<article-title>Error correction of high-throughput se- quencing datasets with non-uniform coverage</article-title>
.
<source>Bioinformatics</source>
<volume>27</volume>
:
<fpage>i137</fpage>
<lpage>41</lpage>
.
<pub-id pub-id-type="pmid">21685062</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Pevzner1">
<label>28</label>
<mixed-citation publication-type="journal">
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
,
<name>
<surname>Tang</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
(
<year>2001</year>
)
<article-title>An eulerian path approach to dna fragment assembly</article-title>
.
<source>Proc Natl Acad Sci U S A</source>
<volume>98</volume>
:
<fpage>9748</fpage>
<lpage>53</lpage>
.
<pub-id pub-id-type="pmid">11504945</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Li1">
<label>29</label>
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>X</given-names>
</name>
,
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
(
<year>2003</year>
)
<article-title>Estimating the repeat structure and length of dna sequences using l-tuples</article-title>
.
<source>Genome Res</source>
<volume>13</volume>
:
<fpage>1916</fpage>
<lpage>22</lpage>
.
<pub-id pub-id-type="pmid">12902383</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Kelley1">
<label>30</label>
<mixed-citation publication-type="journal">
<name>
<surname>Kelley</surname>
<given-names>DR</given-names>
</name>
,
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
,
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
(
<year>2010</year>
)
<article-title>Quake: quality-aware detection and correction of sequencing errors</article-title>
.
<source>Genome Biol</source>
<volume>11</volume>
:
<fpage>R116</fpage>
.
<pub-id pub-id-type="pmid">21114842</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Chitsaz1">
<label>31</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chitsaz</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Yee-Greenbaum</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Tesler</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Lombardo</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Dupont</surname>
<given-names>C</given-names>
</name>
,
<etal>et al</etal>
(
<year>2011</year>
)
<article-title>Efficient de novo assembly of single-cell bacterial genomes from short-read data sets</article-title>
.
<source>Nat Biotechnol</source>
<volume>29</volume>
:
<fpage>915</fpage>
<lpage>21</lpage>
.
<pub-id pub-id-type="pmid">21926975</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Haas1">
<label>32</label>
<mixed-citation publication-type="journal">
<name>
<surname>Haas</surname>
<given-names>BJ</given-names>
</name>
,
<name>
<surname>Papanicolaou</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Yassour</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Grabherr</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Blood</surname>
<given-names>PD</given-names>
</name>
,
<etal>et al</etal>
(
<year>2013</year>
)
<article-title>De novo transcript sequence reconstruction from rna-seq using the trinity platform for reference generation and analysis</article-title>
.
<source>Nat Protoc</source>
<volume>8</volume>
:
<fpage>1494</fpage>
<lpage>512</lpage>
.
<pub-id pub-id-type="pmid">23845962</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Brown2">
<label>33</label>
<mixed-citation publication-type="other">Brown CT (2012) What does trinity's in silico normalization do? URL
<pub-id pub-id-type="doi">10.6084/m9.figshare.98198</pub-id>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Zerbino1">
<label>34</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
,
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
(
<year>2008</year>
)
<article-title>Velvet: algorithms for de novo short read assembly using de bruijn graphs</article-title>
.
<source>Genome Res</source>
<volume>18</volume>
:
<fpage>821</fpage>
<lpage>9</lpage>
.
<pub-id pub-id-type="pmid">18349386</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Rusu1">
<label>35</label>
<mixed-citation publication-type="journal">
<name>
<surname>Rusu</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Dobra</surname>
<given-names>A</given-names>
</name>
(
<year>2008</year>
)
<article-title>Sketches for size of join estimation</article-title>
.
<source>ACM Transactions on Database Systems</source>
<volume>33</volume>
:
<fpage>1</fpage>
<lpage>46</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Cormode2">
<label>36</label>
<mixed-citation publication-type="book">Cormode G, Muthukrishnan S (2005) Summarizing and mining skewed data streams. In: Kargupta H, Srivastava J, Kamath C, Goodman A, editors, SDM. SIAM, pp. 44–55.</mixed-citation>
</ref>
<ref id="pone.0101271-Luo1">
<label>37</label>
<mixed-citation publication-type="journal">
<name>
<surname>Luo</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Friedman</surname>
<given-names>MS</given-names>
</name>
,
<name>
<surname>Shedden</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Hankenson</surname>
<given-names>KD</given-names>
</name>
,
<name>
<surname>Woolf</surname>
<given-names>PJ</given-names>
</name>
(
<year>2009</year>
)
<article-title>Gage: generally applicable gene set enrichment for pathway analysis</article-title>
.
<source>BMC Bioinformatics</source>
<volume>10</volume>
:
<fpage>161</fpage>
.
<pub-id pub-id-type="pmid">19473525</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0101271-Howe2">
<label>38</label>
<mixed-citation publication-type="other">Howe AC, Pell J, Canino-Koning R, Mackelprang R, Tringe S,
<etal>et al</etal>
. (2012) Illumina sequencing artifacts revealed by connectivity analysis of metagenomic datasets. arXiv: 1212.0159.</mixed-citation>
</ref>
<ref id="pone.0101271-Prez1">
<label>39</label>
<mixed-citation publication-type="journal">
<name>
<surname>Pérez</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Granger</surname>
<given-names>B</given-names>
</name>
(
<year>2007</year>
)
<article-title>Ipython: A system for interactive scientific computing</article-title>
.
<source>Computing in Science Engineering</source>
<volume>9</volume>
:
<fpage>21</fpage>
<lpage>29</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0101271-Qin1">
<label>40</label>
<mixed-citation publication-type="journal">
<name>
<surname>Qin</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Raes</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Arumugam</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Burgdorf</surname>
<given-names>KS</given-names>
</name>
,
<etal>et al</etal>
(
<year>2010</year>
)
<article-title>A human gut microbial gene catalogue established by metagenomic sequencing</article-title>
.
<source>Nature</source>
<volume>464</volume>
:
<fpage>59</fpage>
<lpage>65</lpage>
.
<pub-id pub-id-type="pmid">20203603</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0010889 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0010889 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021