Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000368 ( Pmc/Corpus ); précédent : 0003679; suivant : 0003690 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">KrakenUniq: confident and fast metagenomics classification using unique
<italic>k</italic>
-mer counts</title>
<author>
<name sortKey="Breitwieser, F P" sort="Breitwieser, F P" uniqKey="Breitwieser F" first="F. P." last="Breitwieser">F. P. Breitwieser</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine,</institution>
<institution>Johns Hopkins School of Medicine,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Baker, D N" sort="Baker, D N" uniqKey="Baker D" first="D. N." last="Baker">D. N. Baker</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine,</institution>
<institution>Johns Hopkins School of Medicine,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Johns Hopkins University,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Salzberg, S L" sort="Salzberg, S L" uniqKey="Salzberg S" first="S. L." last="Salzberg">S. L. Salzberg</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine,</institution>
<institution>Johns Hopkins School of Medicine,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Johns Hopkins University,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Departments of Biomedical Engineering and Biostatistics,</institution>
<institution>Johns Hopkins University,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">30445993</idno>
<idno type="pmc">6238331</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6238331</idno>
<idno type="RBID">PMC:6238331</idno>
<idno type="doi">10.1186/s13059-018-1568-0</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000368</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000368</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">KrakenUniq: confident and fast metagenomics classification using unique
<italic>k</italic>
-mer counts</title>
<author>
<name sortKey="Breitwieser, F P" sort="Breitwieser, F P" uniqKey="Breitwieser F" first="F. P." last="Breitwieser">F. P. Breitwieser</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine,</institution>
<institution>Johns Hopkins School of Medicine,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Baker, D N" sort="Baker, D N" uniqKey="Baker D" first="D. N." last="Baker">D. N. Baker</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine,</institution>
<institution>Johns Hopkins School of Medicine,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Johns Hopkins University,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Salzberg, S L" sort="Salzberg, S L" uniqKey="Salzberg S" first="S. L." last="Salzberg">S. L. Salzberg</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine,</institution>
<institution>Johns Hopkins School of Medicine,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Johns Hopkins University,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Departments of Biomedical Engineering and Biostatistics,</institution>
<institution>Johns Hopkins University,</institution>
</institution-wrap>
Baltimore, MD USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Genome Biology</title>
<idno type="ISSN">1474-7596</idno>
<idno type="eISSN">1474-760X</idno>
<imprint>
<date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p id="Par1">False-positive identifications are a significant problem in metagenomics classification. We present KrakenUniq, a novel metagenomics classifier that combines the fast
<italic>k</italic>
-mer-based classification of Kraken with an efficient algorithm for assessing the coverage of unique
<italic>k</italic>
-mers found in each species in a dataset. On various test datasets, KrakenUniq gives better recall and precision than other methods and effectively classifies and distinguishes pathogens with low abundance from false positives in infectious disease samples. By using the probabilistic cardinality estimator HyperLogLog, KrakenUniq runs as fast as Kraken and requires little additional memory. KrakenUniq is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/fbreitwieser/krakenuniq">https://github.com/fbreitwieser/krakenuniq</ext-link>
.</p>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s13059-018-1568-0) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salter, Sj" uniqKey="Salter S">SJ Salter</name>
</author>
<author>
<name sortKey="Cox, Mj" uniqKey="Cox M">MJ Cox</name>
</author>
<author>
<name sortKey="Turek, Em" uniqKey="Turek E">EM Turek</name>
</author>
<author>
<name sortKey="Calus, St" uniqKey="Calus S">ST Calus</name>
</author>
<author>
<name sortKey="Cookson, Wo" uniqKey="Cookson W">WO Cookson</name>
</author>
<author>
<name sortKey="Moffatt, Mf" uniqKey="Moffatt M">MF Moffatt</name>
</author>
<author>
<name sortKey="Turner, P" uniqKey="Turner P">P Turner</name>
</author>
<author>
<name sortKey="Parkhill, J" uniqKey="Parkhill J">J Parkhill</name>
</author>
<author>
<name sortKey="Loman, Nj" uniqKey="Loman N">NJ Loman</name>
</author>
<author>
<name sortKey="Walker, Aw" uniqKey="Walker A">AW Walker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thoendel, M" uniqKey="Thoendel M">M Thoendel</name>
</author>
<author>
<name sortKey="Jeraldo, P" uniqKey="Jeraldo P">P Jeraldo</name>
</author>
<author>
<name sortKey="Greenwood Quaintance, Ke" uniqKey="Greenwood Quaintance K">KE Greenwood-Quaintance</name>
</author>
<author>
<name sortKey="Yao, J" uniqKey="Yao J">J Yao</name>
</author>
<author>
<name sortKey="Chia, N" uniqKey="Chia N">N Chia</name>
</author>
<author>
<name sortKey="Hanssen, Ad" uniqKey="Hanssen A">AD Hanssen</name>
</author>
<author>
<name sortKey="Abdel, Mp" uniqKey="Abdel M">MP Abdel</name>
</author>
<author>
<name sortKey="Patel, R" uniqKey="Patel R">R Patel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
<author>
<name sortKey="Breitwieser, Fp" uniqKey="Breitwieser F">FP Breitwieser</name>
</author>
<author>
<name sortKey="Kumar, A" uniqKey="Kumar A">A Kumar</name>
</author>
<author>
<name sortKey="Hao, H" uniqKey="Hao H">H Hao</name>
</author>
<author>
<name sortKey="Burger, P" uniqKey="Burger P">P Burger</name>
</author>
<author>
<name sortKey="Rodriguez, Fj" uniqKey="Rodriguez F">FJ Rodriguez</name>
</author>
<author>
<name sortKey="Lim, M" uniqKey="Lim M">M Lim</name>
</author>
<author>
<name sortKey="Quinones Hinojosa, A" uniqKey="Quinones Hinojosa A">A Quinones-Hinojosa</name>
</author>
<author>
<name sortKey="Gallia, Gl" uniqKey="Gallia G">GL Gallia</name>
</author>
<author>
<name sortKey="Tornheim, Ja" uniqKey="Tornheim J">JA Tornheim</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brown, Jr" uniqKey="Brown J">JR Brown</name>
</author>
<author>
<name sortKey="Bharucha, T" uniqKey="Bharucha T">T Bharucha</name>
</author>
<author>
<name sortKey="Breuer, J" uniqKey="Breuer J">J Breuer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mukherjee, S" uniqKey="Mukherjee S">S Mukherjee</name>
</author>
<author>
<name sortKey="Huntemann, M" uniqKey="Huntemann M">M Huntemann</name>
</author>
<author>
<name sortKey="Ivanova, N" uniqKey="Ivanova N">N Ivanova</name>
</author>
<author>
<name sortKey="Kyrpides, Nc" uniqKey="Kyrpides N">NC Kyrpides</name>
</author>
<author>
<name sortKey="Pati, A" uniqKey="Pati A">A Pati</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dadi, Th" uniqKey="Dadi T">TH Dadi</name>
</author>
<author>
<name sortKey="Renard, By" uniqKey="Renard B">BY Renard</name>
</author>
<author>
<name sortKey="Wieler, Lh" uniqKey="Wieler L">LH Wieler</name>
</author>
<author>
<name sortKey="Semmler, T" uniqKey="Semmler T">T Semmler</name>
</author>
<author>
<name sortKey="Reinert, K" uniqKey="Reinert K">K Reinert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Quince, C" uniqKey="Quince C">C Quince</name>
</author>
<author>
<name sortKey="Walker, Aw" uniqKey="Walker A">AW Walker</name>
</author>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Loman, Nj" uniqKey="Loman N">NJ Loman</name>
</author>
<author>
<name sortKey="Segata, N" uniqKey="Segata N">N Segata</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, De" uniqKey="Wood D">DE Wood</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Flajolet, P" uniqKey="Flajolet P">P Flajolet</name>
</author>
<author>
<name sortKey="Fusy, E" uniqKey="Fusy E">É Fusy</name>
</author>
<author>
<name sortKey="Gandouet, O" uniqKey="Gandouet O">O Gandouet</name>
</author>
<author>
<name sortKey="Meunier, F" uniqKey="Meunier F">F Meunier</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brister, Jr" uniqKey="Brister J">JR Brister</name>
</author>
<author>
<name sortKey="Ako Adjei, D" uniqKey="Ako Adjei D">D Ako-Adjei</name>
</author>
<author>
<name sortKey="Bao, Y" uniqKey="Bao Y">Y Bao</name>
</author>
<author>
<name sortKey="Blinkova, O" uniqKey="Blinkova O">O Blinkova</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mcintyre, Abr" uniqKey="Mcintyre A">ABR McIntyre</name>
</author>
<author>
<name sortKey="Ounit, R" uniqKey="Ounit R">R Ounit</name>
</author>
<author>
<name sortKey="Afshinnekoo, E" uniqKey="Afshinnekoo E">E Afshinnekoo</name>
</author>
<author>
<name sortKey="Prill, Rj" uniqKey="Prill R">RJ Prill</name>
</author>
<author>
<name sortKey="Henaff, E" uniqKey="Henaff E">E Henaff</name>
</author>
<author>
<name sortKey="Alexander, N" uniqKey="Alexander N">N Alexander</name>
</author>
<author>
<name sortKey="Minot, Ss" uniqKey="Minot S">SS Minot</name>
</author>
<author>
<name sortKey="Danko, D" uniqKey="Danko D">D Danko</name>
</author>
<author>
<name sortKey="Foox, J" uniqKey="Foox J">J Foox</name>
</author>
<author>
<name sortKey="Ahsanuddin, S" uniqKey="Ahsanuddin S">S Ahsanuddin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buchfink, B" uniqKey="Buchfink B">B Buchfink</name>
</author>
<author>
<name sortKey="Xie, C" uniqKey="Xie C">C Xie</name>
</author>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sobih, A" uniqKey="Sobih A">A Sobih</name>
</author>
<author>
<name sortKey="Tomescu, Ai" uniqKey="Tomescu A">AI Tomescu</name>
</author>
<author>
<name sortKey="M Kinen, V" uniqKey="M Kinen V">V Mäkinen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ounit, R" uniqKey="Ounit R">R Ounit</name>
</author>
<author>
<name sortKey="Wanamaker, S" uniqKey="Wanamaker S">S Wanamaker</name>
</author>
<author>
<name sortKey="Close, Tj" uniqKey="Close T">TJ Close</name>
</author>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ounit, R" uniqKey="Ounit R">R Ounit</name>
</author>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ames, Sk" uniqKey="Ames S">SK Ames</name>
</author>
<author>
<name sortKey="Hysom, Da" uniqKey="Hysom D">DA Hysom</name>
</author>
<author>
<name sortKey="Gardner, Sn" uniqKey="Gardner S">SN Gardner</name>
</author>
<author>
<name sortKey="Lloyd, Gs" uniqKey="Lloyd G">GS Lloyd</name>
</author>
<author>
<name sortKey="Gokhale, Mb" uniqKey="Gokhale M">MB Gokhale</name>
</author>
<author>
<name sortKey="Allen, Je" uniqKey="Allen J">JE Allen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosen, Gl" uniqKey="Rosen G">GL Rosen</name>
</author>
<author>
<name sortKey="Reichenberger, Er" uniqKey="Reichenberger E">ER Reichenberger</name>
</author>
<author>
<name sortKey="Rosenfeld, Am" uniqKey="Rosenfeld A">AM Rosenfeld</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Freitas, Ta" uniqKey="Freitas T">TA Freitas</name>
</author>
<author>
<name sortKey="Li, Pe" uniqKey="Li P">PE Li</name>
</author>
<author>
<name sortKey="Scholz, Mb" uniqKey="Scholz M">MB Scholz</name>
</author>
<author>
<name sortKey="Chain, Ps" uniqKey="Chain P">PS Chain</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Truong, Dt" uniqKey="Truong D">DT Truong</name>
</author>
<author>
<name sortKey="Franzosa, Ea" uniqKey="Franzosa E">EA Franzosa</name>
</author>
<author>
<name sortKey="Tickle, Tl" uniqKey="Tickle T">TL Tickle</name>
</author>
<author>
<name sortKey="Scholz, M" uniqKey="Scholz M">M Scholz</name>
</author>
<author>
<name sortKey="Weingart, G" uniqKey="Weingart G">G Weingart</name>
</author>
<author>
<name sortKey="Pasolli, E" uniqKey="Pasolli E">E Pasolli</name>
</author>
<author>
<name sortKey="Tett, A" uniqKey="Tett A">A Tett</name>
</author>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
<author>
<name sortKey="Segata, N" uniqKey="Segata N">N Segata</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Darling, Ae" uniqKey="Darling A">AE Darling</name>
</author>
<author>
<name sortKey="Jospin, G" uniqKey="Jospin G">G Jospin</name>
</author>
<author>
<name sortKey="Lowe, E" uniqKey="Lowe E">E Lowe</name>
</author>
<author>
<name sortKey="Fat, M" uniqKey="Fat M">M FAt</name>
</author>
<author>
<name sortKey="Bik, Hm" uniqKey="Bik H">HM Bik</name>
</author>
<author>
<name sortKey="Eisen, Ja" uniqKey="Eisen J">JA Eisen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simner, Pj" uniqKey="Simner P">PJ Simner</name>
</author>
<author>
<name sortKey="Miller, S" uniqKey="Miller S">S Miller</name>
</author>
<author>
<name sortKey="Carroll, Kc" uniqKey="Carroll K">KC Carroll</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, C" uniqKey="Zhang C">C Zhang</name>
</author>
<author>
<name sortKey="Cleveland, K" uniqKey="Cleveland K">K Cleveland</name>
</author>
<author>
<name sortKey="Schnoll Sussman, F" uniqKey="Schnoll Sussman F">F Schnoll-Sussman</name>
</author>
<author>
<name sortKey="Mcclure, B" uniqKey="Mcclure B">B McClure</name>
</author>
<author>
<name sortKey="Bigg, M" uniqKey="Bigg M">M Bigg</name>
</author>
<author>
<name sortKey="Thakkar, P" uniqKey="Thakkar P">P Thakkar</name>
</author>
<author>
<name sortKey="Schultz, N" uniqKey="Schultz N">N Schultz</name>
</author>
<author>
<name sortKey="Shah, Ma" uniqKey="Shah M">MA Shah</name>
</author>
<author>
<name sortKey="Betel, D" uniqKey="Betel D">D Betel</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buchfink, B" uniqKey="Buchfink B">B Buchfink</name>
</author>
<author>
<name sortKey="Xie, C" uniqKey="Xie C">C Xie</name>
</author>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, Daniel H" uniqKey="Huson D">Daniel H. Huson</name>
</author>
<author>
<name sortKey="Beier, Sina" uniqKey="Beier S">Sina Beier</name>
</author>
<author>
<name sortKey="Flade, Isabell" uniqKey="Flade I">Isabell Flade</name>
</author>
<author>
<name sortKey="G Rska, Anna" uniqKey="G Rska A">Anna Górska</name>
</author>
<author>
<name sortKey="El Hadidi, Mohamed" uniqKey="El Hadidi M">Mohamed El-Hadidi</name>
</author>
<author>
<name sortKey="Mitra, Suparna" uniqKey="Mitra S">Suparna Mitra</name>
</author>
<author>
<name sortKey="Ruscheweyh, Hans Joachim" uniqKey="Ruscheweyh H">Hans-Joachim Ruscheweyh</name>
</author>
<author>
<name sortKey="Tappu, Rewati" uniqKey="Tappu R">Rewati Tappu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xu, Y" uniqKey="Xu Y">Y Xu</name>
</author>
<author>
<name sortKey="Chen, Y C" uniqKey="Chen Y">Y-C Chen</name>
</author>
<author>
<name sortKey="Liu, T" uniqKey="Liu T">T Liu</name>
</author>
<author>
<name sortKey="Yu, C H" uniqKey="Yu C">C-H Yu</name>
</author>
<author>
<name sortKey="Chiang, T Y" uniqKey="Chiang T">T-Y Chiang</name>
</author>
<author>
<name sortKey="Hwang, C C" uniqKey="Hwang C">C-C Hwang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Genome Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">Genome Biol</journal-id>
<journal-title-group>
<journal-title>Genome Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1474-7596</issn>
<issn pub-type="epub">1474-760X</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">30445993</article-id>
<article-id pub-id-type="pmc">6238331</article-id>
<article-id pub-id-type="publisher-id">1568</article-id>
<article-id pub-id-type="doi">10.1186/s13059-018-1568-0</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Software</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>KrakenUniq: confident and fast metagenomics classification using unique
<italic>k</italic>
-mer counts</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Breitwieser</surname>
<given-names>F. P.</given-names>
</name>
<address>
<email>florian.bw@gmail.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Baker</surname>
<given-names>D. N.</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Salzberg</surname>
<given-names>S. L.</given-names>
</name>
<address>
<email>salzberg@jhu.edu</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine,</institution>
<institution>Johns Hopkins School of Medicine,</institution>
</institution-wrap>
Baltimore, MD USA</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Department of Computer Science,</institution>
<institution>Johns Hopkins University,</institution>
</institution-wrap>
Baltimore, MD USA</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2171 9311</institution-id>
<institution-id institution-id-type="GRID">grid.21107.35</institution-id>
<institution>Departments of Biomedical Engineering and Biostatistics,</institution>
<institution>Johns Hopkins University,</institution>
</institution-wrap>
Baltimore, MD USA</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>16</day>
<month>11</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>16</day>
<month>11</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection">
<year>2018</year>
</pub-date>
<volume>19</volume>
<elocation-id>198</elocation-id>
<history>
<date date-type="received">
<day>3</day>
<month>4</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>10</month>
<year>2018</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s). 2018</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<p id="Par1">False-positive identifications are a significant problem in metagenomics classification. We present KrakenUniq, a novel metagenomics classifier that combines the fast
<italic>k</italic>
-mer-based classification of Kraken with an efficient algorithm for assessing the coverage of unique
<italic>k</italic>
-mers found in each species in a dataset. On various test datasets, KrakenUniq gives better recall and precision than other methods and effectively classifies and distinguishes pathogens with low abundance from false positives in infectious disease samples. By using the probabilistic cardinality estimator HyperLogLog, KrakenUniq runs as fast as Kraken and requires little additional memory. KrakenUniq is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/fbreitwieser/krakenuniq">https://github.com/fbreitwieser/krakenuniq</ext-link>
.</p>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s13059-018-1568-0) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Metagenomics</kwd>
<kwd>Microbiome</kwd>
<kwd>Metagenomics classification</kwd>
<kwd>Pathogen detection</kwd>
<kwd>Infectious disease diagnosis</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution>Army Research Office (US)</institution>
</funding-source>
<award-id>W911NF-14-1-0490</award-id>
</award-group>
</funding-group>
<funding-group>
<award-group>
<funding-source>
<institution>National Human Genome Research Institute (US)</institution>
</funding-source>
<award-id>R01-HG006677</award-id>
</award-group>
</funding-group>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100000057</institution-id>
<institution>National Institute of General Medical Sciences</institution>
</institution-wrap>
</funding-source>
<award-id>R01-GM083873</award-id>
<award-id>R01-GM118568</award-id>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2018</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p id="Par2">Metagenomics classifiers attempt to assign a taxonomic identity to each read in a dataset. Because metagenomics data often contain tens of millions of reads, classification is typically done using exact matching of short words of length
<italic>k</italic>
(
<italic>k</italic>
-mers) rather than alignment, which would be unacceptably slow. The results contain read classifications but not their aligned positions in the genomes (as reviewed by [
<xref ref-type="bibr" rid="CR1">1</xref>
]). However, read counts can be deceiving. Sequence contamination of the samples—introduced from laboratory kits or the environment during sample extraction, handling, or sequencing—can yield high numbers of spurious identifications [
<xref ref-type="bibr" rid="CR2">2</xref>
,
<xref ref-type="bibr" rid="CR3">3</xref>
]. Having only small amounts of input material can further compound the problem of contamination. When using sequencing for clinical diagnosis of infectious diseases, for example, less than 0.1% of the DNA may derive from microbes of interest [
<xref ref-type="bibr" rid="CR4">4</xref>
,
<xref ref-type="bibr" rid="CR5">5</xref>
]. Additional spurious matches can result from low-complexity regions of genomes and from contamination in the database genomes themselves [
<xref ref-type="bibr" rid="CR6">6</xref>
].</p>
<p id="Par3">Such false-positive reads typically match only small portions of a genome, e.g., if a species’ genome contains a low-complexity region, and the only reads matching that species fall in this region, then the species was probably not present in the sample. Reads from microbes that are truly present should distribute relatively uniformly across the genome rather than being concentrated in one or a few locations. Genome alignment can reveal this information. However, alignment is resource intensive, requiring the construction of indexes for every genome and a relatively slow alignment step to compare all reads against those indexes. Some metagenomics methods do use coverage information to improve mapping or quantification accuracy, but these methods require results from much slower alignment methods as input [
<xref ref-type="bibr" rid="CR7">7</xref>
]. Assembly-based methods also help to avoid false positives, but these are useful only for highly abundant species [
<xref ref-type="bibr" rid="CR8">8</xref>
].</p>
<p id="Par4">Here, we present KrakenUniq, a novel method that combines very fast
<italic>k</italic>
-mer-based classification with a fast
<italic>k</italic>
-mer cardinality estimation. KrakenUniq is based on the Kraken metagenomics classifier [
<xref ref-type="bibr" rid="CR9">9</xref>
], to which it adds a method for counting the number of unique
<italic>k</italic>
-mers identified for each taxon using the efficient cardinality estimation algorithm HyperLogLog [
<xref ref-type="bibr" rid="CR10">10</xref>
<xref ref-type="bibr" rid="CR12">12</xref>
]. By counting how many of each genome’s unique
<italic>k</italic>
-mers are covered by reads, KrakenUniq can often discern false-positive from true-positive matches. Furthermore, KrakenUniq implements additional new features to improve metagenomics classification: (a) searches can be done against multiple databases hierarchically; (b) the taxonomy can be extended to include nodes for strains and plasmids, thus enabling their detection; and (c) the database build script allows the addition of > 100,000 viruses from the NCBI Viral Genome Resource [
<xref ref-type="bibr" rid="CR13">13</xref>
]. KrakenUniq provides a superset of the information provided by Kraken while running equally fast or slightly faster and while using very little additional memory during classification.</p>
</sec>
<sec id="Sec2">
<title>Results</title>
<p id="Par5">KrakenUniq was developed to provide efficient
<italic>k</italic>
-mer count information for all taxa identified in a metagenomics experiment. The main workflow is as follows: As reads are processed, each
<italic>k</italic>
-mer is assigned a taxon from the database (Fig. 
<xref rid="Fig1" ref-type="fig">1a</xref>
). KrakenUniq instantiates a HyperLogLog data sketch for each taxon and adds the
<italic>k</italic>
-mers to it (Fig. 
<xref rid="Fig1" ref-type="fig">1b</xref>
and Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Section 1 on the HyperLogLog algorithm). After classification of a read, KrakenUniq traverses up the taxonomic tree and merges the estimators of each taxon with its parent. In its classification report, KrakenUniq includes the number of unique
<italic>k</italic>
-mers and the depth of
<italic>k</italic>
-mer coverage for each taxon that it observed in the input data (Fig. 
<xref rid="Fig1" ref-type="fig">1c</xref>
).
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Overview of the KrakenUniq algorithm and output.
<bold>a</bold>
An input read is shown as a long gray rectangle, with
<italic>k</italic>
-mers shown as shorter rectangles below it. The taxon mappings for each
<italic>k</italic>
-mer are compared to the database, shown as larger rectangles on the right. For each taxon, a unique
<italic>k</italic>
-mer counter is instantiated, and the observed
<italic>k</italic>
-mers (K7, K8, and K9) are added to the counters.
<bold>b</bold>
Unique
<italic>k</italic>
-mer counting is implemented with the probabilistic estimation method HyperLogLog (HLL) using 16 KB of memory per counter, which limits the error in the cardinality estimate to 1% (see main text).
<bold>c</bold>
The output includes the number of reads, unique
<italic>k</italic>
-mers, duplicity (average time each
<italic>k</italic>
-mer has been seen), and coverage for each taxon observed in the input data</p>
</caption>
<graphic xlink:href="13059_2018_1568_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<sec id="Sec3">
<title>Efficient
<italic>k</italic>
-mer cardinality estimation using the HyperLogLog algorithm</title>
<p id="Par6">Cardinality is the number of elements in a set without duplicates, e.g., the number of distinct words in a text. An exact count can be kept by storing the elements in a sorted list or linear probing hash table, but that requires memory proportional to the number of unique elements. When an accurate estimate of the cardinality is sufficient, however, the computation can be done efficiently with a very small amount of fixed memory. The HyperLogLog algorithm (HLL) [
<xref ref-type="bibr" rid="CR10">10</xref>
], which is well suited for
<italic>k</italic>
-mer counting [
<xref ref-type="bibr" rid="CR14">14</xref>
], keeps a summary or
<italic>sketch</italic>
of the data that is sufficient for precise estimation of the cardinality and requires only a small amount of constant space to estimate cardinalities up to billions. The method centers on the idea that long runs of leading zeros, which can be efficiently computed using machine instructions, are unlikely in random bitstrings. For example, about every fourth bitstring in a random series should start with 01
<sub>2</sub>
(one 0 bit before the first 1 bit), and about every 32nd hash starts with 00001
<sub>2</sub>
. Conversely, if we know the maximum number of leading zeros
<italic>k</italic>
of the members of a random set, we can use 2
<sup>
<italic>k</italic>
 + 1</sup>
as a crude estimate of its cardinality (more details in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Section 1 on the HLL algorithm). HLL keeps
<italic>m</italic>
 = 2
<sup>
<italic>p</italic>
</sup>
1 byte counts of the maximum numbers of leading zeros on the data (its data
<italic>sketch</italic>
), with
<italic>p</italic>
, the precision parameter, typically between 10 and 18 (see Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
). For cardinalities up to about
<italic>m</italic>
/4, we use the sparse representation of the registers suggested by Heule et al. [
<xref ref-type="bibr" rid="CR11">11</xref>
] that has the much higher effective precision
<italic>p</italic>
′ of 25 by encoding each index and count in a vector of 4-byte values (see Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
). To add a
<italic>k</italic>
-mer to its taxon’s sketch, the
<italic>k</italic>
-mer (with
<italic>k</italic>
up to 31) is first mapped by a hash function to a 64-bit hash value. Note that
<italic>k</italic>
-mers that contain non-A, C, G, or T characters (such as ambiguous IUPAC characters) are ignored by KrakenUniq. The first
<italic>p</italic>
bits of the hash value are used as index
<italic>i</italic>
, and the later 64-
<italic>p</italic>
 = 
<italic>q</italic>
bits for counting the number of leading zeros
<italic>k</italic>
. The value of the register
<italic>M</italic>
[
<italic>i</italic>
] in the sketch is updated if
<italic>k</italic>
is larger than the current value of
<italic>M</italic>
[
<italic>i</italic>
].
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Cardinality estimation using HyperLogLog for randomly sampled
<italic>k</italic>
-mers from microbial genomes. Left: standard deviations of the relative errors of the estimate with precision
<italic>p</italic>
ranging from 10 to 18. No systematic biases are apparent, and, as expected, the errors decrease with higher values of
<italic>p</italic>
. Up to cardinalities of about 2
<sup>
<italic>p</italic>
</sup>
/4, the relative error is near zero. At higher cardinalities, the error boundaries stay near constant. Right: the size of the registers, space requirement, and expected relative error for HyperLogLog cardinality estimates with different values of
<italic>p</italic>
. For example, with a precision
<italic>p</italic>
 = 14, the expected relative error is 0.81%, and the counter only requires 16 KB of space, which is three orders of magnitude less than that of an exact counter (at a cardinality of one million). Up to cardinalities of 2
<sup>
<italic>p</italic>
</sup>
/4, KrakenUniq uses a sparse representation of the counter with a higher precision of 25 and an effective relative error rate of about 0.02%</p>
</caption>
<graphic xlink:href="13059_2018_1568_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p id="Par7">When the read classification is finished, the taxon sketches are aggregated up the taxonomy tree by taking the maximum of each register value. The resulting sketches are the same as if the
<italic>k</italic>
-mers were counted at their whole lineage from the beginning. KrakenUniq then computes cardinality estimates using the formula proposed by Ertl [
<xref ref-type="bibr" rid="CR12">12</xref>
], which has theoretical and practical advantages and does not require empirical bias correction factors [
<xref ref-type="bibr" rid="CR10">10</xref>
,
<xref ref-type="bibr" rid="CR11">11</xref>
]. In our tests, it performed better than Flajolet’s and Heule’s methods (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figures S1 and S2).</p>
<p id="Par8">The expected relative error of the final cardinality estimate is approximately 1.04/sqrt(2
<sup>
<italic>p</italic>
</sup>
) [
<xref ref-type="bibr" rid="CR10">10</xref>
]. With
<italic>p</italic>
 = 14, the sketch uses 2
<sup>14</sup>
1-byte registers, i.e., 16 KB of space, and gives estimates with relative errors of less than 1% (Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
). Note that KrakenUniq also incorporates an exact counting mode, which however uses significantly more memory and runtime without appreciable improvements in classification accuracy (see the “
<xref rid="Sec8" ref-type="sec">Exact counting versus estimated cardinality</xref>
” section).</p>
</sec>
<sec id="Sec4">
<title>Results on 21 simulated and 10 biological test datasets</title>
<p id="Par9">We assessed KrakenUniq’s performance on the 34 datasets compiled by McIntyre et al. [
<xref ref-type="bibr" rid="CR15">15</xref>
] (see Additional file 
<xref rid="MOESM2" ref-type="media">2</xref>
: Table S3 for details on the datasets). We place greater emphasis on the 11 biological datasets, which contain more realistic laboratory and environmental contamination. In the first part of this section, we show that unique
<italic>k</italic>
-mer counts provide higher classification accuracy than read counts, and in the second part, we compare KrakenUniq with the results of 11 metagenomics classifiers. We ran KrakenUniq on three databases: “orig,” the database used by McIntyre et al.; “std,” which contains all current complete bacterial, archaeal, and viral genomes from RefSeq plus viral neighbor sequences and the human reference genome; and “nt,” which contains all microbial sequences (including fungi and protists) in the non-redundant nucleotide collection nr/nt provided by NCBI (see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Section 2 for details). The “std” database furthermore includes the UniVec and EmVec sequence sets of synthetic constructs and vector sequences, and low-complexity
<italic>k</italic>
-mers in microbial sequences were masked using NCBI’s dustmasker with default settings. We use two metrics to compare how well methods can separate true positives and false positives: (a) F1 score, i.e., the harmonic mean of precision
<italic>p</italic>
and recall
<italic>r</italic>
, and (b) recall at a maximum false discovery rate (FDR) of 5%. For each method, we compute and select the ideal thresholds based on the read count,
<italic>k</italic>
-mer count or abundance calls. Precision
<italic>p</italic>
is defined as the number of correctly called species (or genera) divided by the number of all called species (or genera) at a given threshold. Recall
<italic>r</italic>
is the proportion of species (or genera) that are in the test dataset and that are called at a given threshold. Higher F1 scores indicate a better separation between true positives and false positives. Higher recall means that more true species can be recovered while controlling the false positives.</p>
<p id="Par10">Because the NCBI taxonomy has been updated since the datasets were published, we manually updated the “truth” sets in several datasets (see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Section 2.3 for details on taxonomy fixes). Any cases that might have been missed would result in a lower apparent performance of KrakenUniq. Note that we exclude the over 10-year-old simulated datasets simHC, simMC, and simLC from Mavromatis et al. (2007), as well as the biological dataset JGI SRR033547 which has only 100 reads.</p>
<sec id="Sec5">
<title>Classification performance using unique
<italic>k</italic>
-mer or read count thresholds</title>
<p id="Par11">We first looked at the performance of the unique
<italic>k</italic>
-mer count thresholds versus read count thresholds (as would be used with Kraken). The
<italic>k</italic>
-mer count thresholds worked very well, particularly for the biological datasets (Table 
<xref rid="Tab1" ref-type="table">1</xref>
and Additional file 
<xref rid="MOESM2" ref-type="media">2</xref>
: Table S3). On the genus level, the average recall in the biological datasets increases by 4–9%, and the average F1 score increases 2–3%. On the species level, the average increase in recall in the biological sets is between 3 and 12%, and the F1 score increases by 1–2%.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Performance of read count and unique
<italic>k</italic>
-mer thresholds at genus and species rank on 10 biological and 21 simulated datasets against the three databases ‘orig’, ‘std’ and ‘nt’</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2">Data Type</th>
<th rowspan="2">Rank</th>
<th rowspan="2">Statistic</th>
<th colspan="3">orig</th>
<th colspan="3">std</th>
<th colspan="3">nt</th>
</tr>
<tr>
<th>reads</th>
<th>
<italic>k</italic>
-mers</th>
<th>%diff</th>
<th>reads</th>
<th>
<italic>k</italic>
-mers</th>
<th>%diff</th>
<th>reads</th>
<th>
<italic>k</italic>
-mers</th>
<th>%diff</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="4">Bio</td>
<td rowspan="2">Genus</td>
<td>Recall</td>
<td>0.90</td>
<td>
<bold>0.93</bold>
</td>
<td>+4.0%</td>
<td>0.89</td>
<td>
<bold>0.94</bold>
</td>
<td>+6.2%</td>
<td>0.91</td>
<td>
<bold>0.99</bold>
</td>
<td>+8.9%</td>
</tr>
<tr>
<td>F1</td>
<td>0.95</td>
<td>
<bold>0.96</bold>
</td>
<td>+1.8%</td>
<td>0.95</td>
<td>
<bold>0.97</bold>
</td>
<td>+2.6%</td>
<td>0.96</td>
<td>
<bold>0.99</bold>
</td>
<td>+3.4%</td>
</tr>
<tr>
<td rowspan="2">Species</td>
<td>Recall</td>
<td>0.85</td>
<td>
<bold>0.87</bold>
</td>
<td>+2.6%</td>
<td>0.70</td>
<td>
<bold>0.78</bold>
</td>
<td>+11.8%</td>
<td>0.95</td>
<td>
<bold>0.98</bold>
</td>
<td>+3.1%</td>
</tr>
<tr>
<td>F1</td>
<td>0.94</td>
<td>0.94</td>
<td>+0.7%</td>
<td>0.90</td>
<td>
<bold>0.92</bold>
</td>
<td>+2.5%</td>
<td>0.97</td>
<td>
<bold>0.99</bold>
</td>
<td>+1.6%</td>
</tr>
<tr>
<td rowspan="4">Sim</td>
<td rowspan="2">Genus</td>
<td>Recall</td>
<td>
<bold>0.96</bold>
</td>
<td>0.94</td>
<td>-2.1%</td>
<td>0.95</td>
<td>
<bold>0.97</bold>
</td>
<td>+2.5%</td>
<td>0.98</td>
<td>0.99</td>
<td>+0.8%</td>
</tr>
<tr>
<td>F1</td>
<td>0.98</td>
<td>0.98</td>
<td>-0.0%</td>
<td>0.98</td>
<td>0.98</td>
<td>+0.3%</td>
<td>0.99</td>
<td>0.99</td>
<td>+0.3%</td>
</tr>
<tr>
<td rowspan="2">Species</td>
<td>Recall</td>
<td>0.92</td>
<td>0.93</td>
<td>+0.6%</td>
<td>0.88</td>
<td>0.88</td>
<td>+0.3%</td>
<td>0.90</td>
<td>0.90</td>
<td>-0.1%</td>
</tr>
<tr>
<td>F1</td>
<td>0.97</td>
<td>0.97</td>
<td>+0.3%</td>
<td>0.94</td>
<td>0.94</td>
<td>+0.5%</td>
<td>0.96</td>
<td>0.96</td>
<td>-0.1%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate better performance by at least 1% difference in the test statistic, show in the third column %diff. Unique
<italic>k</italic>
-mer count thresholds give up to 10% better recall and F1 scores, particularly for the biological datasets</p>
</table-wrap-foot>
</table-wrap>
</p>
<p id="Par12">On the simulated datasets, the differences are less pronounced and vary between databases, even though on average the unique
<italic>k</italic>
-mer count is again better. However, only in two cases (genus recall on databases “orig” and “std”) the difference is higher than 1% in any direction. We find that simulated datasets often lack false positives with a decent number of reads but a lower number of unique
<italic>k</italic>
-mer counts, which we see in real data. Instead, in most simulated datasets, the number of unique
<italic>k</italic>
-mers is linearly increasing with the number of unique reads in both true and false positives (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S3). In biological datasets, sequence contamination and lower read counts for the true positives make the task of separating true and false positives harder.</p>
</sec>
<sec id="Sec6">
<title>Comparison of KrakenUniq with 11 other methods</title>
<p id="Par13">Next, we compared KrakenUniq’s unique
<italic>k</italic>
-mer counts with the results of 11 metagenomics classifiers from McIntyre et al. [
<xref ref-type="bibr" rid="CR15">15</xref>
], which include the alignment-based methods Blast + Megan [
<xref ref-type="bibr" rid="CR16">16</xref>
,
<xref ref-type="bibr" rid="CR17">17</xref>
], Diamond + Megan [
<xref ref-type="bibr" rid="CR17">17</xref>
,
<xref ref-type="bibr" rid="CR18">18</xref>
], and MetaFlow [
<xref ref-type="bibr" rid="CR19">19</xref>
]; the
<italic>k</italic>
-mer-based CLARK [
<xref ref-type="bibr" rid="CR20">20</xref>
], CLARK-S [
<xref ref-type="bibr" rid="CR21">21</xref>
], Kraken [
<xref ref-type="bibr" rid="CR9">9</xref>
], LMAT [
<xref ref-type="bibr" rid="CR22">22</xref>
], and NBC [
<xref ref-type="bibr" rid="CR23">23</xref>
]; and the marker-based methods GOTTCHA [
<xref ref-type="bibr" rid="CR24">24</xref>
], MetaPhlAn2 [
<xref ref-type="bibr" rid="CR25">25</xref>
], and PhyloSift [
<xref ref-type="bibr" rid="CR26">26</xref>
]. KrakenUniq with database “nt” has the highest average recall and F1 score across the biological datasets, as shown in Table 
<xref rid="Tab2" ref-type="table">2</xref>
. As seen before, using unique
<italic>k</italic>
-mer instead of read counts as thresholds increases the scores. While the database selection proves to be very important (KrakenUniq with database “std” is performing 10% worse than KrakenUniq with database “nt”), only Blast has higher average scores than KrakenUniq with
<italic>k</italic>
-mer count thresholds on the original database. On the simulated datasets, KrakenUniq with the “nt” database still ranks at the top, though, as seen previously, there is more variation (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S4). Notably, CLARK is as good as KrakenUniq, but Blast has much worse scores on the simulated datasets.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Performance of KrakenUniq (with unique
<italic>k</italic>
-mer count thresholds) compared to metagenomics classifiers [
<xref ref-type="bibr" rid="CR15">15</xref>
] on the biological datasets (
<italic>n</italic>
 = 10). F1 and recall show the average values over the datasets. Note that “KrakenUniq reads” would be equivalent to standard Kraken</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2"></th>
<th colspan="2">Genus</th>
<th colspan="2">Species</th>
<th></th>
</tr>
<tr>
<th>F1</th>
<th>Recall</th>
<th>F1</th>
<th>Recall</th>
<th>Avg</th>
</tr>
</thead>
<tbody>
<tr>
<td>KrakenUniq nt
<italic>k</italic>
-mers</td>
<td>
<bold>0.99</bold>
</td>
<td>
<bold>0.99</bold>
</td>
<td>
<bold>0.99</bold>
</td>
<td>
<bold>0.98</bold>
</td>
<td>
<bold>0.99</bold>
</td>
</tr>
<tr>
<td>KrakenUniq nt reads</td>
<td>0.96</td>
<td>0.91</td>
<td>0.97</td>
<td>0.95</td>
<td>0.95</td>
</tr>
<tr>
<td>BlastMeganFilteredLiberal</td>
<td>0.97</td>
<td>0.94</td>
<td>0.97</td>
<td>0.89</td>
<td>0.94</td>
</tr>
<tr>
<td>BlastMeganFiltered</td>
<td>0.97</td>
<td>0.93</td>
<td>0.96</td>
<td>0.87</td>
<td>0.93</td>
</tr>
<tr>
<td>KrakenUniq orig
<italic>k</italic>
-mers</td>
<td>0.96</td>
<td>0.93</td>
<td>0.94</td>
<td>0.87</td>
<td>0.93</td>
</tr>
<tr>
<td>ClarkM4Spaced</td>
<td>0.95</td>
<td>0.90</td>
<td>0.94</td>
<td>0.88</td>
<td>0.92</td>
</tr>
<tr>
<td>KrakenUniq orig reads</td>
<td>0.95</td>
<td>0.90</td>
<td>0.94</td>
<td>0.85</td>
<td>0.91</td>
</tr>
<tr>
<td>Kraken</td>
<td>0.95</td>
<td>0.90</td>
<td>0.94</td>
<td>0.84</td>
<td>0.91</td>
</tr>
<tr>
<td>KrakenUniq std.
<italic>k</italic>
-mers</td>
<td>0.97</td>
<td>0.94</td>
<td>0.92</td>
<td>0.78</td>
<td>0.90</td>
</tr>
<tr>
<td>DiamondMegan_sensitive</td>
<td>0.98</td>
<td>0.93</td>
<td>0.92</td>
<td>0.74</td>
<td>0.89</td>
</tr>
<tr>
<td>KrakenFiltered</td>
<td>0.95</td>
<td>0.91</td>
<td>0.90</td>
<td>0.75</td>
<td>0.88</td>
</tr>
<tr>
<td>ClarkM1Default</td>
<td>0.94</td>
<td>0.85</td>
<td>0.91</td>
<td>0.77</td>
<td>0.87</td>
</tr>
<tr>
<td>KrakenUniq std. reads</td>
<td>0.95</td>
<td>0.89</td>
<td>0.90</td>
<td>0.70</td>
<td>0.86</td>
</tr>
<tr>
<td>LMAT</td>
<td>0.97</td>
<td>0.93</td>
<td>0.91</td>
<td>0.60</td>
<td>0.85</td>
</tr>
<tr>
<td>DiamondMegan</td>
<td>0.94</td>
<td>0.87</td>
<td>0.91</td>
<td>0.66</td>
<td>0.85</td>
</tr>
<tr>
<td>Gottcha</td>
<td>0.91</td>
<td>0.84</td>
<td>0.87</td>
<td>0.67</td>
<td>0.82</td>
</tr>
<tr>
<td>NBC</td>
<td>0.87</td>
<td>0.76</td>
<td>0.85</td>
<td>0.73</td>
<td>0.80</td>
</tr>
<tr>
<td>Metaphlan</td>
<td>0.94</td>
<td>0.89</td>
<td>0.83</td>
<td>0.55</td>
<td>0.80</td>
</tr>
<tr>
<td>MetaFlow</td>
<td>0.66</td>
<td>0.53</td>
<td>0.65</td>
<td>0.51</td>
<td>0.59</td>
</tr>
<tr>
<td>PhyloSift</td>
<td>0.68</td>
<td>0.29</td>
<td>0.78</td>
<td>0.54</td>
<td>0.57</td>
</tr>
<tr>
<td>PhyloSift90pct</td>
<td>0.68</td>
<td>0.30</td>
<td>0.77</td>
<td>0.52</td>
<td>0.57</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate the highest value in each column</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
</sec>
<sec id="Sec7">
<title>Generating a better test dataset and selecting an appropriate
<italic>k</italic>
-mer threshold</title>
<p id="Par14">In the previous section, we demonstrated that KrakenUniq gives better recall and F1 scores than other classifiers on the test datasets, given the correct thresholds. How can the correct thresholds be determined on real data with varying sequencing depths and complex communities? The test datasets are not ideal for that the biological datasets lack complexity with a maximum of 25 species in some of the samples, while the simulated samples lack the features of biological datasets.</p>
<p id="Par15">We thus generated a third type of test dataset by sampling reads from real bacterial isolate sequencing runs, of which there are tens of thousands in the Sequence Read Archive (SRA). That way, we created a complex test dataset for which we know the ground truth, with all the features of real sequencing experiments, including lab contaminants and sequencing errors. We selected 280 SRA datasets from 280 different bacterial species that are linked to complete RefSeq genomes (see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Suppl. Methods Section 2.4). We randomly sampled between 1 hundred and 1 million reads (logarithmically distributed) from each experiment, which gave 34 million read pairs in total. Furthermore, we sub-sampled 5 read sets with between 1 and 20 million reads. All read sets were classified with KrakenUniq using the ‘“std” database.</p>
<p id="Par17">Consistent with the results of the previous section, we found that unique
<italic>k</italic>
-mer counts provide better thresholds than read counts both in terms of F1 score and recall in all test datasets (e.g., Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
on 10 million reads—species recall using
<italic>k</italic>
-mers is 0.85, recall using reads 0.76). With higher sequencing depth, the recall increased slightly—from 0.80 to 0.85 on the species level and from 0.87 to 0.89 on the genus level. The ideal values of the unique
<italic>k</italic>
-mer count thresholds, however, vary widely with different sequencing depths. We found that the ideal thresholds increase by about 2000 unique
<italic>k</italic>
-mers per 1 million reads (see Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
). McIntyre et al. [
<xref ref-type="bibr" rid="CR15">15</xref>
] found that
<italic>k</italic>
-mer-based methods show a positive relationship between sequencing depths and misclassified reads. Our analysis also shows that with deeper sequencing depths, higher thresholds are required to control the false-positive rate.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Unique
<italic>k</italic>
-mer count separates true and false positives better than read counts in a complex dataset with ten million reads sampled from SRA experiments. Each dot represents a species, with true species in orange and false species in black. The dashed and dotted lines show the
<italic>k</italic>
-mer thresholds for the ideal F1 score and recall at a maximum of 5% FDR, respectively. In this dataset, a unique
<italic>k</italic>
-mer count in the range 10,000–20,000 would give the best threshold for selecting true species</p>
</caption>
<graphic xlink:href="13059_2018_1568_Fig3_HTML" id="MO3"></graphic>
</fig>
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Deeper sequencing depths require higher unique
<italic>k</italic>
-mer count thresholds to control the false-positive rate and achieve the best recall. A minimum threshold of about 2000 unique
<italic>k</italic>
-mer per a million reads gives the best results in this dataset (solid line in plot), see Additional file 
<xref rid="MOESM3" ref-type="media">3</xref>
: Table S8 for more details</p>
</caption>
<graphic xlink:href="13059_2018_1568_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p id="Par18">In general, we find that for correctly identified species, we obtain up to approximately
<italic>L</italic>
-
<italic>k</italic>
unique
<italic>k</italic>
-mers per each read, where
<italic>L</italic>
is the read length because each read samples a different location in the genome. (Note that once the genome is completely covered, no more unique
<italic>k</italic>
-mers can be detected.) Thus, the
<italic>k</italic>
-mer threshold should always be several times higher than the read count threshold. For the discovery of pathogens in human patients, discussed in the next section, a read count threshold of 10 and unique
<italic>k</italic>
-mer count threshold of 1000 eliminated many background identifications while preserving all true positives, which were discovered from as few as 15 reads.</p>
</sec>
<sec id="Sec8">
<title>Exact counting versus estimated cardinality</title>
<p id="Par19">KrakenUniq’s unique
<italic>k</italic>
-mer count is an estimate, raising the following question: does using an estimate—instead of the exact count—affect the classification performance?</p>
<p id="Par20">To answer this question, we implemented an exact counting mode in KrakenUniq. As expected, exact counting requires significantly more memory and runtime. On the full test dataset (with 34.3 mio paired reads sampled from 280 WGS experiments on bacterial isolates), the more efficient of two version of exact counting required 60% more memory and over 200% more runtime. At the same time, we observed virtually no improvement in term of classification performance (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). A likely explanation for this finding is that over- or underestimation of the true cardinality by a small amount (e.g., 1%) rarely changes the ranking of the identifications. There will be cases, however, where a true species may fall just under a threshold due to the estimation error, and users may choose to use exact counting with KrakenUniq, although this will incur a large penalty in both runtime and memory consumption.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Using cardinality estimates does not decrease classification performance on the test dataset. KrakenUniq in the default mode—using HyperLogLog cardinality estimation with precision 14—classifies reads as accurately as KrakenUniq using exact counting, on both the species and genus level. (Only genus level is shown in the table, which also shows Kraken’s performance for comparison). Note that we tested two versions of exact counting. In version 1, we implemented exact counting using C++ standard library’s unordered_set. Most time is spent on merging counters in the end for report generation. In version 2, we implemented exact counting using khash from klib (
<ext-link ext-link-type="uri" xlink:href="https://github.com/attractivechaos/klib/">https://github.com/attractivechaos/klib/</ext-link>
). KrakenUniq uses version 2. Both unordered sets and the hash map require heap allocations for updating, which can cause significant performance cost at runtime because of global locks. Wall clock time for KrakenUniq includes report generation (which takes an additional 2m33s for Kraken)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2"></th>
<th rowspan="2">Kraken</th>
<th colspan="3">KrakenUniq</th>
</tr>
<tr>
<th>Default</th>
<th>Exact(1)</th>
<th>Exact(2)</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="5">Computational performance</td>
</tr>
<tr>
<td> Wall clock time
<sup>3</sup>
</td>
<td>17m38s</td>
<td>
<bold>14m18s</bold>
</td>
<td>3h30m6s</td>
<td>45m30s</td>
</tr>
<tr>
<td> Speed [Mbp/m]</td>
<td>478.4</td>
<td>
<bold>595.4</bold>
</td>
<td>95.9</td>
<td>377.8</td>
</tr>
<tr>
<td> Memory [GB]</td>
<td>
<bold>167.1</bold>
</td>
<td>168.2</td>
<td>466.2</td>
<td>272.4</td>
</tr>
<tr>
<td> Minor page faults × 10
<sup>6</sup>
</td>
<td>203.5</td>
<td>
<bold>192.2</bold>
</td>
<td>272.5</td>
<td>904.6</td>
</tr>
<tr>
<td colspan="5">Classification performance</td>
</tr>
<tr>
<td> Recall</td>
<td>0.827</td>
<td>
<bold>0.888</bold>
</td>
<td>
<bold>0.888</bold>
</td>
<td>
<bold>0.888</bold>
</td>
</tr>
<tr>
<td> F1 score</td>
<td>0.922</td>
<td>
<bold>0.935</bold>
</td>
<td>
<bold>0.935</bold>
</td>
<td>
<bold>0.935</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate the highest or lowest values in each row</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="Sec9">
<title>Results on biological samples for infectious disease diagnosis</title>
<p id="Par21">Metagenomics is increasingly used to find species of low abundance. A special case is the emerging use of metagenomics for the diagnosis of infectious diseases [
<xref ref-type="bibr" rid="CR27">27</xref>
,
<xref ref-type="bibr" rid="CR28">28</xref>
]. In this application, infected human tissues are sequenced directly to find the likely disease organism. Usually, the vast majority of the reads match (typically 95–99%) the host, and sometimes fewer than 100 reads out of many millions of reads are matched to the target species. Common skin bacteria from the patient or lab personnel and other contamination from sample collection or preparation can easily generate a similar number of reads, and thus mask the signal from the pathogen.</p>
<p id="Par22">To assess if the unique
<italic>k</italic>
-mer count metric in KrakenUniq could be used to rank and identify pathogen from human samples, we reanalyzed ten patient samples from a previously described series of neurological infections [
<xref ref-type="bibr" rid="CR4">4</xref>
]. That study sequenced spinal cord mass and brain biopsies from ten hospitalized patients for whom routine tests for pathogens were inconclusive. In four of the ten cases, a likely diagnosis could be made with the help of metagenomics. To confirm the metagenomics classifications, the authors in the original study re-aligned all pathogen reads to individual genomes.</p>
<p id="Par23">Table 
<xref rid="Tab4" ref-type="table">4</xref>
shows the results of our reanalysis of the confirmed pathogens in the four patients, including the number of reads and unique
<italic>k</italic>
-mers from the pathogen, as well as the number of bases covered by re-alignment to the genomes. Even though the read numbers are very low in two cases, the number of unique
<italic>k</italic>
-mers suggests that each read matches a different location in the genome. For example, in PT8, 15 reads contain 1570 unique
<italic>k</italic>
-mers, and re-alignment shows 2201 covered base pairs. In contrast, Table 
<xref rid="Tab5" ref-type="table">5</xref>
shows examples of identifications from the same datasets that are not well-supported by
<italic>k</italic>
-mer counts. We also examined the likely source of the false-positive identifications by blasting the reads against the full nt database, and found rRNA of environmental bacteria, human RNA, and PhiX-174 mis-assignments (see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Suppl. Methods for details). Notably, the common laboratory and skin contaminants PhiX-174,
<italic>Escherichia coli</italic>
,
<italic>Cutibacterium acnes</italic>
, and
<italic>Delftia</italic>
were detected in most of the samples, too (see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S6). However, those identifications are solid in terms of their
<italic>k</italic>
-mer counts—the bacteria and PhiX-174 are present in the sample, and the reads cover their genomes rather randomly. To discount them, comparisons against a negative control or between multiple samples are required (e.g., with Pavian [
<xref ref-type="bibr" rid="CR29">29</xref>
]).
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>Validated pathogen identifications in patients with neurological infections have high numbers of unique
<italic>k</italic>
-mers per read. The pathogens were identified with as few as 15 reads, but the high number of unique
<italic>k</italic>
-mers indicates distinct locations of the reads along their genomes. Re-alignment of mapped reads to their reference genomes (column “Covered bases”) corroborates the finding of the unique
<italic>k</italic>
-mers (see also Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S4). Interestingly, the
<italic>k</italic>
-mer count in PT5 indicates that there might be multiple strains present in the sample since the
<italic>k</italic>
-mers cover more than one genome. Read lengths were 150–250 bp</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>Sample</th>
<th>Matched microorganism</th>
<th>Reads</th>
<th>
<italic>k</italic>
-mers</th>
<th>Covered bases</th>
</tr>
</thead>
<tbody>
<tr>
<td>PT5</td>
<td>Human polyomavirus 2</td>
<td>9650</td>
<td>7129</td>
<td>5130/5130</td>
</tr>
<tr>
<td>PT7</td>
<td>
<italic>Elizabethkingia genomo</italic>
sp. 3</td>
<td>403</td>
<td>20,724</td>
<td>53,256/4,433,522</td>
</tr>
<tr>
<td>PT8</td>
<td>
<italic>Mycobacterium tuberculosis</italic>
</td>
<td>15</td>
<td>1570</td>
<td>2227/4,411,532</td>
</tr>
<tr>
<td>PT10</td>
<td>Human gammaherpesvirus 4</td>
<td>20</td>
<td>2084</td>
<td>2822/172,764</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>False-positive identifications have few unique
<italic>k</italic>
-mers. Using an extended taxonomy, the identifications in PT4 and PT10 were matched to single accessions (instead of to the species level). The likely true source of the mapped sequences was determined by subsequent BLAST searches and included 16S rRNA present in many uncultured bacteria, human small nucleolar RNAs (snRNAs), and phiX174</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>Sample</th>
<th>Matched microorganism</th>
<th>Reads</th>
<th>
<italic>k</italic>
-mers</th>
<th>Source</th>
</tr>
</thead>
<tbody>
<tr>
<td>PT3</td>
<td>
<italic>Clostridioides difficile</italic>
</td>
<td>122</td>
<td>126</td>
<td>16S rRNA</td>
</tr>
<tr>
<td>PT4</td>
<td>Hepatitis C virus
<break></break>
JF343788.1 recombinant hepatitis C virus</td>
<td>101</td>
<td>3</td>
<td>Human snRNA</td>
</tr>
<tr>
<td>PT5</td>
<td>
<italic>Akkermansia muciniphila</italic>
</td>
<td>936</td>
<td>136</td>
<td>16S rRNA</td>
</tr>
<tr>
<td>PT10</td>
<td>Human betaherpesvirus 5
<break></break>
JN379815.1 human herpesvirus 5 strain U04, partial genome</td>
<td>63</td>
<td>5</td>
<td>phiX174</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<sec id="Sec10">
<title>Further extensions in KrakenUniq</title>
<p id="Par24">KrakenUniq adds three further notable features to the classification engine.
<list list-type="order">
<list-item>
<p id="Par25">Enabling strain identification by extending the taxonomy: The finest level of granularity for Kraken classifications are nodes in the NCBI taxonomy. This means that many strains cannot be resolved, because up to hundreds of strains share the same taxonomy ID. KrakenUniq allows extending the taxonomy with virtual nodes for genomes, chromosomes, and plasmids, and thus enabling identifications at the most specific levels (see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Suppl. Methods Section 3)</p>
</list-item>
<list-item>
<p id="Par26">Integrating 100,000 viral strain sequences: RefSeq includes only one reference genome for most viral species, which means that a lot of the variation of viral strain is not covered in a standard RefSeq database. KrakenUniq sources viral strain sequences from the NCBI Viral Genome Resource that are validated as “neighbors” of RefSeq viruses, which leads to up to 20% more read classifications (see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Suppl. Methods Section 4).</p>
</list-item>
<list-item>
<p id="Par27">Hierarchical classification with multiple databases: Researchers may want to include additional sequence sets, such as draft genomes, in some searches. KrakenUniq allows to chain databases and match each
<italic>k</italic>
-mer hierarchically, stopping when it found a match. For example, to mitigate the problem of host contamination in draft genomes, a search may use the host genome as the first database, then complete microbial genomes then draft microbial genomes. More details are available in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Suppl. Method Section 5.</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="Sec11">
<title>Timing and memory requirements</title>
<p id="Par28">The additional features of KrakenUniq come without a runtime penalty and very limited additional memory requirements. In fact, due to code improvements, KrakenUniq often runs faster than Kraken, particularly when most of the reads come from one species. On the test datasets, the mean classification speed in million base pairs per minute increased slightly from 410 to 421 Mbp/m (see Additional file 
<xref rid="MOESM2" ref-type="media">2</xref>
: Table S3). When factoring in the time needed to summarize classification results by Kraken-report, which is required for Kraken but part of the classification binary of KrakenUniq, KrakenUniq is on average 50% faster. The memory requirements increase on average by 0.5 GB from 39.5 to 40 GB.</p>
<p id="Par29">On the pathogen ID patient data, where in most cases over 99% of the reads were either assigned to human or synthetic reads, KrakenUniq was significantly faster than Kraken (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S5). The classification speed increased from 467 to 733 Mbp/m. The average wall time was about 44% lower, and the average additional memory requirements were less than 1 GB, going from 118.0 to 118.4 GB. All timing comparisons were made after preloading the database and running with 10 parallel threads.</p>
</sec>
</sec>
<sec id="Sec12">
<title>Discussion</title>
<p id="Par30">In our comparison, KrakenUniq performed better in classifying metagenomics data than many existing methods, including the alignment-based methods Blast [
<xref ref-type="bibr" rid="CR16">16</xref>
], Diamond [
<xref ref-type="bibr" rid="CR30">30</xref>
], and MetaFlow [
<xref ref-type="bibr" rid="CR19">19</xref>
]. Blast and Diamond results were post-processed by Megan [
<xref ref-type="bibr" rid="CR17">17</xref>
,
<xref ref-type="bibr" rid="CR31">31</xref>
], which assigns reads to the lowest common ancestor (LCA), but ignores coverages when computing the resulting taxonomic profile. Thus, the taxonomic profile (with read counts as abundance measures) is sensitive to over-representing false positives that have coverage spikes in parts of the genome in the same way as non-alignment based methods. Coverage spikes may appear due to wrongly matched common sequences (e.g., 16S rRNA), short amplified sequences floating in the laboratory, and contamination in database sequences. MetaFlow, on the other hand, implements coverage-sensitive mapping, which should give better abundance calls, but it did not perform very well in our tests. Going from alignments to a good taxonomic profile is difficult because coverage information cannot be as easily computed for the LCA taxon and summarized for higher levels in the taxonomic tree. In comparison, reads and unique
<italic>k</italic>
-mer counts can be assigned to the LCA taxa and summed to higher levels. Notably, KrakenUniq’s
<italic>k</italic>
-mer counting is affected by GC biases in the sequencing data the same way as other read classifiers and aligners [
<xref ref-type="bibr" rid="CR32">32</xref>
] and may underreport GC-rich or GC-poor genomes.</p>
</sec>
<sec id="Sec13">
<title>Conclusions</title>
<p id="Par31">KrakenUniq is a novel method that combines fast
<italic>k</italic>
-mer-based classification with an efficient algorithm for counting the number of unique
<italic>k</italic>
-mers found in each species in a metagenomics dataset. When the reads from a species yield many unique
<italic>k</italic>
-mers, one can be more confident that the taxon is truly present, while a low number of unique
<italic>k</italic>
-mers suggests a possible false-positive identification. We demonstrated that using unique
<italic>k</italic>
-mer counts provides improved accuracy for species identification and that
<italic>k</italic>
-mer counts can help greatly in identifying false positives. In our comparisons with multiple other metagenomics classifiers on multiple metagenomics datasets, we found that KrakenUniq consistently ranked at the top. The strategy of counting unique
<italic>k</italic>
-mer matches allows KrakenUniq to detect that reads are spread across a genome, without the need to align the reads. By using a probabilistic counting algorithm, KrakenUniq is able to match the exceptionally fast classification time of the original Kraken program with only a very small increase in memory. The result is that KrakenUniq gains many of the advantages of alignment at a far lower computational cost.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Additional files</title>
<sec id="Sec14">
<p>
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="13059_2018_1568_MOESM1_ESM.pdf">
<label>Additional file 1:</label>
<caption>
<p>Supplementary materials. Contains supplementary methods sections 1–7,
<bold>Figures S1–S4</bold>
, and
<bold>Tables S1–S2, S4–S7</bold>
. (PDF 1105 kb)</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="MOESM2">
<media xlink:href="13059_2018_1568_MOESM2_ESM.xlsx">
<label>Additional file 2:</label>
<caption>
<p>
<bold>Table S3</bold>
, which has the description and results of the test datasets from McIntyre et al. (2017). (XLSX 18 kb)</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="MOESM3">
<media xlink:href="13059_2018_1568_MOESM3_ESM.pdf">
<label>Additional file 3:</label>
<caption>
<p>
<bold>Table S8</bold>
, showing the comparison of sequencing depth and
<italic>k</italic>
-mer count threshold from Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
. (PDF 93 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>Thanks to Jen Lu, Ales Varabyou, Thomas Mehoke, David Karig, Sharon Bewick, and Peter Thielen for the valuable discussions on the general method and its applicability. Thanks to Alexa McIntyre and Rachid Ounit for providing feedback and for sharing scripts and data from their benchmarking paper. Thanks to Jessica Atwell for proofreading the manuscript.</p>
<sec id="FPar1">
<title>Funding</title>
<p id="Par32">This work was supported in part by grants R01-GM083873, R01-HG006677, and R01GM118568 from the National Institutes of Health and by grant number W911NF-14-1-0490 from the US Army Research Office.</p>
</sec>
<sec id="FPar2">
<title>Availability of data and materials</title>
<p id="Par33">KrakenUniq
<inline-graphic xlink:href="13059_2018_1568_Figa_HTML.gif" id="d29e2327"></inline-graphic>
is implemented in C++ and Perl. Its source code is available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/fbreitwieser/krakenuniq">https://github.com/fbreitwieser/krakenuniq</ext-link>
[
<xref ref-type="bibr" rid="CR33">33</xref>
], licensed under GPL3. The version used in the manuscript is permanently available under 10.5281/zenodo.1412647. Analysis scripts for the results of this manuscript are available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/fbreitwieser/krakenuniq-manuscript-code">https://github.com/fbreitwieser/krakenuniq-manuscript-code</ext-link>
. Additional information and manual are available at
<ext-link ext-link-type="uri" xlink:href="http://ccb.jhu.edu/software/krakenuniq">http://ccb.jhu.edu/software/krakenuniq</ext-link>
. [
<xref ref-type="bibr" rid="CR34">34</xref>
].</p>
<p id="Par34">The datasets of McIntyre et al. are available at
<ext-link ext-link-type="uri" xlink:href="https://ftp-private.ncbi.nlm.nih.gov/nist-immsa/IMMSA">https://ftp-private.ncbi.nlm.nih.gov/nist-immsa/IMMSA</ext-link>
[
<xref ref-type="bibr" rid="CR35">35</xref>
]. The sequencing datasets of Salzberg et al. are available under the BioProject accession PRJNA314149 [
<xref ref-type="bibr" rid="CR36">36</xref>
]. Note that human reads have been filtered. The test datasets generated by sampling reads from bacterial isolate SRA experiments are available at
<ext-link ext-link-type="uri" xlink:href="ftp://ftp.ccb.jhu.edu/pub/software/krakenuniq/SraSampledDatasets">ftp://ftp.ccb.jhu.edu/pub/software/krakenuniq/SraSampledDatasets</ext-link>
[
<xref ref-type="bibr" rid="CR37">37</xref>
].</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>FPB conceived and implemented the method. DNB helped in the development of the exact counting mode of KrakenUniq and in discussions of the HLL algorithms. FPB and SLS discussed the results and wrote the manuscript. All authors read and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec id="FPar3">
<title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec id="FPar4">
<title>Consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec id="FPar5">
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec id="FPar6">
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<mixed-citation publication-type="other">Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembly. Brief Bioinform. 2017. 10.1093/bib/bbx120.</mixed-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salter</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Cox</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Turek</surname>
<given-names>EM</given-names>
</name>
<name>
<surname>Calus</surname>
<given-names>ST</given-names>
</name>
<name>
<surname>Cookson</surname>
<given-names>WO</given-names>
</name>
<name>
<surname>Moffatt</surname>
<given-names>MF</given-names>
</name>
<name>
<surname>Turner</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Parkhill</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Loman</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Walker</surname>
<given-names>AW</given-names>
</name>
</person-group>
<article-title>Reagent and laboratory contamination can critically impact sequence-based microbiome analyses</article-title>
<source>BMC Biol</source>
<year>2014</year>
<volume>12</volume>
<fpage>87</fpage>
<pub-id pub-id-type="doi">10.1186/s12915-014-0087-z</pub-id>
<pub-id pub-id-type="pmid">25387460</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thoendel</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Jeraldo</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Greenwood-Quaintance</surname>
<given-names>KE</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chia</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Hanssen</surname>
<given-names>AD</given-names>
</name>
<name>
<surname>Abdel</surname>
<given-names>MP</given-names>
</name>
<name>
<surname>Patel</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Impact of contaminating DNA in whole-genome amplification kits used for metagenomic shotgun sequencing for infection diagnosis</article-title>
<source>J Clin Microbiol</source>
<year>2017</year>
<volume>55</volume>
<fpage>1789</fpage>
<lpage>1801</lpage>
<pub-id pub-id-type="doi">10.1128/JCM.02402-16</pub-id>
<pub-id pub-id-type="pmid">28356418</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Breitwieser</surname>
<given-names>FP</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hao</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Burger</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rodriguez</surname>
<given-names>FJ</given-names>
</name>
<name>
<surname>Lim</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Quinones-Hinojosa</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Gallia</surname>
<given-names>GL</given-names>
</name>
<name>
<surname>Tornheim</surname>
<given-names>JA</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system</article-title>
<source>Neurol Neuroimmunol Neuroinflamm</source>
<year>2016</year>
<volume>3</volume>
<fpage>e251</fpage>
<pub-id pub-id-type="doi">10.1212/NXI.0000000000000251</pub-id>
<pub-id pub-id-type="pmid">27340685</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brown</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Bharucha</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Breuer</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Encephalitis diagnosis using metagenomics: application of next generation sequencing for undiagnosed cases</article-title>
<source>J Inf Secur</source>
<year>2018</year>
<volume>76</volume>
<fpage>225</fpage>
<lpage>240</lpage>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mukherjee</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Huntemann</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ivanova</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Kyrpides</surname>
<given-names>NC</given-names>
</name>
<name>
<surname>Pati</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Large-scale contamination of microbial isolate genomes by Illumina PhiX control</article-title>
<source>Stand Genomic Sci</source>
<year>2015</year>
<volume>10</volume>
<fpage>18</fpage>
<pub-id pub-id-type="doi">10.1186/1944-3277-10-18</pub-id>
<pub-id pub-id-type="pmid">26203331</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dadi</surname>
<given-names>TH</given-names>
</name>
<name>
<surname>Renard</surname>
<given-names>BY</given-names>
</name>
<name>
<surname>Wieler</surname>
<given-names>LH</given-names>
</name>
<name>
<surname>Semmler</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>SLIMM: species level identification of microorganisms from metagenomes</article-title>
<source>PeerJ</source>
<year>2017</year>
<volume>5</volume>
<fpage>e3138</fpage>
<pub-id pub-id-type="doi">10.7717/peerj.3138</pub-id>
<pub-id pub-id-type="pmid">28367376</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Quince</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Walker</surname>
<given-names>AW</given-names>
</name>
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Loman</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Segata</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Shotgun metagenomics, from sampling to analysis</article-title>
<source>Nat Biotechnol</source>
<year>2017</year>
<volume>35</volume>
<fpage>833</fpage>
<lpage>844</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.3935</pub-id>
<pub-id pub-id-type="pmid">28898207</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wood</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments</article-title>
<source>Genome Biol</source>
<year>2014</year>
<volume>15</volume>
<fpage>R46</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
<pub-id pub-id-type="pmid">24580807</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Flajolet</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Fusy</surname>
<given-names>É</given-names>
</name>
<name>
<surname>Gandouet</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Meunier</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm</article-title>
<source>AofA: analysis of algorithms; 2007-06-17; Juan les Pins</source>
<year>2007</year>
<publisher-loc>France</publisher-loc>
<publisher-name>Discrete mathematics and theoretical computer science</publisher-name>
<fpage>137</fpage>
<lpage>156</lpage>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<mixed-citation publication-type="other">Heule S, Nunkesser M, Hall A. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In proceedings of the 16th International Conference on Extending Database Technology. ACM; 2013. p. 683–692.</mixed-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<mixed-citation publication-type="other">Ertl O: New cardinality estimation methods for HyperLogLog sketches. arXiv:170607290 2017.</mixed-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brister</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Ako-Adjei</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Bao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Blinkova</surname>
<given-names>O</given-names>
</name>
</person-group>
<article-title>NCBI viral genomes resource</article-title>
<source>Nucleic Acids Res</source>
<year>2015</year>
<volume>43</volume>
<fpage>D571</fpage>
<lpage>D577</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gku1207</pub-id>
<pub-id pub-id-type="pmid">25428358</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<mixed-citation publication-type="other">Irber Junior LC, Brown CT. Efficient cardinality estimation for k-mers in large DNA sequencing data sets. bioRxiv. 2016.</mixed-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McIntyre</surname>
<given-names>ABR</given-names>
</name>
<name>
<surname>Ounit</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Afshinnekoo</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Prill</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Henaff</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Alexander</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Minot</surname>
<given-names>SS</given-names>
</name>
<name>
<surname>Danko</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Foox</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ahsanuddin</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Comprehensive benchmarking and ensemble approaches for metagenomic classifiers</article-title>
<source>Genome Biol</source>
<year>2017</year>
<volume>18</volume>
<fpage>182</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-017-1299-7</pub-id>
<pub-id pub-id-type="pmid">28934964</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol</source>
<year>1990</year>
<volume>215</volume>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(05)80360-2</pub-id>
<pub-id pub-id-type="pmid">2231712</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>Auch</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schuster</surname>
<given-names>SC</given-names>
</name>
</person-group>
<article-title>MEGAN analysis of metagenomic data</article-title>
<source>Genome Res</source>
<year>2007</year>
<volume>17</volume>
<fpage>377</fpage>
<lpage>386</lpage>
<pub-id pub-id-type="doi">10.1101/gr.5969107</pub-id>
<pub-id pub-id-type="pmid">17255551</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buchfink</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
</person-group>
<article-title>Fast and sensitive protein alignment using DIAMOND</article-title>
<source>Nat Methods</source>
<year>2015</year>
<volume>12</volume>
<fpage>59</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.3176</pub-id>
<pub-id pub-id-type="pmid">25402007</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Sobih</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tomescu</surname>
<given-names>AI</given-names>
</name>
<name>
<surname>Mäkinen</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>MetaFlow: metagenomic profiling based on whole-genome coverage analysis with min-cost flows</article-title>
<source>Research in Computational Molecular Biology</source>
<year>2016</year>
<fpage>111</fpage>
<lpage>121</lpage>
</element-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ounit</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wanamaker</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Close</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Lonardi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers</article-title>
<source>BMC Genomics</source>
<year>2015</year>
<volume>16</volume>
<fpage>236</fpage>
<pub-id pub-id-type="doi">10.1186/s12864-015-1419-2</pub-id>
<pub-id pub-id-type="pmid">25879410</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ounit</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lonardi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Higher classification sensitivity of short metagenomic reads with CLARK-S</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<fpage>3823</fpage>
<lpage>3825</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btw542</pub-id>
<pub-id pub-id-type="pmid">27540266</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ames</surname>
<given-names>SK</given-names>
</name>
<name>
<surname>Hysom</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Gardner</surname>
<given-names>SN</given-names>
</name>
<name>
<surname>Lloyd</surname>
<given-names>GS</given-names>
</name>
<name>
<surname>Gokhale</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Allen</surname>
<given-names>JE</given-names>
</name>
</person-group>
<article-title>Scalable metagenomic taxonomy classification using a reference genome database</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<fpage>2253</fpage>
<lpage>2260</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt389</pub-id>
<pub-id pub-id-type="pmid">23828782</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rosen</surname>
<given-names>GL</given-names>
</name>
<name>
<surname>Reichenberger</surname>
<given-names>ER</given-names>
</name>
<name>
<surname>Rosenfeld</surname>
<given-names>AM</given-names>
</name>
</person-group>
<article-title>NBC: the naive Bayes classification tool webserver for taxonomic classification of metagenomic reads</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>127</fpage>
<lpage>129</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq619</pub-id>
<pub-id pub-id-type="pmid">21062764</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Freitas</surname>
<given-names>TA</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>PE</given-names>
</name>
<name>
<surname>Scholz</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Chain</surname>
<given-names>PS</given-names>
</name>
</person-group>
<article-title>Accurate read-based metagenome characterization using a hierarchical suite of unique signatures</article-title>
<source>Nucleic Acids Res</source>
<year>2015</year>
<volume>43</volume>
<fpage>e69</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkv180</pub-id>
<pub-id pub-id-type="pmid">25765641</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Truong</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Franzosa</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Tickle</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Scholz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Weingart</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Pasolli</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Tett</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Segata</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>MetaPhlAn2 for enhanced metagenomic taxonomic profiling</article-title>
<source>Nat Methods</source>
<year>2015</year>
<volume>12</volume>
<fpage>902</fpage>
<lpage>903</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.3589</pub-id>
<pub-id pub-id-type="pmid">26418763</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Darling</surname>
<given-names>AE</given-names>
</name>
<name>
<surname>Jospin</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Lowe</surname>
<given-names>E</given-names>
</name>
<name>
<surname>FAt</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Bik</surname>
<given-names>HM</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>JA</given-names>
</name>
</person-group>
<article-title>PhyloSift: phylogenetic analysis of genomes and metagenomes</article-title>
<source>PeerJ</source>
<year>2014</year>
<volume>2</volume>
<fpage>e243</fpage>
<pub-id pub-id-type="doi">10.7717/peerj.243</pub-id>
<pub-id pub-id-type="pmid">24482762</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simner</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Carroll</surname>
<given-names>KC</given-names>
</name>
</person-group>
<article-title>Understanding the promises and hurdles of metagenomic next-generation sequencing as a diagnostic tool for infectious diseases</article-title>
<source>Clin Infect Dis</source>
<year>2018</year>
<volume>66</volume>
<fpage>778</fpage>
<lpage>788</lpage>
<pub-id pub-id-type="doi">10.1093/cid/cix881</pub-id>
<pub-id pub-id-type="pmid">29040428</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Cleveland</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Schnoll-Sussman</surname>
<given-names>F</given-names>
</name>
<name>
<surname>McClure</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Bigg</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Thakkar</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Schultz</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Betel</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Identification of low abundance microbiome in clinical samples using whole genome sequencing</article-title>
<source>Genome Biol</source>
<year>2015</year>
<volume>16</volume>
<fpage>265</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-015-0821-z</pub-id>
<pub-id pub-id-type="pmid">26614063</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29.</label>
<mixed-citation publication-type="other">Breitwieser FP, Salzberg SL. Pavian: interactive analysis of metagenomics data for microbiomics and pathogen identification. BioRxiv. 2016.</mixed-citation>
</ref>
<ref id="CR30">
<label>30.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buchfink</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
</person-group>
<article-title>Fast and sensitive protein alignment using DIAMOND</article-title>
<source>Nat Methods</source>
<year>2014</year>
<volume>12</volume>
<fpage>59</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.3176</pub-id>
<pub-id pub-id-type="pmid">25402007</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huson</surname>
<given-names>Daniel H.</given-names>
</name>
<name>
<surname>Beier</surname>
<given-names>Sina</given-names>
</name>
<name>
<surname>Flade</surname>
<given-names>Isabell</given-names>
</name>
<name>
<surname>Górska</surname>
<given-names>Anna</given-names>
</name>
<name>
<surname>El-Hadidi</surname>
<given-names>Mohamed</given-names>
</name>
<name>
<surname>Mitra</surname>
<given-names>Suparna</given-names>
</name>
<name>
<surname>Ruscheweyh</surname>
<given-names>Hans-Joachim</given-names>
</name>
<name>
<surname>Tappu</surname>
<given-names>Rewati</given-names>
</name>
</person-group>
<article-title>MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data</article-title>
<source>PLOS Computational Biology</source>
<year>2016</year>
<volume>12</volume>
<issue>6</issue>
<fpage>e1004957</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1004957</pub-id>
<pub-id pub-id-type="pmid">27327495</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y-C</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>C-H</given-names>
</name>
<name>
<surname>Chiang</surname>
<given-names>T-Y</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>C-C</given-names>
</name>
</person-group>
<article-title>Effects of GC bias in next-generation-sequencing data on de novo genome assembly</article-title>
<source>PLoS One</source>
<year>2013</year>
<volume>8</volume>
<issue>4</issue>
<fpage>e62856</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0062856</pub-id>
<pub-id pub-id-type="pmid">23638157</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33.</label>
<mixed-citation publication-type="other">Breitwieser FP, Baker DN, Salzberg SL. Github repository of KrakenUniq
<ext-link ext-link-type="uri" xlink:href="https://github.com/fbreitwieser/krakenuniq">https://github.com/fbreitwieser/krakenuniq</ext-link>
. Accessed 18 Oct 2018.</mixed-citation>
</ref>
<ref id="CR34">
<label>34.</label>
<mixed-citation publication-type="other">Breitwieser FP, Baker DN, Salzberg SL. Github repository of KrakenUniq manuscript code.
<ext-link ext-link-type="uri" xlink:href="https://github.com/fbreitwieser/krakenuniq-manuscript-code">https://github.com/fbreitwieser/krakenuniq-manuscript-code</ext-link>
. Accessed 18 Oct 2018.</mixed-citation>
</ref>
<ref id="CR35">
<label>35.</label>
<mixed-citation publication-type="other">McIntyre ABR, Ounit R, Afshinnekoo E, Prill RJ, Hénaff E, Alexander N, Minot SS, Danko D, Foox J, Ahsanuddin S, et al. IMMSA datasets used in McIntyre et al.
<ext-link ext-link-type="uri" xlink:href="https://ftp-private.ncbi.nlm.nih.gov/nist-immsa/IMMSA/">https://ftp-private.ncbi.nlm.nih.gov/nist-immsa/IMMSA/</ext-link>
. Accessed 18 Oct 2018.</mixed-citation>
</ref>
<ref id="CR36">
<label>36.</label>
<mixed-citation publication-type="other">Salzberg SL, Breitwieser FP, Kumar A, Hao H, Burger P, Rodriguez FJ, Lim M, Quinones-Hinojosa A, Gallia GL, Tornheim JA, et al. Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system; BioProject
<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/bioproject/PRJNA314149/">https://www.ncbi.nlm.nih.gov/bioproject/PRJNA314149/</ext-link>
. Accessed 18 Oct 2018.</mixed-citation>
</ref>
<ref id="CR37">
<label>37.</label>
<mixed-citation publication-type="other">Breitwieser FP, Baker DN, Salzberg SL. Datasets generated from reads sampled from experiments in SRA linked to bacterial RefSeq genomes
<ext-link ext-link-type="uri" xlink:href="ftp://ftp.ccb.jhu.edu/pub/software/krakenuniq/SraSampledDatasets">ftp://ftp.ccb.jhu.edu/pub/software/krakenuniq/SraSampledDatasets</ext-link>
. Accessed 18 Oct 2018.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000368  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000368  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021