Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000292 ( Pmc/Corpus ); précédent : 0002919; suivant : 0002930 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative
<italic>k</italic>
-mers</title>
<author>
<name sortKey="Ounit, Rachid" sort="Ounit, Rachid" uniqKey="Ounit R" first="Rachid" last="Ounit">Rachid Ounit</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521 Riverside USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wanamaker, Steve" sort="Wanamaker, Steve" uniqKey="Wanamaker S" first="Steve" last="Wanamaker">Steve Wanamaker</name>
<affiliation>
<nlm:aff id="Aff2">Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521 Riverside USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Close, Timothy J" sort="Close, Timothy J" uniqKey="Close T" first="Timothy J" last="Close">Timothy J. Close</name>
<affiliation>
<nlm:aff id="Aff2">Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521 Riverside USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lonardi, Stefano" sort="Lonardi, Stefano" uniqKey="Lonardi S" first="Stefano" last="Lonardi">Stefano Lonardi</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521 Riverside USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25879410</idno>
<idno type="pmc">4428112</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4428112</idno>
<idno type="RBID">PMC:4428112</idno>
<idno type="doi">10.1186/s12864-015-1419-2</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000292</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000292</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative
<italic>k</italic>
-mers</title>
<author>
<name sortKey="Ounit, Rachid" sort="Ounit, Rachid" uniqKey="Ounit R" first="Rachid" last="Ounit">Rachid Ounit</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521 Riverside USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wanamaker, Steve" sort="Wanamaker, Steve" uniqKey="Wanamaker S" first="Steve" last="Wanamaker">Steve Wanamaker</name>
<affiliation>
<nlm:aff id="Aff2">Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521 Riverside USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Close, Timothy J" sort="Close, Timothy J" uniqKey="Close T" first="Timothy J" last="Close">Timothy J. Close</name>
<affiliation>
<nlm:aff id="Aff2">Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521 Riverside USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lonardi, Stefano" sort="Lonardi, Stefano" uniqKey="Lonardi S" first="Stefano" last="Lonardi">Stefano Lonardi</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521 Riverside USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Genomics</title>
<idno type="eISSN">1471-2164</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.</p>
</sec>
<sec>
<title>Results</title>
<p>We introduce
<sc>Clark</sc>
a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of
<sc>Clark</sc>
is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode
<sc>Clark</sc>
classifies, with high accuracy, about 32 million metagenomic short reads per minute.
<sc>Clark</sc>
can also classify BAC clones or transcripts to chromosome arms and centromeric regions.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>
<sc>Clark</sc>
is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at
<ext-link ext-link-type="uri" xlink:href="http://clark.cs.ucr.edu/">http://clark.cs.ucr.edu/</ext-link>
.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12864-015-1419-2) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Venter, Jc" uniqKey="Venter J">JC Venter</name>
</author>
<author>
<name sortKey="Remington, K" uniqKey="Remington K">K Remington</name>
</author>
<author>
<name sortKey="Heidelberg, Jf" uniqKey="Heidelberg J">JF Heidelberg</name>
</author>
<author>
<name sortKey="Halpern, Al" uniqKey="Halpern A">AL Halpern</name>
</author>
<author>
<name sortKey="Rusch, D" uniqKey="Rusch D">D Rusch</name>
</author>
<author>
<name sortKey="Eisen, Ja" uniqKey="Eisen J">JA Eisen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
<author>
<name sortKey="Gevers, D" uniqKey="Gevers D">D Gevers</name>
</author>
<author>
<name sortKey="Knight, R" uniqKey="Knight R">R Knight</name>
</author>
<author>
<name sortKey="Abubucker, S" uniqKey="Abubucker S">S Abubucker</name>
</author>
<author>
<name sortKey="Badger, Jh" uniqKey="Badger J">JH Badger</name>
</author>
<author>
<name sortKey="Chinwalla, At" uniqKey="Chinwalla A">AT Chinwalla</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brady, A" uniqKey="Brady A">A Brady</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Gibbons, T" uniqKey="Gibbons T">T Gibbons</name>
</author>
<author>
<name sortKey="Ghodsi, M" uniqKey="Ghodsi M">M Ghodsi</name>
</author>
<author>
<name sortKey="Treangen, T" uniqKey="Treangen T">T Treangen</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segata, N" uniqKey="Segata N">N Segata</name>
</author>
<author>
<name sortKey="Waldron, L" uniqKey="Waldron L">L Waldron</name>
</author>
<author>
<name sortKey="Ballarini, A" uniqKey="Ballarini A">A Ballarini</name>
</author>
<author>
<name sortKey="Narasimhan, V" uniqKey="Narasimhan V">V Narasimhan</name>
</author>
<author>
<name sortKey="Jousson, O" uniqKey="Jousson O">O Jousson</name>
</author>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosen, Gl" uniqKey="Rosen G">GL Rosen</name>
</author>
<author>
<name sortKey="Reichenberger, Er" uniqKey="Reichenberger E">ER Reichenberger</name>
</author>
<author>
<name sortKey="Rosenfeld, Am" uniqKey="Rosenfeld A">AM Rosenfeld</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Patil, Kr" uniqKey="Patil K">KR Patil</name>
</author>
<author>
<name sortKey="Haider, P" uniqKey="Haider P">P Haider</name>
</author>
<author>
<name sortKey="Pope, Pb" uniqKey="Pope P">PB Pope</name>
</author>
<author>
<name sortKey="Turnbaugh, Pj" uniqKey="Turnbaugh P">PJ Turnbaugh</name>
</author>
<author>
<name sortKey="Morrison, M" uniqKey="Morrison M">M Morrison</name>
</author>
<author>
<name sortKey="Scheffer, T" uniqKey="Scheffer T">T Scheffer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ames, Sk" uniqKey="Ames S">SK Ames</name>
</author>
<author>
<name sortKey="Hysom, Da" uniqKey="Hysom D">DA Hysom</name>
</author>
<author>
<name sortKey="Gardner, Sn" uniqKey="Gardner S">SN Gardner</name>
</author>
<author>
<name sortKey="Lloyd, Gs" uniqKey="Lloyd G">GS Lloyd</name>
</author>
<author>
<name sortKey="Gokhale, Mb" uniqKey="Gokhale M">MB Gokhale</name>
</author>
<author>
<name sortKey="Allen, Je" uniqKey="Allen J">JE Allen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, D" uniqKey="Wood D">D Wood</name>
</author>
<author>
<name sortKey="Salzberg, S" uniqKey="Salzberg S">S Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bazinet, Al" uniqKey="Bazinet A">AL Bazinet</name>
</author>
<author>
<name sortKey="Cummings, Mp" uniqKey="Cummings M">MP Cummings</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koslicki, D" uniqKey="Koslicki D">D Koslicki</name>
</author>
<author>
<name sortKey="Foucart, S" uniqKey="Foucart S">S Foucart</name>
</author>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, Wj" uniqKey="Kent W">WJ Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mavromatis, K" uniqKey="Mavromatis K">K Mavromatis</name>
</author>
<author>
<name sortKey="Ivanova, N" uniqKey="Ivanova N">N Ivanova</name>
</author>
<author>
<name sortKey="Barry, K" uniqKey="Barry K">K Barry</name>
</author>
<author>
<name sortKey="Shapiro, H" uniqKey="Shapiro H">H Shapiro</name>
</author>
<author>
<name sortKey="Goltsman, E" uniqKey="Goltsman E">E Goltsman</name>
</author>
<author>
<name sortKey="Mchardy, Ac" uniqKey="Mchardy A">AC McHardy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Magoc, T" uniqKey="Magoc T">T Magoc</name>
</author>
<author>
<name sortKey="Pabinger, S" uniqKey="Pabinger S">S Pabinger</name>
</author>
<author>
<name sortKey="Canzar, S" uniqKey="Canzar S">S Canzar</name>
</author>
<author>
<name sortKey="Liu, X" uniqKey="Liu X">X Liu</name>
</author>
<author>
<name sortKey="Su, Q" uniqKey="Su Q">Q Su</name>
</author>
<author>
<name sortKey="Puiu, D" uniqKey="Puiu D">D Puiu</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Antonio, Ma" uniqKey="Antonio M">MA Antonio</name>
</author>
<author>
<name sortKey="Hawes, Se" uniqKey="Hawes S">SE Hawes</name>
</author>
<author>
<name sortKey="Hillier, Sl" uniqKey="Hillier S">SL Hillier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hyman, Rw" uniqKey="Hyman R">RW Hyman</name>
</author>
<author>
<name sortKey="Fukushima, M" uniqKey="Fukushima M">M Fukushima</name>
</author>
<author>
<name sortKey="Diamond, L" uniqKey="Diamond L">L Diamond</name>
</author>
<author>
<name sortKey="Kumm, J" uniqKey="Kumm J">J Kumm</name>
</author>
<author>
<name sortKey="Giudice, Lc" uniqKey="Giudice L">LC Giudice</name>
</author>
<author>
<name sortKey="Davis, Rw" uniqKey="Davis R">RW Davis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dolezel, J" uniqKey="Dolezel J">J Doležel</name>
</author>
<author>
<name sortKey="Vrana, J" uniqKey="Vrana J">J Vrána</name>
</author>
<author>
<name sortKey="Safa, J" uniqKey="Safa J">J Šafář</name>
</author>
<author>
<name sortKey="Bartos, J" uniqKey="Bartos J">J Bartoš</name>
</author>
<author>
<name sortKey="Kubalakova, M" uniqKey="Kubalakova M">M Kubaláková</name>
</author>
<author>
<name sortKey="Simkova, H" uniqKey="Simkova H">H Šimková</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
<author>
<name sortKey="Duma, D" uniqKey="Duma D">D Duma</name>
</author>
<author>
<name sortKey="Alpert, M" uniqKey="Alpert M">M Alpert</name>
</author>
<author>
<name sortKey="Cordero, F" uniqKey="Cordero F">F Cordero</name>
</author>
<author>
<name sortKey="Beccuti, M" uniqKey="Beccuti M">M Beccuti</name>
</author>
<author>
<name sortKey="Bhat, Pr" uniqKey="Bhat P">PR Bhat</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Luo, R" uniqKey="Luo R">R Luo</name>
</author>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Xie, Y" uniqKey="Xie Y">Y Xie</name>
</author>
<author>
<name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author>
<name sortKey="Huang, W" uniqKey="Huang W">W Huang</name>
</author>
<author>
<name sortKey="Yuan, J" uniqKey="Yuan J">J Yuan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Close, Tj" uniqKey="Close T">TJ Close</name>
</author>
<author>
<name sortKey="Wanamaker, S" uniqKey="Wanamaker S">S Wanamaker</name>
</author>
<author>
<name sortKey="Roose, Ml" uniqKey="Roose M">ML Roose</name>
</author>
<author>
<name sortKey="Lyon, M" uniqKey="Lyon M">M Lyon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Close, Tj" uniqKey="Close T">TJ Close</name>
</author>
<author>
<name sortKey="Bhat, Pr" uniqKey="Bhat P">PR Bhat</name>
</author>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
<author>
<name sortKey="Wu, Y" uniqKey="Wu Y">Y Wu</name>
</author>
<author>
<name sortKey="Rostoks, N" uniqKey="Rostoks N">N Rostoks</name>
</author>
<author>
<name sortKey="Ramsay, L" uniqKey="Ramsay L">L Ramsay</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mascher, M" uniqKey="Mascher M">M Mascher</name>
</author>
<author>
<name sortKey="Muehlbauer, Gj" uniqKey="Muehlbauer G">GJ Muehlbauer</name>
</author>
<author>
<name sortKey="Rokhsar, Ds" uniqKey="Rokhsar D">DS Rokhsar</name>
</author>
<author>
<name sortKey="Chapman, J" uniqKey="Chapman J">J Chapman</name>
</author>
<author>
<name sortKey="Schmutz, J" uniqKey="Schmutz J">J Schmutz</name>
</author>
<author>
<name sortKey="Barry, K" uniqKey="Barry K">K Barry</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tu, Q" uniqKey="Tu Q">Q Tu</name>
</author>
<author>
<name sortKey="He, Z" uniqKey="He Z">Z He</name>
</author>
<author>
<name sortKey="Zhou, J" uniqKey="Zhou J">J Zhou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Schwartz, S" uniqKey="Schwartz S">S Schwartz</name>
</author>
<author>
<name sortKey="Wagner, L" uniqKey="Wagner L">L Wagner</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Genomics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Genomics</journal-id>
<journal-title-group>
<journal-title>BMC Genomics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2164</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25879410</article-id>
<article-id pub-id-type="pmc">4428112</article-id>
<article-id pub-id-type="publisher-id">1419</article-id>
<article-id pub-id-type="doi">10.1186/s12864-015-1419-2</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative
<italic>k</italic>
-mers</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Ounit</surname>
<given-names>Rachid</given-names>
</name>
<address>
<email>rouni001@cs.ucr.edu</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wanamaker</surname>
<given-names>Steve</given-names>
</name>
<address>
<email>steve.wanamaker@ucr.edu</email>
</address>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Close</surname>
<given-names>Timothy J</given-names>
</name>
<address>
<email>timothy.close@ucr.edu</email>
</address>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Lonardi</surname>
<given-names>Stefano</given-names>
</name>
<address>
<email>stelo@cs.ucr.edu</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<aff id="Aff1">
<label></label>
Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521 Riverside USA</aff>
<aff id="Aff2">
<label></label>
Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521 Riverside USA</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>25</day>
<month>3</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>25</day>
<month>3</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<volume>16</volume>
<issue>1</issue>
<elocation-id>236</elocation-id>
<history>
<date date-type="received">
<day>6</day>
<month>1</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>2</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-statement>© Ounit et al.; licensee BioMed Central. 2015</copyright-statement>
<license license-type="open-access">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.</p>
</sec>
<sec>
<title>Results</title>
<p>We introduce
<sc>Clark</sc>
a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of
<sc>Clark</sc>
is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode
<sc>Clark</sc>
classifies, with high accuracy, about 32 million metagenomic short reads per minute.
<sc>Clark</sc>
can also classify BAC clones or transcripts to chromosome arms and centromeric regions.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>
<sc>Clark</sc>
is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at
<ext-link ext-link-type="uri" xlink:href="http://clark.cs.ucr.edu/">http://clark.cs.ucr.edu/</ext-link>
.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12864-015-1419-2) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Metagenomics</kwd>
<kwd>Genomics</kwd>
<kwd>Arm/chromosome assignments</kwd>
<kwd>Discriminative
<italic>k</italic>
-mers</kwd>
<kwd>Sequence-specific
<italic>k</italic>
-mers</kwd>
<kwd>Chromosome arm</kwd>
<kwd>Centromere</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2015</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>The classification problem of determining the origin of a given DNA sequence (e.g., a read or a transcript) in a given set of target sequences (e.g., a set of known genomes) is common to several fields of computational molecular biology. Here, we focus our attention on two applications related to metagenomics and genomics.</p>
<p>In metagenomics, the objective is to study the composition of microbial community in an environmental sample. For example, sequencing of seawater samples has enabled discoveries in microbial diversity in the marine environment [
<xref ref-type="bibr" rid="CR1">1</xref>
]. Similarly, the study of samples from the human body has elucidated the symbiotic relationships between the human microbiome and human health [
<xref ref-type="bibr" rid="CR2">2</xref>
,
<xref ref-type="bibr" rid="CR3">3</xref>
]. Once a metagenomic sample is sequenced, the first task is to determine the identities of the microbial species present in the sample. Several tools are available to classify metagenomic reads against known bacterial genomes via alignment (e.g., [
<xref ref-type="bibr" rid="CR4">4</xref>
-
<xref ref-type="bibr" rid="CR7">7</xref>
]) or sequence composition (e.g., [
<xref ref-type="bibr" rid="CR8">8</xref>
-
<xref ref-type="bibr" rid="CR11">11</xref>
]). A recent comparative evaluation of these tools [
<xref ref-type="bibr" rid="CR12">12</xref>
] demonstrated that
<sc>NBC</sc>
[
<xref ref-type="bibr" rid="CR8">8</xref>
] exhibits the highest accuracy and sensitivity at the genus level among [
<xref ref-type="bibr" rid="CR4">4</xref>
-
<xref ref-type="bibr" rid="CR6">6</xref>
,
<xref ref-type="bibr" rid="CR9">9</xref>
]. This study also showed that
<sc>NBC</sc>
and other probabilistic methods (e.g.,
<sc>PHYMMBL</sc>
[
<xref ref-type="bibr" rid="CR5">5</xref>
]) as well BLAST-based methods (e.g.,
<sc>MEGAN</sc>
[
<xref ref-type="bibr" rid="CR4">4</xref>
],
<sc>METAPHYLER</sc>
[
<xref ref-type="bibr" rid="CR6">6</xref>
]) are computationally expensive. Recently, new faster methods have been introduced (e.g.,
<sc>KRAKEN</sc>
[
<xref ref-type="bibr" rid="CR11">11</xref>
]) but their performance still does not meet
<sc>NBC</sc>
’s sensitivity. To the best of our knowledge, there is no tool yet that has both a sensitivity comparable to
<sc>NBC</sc>
and a speed comparable to
<sc>KRAKEN</sc>
. A related group of metagenomic tools, such as
<sc>METAPHLAN</sc>
[
<xref ref-type="bibr" rid="CR7">7</xref>
] and
<sc>WGSQUIKR</sc>
[
<xref ref-type="bibr" rid="CR13">13</xref>
] addresses the abundance estimation problem, that is, they estimate from the reads the proportion of each organism present in the sample.</p>
<p>The second application is associated with
<italic>de novo</italic>
clone-by-clone sequencing and assembly. Given a BAC clone (or a transcript), an objective of a classification problem sometimes is to determine which chromosome (or arm) is the most likely origin of that clone/transcript. The problem assumes that reads for each BAC/transcript as well as reads for each chromosome arm are available, but that the fully-assembled reference genome is not. This is the situation in barley, which we have used for this work, and for many other organisms. In the past, the BAC/transcript assignment problem had been addressed using general-purpose alignment tools (e.g.,
<sc>BLAST</sc>
[
<xref ref-type="bibr" rid="CR14">14</xref>
] or
<sc>BLAT</sc>
[
<xref ref-type="bibr" rid="CR15">15</xref>
]), as in [
<xref ref-type="bibr" rid="CR16">16</xref>
].</p>
<p>In both of these applications the computational problem is the same: given a set of DNA sequences to be classified (henceforth called “objects”) and a set of reference sequences (e.g., genus-level sequences, chromosome arms, etc., henceforth called “targets”), identify which target is the most likely origin of each object based on sequence similarity. Although this problem has been extensively studied, it is still computationally challenging due to the rapid advances in sequencing technologies: cheaper, faster, sequencing instruments can now generate billion of reads in a few days. As the number of objects grows, so does the number of targets, as demonstrated by the exponential growth of GenBank [
<xref ref-type="bibr" rid="CR17">17</xref>
]. Given these demands, it is critical for software tools to minimize computational resources (time, memory, I/O, etc) required for analysis.</p>
<p>Here we present
<sc>CLARK</sc>
(CLAssifier based on Reduced K-mers), a new tool that can accurately and efficiently classify objects to targets, based on reduced sets of
<italic>k</italic>
-mers (i.e., DNA words of length
<italic>k</italic>
).
<sc>CLARK</sc>
is the first method able to perform classification of short metagenomics reads at the genus/species level with a sensitivity comparable to that of
<sc>NBC</sc>
, while achieving a comparable speed to
<sc>KRAKEN</sc>
. In some situations,
<sc>CLARK</sc>
can be faster and more precise than
<sc>KRAKEN</sc>
at the genus/species level. Unlike tools like
<sc>LMAT</sc>
[
<xref ref-type="bibr" rid="CR10">10</xref>
],
<sc>METAPHYLAN</sc>
,
<sc>PHYLOPYTHIAS</sc>
[
<xref ref-type="bibr" rid="CR9">9</xref>
],
<sc>METAPHYLER</sc>
[
<xref ref-type="bibr" rid="CR6">6</xref>
], or
<sc>NBC</sc>
,
<sc>CLARK</sc>
produces assignments with confidence scores, which are critical to post-process assignments in downstream analyses. Additionally,
<sc>CLARK</sc>
is designed to be user-friendly, self-contained (i.e., does not depend on any other tool or library), and multi-core-friendly.
<sc>CLARK</sc>
does not need as much disk space as
<sc>KRAKEN</sc>
or
<sc>PHYMMBL</sc>
. Finally, a “RAM-light” version of
<sc>CLARK</sc>
can be run on a memory-limited architecture (such as a 4 GB RAM laptop).</p>
</sec>
<sec id="Sec2">
<title>Results and discussion</title>
<p>We briefly review
<sc>CLARK</sc>
’s algorithm before reporting experimental results.</p>
<sec id="Sec3">
<title>Target-specific
<italic>k</italic>
-mers and Classification</title>
<p>During preprocessing,
<sc>CLARK</sc>
builds a large index containing the
<italic>k</italic>
-spectrums of all targets sequences. We recall that a
<italic>k</italic>
-mer is a DNA word of fixed length
<italic>k</italic>
, and that the
<italic>k</italic>
-spectrum of a string
<italic>x</italic>
is the vector of dimension 4
<sup>
<italic>k</italic>
</sup>
that collects the number of occurences of all possible
<italic>k</italic>
-mers in
<italic>x</italic>
. The
<italic>k</italic>
-spectrum is a succinct (lossy) representation of
<italic>x</italic>
, which allows sequence comparison (see e.g., [
<xref ref-type="bibr" rid="CR18">18</xref>
]). Once all
<italic>k</italic>
-spectrums of target sequences have been collected in the index,
<sc>CLARK</sc>
removes any common
<italic>k</italic>
-mers between targets (see
<xref rid="Sec10" ref-type="sec">Methods</xref>
section).</p>
<p>Henceforth, we call the remaining
<italic>k</italic>
-mers either
<italic>target-specific</italic>
or
<italic>discriminative</italic>
, because they represent genomic regions that uniquely characterize each target. Finally, an object is assigned to the target with which it shares the highest number of
<italic>k</italic>
-mers.</p>
<p>
<sc>CLARK</sc>
offers two modes of execution. The first mode (henceforth named “full”) outputs for each object the number of hits against all the targets and the confidence score of the assignment (which is a number 0.5–1.0). The second mode (“default”) employs sampling to reduce the number the target-specific
<italic>k</italic>
-mers for classification, and outputs assignments without any detailed statistics so that the output size is significantly reduced (see
<xref rid="Sec10" ref-type="sec">Methods</xref>
section for more details). The default mode is slightly less accurate, but it is faster.</p>
</sec>
<sec id="Sec4">
<title>Metagenomics classification</title>
<p>Inputs to this classification task are (1) NCBI/RefSeq databases of known bacterial genomes (targets) and, either (2A) the set of metagenomic reads used in [
<xref ref-type="bibr" rid="CR11">11</xref>
] and the set of simulated long reads from “simHC” [
<xref ref-type="bibr" rid="CR19">19</xref>
], or (2B) the set of real metagenomic reads from the Human Microbiome Project (objects). The Human Microbiome Project data are freely accessible [
<xref ref-type="bibr" rid="CR2">2</xref>
,
<xref ref-type="bibr" rid="CR3">3</xref>
].</p>
<p>At the time we carried out the experiments the NCBI/RefSeq database was composed of 2,752 complete bacterial genomes, distributed into 695 distinct genera, or 1,473 species. The total length of all these bacterial genomes was about 9.5 Gbp. The average size of a genome was about 3.5 Mbp.</p>
<p>In the first experiment, we used three microbial metagenomics datasets called “HiSeq”, “MiSeq” and “simBA-5” that were introduced in [
<xref ref-type="bibr" rid="CR11">11</xref>
]. According to [
<xref ref-type="bibr" rid="CR11">11</xref>
], “the HiSeq and MiSeq metagenomes were built using twenty sets of bacterial whole-genome shotgun reads. These reads were found either as part of the GAGE-B project [
<xref ref-type="bibr" rid="CR20">20</xref>
] or in the NCBI Sequence Read Archive. Each metagenome contains sequences from ten genomes (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1 in [
<xref ref-type="bibr" rid="CR11">11</xref>
] for the list of genomes). For these metagenomes, 10% of their sequences were selected from each of the ten component genome data sets (i.e., each genome had equal sequence abundance)”. The set “simBA-5” included “simulated bacterial and archaeal reads, and was created with an error rate five times higher than” the default (see [
<xref ref-type="bibr" rid="CR11">11</xref>
]). We also analyzed the set “simHC” of synthetic reads [
<xref ref-type="bibr" rid="CR19">19</xref>
], which simulates high complexity communities lacking dominant populations. SimHC contains 113 sets of reads from various microbial genomes. From simHC, we selected arbitrarily twenty distinct genomes, and extracted the first 500 reads for each genome to build a total of 10,000 reads (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S4). We called this latter dataset “simHC.20.500”.</p>
<p>For the experiments below we used the “HiSeq”, “MiSeq” (which can be considered set of read of low/medium complexity), “simBA-5” from [
<xref ref-type="bibr" rid="CR11">11</xref>
] and “simHC.20.500” (which can be considered set of reads of high complexity). Each of these sets contains 10,000 reads. The average read length in HiSeq was 92 bp, 156 bp in MiSeq, and 951 bp in simHC.20.500. In simBA-5, all reads are 100 bp long.</p>
<p>In the second experiment, we have arbitrarily chosen three metagenomic samples selected from the Human Microbiome Project [
<xref ref-type="bibr" rid="CR2">2</xref>
,
<xref ref-type="bibr" rid="CR3">3</xref>
]. The three samples we used were SRS015072 (mid-vagina) containing 572 thousand paired-end reads, SRS019120 (saliva) containing 4.3 million paired-end reads, and SRS023847 (nose) containing 5.2 million paired-end reads.</p>
<sec id="Sec5">
<title>HiSeq, MiSeq, simBA-5 and simHC.20.500</title>
<p>We used
<sc>CLARK</sc>
to classify the reads in the four datasets described above and compared its classification results against the state-of-the-art methods, namely
<sc>NBC</sc>
[
<xref ref-type="bibr" rid="CR8">8</xref>
], which we chose for its high accuracy (currently the most sensitive metagenomics classifier, according to [
<xref ref-type="bibr" rid="CR12">12</xref>
]), and
<sc>KRAKEN</sc>
, which we chose due to its high speed (currently the fastest metagenomics classifier, according to [
<xref ref-type="bibr" rid="CR11">11</xref>
]) and its high precision at the genus level.</p>
<p>We classified the reads (i) against 695 genus-level targets (Table
<xref rid="Tab1" ref-type="table">1</xref>
) and (ii) against 1473 species-level targets (Table
<xref rid="Tab2" ref-type="table">2</xref>
).
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>
<bold>Genus-level classification accuracy and speed of </bold>
<sc>CLARK</sc>
,
<sc>KRAKEN</sc>
<bold>, and </bold>
<sc>NBC</sc>
<bold> for four simulated metagenomes and several</bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mer length</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left">
<bold>
<italic>k</italic>
</bold>
</th>
<th align="left" colspan="3">
<bold>HiSeq</bold>
</th>
<th align="left" colspan="3">
<bold>MiSeq</bold>
</th>
<th align="left" colspan="3">
<bold>simBA-5</bold>
</th>
<th align="left" colspan="3">
<bold>simHC.20.500</bold>
</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left">
<bold>
<italic>Prec</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Sens</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Speed</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Prec</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Sens</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Speed</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Prec</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Sens</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Speed</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Prec</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Sens</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Speed</italic>
</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<sc>NBC</sc>
</td>
<td align="center">15
<sup></sup>
</td>
<td align="center">
<bold>82.57</bold>
</td>
<td align="center">
<bold>82.57</bold>
</td>
<td align="right">0.008</td>
<td align="center">
<bold>81.00</bold>
</td>
<td align="center">
<bold>81.00</bold>
</td>
<td align="right">0.007</td>
<td align="center">
<bold>97.69</bold>
</td>
<td align="center">
<bold>97.69</bold>
</td>
<td align="right">0.007</td>
<td align="center">
<bold>99.40</bold>
</td>
<td align="center">
<bold>99.40</bold>
</td>
<td align="right">0.005</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">13
<sup></sup>
</td>
<td align="center">78.85</td>
<td align="center">78.85</td>
<td align="right">0.011</td>
<td align="center">77.70</td>
<td align="center">77.70</td>
<td align="right">0.009</td>
<td align="center">92.41</td>
<td align="center">92.41</td>
<td align="right">0.010</td>
<td align="center">98.57</td>
<td align="center">98.57</td>
<td align="right">0.006</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">11
<sup></sup>
</td>
<td align="center">58.97</td>
<td align="center">58.97</td>
<td align="right">
<bold>0.020</bold>
</td>
<td align="center">64.43</td>
<td align="center">64.43</td>
<td align="right">
<bold>0.016</bold>
</td>
<td align="center">46.10</td>
<td align="center">46.10</td>
<td align="right">
<bold>0.017</bold>
</td>
<td align="center">86.83</td>
<td align="center">86.83</td>
<td align="right">
<bold>0.008</bold>
</td>
</tr>
<tr>
<td align="left">
<sc>Clark</sc>
(full)</td>
<td align="center">31</td>
<td align="center">
<bold>99.26</bold>
</td>
<td align="center">77.78</td>
<td align="right">
<bold>541</bold>
</td>
<td align="center">
<bold>95.33</bold>
</td>
<td align="center">77.69</td>
<td align="right">
<bold>435</bold>
</td>
<td align="center">98.88</td>
<td align="center">89.67</td>
<td align="right">
<bold>591</bold>
</td>
<td align="center">
<bold>99.68</bold>
</td>
<td align="center">
<bold>99.42</bold>
</td>
<td align="right">121</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">27</td>
<td align="center">98.98</td>
<td align="center">79.88</td>
<td align="right">538</td>
<td align="center">93.50</td>
<td align="center">78.57</td>
<td align="right">433</td>
<td align="center">
<bold>98.90</bold>
</td>
<td align="center">93.09</td>
<td align="right">585</td>
<td align="center">99.67</td>
<td align="center">
<bold>99.42</bold>
</td>
<td align="right">
<bold>122</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">23</td>
<td align="center">97.33</td>
<td align="center">81.97</td>
<td align="right">530</td>
<td align="center">90.06</td>
<td align="center">80.02</td>
<td align="right">426</td>
<td align="center">98.71</td>
<td align="center">94.54</td>
<td align="right">559</td>
<td align="center">99.59</td>
<td align="center">
<bold>99.42</bold>
</td>
<td align="right">119</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">20</td>
<td align="center">87.00</td>
<td align="center">
<bold>82.87</bold>
</td>
<td align="right">532</td>
<td align="center">82.45</td>
<td align="center">
<bold>80.19</bold>
</td>
<td align="right">420</td>
<td align="center">97.38</td>
<td align="center">
<bold>94.80</bold>
</td>
<td align="right">549</td>
<td align="center">99.43</td>
<td align="center">99.41</td>
<td align="right">115</td>
</tr>
<tr>
<td align="left">
<sc>Kraken</sc>
</td>
<td align="center">31</td>
<td align="center">
<bold>99.26</bold>
</td>
<td align="center">77.76</td>
<td align="right">
<bold>2,332</bold>
</td>
<td align="center">
<bold>95.50</bold>
</td>
<td align="center">77.59</td>
<td align="right">
<bold>1,361</bold>
</td>
<td align="center">98.28</td>
<td align="center">89.35</td>
<td align="right">
<bold>1,976</bold>
</td>
<td align="center">96.83</td>
<td align="center">96.55</td>
<td align="right">
<bold>237</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">27</td>
<td align="center">99.01</td>
<td align="center">79.85</td>
<td align="right">2,048</td>
<td align="center">93.91</td>
<td align="center">78.47</td>
<td align="right">1,240</td>
<td align="center">
<bold>98.31</bold>
</td>
<td align="center">92.73</td>
<td align="right">1,917</td>
<td align="center">
<bold>96.85</bold>
</td>
<td align="center">96.57</td>
<td align="right">231</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">23</td>
<td align="center">97.45</td>
<td align="center">81.89</td>
<td align="right">1,923</td>
<td align="center">90.56</td>
<td align="center">79.75</td>
<td align="right">1,186</td>
<td align="center">98.25</td>
<td align="center">94.18</td>
<td align="right">1,824</td>
<td align="center">96.80</td>
<td align="center">96.57</td>
<td align="right">228</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">20</td>
<td align="center">90.22</td>
<td align="center">
<bold>82.67</bold>
</td>
<td align="right">1,546</td>
<td align="center">86.28</td>
<td align="center">
<bold>79.99</bold>
</td>
<td align="right">965</td>
<td align="center">98.07</td>
<td align="center">
<bold>94.44</bold>
</td>
<td align="right">1,478</td>
<td align="center">96.71</td>
<td align="center">
<bold>96.59</bold>
</td>
<td align="right">211</td>
</tr>
<tr>
<td align="left">
<sc>Clark</sc>
</td>
<td align="center">31</td>
<td align="center">
<bold>99.31</bold>
</td>
<td align="center">77.25</td>
<td align="right">
<bold>3,116</bold>
</td>
<td align="center">
<bold>95.66</bold>
</td>
<td align="center">77.44</td>
<td align="right">
<bold>1,670</bold>
</td>
<td align="center">
<bold>98.91</bold>
</td>
<td align="center">88.62</td>
<td align="right">
<bold>2,855</bold>
</td>
<td align="center">
<bold>99.68</bold>
</td>
<td align="center">
<bold>99.42</bold>
</td>
<td align="right">
<bold>251</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">27</td>
<td align="center">99.07</td>
<td align="center">79.37</td>
<td align="right">2,796</td>
<td align="center">93.90</td>
<td align="center">78.29</td>
<td align="right">1,522</td>
<td align="center">98.90</td>
<td align="center">92.26</td>
<td align="right">2,554</td>
<td align="center">99.67</td>
<td align="center">
<bold>99.42</bold>
</td>
<td align="right">241</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">23</td>
<td align="center">97.85</td>
<td align="center">81.36</td>
<td align="right">2,679</td>
<td align="center">90.98</td>
<td align="center">79.57</td>
<td align="right">1,482</td>
<td align="center">98.75</td>
<td align="center">94.26</td>
<td align="right">2,394</td>
<td align="center">99.60</td>
<td align="center">
<bold>99.42</bold>
</td>
<td align="right">244</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">20</td>
<td align="center">88.60</td>
<td align="center">
<bold>82.26</bold>
</td>
<td align="right">2,567</td>
<td align="center">83.35</td>
<td align="center">
<bold>79.77</bold>
</td>
<td align="right">1,456</td>
<td align="center">97.73</td>
<td align="center">
<bold>94.49</bold>
</td>
<td align="right">2,306</td>
<td align="center">99.43</td>
<td align="center">99.41</td>
<td align="right">239</td>
</tr>
<tr>
<td align="left">
<sc>Kraken</sc>
-Q</td>
<td align="center">31</td>
<td align="center">
<bold>99.20</bold>
</td>
<td align="center">76.84</td>
<td align="right">6,224</td>
<td align="center">
<bold>95.81</bold>
</td>
<td align="center">
<bold>74.13</bold>
</td>
<td align="right">5,308</td>
<td align="center">
<bold>98.17</bold>
</td>
<td align="center">87.46</td>
<td align="right">7,023</td>
<td align="center">
<bold>91.17</bold>
</td>
<td align="center">
<bold>85.79</bold>
</td>
<td align="right">3,809</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">27</td>
<td align="center">98.79</td>
<td align="center">78.19</td>
<td align="right">6,410</td>
<td align="center">94.12</td>
<td align="center">73.73</td>
<td align="right">5,555</td>
<td align="center">98.11</td>
<td align="center">
<bold>89.89</bold>
</td>
<td align="right">7,992</td>
<td align="center">90.99</td>
<td align="center">83.71</td>
<td align="right">4,196</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">23</td>
<td align="center">96.67</td>
<td align="center">
<bold>78.48</bold>
</td>
<td align="right">7,015</td>
<td align="center">90.57</td>
<td align="center">72.35</td>
<td align="right">6,329</td>
<td align="center">97.21</td>
<td align="center">89.07</td>
<td align="right">8,989</td>
<td align="center">90.46</td>
<td align="center">79.27</td>
<td align="right">4,574</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">20</td>
<td align="center">82.07</td>
<td align="center">70.11</td>
<td align="right">
<bold>9,437</bold>
</td>
<td align="center">80.05</td>
<td align="center">65.25</td>
<td align="right">
<bold>9,537</bold>
</td>
<td align="center">90.02</td>
<td align="center">77.04</td>
<td align="right">
<bold>10,961</bold>
</td>
<td align="center">70.86</td>
<td align="center">57.40</td>
<td align="right">
<bold>5,819</bold>
</td>
</tr>
<tr>
<td align="left">
<sc>Clark</sc>
-
<italic>E</italic>
</td>
<td align="center">31</td>
<td align="center">
<bold>99.55</bold>
</td>
<td align="center">72.72</td>
<td align="right">
<bold>32,450</bold>
</td>
<td align="center">
<bold>98.11</bold>
</td>
<td align="center">
<bold>74.58</bold>
</td>
<td align="right">
<bold>28,988</bold>
</td>
<td align="center">
<bold>99.00</bold>
</td>
<td align="center">77.85</td>
<td align="right">26,171</td>
<td align="center">97.63</td>
<td align="center">97.31</td>
<td align="right">15,426</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">27</td>
<td align="center">99.43</td>
<td align="center">74.67</td>
<td align="right">29,897</td>
<td align="center">96.93</td>
<td align="center">75.68</td>
<td align="right">28,459</td>
<td align="center">98.93</td>
<td align="center">84.86</td>
<td align="right">
<bold>27,451</bold>
</td>
<td align="center">97.47</td>
<td align="center">97.18</td>
<td align="right">
<bold>16,124</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">23</td>
<td align="center">98.93</td>
<td align="center">78.20</td>
<td align="right">31,112</td>
<td align="center">95.01</td>
<td align="center">76.88</td>
<td align="right">26,747</td>
<td align="center">98.34</td>
<td align="center">90.20</td>
<td align="right">26,647</td>
<td align="center">
<bold>98.56</bold>
</td>
<td align="center">
<bold>98.32</bold>
</td>
<td align="right">15,408</td>
</tr>
<tr>
<td align="left"></td>
<td align="center">20</td>
<td align="center">94.74</td>
<td align="center">
<bold>78.46</bold>
</td>
<td align="right">30,029</td>
<td align="center">90.57</td>
<td align="center">76.60</td>
<td align="right">25,789</td>
<td align="center">96.61</td>
<td align="center">
<bold>89.98</bold>
</td>
<td align="right">26,545</td>
<td align="center">93.94</td>
<td align="center">93.82</td>
<td align="right">15,587</td>
</tr>
<tr>
<td align="left">
<sc>Clark</sc>
-
<italic>l</italic>
</td>
<td align="center">27</td>
<td align="center">98.45</td>
<td align="center">62.30</td>
<td align="right">1,525</td>
<td align="center">92.11</td>
<td align="center">69.64</td>
<td align="right">861</td>
<td align="center">95.96</td>
<td align="center">52.00</td>
<td align="right">1,705</td>
<td align="center">99.49</td>
<td align="center">98.94</td>
<td align="right">143</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Performance statistics for several choices of the
<italic>k</italic>
-mer length for
<sc>NBC</sc>
,
<sc>KRAKEN</sc>
,
<sc>CLARK</sc>
and their fast variants on the classification of “HiSeq”, “MiSeq”, “simBA-5” and “simHC.20.500” metagenomic datasets against the 695 genus-level targets; precision and sensitivity are expressed as percentages, while speed is expressed in 10
<sup>3</sup>
reads per minute;
<sc>KRAKEN</sc>
-Q and
<sc>CLARK</sc>
-
<italic>E</italic>
are faster, but less accurate, variants of these tools;
<sc>CLARK</sc>
-
<italic>l</italic>
is a less memory-intensive version of
<sc>CLARK</sc>
which runs only for
<italic>k</italic>
= 27; experiments were carried out in single-threaded mode;
<sup></sup>
parameter
<italic>k</italic>
is referred as
<italic>N</italic>
in the
<sc>NBC</sc>
manuscript.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>
<bold>Species-level classification accuracy and speed of </bold>
<sc>CLARK</sc>
,
<sc>KRAKEN</sc>
<bold>, and </bold>
<sc>NBC</sc>
<bold> for four simulated metagenomes</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left" colspan="3">
<bold>HiSeq</bold>
</th>
<th align="left" colspan="3">
<bold>MiSeq</bold>
</th>
<th align="left" colspan="3">
<bold>simBA-5</bold>
</th>
<th align="left" colspan="3">
<bold>simHC.20.500</bold>
</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">
<bold>
<italic>Prec</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Sens</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Speed</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Prec</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Sens</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Speed</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Prec</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Sens</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Speed</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Prec</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Sens</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Speed</italic>
</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<sc>NBC</sc>
(
<italic>k</italic>
=15)</td>
<td align="left">68.67</td>
<td align="left">68.70</td>
<td align="left">0.008</td>
<td align="left">68.33</td>
<td align="left">68.33</td>
<td align="left">0.007</td>
<td align="left">91.74</td>
<td align="left">91.74</td>
<td align="left">0.007</td>
<td align="left">94.32</td>
<td align="left">94.32</td>
<td align="left">0.005</td>
</tr>
<tr>
<td align="left">
<sc>Clark</sc>
(
<italic>k</italic>
=20)</td>
<td align="left">69.44</td>
<td align="left">61.46</td>
<td align="left">272</td>
<td align="left">70.72</td>
<td align="left">62.45</td>
<td align="left">239</td>
<td align="left">91.32</td>
<td align="left">82.48</td>
<td align="left">269</td>
<td align="left">94.34</td>
<td align="left">94.32</td>
<td align="left">96</td>
</tr>
<tr>
<td align="left">
<sc>Kraken</sc>
(
<italic>k</italic>
=31)</td>
<td align="left">74.00</td>
<td align="left">53.49</td>
<td align="left">2,332</td>
<td align="left">77.72</td>
<td align="left">58.72</td>
<td align="left">1,361</td>
<td align="left">92.99</td>
<td align="left">78.70</td>
<td align="left">1,976</td>
<td align="left">84.67</td>
<td align="left">84.31</td>
<td align="left">237</td>
</tr>
<tr>
<td align="left">
<sc>Clark</sc>
(
<italic>k</italic>
=31)</td>
<td align="left">86.74</td>
<td align="left">58.59</td>
<td align="left">3,011</td>
<td align="left">89.49</td>
<td align="left">61.84</td>
<td align="left">1,566</td>
<td align="left">98.85</td>
<td align="left">76.80</td>
<td align="left">2,855</td>
<td align="left">94.67</td>
<td align="left">94.26</td>
<td align="left">251</td>
</tr>
<tr>
<td align="left">
<sc>Kraken</sc>
-Q (
<italic>k</italic>
=31)</td>
<td align="left">75.88</td>
<td align="left">50.78</td>
<td align="left">6,224</td>
<td align="left">78.07</td>
<td align="left">53.68</td>
<td align="left">5,308</td>
<td align="left">92.67</td>
<td align="left">74.39</td>
<td align="left">7,023</td>
<td align="left">82.40</td>
<td align="left">74.84</td>
<td align="left">3,809</td>
</tr>
<tr>
<td align="left">
<sc>Clark</sc>
-
<italic>E</italic>
(
<italic>k</italic>
=31)</td>
<td align="left">90.08</td>
<td align="left">55.18</td>
<td align="left">30,976</td>
<td align="left">94.31</td>
<td align="left">58.36</td>
<td align="left">24,029</td>
<td align="left">98.92</td>
<td align="left">66.02</td>
<td align="left">24,996</td>
<td align="left">92.78</td>
<td align="left">92.38</td>
<td align="left">15,583</td>
</tr>
<tr>
<td align="left">
<sc>Clark</sc>
-
<italic>l</italic>
(
<italic>k</italic>
=27)</td>
<td align="left">85.35</td>
<td align="left">53.95</td>
<td align="left">1,676</td>
<td align="left">85.89</td>
<td align="left">64.91</td>
<td align="left">904</td>
<td align="left">85.55</td>
<td align="left">46.28</td>
<td align="left">1,702</td>
<td align="left">94.06</td>
<td align="left">93.53</td>
<td align="left">141</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Precision and sensitivity are expressed as percentages, while speed is expressed in 10
<sup>3</sup>
reads per minute for
<sc>NBC</sc>
,
<sc>KRAKEN</sc>
, and
<sc>CLARK</sc>
on the classification of “HiSeq”, “MiSeq”, “simBA-5” and “simHC.20.500” metagenome datasets against the 1473 species-level targets, in single-threaded mode.</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>For a given level in the taxonomy tree (e.g., genus), we define
<italic>precision</italic>
as the fraction of correct assignments over the total number of assignments, and
<italic>sensitivity</italic>
as the ratio between the number of correct assignments and the number of objects to be classified. In order to have a fair comparison against
<sc>KRAKEN</sc>
’s assignments, when
<sc>KRAKEN</sc>
produces an assignment that is not available at or below the genus or species level, it is then considered as not assigned.</p>
<p>Table
<xref rid="Tab1" ref-type="table">1</xref>
reports precision, sensitivity and processing speeds (in 10
<sup>3</sup>
reads per minute) obtained by
<sc>NBC</sc>
,
<sc>KRAKEN</sc>
and
<sc>CLARK</sc>
on the HiSeq, MiSeq, simBA-5 and simHC.20.500 datasets, for several values of the
<italic>k</italic>
-mer length. The table illustrates how the performance of these tools is affected by the choice of
<italic>k</italic>
. By increasing
<italic>k</italic>
one generally increases precision, but can lower sensitivity (also see Figure
<xref rid="Fig1" ref-type="fig">1</xref>
). To carry out a fair comparison between tools, we decided to first determine
<sc>NBC</sc>
’s and
<sc>KRAKEN</sc>
’s optimal
<italic>k</italic>
-mer length, and then run
<sc>CLARK</sc>
with a value of
<italic>k</italic>
that would match either sensitivity or precision.
<fig id="Fig1">
<label>Figure 1</label>
<caption>
<p>Classification performance of
<sc>CLARK</sc>
for several
<italic>k</italic>
-mer length and for various datasets.
<sc>CLARK</sc>
’s precision, sensitivity, assignment rate, average confidence scores and precision of high confidence assignments (HC) for several choices of the
<italic>k</italic>
-mer length on the “HiSeq” metagenomic dataset
<bold>(a)</bold>
, the “MiSeq” metagenomic dataset
<bold>(b)</bold>
, the “simBA-5” metagenomic dataset
<bold>(c)</bold>
, the “simHC.20.500” metagenomic dataset
<bold>(d)</bold>
, and barley unigenes
<bold>(e)</bold>
.
<bold>(a)</bold>
<bold>(d)</bold>
are results of the classification against the 695 genus-level targets.</p>
</caption>
<graphic xlink:href="12864_2015_1419_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>
<sc>NBC</sc>
was tested with
<italic>k</italic>
=11,13,15. We observed that
<italic>k</italic>
=15 produced the highest sensitivity on all datasets. The value
<italic>k</italic>
=15 is the highest possible value, which is recommended by the authors of [
<xref ref-type="bibr" rid="CR8">8</xref>
] for datasets composed of short reads. Since
<sc>NBC</sc>
produces detailed statistics on the assignments, we executed
<sc>CLARK</sc>
in “full” mode for a fair comparison. Using
<italic>k</italic>
=20 for
<sc>CLARK</sc>
(full mode) we obtained a similar sensitivity to
<sc>NBC</sc>
(
<sc>CLARK</sc>
is actually more sensitive than
<sc>NBC</sc>
on HiSeq and simHC.20.500). At the same level of sensitivity of
<sc>NBC</sc>
,
<sc>CLARK</sc>
achieves a higher precision and it is thousands of times faster.</p>
<p>In the case of
<sc>KRAKEN</sc>
,
<italic>k</italic>
=31 was the value used in [
<xref ref-type="bibr" rid="CR11">11</xref>
] for HiSeq, MiSeq and simBA-5 and it is supposed to achieve the highest precision. Nonetheless, we tried to run
<sc>KRAKEN</sc>
for other values of
<italic>k</italic>
. As expected, Table
<xref rid="Tab1" ref-type="table">1</xref>
shows that
<italic>k</italic>
=31 produces the best precision for all the datasets. For this comparison, we also ran
<sc>CLARK</sc>
with
<italic>k</italic>
=31. Observe that
<sc>CLARK</sc>
(default mode) is slightly less sensitive than
<sc>KRAKEN</sc>
but is more precise and faster. The difference in speed is significant for all datasets of short reads (300−800 thousand additional reads/min). On simHC.20.500,
<sc>KRAKEN</sc>
and
<sc>CLARK</sc>
achieve the same speed due to the fact that these datasets contain longer reads. Finally,
<sc>CLARK</sc>
has better sensitivity than
<sc>KRAKEN</sc>
on simHC.20.500.</p>
<p>The same comparisons were carried out between the two variants of
<sc>KRAKEN</sc>
and
<sc>CLARK</sc>
optimized for speed, called
<sc>KRAKEN</sc>
-Q and
<sc>CLARK</sc>
-
<italic>E</italic>
(
<italic>E</italic>
for “Express”, see
<xref rid="Sec10" ref-type="sec">Methods</xref>
section). As indicated in Table
<xref rid="Tab1" ref-type="table">1</xref>
,
<sc>KRAKEN</sc>
-Q achieves the best precision for all the datasets when
<italic>k</italic>
=31, which is consistent with [
<xref ref-type="bibr" rid="CR11">11</xref>
]. However, when
<italic>k</italic>
=31
<sc>CLARK</sc>
-
<italic>E</italic>
runs four–five times faster than
<sc>KRAKEN</sc>
-Q and is also more precise. In addition, observe that as we decrease
<italic>k</italic>
, both variants gets faster but
<sc>CLARK</sc>
-
<italic>E</italic>
maintains a precision above 90% while
<sc>KRAKEN</sc>
-Q produces progressively lower precisions.</p>
<p>In the last row of Table
<xref rid="Tab1" ref-type="table">1</xref>
, we report the performance of
<sc>CLARK</sc>
-
<italic>l</italic>
, another variant of
<sc>CLARK</sc>
designed for low RAM architectures that runs only for
<italic>k</italic>
=27 (see
<xref rid="Sec10" ref-type="sec">Methods</xref>
section).
<sc>CLARK</sc>
-
<italic>l</italic>
performs assignments with a lower precision than
<sc>CLARK</sc>
(the difference is at most 3.5% in these experiments) but can process more than 1.5 million of reads per minute on HiSeq or simBA-5, and only uses about 4% of the memory used by
<sc>CLARK</sc>
(see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1).</p>
<p>All experimental results reported so far were obtained in single-threaded mode. If a multi-core architecture is available,
<sc>CLARK</sc>
and
<sc>KRAKEN</sc>
can take advantage of it. In Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S2, we summarize the classification speed of the two tools using 1, 2, 4 or 8 threads for
<italic>k</italic>
=31. Observe that using eight threads,
<sc>CLARK</sc>
achieves a speed-up of 5.2x compared to one thread, while
<sc>KRAKEN</sc>
only achieves a speed-up of 1.2x. When comparing
<sc>CLARK</sc>
-
<italic>E</italic>
to
<sc>KRAKEN-Q</sc>
, we can make similar observations. In general, note that
<sc>CLARK</sc>
-
<italic>E</italic>
is at least five times faster than
<sc>KRAKEN-Q</sc>
, independently of the number of threads used.</p>
<p>For the analysis at the species level, we repeated the classification of the objects in the four datasets described above against species-level targets. This time we used values of
<italic>k</italic>
that allowed best sensitivity for
<sc>NBC</sc>
(
<italic>k</italic>
=15) and best precision for
<sc>KRAKEN</sc>
(
<italic>k</italic>
=31). Observe in Table
<xref rid="Tab2" ref-type="table">2</xref>
that
<sc>NBC</sc>
achieves the best sensitivity on all datasets. However, when
<sc>CLARK</sc>
is ran in full mode using
<italic>k</italic>
=20, it achieves a higher precision than
<sc>NBC</sc>
on HiSeq, MiSeq and simHC.20.500, and is several orders of magnitude faster. In addition,
<sc>CLARK</sc>
in default mode using
<italic>k</italic>
=31 achieves higher precision than
<sc>KRAKEN</sc>
on all datasets (as much as 10% higher on HiSeq and MiSeq) when
<italic>k</italic>
=31.
<sc>CLARK</sc>
also outperforms the speed of
<sc>KRAKEN</sc>
on HiSeq, MiSeq and simBA-5. On simHC.20.500, since the reads are much longer, the speed of
<sc>KRAKEN</sc>
and
<sc>CLARK</sc>
are comparable. But,
<sc>CLARK</sc>
has higher sensitivity than
<sc>KRAKEN</sc>
on HiSeq, MiSeq and simHC.20.500. Finally, the fast variant
<sc>CLARK</sc>
-
<italic>E</italic>
, as previously observed for the experiments at the genus level, outperforms
<sc>KRAKEN</sc>
-
<italic>Q</italic>
in both speed and precision.</p>
</sec>
<sec id="Sec6">
<title>Human microbiome samples</title>
<p>In the second experiment, we used
<sc>CLARK</sc>
to classify Human Microbiome Project reads against 695 genus-level targets described above. This time, however, the “ground truth” was not available.</p>
<p>Using
<italic>k</italic>
=31,
<sc>CLARK</sc>
was able to assign 42.1% of the reads in SRS015072 (mid-vagina), 30.8% of the reads in SRS019120 (saliva) and 49.8% of the reads in SRS023847 (nose).
<sc>KRAKEN</sc>
achieved similar rates of assigned reads using
<italic>k</italic>
=31. Reducing
<italic>k</italic>
would increase the number of assignments, at the cost of increasing the probability of misclassification. We investigated whether we could take advantage of
<sc>CLARK</sc>
’s confidence scores to compensate for a smaller value of
<italic>k</italic>
, and improve the fraction of assigned reads.</p>
<p>Figure
<xref rid="Fig1" ref-type="fig">1</xref>
a to Figure
<xref rid="Fig1" ref-type="fig">1</xref>
d show that
<sc>CLARK</sc>
’s sensitivity on the four datasets is the highest for
<italic>k</italic>
=20 or
<italic>k</italic>
=21. However, the precision for
<italic>k</italic>
=20 and
<italic>k</italic>
=21 is about 15% lower than for
<italic>k</italic>
=31, which implies that a large proportion of assignments may be incorrect. We have strong experimental evidence that shows that the higher is
<sc>CLARK</sc>
’s confidence score for an assignment, the more likely that assignment is correct (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Supplementary Note 2). In addition, we observe in Figure
<xref rid="Fig1" ref-type="fig">1</xref>
a to Figure
<xref rid="Fig1" ref-type="fig">1</xref>
d that the precision of high confidence assignments is higher than the average precision of all assignments, and is relatively constant for all
<italic>k</italic>
-mer length. The idea is to use
<italic>k</italic>
=20 to maximize the number of assigned reads, but only consider high confidence assignments to increase the precision. We call an assignment
<italic>high confidence</italic>
if the confidence score is higher than 0.75,
<italic>low confidence</italic>
otherwise.</p>
<p>Observe in Table
<xref rid="Tab3" ref-type="table">3</xref>
that the number of high confidence assignments for
<italic>k</italic>
=20 is significantly higher than for
<italic>k</italic>
=31. The relative increase in assignments is about 40% (from 42.1% to 62.3% in SRS015072, 30.8% to 55.1% on SRS019120, and 49.8% to 68.3% on SRS023847). Table
<xref rid="Tab3" ref-type="table">3</xref>
also reports the most frequent five genera in high confidence assignments. For the saliva sample, the dominance of
<italic>Streptococcus</italic>
,
<italic>Haemophilus</italic>
and
<italic>Prevotella</italic>
is consistent with findings in [
<xref ref-type="bibr" rid="CR2">2</xref>
] and [
<xref ref-type="bibr" rid="CR11">11</xref>
]. Study [
<xref ref-type="bibr" rid="CR21">21</xref>
], which focused on salivary microbiota of 35 inflammatory bowel disease patients, also reports
<italic>Streptococcus</italic>
,
<italic>Prevotella</italic>
,
<italic>Neisseria</italic>
,
<italic>Haemophilus</italic>
and
<italic>Veillonella</italic>
as dominant genera. Concerning the mid-vagina sample, we have found that
<italic>Lactobacillus</italic>
is the dominant genus, in agreement with findings reported in [
<xref ref-type="bibr" rid="CR2">2</xref>
,
<xref ref-type="bibr" rid="CR22">22</xref>
,
<xref ref-type="bibr" rid="CR23">23</xref>
]. The proportion of
<italic>Lactobacillus</italic>
we have identified (64.7%) is very close to the reported proportion (69%–71%) in [
<xref ref-type="bibr" rid="CR22">22</xref>
,
<xref ref-type="bibr" rid="CR23">23</xref>
]. The presence of
<italic>Pseudomonas</italic>
and
<italic>Gardnerella</italic>
is expected because some individuals who lack
<italic>Lactobacillus</italic>
have instead
<italic>Gardnerella</italic>
or
<italic>Pseudomonas</italic>
as the predominant bacteria [
<xref ref-type="bibr" rid="CR22">22</xref>
,
<xref ref-type="bibr" rid="CR23">23</xref>
]. In the nose sample, the high presence of
<italic>Propionibacterium</italic>
and
<italic>Staphylococcus</italic>
is consistent with the results in [
<xref ref-type="bibr" rid="CR2">2</xref>
].
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>
<bold>Summary of the Genus-level classification for three Human Microbiome Project datasets (</bold>
<bold>
<italic>k</italic>
</bold>
<bold>=20)</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">
<bold>
<italic>SRS ID</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>High confidence</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Low confidence</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>No assignment</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Average</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Most frequent genera (high</italic>
</bold>
</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">
<bold>
<italic>assignments (%)</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>assignments (%)</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>(%)</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>confidence score</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>confidence assignments)</italic>
</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">015072</td>
<td align="left">62.3%</td>
<td align="left">25.9%</td>
<td align="left">11.8%</td>
<td align="left">0.868</td>
<td align="left">
<italic>Lactobacillus</italic>
(64.7%)</td>
</tr>
<tr>
<td align="left">(vagina)</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Pseudomonas</italic>
(7.3%)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Desulfosporosinus</italic>
(4.4%)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Clostridium</italic>
(1.7%)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Gardnerella</italic>
(1.2%)</td>
</tr>
<tr>
<td align="left">019120</td>
<td align="left">55.1%</td>
<td align="left">28.2%</td>
<td align="left">16.7%</td>
<td align="left">0.842</td>
<td align="left">
<italic>Streptococcus</italic>
(27.2%)</td>
</tr>
<tr>
<td align="left">(mouth)</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Haemophilus</italic>
(15.0%)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Prevotella</italic>
(11.4%)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Neisseria</italic>
(5.0%)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Veillonella</italic>
(2.9%)</td>
</tr>
<tr>
<td align="left">023847</td>
<td align="left">68.3%</td>
<td align="left">23.8%</td>
<td align="left">7.9%</td>
<td align="left">0.954</td>
<td align="left">
<italic>Propionibacterium</italic>
(61.5%)</td>
</tr>
<tr>
<td align="left">(nose)</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Staphylococcus</italic>
(8.5%)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Achromobacter</italic>
(7.5%)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Alteromonas</italic>
(6.3%)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">
<italic>Desulfosporosinus</italic>
(5.0%)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Columns: (1) short read sample ID; (2) percentage of high confidence assignments; (3) percentage of low confidence assignments; (4) percentage of unassigned reads; (5) average confidence score for all assignments; (6) five most frequent genera in high confidence assignments (listed in decreasing order). An assignment is
<italic>high confidence</italic>
if the confidence score is higher than 0.75,
<italic>low confidence</italic>
otherwise.</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
</sec>
<sec id="Sec7">
<title>Classification of barley BACs and unigenes to chromosome arms and centromeres</title>
<p>Inputs to this classification task were (1) barley chromosome arms (targets) and (2) barley BACs or unigenes (objects). Samples of each barley chromosome arm were obtained using flow-sorting [
<xref ref-type="bibr" rid="CR24">24</xref>
]. The procedure to obtain gene-rich barley BACs was described in [
<xref ref-type="bibr" rid="CR25">25</xref>
]. Sequences for chromosome arms and BACs were generated on an Illumina HiSeq 2000 instrument by J. Weger at UC Riverside.</p>
<p>For the targets, we processed thirteen datasets of shotgun sequenced reads: one for barley chromosome 1H and twelve for barley chromosome arms (namely, 2HL, 2HS, 3HL, 3HS, 4HL, 4HS, 5HL, 5HS, 6HL, 6HS, 7HL, and 7HS). After quality-trimming the reads, we had a total of about 181 Gbp of sequence data. The cumulative size of the assembled barley chromosome arms obtained via
<sc>SOAPDENOVO</sc>
[
<xref ref-type="bibr" rid="CR26">26</xref>
] resulted in about 2 Gbp (about 40% of the barley genome).</p>
<p>The objects were 50,938 barley unigenes (transcript assembly from ESTs) obtained from [
<xref ref-type="bibr" rid="CR27">27</xref>
] for a total of about 222.4 Mbp. Additionally, we trimmed short reads for 15,721 BACs obtained from [
<xref ref-type="bibr" rid="CR25">25</xref>
], for a total of about 1.73 Gbp. We also had access to 15,697 BAC assemblies (not all BACs had a sufficient number of reads for an assembly) for a total of about 1.80 Gbp. While the genomic location for the majority of these “objects” was unknown, we had 1,652 unigenes for which a location was derived from the Golden Gate oligonucleotide pool assay (OPA) [
<xref ref-type="bibr" rid="CR28">28</xref>
], which allowed us to determine a presumed location of 2,252 BACs [
<xref ref-type="bibr" rid="CR25">25</xref>
]. We should point out that although we have used these locations as the “ground truth” to establish the accuracy of the classification, our observations indicate about 5% errors in these OPA assignments [
<xref ref-type="bibr" rid="CR25">25</xref>
].</p>
<p>As stated above, the most critical parameter in
<sc>CLARK</sc>
is the length of the
<italic>k</italic>
-mer used for classification. By assuming that the subset of the unigenes that have a location via OPA are correct, we were able to estimate
<sc>CLARK</sc>
’s precision and sensitivity for various choices of
<italic>k</italic>
. Figure
<xref rid="Fig1" ref-type="fig">1</xref>
e shows these statistics, along with the assignment rate (fraction of unigenes assigned) and the average confidence score for all assignments. Observe that as
<italic>k</italic>
increases, the number of assignments decreases but the precision/sensitivity increases. Based on this analysis we determined that
<italic>k</italic>
=19 represents a good tradeoff for this dataset.</p>
<p>Table
<xref rid="Tab4" ref-type="table">4</xref>
summarizes
<sc>CLARK</sc>
’s assignment of barley unigenes (assemblies) to barley chromosomes arms (assemblies) using
<italic>k</italic>
=19. When both targets and objects are assemblies, we call this an “A2A” assignment. Observe that most of the assignments have high confidence and they are relatively evenly distributed among barley chromosome arms (the seven barley chromosomes are believed to be relatively similar in length). Observe in Figure
<xref rid="Fig1" ref-type="fig">1</xref>
e that
<sc>CLARK</sc>
’s precision and sensitivity for this classification task is very high (both at 98.49%) while the average confidence score is above 0.96, and 99.44% of unigenes are assigned.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>
<bold>Summary of </bold>
<sc>CLARK</sc>
<bold>’s assignment of 50,646 unigenes (EST assemblies) to barley chromosome arms (assemblies) and centromeres (</bold>
<bold>
<italic>k</italic>
</bold>
<bold>=19)</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">
<bold>
<italic>Targets</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>19-mers</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Discriminative 19-mers</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Assignments</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Low confidence</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>High confidence</italic>
</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">1H</td>
<td align="left">180,176,713</td>
<td align="left">108,894,740</td>
<td align="left">8,197</td>
<td align="left">21.1%</td>
<td align="left">78.9%</td>
</tr>
<tr>
<td align="left">2HC</td>
<td align="left">-</td>
<td align="left">814,357</td>
<td align="left">15</td>
<td align="left">93.3%</td>
<td align="left">6.7%</td>
</tr>
<tr>
<td align="left">2HL</td>
<td align="left">103,679,920</td>
<td align="left">64,700,161</td>
<td align="left">4,776</td>
<td align="left">15.8%</td>
<td align="left">84.2%</td>
</tr>
<tr>
<td align="left">2HS</td>
<td align="left">90,912,314</td>
<td align="left">54,449,430</td>
<td align="left">3,334</td>
<td align="left">17.3%</td>
<td align="left">82.7%</td>
</tr>
<tr>
<td align="left">3HC</td>
<td align="left">-</td>
<td align="left">1,532,968</td>
<td align="left">29</td>
<td align="left">79.3%</td>
<td align="left">20.7%</td>
</tr>
<tr>
<td align="left">3HL</td>
<td align="left">123,140,951</td>
<td align="left">78,158,244</td>
<td align="left">4,726</td>
<td align="left">16.7%</td>
<td align="left">83.3%</td>
</tr>
<tr>
<td align="left">3HS</td>
<td align="left">111,951,787</td>
<td align="left">70,473,478</td>
<td align="left">3,159</td>
<td align="left">20.4%</td>
<td align="left">79.6%</td>
</tr>
<tr>
<td align="left">4HC</td>
<td align="left">-</td>
<td align="left">3,105,047</td>
<td align="left">54</td>
<td align="left">50.0%</td>
<td align="left">50.0%</td>
</tr>
<tr>
<td align="left">4HL</td>
<td align="left">106,999,773</td>
<td align="left">64,749,958</td>
<td align="left">3,531</td>
<td align="left">14.4%</td>
<td align="left">85.6%</td>
</tr>
<tr>
<td align="left">4HS</td>
<td align="left">89,027,872</td>
<td align="left">51,612,790</td>
<td align="left">2,468</td>
<td align="left">16.4%</td>
<td align="left">83.6%</td>
</tr>
<tr>
<td align="left">5HC</td>
<td align="left">-</td>
<td align="left">604,030</td>
<td align="left">9</td>
<td align="left">88.9%</td>
<td align="left">11.1%</td>
</tr>
<tr>
<td align="left">5HL</td>
<td align="left">117,915,094</td>
<td align="left">77,128,375</td>
<td align="left">6,111</td>
<td align="left">12.2%</td>
<td align="left">87.8%</td>
</tr>
<tr>
<td align="left">5HS</td>
<td align="left">58,067,400</td>
<td align="left">34,037,607</td>
<td align="left">1,619</td>
<td align="left">17.8%</td>
<td align="left">82.2%</td>
</tr>
<tr>
<td align="left">6HC</td>
<td align="left">-</td>
<td align="left">469,530</td>
<td align="left">9</td>
<td align="left">100.0%</td>
<td align="left">0.0%</td>
</tr>
<tr>
<td align="left">6HL</td>
<td align="left">74,485,223</td>
<td align="left">44,221,184</td>
<td align="left">2,973</td>
<td align="left">12.4%</td>
<td align="left">87.6%</td>
</tr>
<tr>
<td align="left">6HS</td>
<td align="left">111,834,123</td>
<td align="left">83,957,421</td>
<td align="left">2,721</td>
<td align="left">24.4%</td>
<td align="left">75.6%</td>
</tr>
<tr>
<td align="left">7HC</td>
<td align="left">-</td>
<td align="left">795,923</td>
<td align="left">9</td>
<td align="left">88.9%</td>
<td align="left">11.1%</td>
</tr>
<tr>
<td align="left">7HL</td>
<td align="left">92,603,503</td>
<td align="left">58,159,248</td>
<td align="left">3,556</td>
<td align="left">10.9%</td>
<td align="left">89.1%</td>
</tr>
<tr>
<td align="left">7HS</td>
<td align="left">90,217,777</td>
<td align="left">55,276,671</td>
<td align="left">3,350</td>
<td align="left">12.6%</td>
<td align="left">87.4%</td>
</tr>
<tr>
<td align="left">
<italic>Total</italic>
</td>
<td align="left">1,351,012,450</td>
<td align="left">853,141,162</td>
<td align="left">50,646</td>
<td align="left">16.5%</td>
<td align="left">83.5%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Columns: (1) barley chromosome 1H, twelve chromosome arms, and six centromeres; (2) number of distinct
<italic>k</italic>
-mers in each target; (3) number of discriminative
<italic>k</italic>
-mers present in target sequences (must occur at least once); (4) number of assigned objects per target; (5) number of low confidence assignment per target; (6) number of high confidence assignment per target; (7) percentage of low confidence assignment (as a fraction of the total number of assigned objects per target); (8) percentage of high confidence assignment (as a fraction of the total number of assigned objects per target).</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S3 presents a summary of
<sc>CLARK</sc>
’s assignment of barley BACs (assemblies) to arms (assemblies), while Table
<xref rid="Tab5" ref-type="table">5</xref>
refers to the same assignments based on the reads instead of the assemblies (“R2R” assignment). The consistency between these results (same distribution of BACs assignments over chromosome arms, and similar proportion of high and low confidence assignments) demonstrates the robustness of our approach. The agreement with OPA-based locations is 92.9% for R2R assignments, and 93.2% for A2A assignments. Observe that the agreement for BAC/arm assignments is lower than unigene/arm assignments (98.49%).
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>
<bold>Summary of </bold>
<sc>CLARK</sc>
<bold>’s assignment of 15,665 BACs (represented as reads) to barley chromosome arms (reads) and centromeres (</bold>
<bold>
<italic>k</italic>
</bold>
<bold>=19)</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">
<bold>
<italic>Targets</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>19-mers</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Discriminative 19-mers</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Assignments</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>Low confidence</italic>
</bold>
</th>
<th align="left">
<bold>
<italic>High confidence</italic>
</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">1H</td>
<td align="left">448,768,897</td>
<td align="left">126,997,864</td>
<td align="left">2,068</td>
<td align="left">4.2%</td>
<td align="left">95.8%</td>
</tr>
<tr>
<td align="left">2HC</td>
<td align="left">-</td>
<td align="left">1,738,722</td>
<td align="left">0</td>
<td align="left">-</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">2HL</td>
<td align="left">451,729,142</td>
<td align="left">102,959,160</td>
<td align="left">1,417</td>
<td align="left">2.1%</td>
<td align="left">97.9%</td>
</tr>
<tr>
<td align="left">2HS</td>
<td align="left">401,605,473</td>
<td align="left">79,225,936</td>
<td align="left">1,071</td>
<td align="left">2.4%</td>
<td align="left">97.6%</td>
</tr>
<tr>
<td align="left">3HC</td>
<td align="left">-</td>
<td align="left">4,631,639</td>
<td align="left">0</td>
<td align="left">-</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">3HL</td>
<td align="left">553,420,081</td>
<td align="left">138,939,217</td>
<td align="left">1,423</td>
<td align="left">2.2%</td>
<td align="left">97.8%</td>
</tr>
<tr>
<td align="left">3HS</td>
<td align="left">538,777,930</td>
<td align="left">113,354,224</td>
<td align="left">892</td>
<td align="left">3.5%</td>
<td align="left">96.5%</td>
</tr>
<tr>
<td align="left">4HC</td>
<td align="left">-</td>
<td align="left">6,428,726</td>
<td align="left">70</td>
<td align="left">14.3</td>
<td align="left">85.7%</td>
</tr>
<tr>
<td align="left">4HL</td>
<td align="left">494,923,209</td>
<td align="left">106,930,230</td>
<td align="left">1,127</td>
<td align="left">2.3%</td>
<td align="left">97.7%</td>
</tr>
<tr>
<td align="left">4HS</td>
<td align="left">462,144,322</td>
<td align="left">85,650,765</td>
<td align="left">888</td>
<td align="left">3.4%</td>
<td align="left">96.6%</td>
</tr>
<tr>
<td align="left">5HC</td>
<td align="left">-</td>
<td align="left">1,643,194</td>
<td align="left">0</td>
<td align="left">-</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">5HL</td>
<td align="left">558,710,983</td>
<td align="left">121,491,586</td>
<td align="left">1,657</td>
<td align="left">2.3%</td>
<td align="left">97.7%</td>
</tr>
<tr>
<td align="left">5HS</td>
<td align="left">281,062,766</td>
<td align="left">57,181,745</td>
<td align="left">658</td>
<td align="left">2.4%</td>
<td align="left">97.6%</td>
</tr>
<tr>
<td align="left">6HC</td>
<td align="left">-</td>
<td align="left">1,287,133</td>
<td align="left">0</td>
<td align="left">-</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">6HL</td>
<td align="left">311,443,157</td>
<td align="left">70,856,097</td>
<td align="left">1,136</td>
<td align="left">2.0%</td>
<td align="left">98.0%</td>
</tr>
<tr>
<td align="left">6HS</td>
<td align="left">877,169,677</td>
<td align="left">255,819,549</td>
<td align="left">850</td>
<td align="left">2.9%</td>
<td align="left">97.1%</td>
</tr>
<tr>
<td align="left">7HC</td>
<td align="left">-</td>
<td align="left">1,697,991</td>
<td align="left">0</td>
<td align="left">-</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">7HL</td>
<td align="left">366,612,780</td>
<td align="left">82,987,499</td>
<td align="left">1,175</td>
<td align="left">2.0%</td>
<td align="left">98.0%</td>
</tr>
<tr>
<td align="left">7HS</td>
<td align="left">365,475,556</td>
<td align="left">83,848,867</td>
<td align="left">1,233</td>
<td align="left">2.8%</td>
<td align="left">97.2%</td>
</tr>
<tr>
<td align="left">
<italic>Total</italic>
</td>
<td align="left">6,111,843,973</td>
<td align="left">1,443,670,144</td>
<td align="left">15,665</td>
<td align="left">2.7%</td>
<td align="left">97.3%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Columns: (1) barley chromosome 1H, twelve chromosome arms, and six centromeres; (2) number of distinct
<italic>k</italic>
-mers in each target; (3) number of discriminative
<italic>k</italic>
-mers present in target sequences (must occur at least twice); (4) number of assigned objects per target; (5) number of low confidence assignment per target; (6) number of high confidence assignment per target; (7) percentage of low confidence assignment (as a fraction of the total number of assigned objects per target); (8) percentage of high confidence assignment (as a fraction of the total number of assigned objects per target).</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Finally, we compared
<sc>CLARK</sc>
against (1) the
<sc>BLAST</sc>
-based method used in [
<xref ref-type="bibr" rid="CR25">25</xref>
] for BAC-arm assignment (A2A); and (2) the assignments provided in [
<xref ref-type="bibr" rid="CR16">16</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
]. For (1),
<sc>CLARK</sc>
assigned 13,706 BACs (of which 2,252 have a prior OPA-based location) while the
<sc>BLAST</sc>
-based method assigned 13,583 BACs (of which 2,238 have a prior OPA-based location).
<sc>CLARK</sc>
’s precision and sensitivity were 93.2% and 93.2%, respectively, while
<sc>BLAST</sc>
-based’s precision and sensitivity were 92.4% and 91.9%, respectively.
<sc>BLAST</sc>
-based and
<sc>CLARK</sc>
disagreed on 19 assignments; within these 19 disagreements,
<sc>CLARK</sc>
agreed with the GoldenGate assays on seven cases, and
<sc>BLAST</sc>
-based agreed on four cases. In (2), we examined the assignment for the 1,037 BACs that were sequenced by our group and by Leibniz-Institut fur Pflanzengenetik und Kulturpflanzenforschung, Gatersleben, Germany (IPK) [
<xref ref-type="bibr" rid="CR16">16</xref>
] and we identified only 42 disagreements (4% of the total); among these disagreements, 19 had an independent assignment via POP-seq [
<xref ref-type="bibr" rid="CR29">29</xref>
]. In 15 cases out of 19, our assignment agreed with the POP-seq assignment. For the 23 disagreements for which there was no POP-seq assignment, we compared the assembled BACs and we discovered 6 cases in which the sequences were less than 30% similar, suggesting a naming error. In summary, there were only a handful of cases where the disagreement could not be readily explained.</p>
</sec>
<sec id="Sec8">
<title>Performance dependency on the
<italic>k</italic>
-mer length</title>
<p>To determine the optimal value of
<italic>k</italic>
for a particular dataset one can take advantage of prior knowledge, as we did in the case of unigene/BAC assignment to chromosomes. In that case, we had 1,657 unigenes for which the correct BAC assignment (approximately 95% accuracy) was experimentally determined via Illumina GoldenGate assay (BOPA1 and BOPA2). Given these known assignments, we estimated precision and sensitivity, as well as the average confidence score for all assignments and the assignment rate (see Figure
<xref rid="Fig1" ref-type="fig">1</xref>
e). Observe that
<italic>k</italic>
=19 maximizes all four measurements. Higher precision and average confidence score can be achieved by using higher
<italic>k</italic>
but at the cost of decreasing sensitivity and assignment rate.</p>
<p>Similar evaluation were carried out on the metagenomic datasets. Figure
<xref rid="Fig1" ref-type="fig">1</xref>
a to Figure
<xref rid="Fig1" ref-type="fig">1</xref>
d show precision, sensitivity, as well as assignment rate and average confidence score as a function of
<italic>k</italic>
. In both cases we observe that as we increase
<italic>k</italic>
, precision and the average confidence score are increasing, while the sensitivity is decreasing. We observe that the maximum sensitivity is achieved for
<italic>k</italic>
in the range 19–22 for all metagenomic datasets, independently of the reads length or complexity.</p>
<p>As a consequence, for high sensitivity (or high number of assignments) one must choose
<italic>k</italic>
between 19 and 22. For high precision (or high confidence score) one must choose
<italic>k</italic>
higher than 26. The current implementation supports
<italic>k</italic>
up to 32.</p>
</sec>
</sec>
<sec id="Sec9" sec-type="conclusion">
<title>Conclusions</title>
<p>We have presented
<sc>CLARK</sc>
, a new method for metagenomic sequence classification and chromosome/arm assignments of DNA sequences.</p>
<p>Experimental results demonstrate that
<sc>CLARK</sc>
has several advantages over alternative methods. (i)
<sc>CLARK</sc>
is able to classify short metagenomic reads with high accuracy at multiple taxonomic ranks (i.e., species and genus level) and its assignments on real metagenomic samples are consistent with findings published in the literature. (ii)
<sc>CLARK</sc>
can achieve the same or better accuracy than the state-of-the-art metagenomic classifiers. (iii) The classification speed of
<sc>CLARK</sc>
, in the context of metagenomics, is unmatched.
<sc>CLARK</sc>
can classify 32 million metagenomic short reads per minute, which is five times faster than
<sc>KRAKEN</sc>
. In addition,
<sc>CLARK</sc>
“scales” better on a multi-core architectures: the speed-up one can obtain by adding more threads is higher than
<sc>KRAKEN</sc>
. (iv)
<sc>CLARK</sc>
is able to output confidence scores, is user-friendly and self-contained (unlike most of other classifiers, it does not require external tool such as
<sc>BLAST</sc>
or
<sc>MEGABLAST</sc>
, etc). (v)
<sc>CLARK</sc>
can be executed with relatively small amounts of RAM (unlike
<sc>LMAT</sc>
) or disk space (unlike
<sc>PHYMMBL</sc>
or
<sc>KRAKEN</sc>
). Indeed,
<sc>LMAT</sc>
can use about 500 GB of RAM, while the maximum amount of RAM needed by
<sc>CLARK</sc>
is less than 165 GB (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1).
<sc>PHYMMBL</sc>
or
<sc>KRAKEN</sc>
require respectively about 120 GB and 140 GB of disk space to run, while
<sc>CLARK</sc>
requires 40–42 GB for classification. (vi) In the context of genomics,
<sc>CLARK</sc>
can classify BACs and transcripts with better accuracy than previously used BLAST-based method [
<xref ref-type="bibr" rid="CR25">25</xref>
], and can infer centromeric regions.</p>
<p>Even though in this manuscript we focus the attention on genus and species level classification,
<sc>CLARK</sc>
is expected to work also at higher taxonomic levels such as phylum, family or class. As it is now, however,
<sc>CLARK</sc>
cannot take advantage of taxonomic tree structures. We believe that
<sc>CLARK</sc>
will be useful in a variety of applications in metagenomics and genomics. For instance, we have used
<sc>CLARK</sc>
to identify chimerism and vector contamination in sequenced BACs.</p>
</sec>
<sec id="Sec10" sec-type="methods">
<title>Methods</title>
<sec id="Sec11">
<title>Building target-specific
<italic>k</italic>
-mer sets</title>
<p>
<sc>CLARK</sc>
accepts inputs in fasta/fastq format; alternatively the input can be given as a text file containing the
<italic>k</italic>
-mer distribution (i.e., each line contains a
<italic>k</italic>
-mer and its number of occurrences).
<sc>CLARK</sc>
first builds an index from the target sequences, unless one already exists for the specified input files. If a user wants to classify objects at the genus level (or another taxonomic rank), he/she is expected to generate targets by grouping genomes of the same genus (or with the same taxonomic label). This strategy represents a major difference with other tools (such as
<sc>LMAT</sc>
, or
<sc>KRAKEN</sc>
). The index is a hash-table storing, for each distinct
<italic>k</italic>
-mer
<italic>w</italic>
(1) the ID for the target containing
<italic>w</italic>
, (2) the number of distinct targets containing
<italic>w</italic>
, and (3) the number of occurrences of
<italic>w</italic>
in all the targets. This hash-table uses separate chaining to resolve collisions (at each bucket).
<sc>CLARK</sc>
then removes any
<italic>k</italic>
-mer that appears in more than one target, except in the case of chromosome arm assignment. In the latter case,
<italic>k</italic>
-mers shared by the two arms of the same chromosome are used to define centromeric regions of overlap. Also,
<italic>k</italic>
-mers in the index may be removed based on their number of occurrences if the user has specified a minimum number of occurrences. These rare
<italic>k</italic>
-mers tend to be spurious from sequencing errors. Other metagenomic classifiers like
<sc>KRAKEN</sc>
and
<sc>LMAT</sc>
do not offer this protection against noise, which is very useful when target sequences are reads (or low-quality assemblies). Then, the resulting sets of target-specific
<italic>k</italic>
-mers are stored in disk for the next phase. The time and memory needed to create the index (for
<italic>k</italic>
=31) are given in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1. This table also contains the time and memory required by
<sc>NBC</sc>
and
<sc>KRAKEN</sc>
. Observe that
<sc>CLARK</sc>
is faster than
<sc>NBC</sc>
and
<sc>KRAKEN</sc>
to create the index, and it uses less RAM and disk space than
<sc>KRAKEN</sc>
for classifying objects.</p>
<p>The concept of “target-specific
<italic>k</italic>
-mers” is similar to the notion of “clade-specific marker genes” proposed in [
<xref ref-type="bibr" rid="CR7">7</xref>
] or “genome-specific markers” recently proposed in [
<xref ref-type="bibr" rid="CR30">30</xref>
]. While
<sc>CLARK</sc>
uses exact matching to identify the target-specific
<italic>k</italic>
-mers derived from any region in the genome, the authors in [
<xref ref-type="bibr" rid="CR7">7</xref>
] disregard intergenic regions. The authors of [
<xref ref-type="bibr" rid="CR30">30</xref>
] focus on strain-specific markers identified by approximate string matching, while
<sc>CLARK</sc>
uses exact matching. Another important difference is that the method presented in [
<xref ref-type="bibr" rid="CR30">30</xref>
] relies on
<sc>MEGABLAST</sc>
[
<xref ref-type="bibr" rid="CR31">31</xref>
] to perform the classification, which is several orders of magnitude slower than
<sc>KRAKEN</sc>
[
<xref ref-type="bibr" rid="CR11">11</xref>
].</p>
<p>For users that want to run
<sc>CLARK</sc>
on workstations with limited amounts of RAM, we have designed
<sc>CLARK</sc>
-
<italic>l</italic>
(“light”).
<sc>CLARK</sc>
-
<italic>l</italic>
is a variant of
<sc>CLARK</sc>
that has a much smaller RAM footprint but can classify objects with similar speed and accuracy. The reduction in RAM can be achieved by constructing a hash-table of smaller size and by constructing smaller sets of discriminative
<italic>k</italic>
-mers. Instead of considering all
<italic>k</italic>
-mers in a target,
<sc>CLARK</sc>
-
<italic>l</italic>
samples a fraction of them.
<sc>CLARK</sc>
-
<italic>l</italic>
uses 27-mers (27-mers appeared to be a good tradeoff between speed, low memory usage and precision) and skips four consecutive/non-overlapping 27-mers. As a result,
<sc>CLARK</sc>
-
<italic>l</italic>
’s peak RAM usage is about 3.8 GB during the index creation, and 2.8 GB when computing the classification (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1).
<sc>CLARK</sc>
-
<italic>l</italic>
has also the advantage to be very fast in building the hash table. Table
<xref rid="Tab1" ref-type="table">1</xref>
includes the performance of
<sc>CLARK</sc>
-
<italic>l</italic>
. While the precision and sensitivity are lower compared to
<sc>CLARK</sc>
,
<sc>CLARK</sc>
-
<italic>l</italic>
still achieves high precision and high speed.</p>
</sec>
<sec id="Sec12">
<title>Sequence classification</title>
<p>In the full mode, once the index containing target-specific
<italic>k</italic>
-mers has been created,
<sc>CLARK</sc>
creates a “dictionary” that associates
<italic>k</italic>
-mers to targets. Then,
<sc>CLARK</sc>
iteratively processes each object: for each object sequence
<italic>o</italic>
<sc>CLARK</sc>
queries the index to fetch the set of
<italic>k</italic>
-mers in
<italic>o</italic>
. A “hit” is obtained when a
<italic>k</italic>
-mer (either forward or reverse complement) matches a target-specific
<italic>k</italic>
-mer set. Object
<italic>o</italic>
is assigned to the target that has the highest number of hits (see algorithmic details in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Supplementary Note 1 and Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S5). The confidence score is computed as
<italic>h</italic>
<sub>1</sub>
/(
<italic>h</italic>
<sub>1</sub>
+
<italic>h</italic>
<sub>2</sub>
), where
<italic>h</italic>
<sub>1</sub>
is the number of hits for the highest target, and
<italic>h</italic>
<sub>2</sub>
is the number of hits for the second-highest target.</p>
<p>The rationale to remove common
<italic>k</italic>
-mers between targets (at any taxonomy level defined by the user) is that they increase the “noise” in the classification process. If they were present, more targets could obtain the same number of hits which would complicate the assignment. If such conflicts can be avoided, then there is no need to query the taxonomy tree, and find, for example, the lowest common ancestor taxons for “conflicting nodes” to resolve them as it is done in other tools (e.g.,
<sc>KRAKEN</sc>
or
<sc>LMAT</sc>
). Observe in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1, that most of
<sc>CLARK</sc>
’s assignments have high confidence scores. Observe that at least 95% of all assignments in HiSeq, MiSeq, simBA-5 and simHC.20.500 made by
<sc>CLARK</sc>
in the full mode, have confidence scores equal to 1 (i.e., exactly one target gets hits), and the average confidence scores in all these assignments is 0.997. This implies that, on average, the number of hits for the top target (which will receive the assignment) is about 336 times higher than the second. Thus,
<sc>CLARK</sc>
, unlike
<sc>LMAT</sc>
or
<sc>KRAKEN</sc>
, does not need the taxonomy tree to classify objects, instead one “flat” level is clearly sufficient.</p>
<p>If users are not interested in collecting confidence scores and all hit counts, then it is recommended to use the default mode of
<sc>CLARK</sc>
. In this mode,
<sc>CLARK</sc>
stops querying
<italic>k</italic>
-mers for an object as soon as there is at least one target that collects at least half of the total possible hits. Also, this mode loads in main memory about half of the target-specific
<italic>k</italic>
-mers. This is done by alternatively loading or skipping target-specific
<italic>k</italic>
-mers based on their index positions.
<sc>CLARK</sc>
runs significantly faster in default mode (2–5 times faster in our experiments) with negligible degradation of sensitivity and assignment rate. Also, the RAM usage is significantly lower than the full mode (up to 50% lower in our experiments). If speed is the primary concern, we have designed an “express” variant of
<sc>CLARK</sc>
called
<sc>CLARK</sc>
-
<italic>E</italic>
.
<sc>CLARK</sc>
-
<italic>E</italic>
is based upon Theorem 1 (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Supplementary Note 1), which states that if an object originates from one of the targets then either one or no target will be hit from the
<italic>k</italic>
-mers in the object. Since we use target-specific
<italic>k</italic>
-mer sets, at most one target can be associated to the
<italic>k</italic>
-mers of an object. In addition, we reduce the number of queries to the database by considering a sample of the
<italic>k</italic>
-mers in the object. So
<sc>CLARK</sc>
-
<italic>E</italic>
only queries non-overlapping
<italic>k</italic>
-mers, and the object is assigned to the first target that obtains a hit. This optimization allows
<sc>CLARK</sc>
-
<italic>E</italic>
to be extremely fast compared to
<sc>CLARK</sc>
/
<sc>KRAKEN</sc>
(see Table
<xref rid="Tab1" ref-type="table">1</xref>
), while maintaining high precision and sensitivity.</p>
</sec>
<sec id="Sec13">
<title>Running time analysis</title>
<p>All experiments presented in this study were run on a Dell PowerEdge T710 server (dual Intel Xeon X5660 2.8 Ghz, 12 cores, 192 GB of RAM).
<sc>CLARK</sc>
-
<italic>l</italic>
was also run on a Mac OS X, Version 10.9.5 (2.53 GHz Intel Core 2 Duo, 4 GB of RAM). When comparing
<sc>KRAKEN</sc>
to
<sc>CLARK</sc>
in their default mode, and
<sc>KRAKEN</sc>
-Q to
<sc>CLARK</sc>
-
<italic>E</italic>
, we always set
<sc>KRAKEN</sc>
to “preload” its database in main memory and print results to a file (instead of the standard output) to achieve the highest speed. For consistency,
<sc>CLARK</sc>
was also run under the same conditions. For the results in Table
<xref rid="Tab1" ref-type="table">1</xref>
and Table
<xref rid="Tab2" ref-type="table">2</xref>
,
<sc>CLARK</sc>
(v1.0),
<sc>NBC</sc>
(v1.1), and
<sc>KRAKEN</sc>
(v0.10.4-beta) were run in single-threaded mode, three times on the same inputs in order to smooth fluctuations due to I/O and cache issues (the reported numbers are best values). We have also run the latest version of Kraken (v0.10.5-beta) and we did not observe a significant variation of accuracy and usage of RAM. However, we observed a 15% decrease in the classification speed compared to version v0.10.4-beta. The software tool
<sc>CLARK</sc>
is available for download at
<ext-link ext-link-type="uri" xlink:href="http://clark.cs.ucr.edu/">http://clark.cs.ucr.edu/</ext-link>
.</p>
</sec>
</sec>
<sec id="Sec14">
<title>Ethics statement</title>
<p>All human data used in this study are from the Human Microbiome Project [
<xref ref-type="bibr" rid="CR2">2</xref>
,
<xref ref-type="bibr" rid="CR3">3</xref>
], which is a free and publicly available database.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Additional file</title>
<sec id="Sec15">
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="12864_2015_1419_MOESM1_ESM.pdf">
<label>Additional file 1</label>
<caption>
<p>
<bold>Supplementary Material.</bold>
Detail about the mathematical modeling, the impact of the
<italic>k</italic>
-mer length on results, the analysis of the confidence scores, and the software implementation.</p>
</caption>
</media>
</supplementary-material>
</sec>
</sec>
</body>
<back>
<fn-group>
<fn>
<p>
<bold>Competing interests</bold>
</p>
<p>The authors declare that they have no competing financial interests.</p>
</fn>
<fn>
<p>
<bold>Authors’ contributions</bold>
</p>
<p>RO designed, implemented, tested, and optimized
<sc>Clark</sc>
; RO also collected experimental results and wrote the draft of the manuscript; SW helped to carry out the comparison between
<sc>Clark</sc>
and the
<sc>Blast</sc>
-based method that he wrote for [
<xref ref-type="bibr" rid="CR25">25</xref>
]; SL and TJC proposed and supervised the project; RO, SL and TJC edited the manuscript. All authors read and approved the final manuscript.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>This project was funded in part by USDA, “Advancing the Barley Genome” (2009-65300-05645) and by NSF, “ABI Innovation: Barcoding-Free Multiplexing: Leveraging Combinatorial Pooling for High-Throughput Sequencing” (DBI-1062301), and “III: Algorithms and Software Tools for Epigenetics Research” (IIS-1302134). We are thankful to the authors of
<sc>NBC</sc>
and
<sc>Kraken</sc>
for their useful advice on running their tools. We thank Dr. Gail Rosen and the anonymous reviewers for constructive comments on the manuscript.</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Venter</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Remington</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Heidelberg</surname>
<given-names>JF</given-names>
</name>
<name>
<surname>Halpern</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Rusch</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>JA</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Environmental genome shotgun sequencing of the Sargasso Sea</article-title>
<source>Science</source>
<year>2004</year>
<volume>304</volume>
<issue>5667</issue>
<fpage>66</fpage>
<lpage>74</lpage>
<pub-id pub-id-type="doi">10.1126/science.1093857</pub-id>
<pub-id pub-id-type="pmid">15001713</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Gevers</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Knight</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Abubucker</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Badger</surname>
<given-names>JH</given-names>
</name>
<name>
<surname>Chinwalla</surname>
<given-names>AT</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Structure, function and diversity of the healthy human microbiome</article-title>
<source>Nature</source>
<year>2012</year>
<volume>486</volume>
<issue>7402</issue>
<fpage>207</fpage>
<lpage>14</lpage>
<pub-id pub-id-type="doi">10.1038/nature11234</pub-id>
<pub-id pub-id-type="pmid">22699609</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<collab>The Human Microbiome Project Consortium</collab>
</person-group>
<article-title>A framework for human microbiome research</article-title>
<source>Nature</source>
<year>2012</year>
<volume>486</volume>
<issue>7402</issue>
<fpage>215</fpage>
<lpage>21</lpage>
<pub-id pub-id-type="doi">10.1038/nature11209</pub-id>
<pub-id pub-id-type="pmid">22699610</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>Auch</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schuster</surname>
<given-names>SC</given-names>
</name>
</person-group>
<article-title>MEGAN analysis of metagenomic data</article-title>
<source>Genome Res.</source>
<year>2007</year>
<volume>17</volume>
<issue>3</issue>
<fpage>377</fpage>
<lpage>86</lpage>
<pub-id pub-id-type="doi">10.1101/gr.5969107</pub-id>
<pub-id pub-id-type="pmid">17255551</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brady</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>PhymmBL expanded: confidence scores, custom databases, parallelization and more</article-title>
<source>Nat Methods</source>
<year>2011</year>
<volume>8</volume>
<issue>5</issue>
<fpage>367</fpage>
<pub-id pub-id-type="doi">10.1038/nmeth0511-367</pub-id>
<pub-id pub-id-type="pmid">21527926</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Gibbons</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ghodsi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Treangen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences</article-title>
<source>BMC Genomics</source>
<year>2011</year>
<volume>12</volume>
<issue>Suppl 2</issue>
<fpage>4</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-12-S2-S4</pub-id>
<pub-id pub-id-type="pmid">21205322</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segata</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Waldron</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ballarini</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Narasimhan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Jousson</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Metagenomic microbial community profiling using unique clade-specific marker genes</article-title>
<source>Nat Methods</source>
<year>2012</year>
<volume>9</volume>
<issue>8</issue>
<fpage>811</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.2066</pub-id>
<pub-id pub-id-type="pmid">22688413</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rosen</surname>
<given-names>GL</given-names>
</name>
<name>
<surname>Reichenberger</surname>
<given-names>ER</given-names>
</name>
<name>
<surname>Rosenfeld</surname>
<given-names>AM</given-names>
</name>
</person-group>
<article-title>NBC: the naive bayes classification tool webserver for taxonomic classification of metagenomic reads</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>1</issue>
<fpage>127</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq619</pub-id>
<pub-id pub-id-type="pmid">21062764</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Patil</surname>
<given-names>KR</given-names>
</name>
<name>
<surname>Haider</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pope</surname>
<given-names>PB</given-names>
</name>
<name>
<surname>Turnbaugh</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Morrison</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Scheffer</surname>
<given-names>T</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Taxonomic metagenome sequence assignment with structured output models</article-title>
<source>Nat Methods</source>
<year>2011</year>
<volume>8</volume>
<issue>3</issue>
<fpage>191</fpage>
<lpage>2</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth0311-191</pub-id>
<pub-id pub-id-type="pmid">21358620</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ames</surname>
<given-names>SK</given-names>
</name>
<name>
<surname>Hysom</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Gardner</surname>
<given-names>SN</given-names>
</name>
<name>
<surname>Lloyd</surname>
<given-names>GS</given-names>
</name>
<name>
<surname>Gokhale</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Allen</surname>
<given-names>JE</given-names>
</name>
</person-group>
<article-title>Scalable metagenomic taxonomy classification using a reference genome database</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>18</issue>
<fpage>2253</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt389</pub-id>
<pub-id pub-id-type="pmid">23828782</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wood</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments</article-title>
<source>Genome Biol.</source>
<year>2014</year>
<volume>15</volume>
<issue>3</issue>
<fpage>46</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bazinet</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Cummings</surname>
<given-names>MP</given-names>
</name>
</person-group>
<article-title>A comparative evaluation of sequence classification programs</article-title>
<source>BMC Bioinf.</source>
<year>2012</year>
<volume>13</volume>
<issue>1</issue>
<fpage>92</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-13-92</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koslicki</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Foucart</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rosen</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>WGSQuikr: Fast whole-genome shotgun metagenomic classification</article-title>
<source>PloS one</source>
<year>2014</year>
<volume>9</volume>
<issue>3</issue>
<fpage>91784</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0091784</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol.</source>
<year>1990</year>
<volume>215</volume>
<issue>3</issue>
<fpage>403</fpage>
<lpage>10</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(05)80360-2</pub-id>
<pub-id pub-id-type="pmid">2231712</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kent</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<article-title>BLAT: the BLAST-like alignment tool</article-title>
<source>Genome Res.</source>
<year>2002</year>
<volume>12</volume>
<issue>4</issue>
<fpage>656</fpage>
<lpage>64</lpage>
<pub-id pub-id-type="doi">10.1101/gr.229202.ArticlepublishedonlinebeforeMarch2002</pub-id>
<pub-id pub-id-type="pmid">11932250</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<collab>International Barley Genome Sequencing Consortium</collab>
</person-group>
<article-title>A physical, genetic and functional sequence assembly of the barley genome</article-title>
<source>Nature</source>
<year>2012</year>
<volume>491</volume>
<issue>7426</issue>
<fpage>711</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="pmid">23075845</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<mixed-citation publication-type="other">Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, et al.Genbank. Nucleic Acids Res. 2012:1195.</mixed-citation>
</ref>
<ref id="CR18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison: a review</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>4</issue>
<fpage>513</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg005</pub-id>
<pub-id pub-id-type="pmid">12611807</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mavromatis</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ivanova</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Barry</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Shapiro</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Goltsman</surname>
<given-names>E</given-names>
</name>
<name>
<surname>McHardy</surname>
<given-names>AC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Use of simulated data sets to evaluate the fidelity of metagenomic processing methods</article-title>
<source>Nat Methods</source>
<year>2007</year>
<volume>4</volume>
<issue>6</issue>
<fpage>495</fpage>
<lpage>500</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth1043</pub-id>
<pub-id pub-id-type="pmid">17468765</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Magoc</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Pabinger</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Canzar</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Puiu</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<article-title>GAGE-B: an evaluation of genome assemblers for bacterial organisms</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>14</issue>
<fpage>1718</fpage>
<lpage>25</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt273</pub-id>
<pub-id pub-id-type="pmid">23665771</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<mixed-citation publication-type="other">Said HS, Suda W, Nakagome S, Chinen H, Oshima K, Kim S, et al.Dysbiosis of salivary microbiota in inflammatory bowel disease and its association with oral immunological biomarkers. DNA Res. 2013:037.</mixed-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Antonio</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Hawes</surname>
<given-names>SE</given-names>
</name>
<name>
<surname>Hillier</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>The identification of vaginal lactobacillus species and the demographic and microbiologic characteristics of women colonized by these species</article-title>
<source>J Infectious Diseases</source>
<year>1999</year>
<volume>180</volume>
<issue>6</issue>
<fpage>1950</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1086/315109</pub-id>
<pub-id pub-id-type="pmid">10558952</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hyman</surname>
<given-names>RW</given-names>
</name>
<name>
<surname>Fukushima</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Diamond</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Kumm</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Giudice</surname>
<given-names>LC</given-names>
</name>
<name>
<surname>Davis</surname>
<given-names>RW</given-names>
</name>
</person-group>
<article-title>Microbes on the human vaginal epithelium</article-title>
<source>Proc Nat Acad Sci.</source>
<year>2005</year>
<volume>102</volume>
<issue>22</issue>
<fpage>7952</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.0503236102</pub-id>
<pub-id pub-id-type="pmid">15911771</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Doležel</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Vrána</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Šafář</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bartoš</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kubaláková</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Šimková</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Chromosomes in the flow to simplify genome analysis</article-title>
<source>Funct Integr Genomics</source>
<year>2012</year>
<volume>12</volume>
<issue>3</issue>
<fpage>397</fpage>
<lpage>416</lpage>
<pub-id pub-id-type="doi">10.1007/s10142-012-0293-0</pub-id>
<pub-id pub-id-type="pmid">22895700</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lonardi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Duma</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Alpert</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Cordero</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Beccuti</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Bhat</surname>
<given-names>PR</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Combinatorial pooling enables selective sequencing of the barley gene space</article-title>
<source>PLoS Comput Biol.</source>
<year>2013</year>
<volume>9</volume>
<issue>4</issue>
<fpage>1003010</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1003010</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler</article-title>
<source>Gigascience</source>
<year>2012</year>
<volume>1</volume>
<issue>1</issue>
<fpage>18</fpage>
<pub-id pub-id-type="doi">10.1186/2047-217X-1-18</pub-id>
<pub-id pub-id-type="pmid">23587118</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Close</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Wanamaker</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Roose</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Lyon</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>HarvEST</article-title>
<source>Methods Mol Biol.</source>
<year>2006</year>
<volume>406</volume>
<fpage>161</fpage>
<lpage>77</lpage>
<pub-id pub-id-type="pmid">18287692</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Close</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Bhat</surname>
<given-names>PR</given-names>
</name>
<name>
<surname>Lonardi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Rostoks</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Ramsay</surname>
<given-names>L</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Development and implementation of high-throughput SNP genotyping in barley</article-title>
<source>BMC Genomics</source>
<year>2009</year>
<volume>10</volume>
<issue>1</issue>
<fpage>582</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-10-582</pub-id>
<pub-id pub-id-type="pmid">19961604</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mascher</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Muehlbauer</surname>
<given-names>GJ</given-names>
</name>
<name>
<surname>Rokhsar</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Chapman</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schmutz</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Barry</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Anchoring and ordering NGS contig assemblies by population sequencing (Popseq)</article-title>
<source>Plant J.</source>
<year>2013</year>
<volume>76</volume>
<issue>4</issue>
<fpage>718</fpage>
<lpage>27</lpage>
<pub-id pub-id-type="doi">10.1111/tpj.12319</pub-id>
<pub-id pub-id-type="pmid">23998490</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tu</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>He</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Strain/species identification in metagenomes using genome-specific markers</article-title>
<source>Nucleic Acids Res.</source>
<year>2014</year>
<volume>42</volume>
<issue>8</issue>
<fpage>67</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gku138</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Schwartz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wagner</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>A greedy algorithm for aligning DNA sequences</article-title>
<source>J Comput Biol.</source>
<year>2000</year>
<volume>7</volume>
<issue>1-2</issue>
<fpage>203</fpage>
<lpage>14</lpage>
<pub-id pub-id-type="doi">10.1089/10665270050081478</pub-id>
<pub-id pub-id-type="pmid">10890397</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000292  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000292  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021