MersV1, Pmc, Corpus, bibRecord, 001006

CoMeta: Classification of Metagenomes Using k-mers

Identifieur interne : 001006 ( Pmc/Corpus ); précédent : 001005; suivant : 001007

CoMeta: Classification of Metagenomes Using k-mers

Auteurs : Jolanta Kawulok ; Sebastian Deorowicz

Source :

PLoS ONE [ 1932-6203 ] ; 2015.

RBID : PMC:4401624

Abstract

Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (Classification of metagenomes), which assigns a query read (a DNA fragment) into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions. In CoMeta, we used the exact method for read classification using short subsequences (k-mers) and fast program for indexing large set of k-mers. In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons. The presented experimental study confirms that CoMeta outperforms other programs used in this context. CoMeta is available at https://github.com/jkawulok/cometa under a free GNU GPL 2 license.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4401624

DOI: 10.1371/journal.pone.0121453
PubMed: 25884504
PubMed Central: 4401624

Links to Exploration step

PMC:4401624

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">CoMeta: Classification of Metagenomes Using <italic>k</italic>
-mers</title>
<author><name sortKey="Kawulok, Jolanta" sort="Kawulok, Jolanta" uniqKey="Kawulok J" first="Jolanta" last="Kawulok">Jolanta Kawulok</name>
<affiliation><nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation><nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">25884504</idno>
<idno type="pmc">4401624</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4401624</idno>
<idno type="RBID">PMC:4401624</idno>
<idno type="doi">10.1371/journal.pone.0121453</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">001006</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001006</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">CoMeta: Classification of Metagenomes Using <italic>k</italic>
-mers</title>
<author><name sortKey="Kawulok, Jolanta" sort="Kawulok, Jolanta" uniqKey="Kawulok J" first="Jolanta" last="Kawulok">Jolanta Kawulok</name>
<affiliation><nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation><nlm:aff id="aff001"></nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint><date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (<underline>C</underline>
lassification <underline>o</underline>
f <underline>meta</underline>
genomes), which assigns a query read (a DNA fragment) into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions. In CoMeta, we used the exact method for read classification using short subsequences (<italic>k</italic>
-mers) and fast program for indexing large set of <italic>k</italic>
-mers. In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons. The presented experimental study confirms that CoMeta outperforms other programs used in this context. CoMeta is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/jkawulok/cometa">https://github.com/jkawulok/cometa</ext-link>
 under a free GNU GPL 2 license.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Handelsman, J" uniqKey="Handelsman J">J Handelsman</name>
</author>
<author><name sortKey="Rondon, Mr" uniqKey="Rondon M">MR Rondon</name>
</author>
<author><name sortKey="Brady, Sf" uniqKey="Brady S">SF Brady</name>
</author>
<author><name sortKey="Clardy, J" uniqKey="Clardy J">J Clardy</name>
</author>
<author><name sortKey="Goodman, Rm" uniqKey="Goodman R">RM Goodman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pace, Nr" uniqKey="Pace N">NR Pace</name>
</author>
<author><name sortKey="Stahl, Da" uniqKey="Stahl D">DA Stahl</name>
</author>
<author><name sortKey="Olsen, Gj" uniqKey="Olsen G">GJ Olsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Handelsman, J" uniqKey="Handelsman J">J Handelsman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Simon, C" uniqKey="Simon C">C Simon</name>
</author>
<author><name sortKey="Daniel, R" uniqKey="Daniel R">R Daniel</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Rousk, J" uniqKey="Rousk J">J Rousk</name>
</author>
<author><name sortKey="Baath, E" uniqKey="Baath E">E Baath</name>
</author>
<author><name sortKey="Brookes, Pc" uniqKey="Brookes P">PC Brookes</name>
</author>
<author><name sortKey="Lauber, Cl" uniqKey="Lauber C">CL Lauber</name>
</author>
<author><name sortKey="Lozupone, C" uniqKey="Lozupone C">C Lozupone</name>
</author>
<author><name sortKey="Caporaso, Jg" uniqKey="Caporaso J">JG Caporaso</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fierer, N" uniqKey="Fierer N">N Fierer</name>
</author>
<author><name sortKey="Leff, J" uniqKey="Leff J">J Leff</name>
</author>
<author><name sortKey="Adams, B" uniqKey="Adams B">B Adams</name>
</author>
<author><name sortKey="Nielsen, U" uniqKey="Nielsen U">U Nielsen</name>
</author>
<author><name sortKey="Bates, S" uniqKey="Bates S">S Bates</name>
</author>
<author><name sortKey="Lauber, C" uniqKey="Lauber C">C Lauber</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Abbai, N" uniqKey="Abbai N">N Abbai</name>
</author>
<author><name sortKey="Govender, A" uniqKey="Govender A">A Govender</name>
</author>
<author><name sortKey="Shaik, R" uniqKey="Shaik R">R Shaik</name>
</author>
<author><name sortKey="Pillay, B" uniqKey="Pillay B">B Pillay</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kennedy, J" uniqKey="Kennedy J">J Kennedy</name>
</author>
<author><name sortKey="O Eary, Nd" uniqKey="O Eary N">ND O’Leary</name>
</author>
<author><name sortKey="Kiran, Gs" uniqKey="Kiran G">GS Kiran</name>
</author>
<author><name sortKey="Morrissey, Jp" uniqKey="Morrissey J">JP Morrissey</name>
</author>
<author><name sortKey="O Ara, F" uniqKey="O Ara F">F O’Gara</name>
</author>
<author><name sortKey="Selvin, J" uniqKey="Selvin J">J Selvin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gilbert, J" uniqKey="Gilbert J">J Gilbert</name>
</author>
<author><name sortKey="Field, D" uniqKey="Field D">D Field</name>
</author>
<author><name sortKey="Huang, Y" uniqKey="Huang Y">Y Huang</name>
</author>
<author><name sortKey="Edwards, R" uniqKey="Edwards R">R Edwards</name>
</author>
<author><name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author><name sortKey="Gilna, P" uniqKey="Gilna P">P Gilna</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yergeau, E" uniqKey="Yergeau E">E Yergeau</name>
</author>
<author><name sortKey="Lawrence, Jr" uniqKey="Lawrence J">JR Lawrence</name>
</author>
<author><name sortKey="Waiser, Mj" uniqKey="Waiser M">MJ Waiser</name>
</author>
<author><name sortKey="Korber, Dr" uniqKey="Korber D">DR Korber</name>
</author>
<author><name sortKey="Greer, Cw" uniqKey="Greer C">CW Greer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rhee, Jk" uniqKey="Rhee J">JK Rhee</name>
</author>
<author><name sortKey="Ahn, Dg" uniqKey="Ahn D">DG Ahn</name>
</author>
<author><name sortKey="Kim, Yg" uniqKey="Kim Y">YG Kim</name>
</author>
<author><name sortKey="Oh, Jw" uniqKey="Oh J">JW Oh</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Simon, C" uniqKey="Simon C">C Simon</name>
</author>
<author><name sortKey="Wiezer, A" uniqKey="Wiezer A">A Wiezer</name>
</author>
<author><name sortKey="Strittmatter, Aw" uniqKey="Strittmatter A">AW Strittmatter</name>
</author>
<author><name sortKey="Daniel, R" uniqKey="Daniel R">R Daniel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Heath, C" uniqKey="Heath C">C Heath</name>
</author>
<author><name sortKey="Hu, Xpp" uniqKey="Hu X">XPP Hu</name>
</author>
<author><name sortKey="Cary, Sc" uniqKey="Cary S">SC Cary</name>
</author>
<author><name sortKey="Cowan, D" uniqKey="Cowan D">D Cowan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Nguyen, Nh" uniqKey="Nguyen N">NH Nguyen</name>
</author>
<author><name sortKey="Maruset, L" uniqKey="Maruset L">L Maruset</name>
</author>
<author><name sortKey="Uengwetwanit, T" uniqKey="Uengwetwanit T">T Uengwetwanit</name>
</author>
<author><name sortKey="Mhuantong, W" uniqKey="Mhuantong W">W Mhuantong</name>
</author>
<author><name sortKey="Harnpicharnchai, P" uniqKey="Harnpicharnchai P">P Harnpicharnchai</name>
</author>
<author><name sortKey="Champreda, V" uniqKey="Champreda V">V Champreda</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hess, M" uniqKey="Hess M">M Hess</name>
</author>
<author><name sortKey="Sczyrba, A" uniqKey="Sczyrba A">A Sczyrba</name>
</author>
<author><name sortKey="Egan, R" uniqKey="Egan R">R Egan</name>
</author>
<author><name sortKey="Kim, T" uniqKey="Kim T">T Kim</name>
</author>
<author><name sortKey="Chokhawala, H" uniqKey="Chokhawala H">H Chokhawala</name>
</author>
<author><name sortKey="Schroth, G" uniqKey="Schroth G">G Schroth</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Qin, J" uniqKey="Qin J">J Qin</name>
</author>
<author><name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author><name sortKey="Raes, J" uniqKey="Raes J">J Raes</name>
</author>
<author><name sortKey="Arumugam, M" uniqKey="Arumugam M">M Arumugam</name>
</author>
<author><name sortKey="Burgdorf, K" uniqKey="Burgdorf K">K Burgdorf</name>
</author>
<author><name sortKey="Manichanh, C" uniqKey="Manichanh C">C Manichanh</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kuczynski, J" uniqKey="Kuczynski J">J Kuczynski</name>
</author>
<author><name sortKey="Lauber, Cl" uniqKey="Lauber C">CL Lauber</name>
</author>
<author><name sortKey="Walters, Wa" uniqKey="Walters W">WA Walters</name>
</author>
<author><name sortKey="Parfrey, Lw" uniqKey="Parfrey L">LW Parfrey</name>
</author>
<author><name sortKey="Clemente, Jc" uniqKey="Clemente J">JC Clemente</name>
</author>
<author><name sortKey="Gevers, D" uniqKey="Gevers D">D Gevers</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bruls, T" uniqKey="Bruls T">T Bruls</name>
</author>
<author><name sortKey="Weissenbach, J" uniqKey="Weissenbach J">J Weissenbach</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Peterson, J" uniqKey="Peterson J">J Peterson</name>
</author>
<author><name sortKey="Garges, S" uniqKey="Garges S">S Garges</name>
</author>
<author><name sortKey="Giovanni, M" uniqKey="Giovanni M">M Giovanni</name>
</author>
<author><name sortKey="Mcinnes, P" uniqKey="Mcinnes P">P McInnes</name>
</author>
<author><name sortKey="Wang, L" uniqKey="Wang L">L Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Thomas, T" uniqKey="Thomas T">T Thomas</name>
</author>
<author><name sortKey="Gilbert, J" uniqKey="Gilbert J">J Gilbert</name>
</author>
<author><name sortKey="Meyer, F" uniqKey="Meyer F">F Meyer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kunin, V" uniqKey="Kunin V">V Kunin</name>
</author>
<author><name sortKey="Copeland, A" uniqKey="Copeland A">A Copeland</name>
</author>
<author><name sortKey="Lapidus, A" uniqKey="Lapidus A">A Lapidus</name>
</author>
<author><name sortKey="Mavromatis, K" uniqKey="Mavromatis K">K Mavromatis</name>
</author>
<author><name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sanger, F" uniqKey="Sanger F">F Sanger</name>
</author>
<author><name sortKey="Nicklen, S" uniqKey="Nicklen S">S Nicklen</name>
</author>
<author><name sortKey="Coulson, Ar" uniqKey="Coulson A">AR Coulson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Metzker, Ml" uniqKey="Metzker M">ML Metzker</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Nalbantoglu, U" uniqKey="Nalbantoglu U">U Nalbantoglu</name>
</author>
<author><name sortKey="Cakar, A" uniqKey="Cakar A">A Cakar</name>
</author>
<author><name sortKey="Dogan, H" uniqKey="Dogan H">H Dogan</name>
</author>
<author><name sortKey="Abaci, N" uniqKey="Abaci N">N Abaci</name>
</author>
<author><name sortKey="Ustek, D" uniqKey="Ustek D">D Ustek</name>
</author>
<author><name sortKey="Sayood, K" uniqKey="Sayood K">K Sayood</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, Z" uniqKey="Wang Z">Z Wang</name>
</author>
<author><name sortKey="Yang, J" uniqKey="Yang J">J Yang</name>
</author>
<author><name sortKey="Zhou, J" uniqKey="Zhou J">J Zhou</name>
</author>
<author><name sortKey="Zhang, C" uniqKey="Zhang C">C Zhang</name>
</author>
<author><name sortKey="Su, X" uniqKey="Su X">X Su</name>
</author>
<author><name sortKey="Li, T" uniqKey="Li T">T Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Shafquat, A" uniqKey="Shafquat A">A Shafquat</name>
</author>
<author><name sortKey="Joice, R" uniqKey="Joice R">R Joice</name>
</author>
<author><name sortKey="Simmons, Sl" uniqKey="Simmons S">SL Simmons</name>
</author>
<author><name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hauser, Pm" uniqKey="Hauser P">PM Hauser</name>
</author>
<author><name sortKey="Bernard, T" uniqKey="Bernard T">T Bernard</name>
</author>
<author><name sortKey="Greub, G" uniqKey="Greub G">G Greub</name>
</author>
<author><name sortKey="Jaton, K" uniqKey="Jaton K">K Jaton</name>
</author>
<author><name sortKey="Pagni, M" uniqKey="Pagni M">M Pagni</name>
</author>
<author><name sortKey="Hafen, Gm" uniqKey="Hafen G">GM Hafen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Benson, Da" uniqKey="Benson D">DA Benson</name>
</author>
<author><name sortKey="Cavanaugh, M" uniqKey="Cavanaugh M">M Cavanaugh</name>
</author>
<author><name sortKey="Clark, K" uniqKey="Clark K">K Clark</name>
</author>
<author><name sortKey="Karsch Mizrachi, I" uniqKey="Karsch Mizrachi I">I Karsch-Mizrachi</name>
</author>
<author><name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
<author><name sortKey="Ostell, J" uniqKey="Ostell J">J Ostell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fierer, N" uniqKey="Fierer N">N Fierer</name>
</author>
<author><name sortKey="Breitbart, M" uniqKey="Breitbart M">M Breitbart</name>
</author>
<author><name sortKey="Nulton, J" uniqKey="Nulton J">J Nulton</name>
</author>
<author><name sortKey="Salamon, P" uniqKey="Salamon P">P Salamon</name>
</author>
<author><name sortKey="Lozupone, C" uniqKey="Lozupone C">C Lozupone</name>
</author>
<author><name sortKey="Jones, R" uniqKey="Jones R">R Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Simister, R" uniqKey="Simister R">R Simister</name>
</author>
<author><name sortKey="Taylor, Mw" uniqKey="Taylor M">MW Taylor</name>
</author>
<author><name sortKey="Tsai, P" uniqKey="Tsai P">P Tsai</name>
</author>
<author><name sortKey="Fan, L" uniqKey="Fan L">L Fan</name>
</author>
<author><name sortKey="Bruxner, Tj" uniqKey="Bruxner T">TJ Bruxner</name>
</author>
<author><name sortKey="Crowe, Ml" uniqKey="Crowe M">ML Crowe</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Krogius Kurikka, L" uniqKey="Krogius Kurikka L">L Krogius-Kurikka</name>
</author>
<author><name sortKey="Kassinen, A" uniqKey="Kassinen A">A Kassinen</name>
</author>
<author><name sortKey="Paulin, L" uniqKey="Paulin L">L Paulin</name>
</author>
<author><name sortKey="Corander, J" uniqKey="Corander J">J Corander</name>
</author>
<author><name sortKey="Makivuokko, H" uniqKey="Makivuokko H">H Makivuokko</name>
</author>
<author><name sortKey="Tuimala, J" uniqKey="Tuimala J">J Tuimala</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
<author><name sortKey="Mclenachan, Pa" uniqKey="Mclenachan P">PA McLenachan</name>
</author>
<author><name sortKey="Biggs, Pj" uniqKey="Biggs P">PJ Biggs</name>
</author>
<author><name sortKey="Winder, Lh" uniqKey="Winder L">LH Winder</name>
</author>
<author><name sortKey="Schoenfeld, Bik" uniqKey="Schoenfeld B">BIK Schoenfeld</name>
</author>
<author><name sortKey="Narayan, Vv" uniqKey="Narayan V">VV Narayan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Brady, A" uniqKey="Brady A">A Brady</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Diaz, Nn" uniqKey="Diaz N">NN Diaz</name>
</author>
<author><name sortKey="Krause, L" uniqKey="Krause L">L Krause</name>
</author>
<author><name sortKey="Goesmann, A" uniqKey="Goesmann A">A Goesmann</name>
</author>
<author><name sortKey="Niehaus, K" uniqKey="Niehaus K">K Niehaus</name>
</author>
<author><name sortKey="Nattkemper, Tw" uniqKey="Nattkemper T">TW Nattkemper</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rosen, Gl" uniqKey="Rosen G">GL Rosen</name>
</author>
<author><name sortKey="Reichenberger, Er" uniqKey="Reichenberger E">ER Reichenberger</name>
</author>
<author><name sortKey="Rosenfeld, Am" uniqKey="Rosenfeld A">AM Rosenfeld</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Patil, Kr" uniqKey="Patil K">KR Patil</name>
</author>
<author><name sortKey="Haider, P" uniqKey="Haider P">P Haider</name>
</author>
<author><name sortKey="Pope, Pb" uniqKey="Pope P">PB Pope</name>
</author>
<author><name sortKey="Turnbaugh, Pj" uniqKey="Turnbaugh P">PJ Turnbaugh</name>
</author>
<author><name sortKey="Morrison, M" uniqKey="Morrison M">M Morrison</name>
</author>
<author><name sortKey="Scheffer, T" uniqKey="Scheffer T">T Scheffer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
<author><name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kawulok, M" uniqKey="Kawulok M">M Kawulok</name>
</author>
<author><name sortKey="Nalepa, J" uniqKey="Nalepa J">J Nalepa</name>
</author>
<author><name sortKey="Gimel Arb, G" uniqKey="Gimel Arb G">G Gimel’farb</name>
</author>
<author><name sortKey="Hancock, E" uniqKey="Hancock E">E Hancock</name>
</author>
<author><name sortKey="Imiya, A" uniqKey="Imiya A">A Imiya</name>
</author>
<author><name sortKey="Kuijper, A" uniqKey="Kuijper A">A Kuijper</name>
</author>
<author><name sortKey="Kudo, M" uniqKey="Kudo M">M Kudo</name>
</author>
<author><name sortKey="Omachi, S" uniqKey="Omachi S">S Omachi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cyran, Ka" uniqKey="Cyran K">KA Cyran</name>
</author>
<author><name sortKey="Kawulok, J" uniqKey="Kawulok J">J Kawulok</name>
</author>
<author><name sortKey="Kawulok, M" uniqKey="Kawulok M">M Kawulok</name>
</author>
<author><name sortKey="Stawarz, M" uniqKey="Stawarz M">M Stawarz</name>
</author>
<author><name sortKey="Michalak, M" uniqKey="Michalak M">M Michalak</name>
</author>
<author><name sortKey="Pietrowska, M" uniqKey="Pietrowska M">M Pietrowska</name>
</author>
<author><name sortKey="Ramanna, S" uniqKey="Ramanna S">S Ramanna</name>
</author>
<author><name sortKey="Jain, Lc" uniqKey="Jain L">LC Jain</name>
</author>
<author><name sortKey="Howlett, Rj" uniqKey="Howlett R">RJ Howlett</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, D" uniqKey="Wang D">D Wang</name>
</author>
<author><name sortKey="Shi, L" uniqKey="Shi L">L Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author><name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author><name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author><name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gori, F" uniqKey="Gori F">F Gori</name>
</author>
<author><name sortKey="Folino, G" uniqKey="Folino G">G Folino</name>
</author>
<author><name sortKey="Jetten, Msm" uniqKey="Jetten M">MSM Jetten</name>
</author>
<author><name sortKey="Marchiori, E" uniqKey="Marchiori E">E Marchiori</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Monzoorul Haque, M" uniqKey="Monzoorul Haque M">M Monzoorul Haque</name>
</author>
<author><name sortKey="Ghosh, Ts" uniqKey="Ghosh T">TS Ghosh</name>
</author>
<author><name sortKey="Komanduri, D" uniqKey="Komanduri D">D Komanduri</name>
</author>
<author><name sortKey="Mande, Ss" uniqKey="Mande S">SS Mande</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gerlach, W" uniqKey="Gerlach W">W Gerlach</name>
</author>
<author><name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Meyer, F" uniqKey="Meyer F">F Meyer</name>
</author>
<author><name sortKey="Paarmann, D" uniqKey="Paarmann D">D Paarmann</name>
</author>
<author><name sortKey="D Ouza, M" uniqKey="D Ouza M">M D’Souza</name>
</author>
<author><name sortKey="Olson, R" uniqKey="Olson R">R Olson</name>
</author>
<author><name sortKey="Glass, E" uniqKey="Glass E">E Glass</name>
</author>
<author><name sortKey="Kubal, M" uniqKey="Kubal M">M Kubal</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Schreiber, F" uniqKey="Schreiber F">F Schreiber</name>
</author>
<author><name sortKey="Gumrich, P" uniqKey="Gumrich P">P Gumrich</name>
</author>
<author><name sortKey="Daniel, R" uniqKey="Daniel R">R Daniel</name>
</author>
<author><name sortKey="Meinicke, P" uniqKey="Meinicke P">P Meinicke</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Stranneheim, H" uniqKey="Stranneheim H">H Stranneheim</name>
</author>
<author><name sortKey="Kaller, M" uniqKey="Kaller M">M Kaller</name>
</author>
<author><name sortKey="Allander, T" uniqKey="Allander T">T Allander</name>
</author>
<author><name sortKey="Andersson, B" uniqKey="Andersson B">B Andersson</name>
</author>
<author><name sortKey="Arvestad, L" uniqKey="Arvestad L">L Arvestad</name>
</author>
<author><name sortKey="Lundeberg, J" uniqKey="Lundeberg J">J Lundeberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ames, S" uniqKey="Ames S">S Ames</name>
</author>
<author><name sortKey="Hysom, Da" uniqKey="Hysom D">DA Hysom</name>
</author>
<author><name sortKey="Gardner, Sn" uniqKey="Gardner S">SN Gardner</name>
</author>
<author><name sortKey="Lloyd, Gs" uniqKey="Lloyd G">GS Lloyd</name>
</author>
<author><name sortKey="Gokhale, Mb" uniqKey="Gokhale M">MB Gokhale</name>
</author>
<author><name sortKey="Allen, Je" uniqKey="Allen J">JE Allen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wood, De" uniqKey="Wood D">DE Wood</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author><name sortKey="Hayes, W" uniqKey="Hayes W">W Hayes</name>
</author>
<author><name sortKey="Hunt, Br" uniqKey="Hunt B">BR Hunt</name>
</author>
<author><name sortKey="Mount, Sm" uniqKey="Mount S">SM Mount</name>
</author>
<author><name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author><name sortKey="Kokot, M" uniqKey="Kokot M">M Kokot</name>
</author>
<author><name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author><name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Movahedi, Ns" uniqKey="Movahedi N">NS Movahedi</name>
</author>
<author><name sortKey="Forouzmand, E" uniqKey="Forouzmand E">E Forouzmand</name>
</author>
<author><name sortKey="Chitsaz, H" uniqKey="Chitsaz H">H Chitsaz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author><name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
<author><name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bazinet, Al" uniqKey="Bazinet A">AL Bazinet</name>
</author>
<author><name sortKey="Cummings, Mp" uniqKey="Cummings M">MP Cummings</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kawulok, J" uniqKey="Kawulok J">J Kawulok</name>
</author>
<author><name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author><name sortKey="Kozielski, S" uniqKey="Kozielski S">S Kozielski</name>
</author>
<author><name sortKey="Mrozek, D" uniqKey="Mrozek D">D Mrozek</name>
</author>
<author><name sortKey="Kasprowski, P" uniqKey="Kasprowski P">P Kasprowski</name>
</author>
<author><name sortKey="Maysiak Mrozek, B" uniqKey="Maysiak Mrozek B">B Maysiak-Mrozek</name>
</author>
<author><name sortKey="Kostrzewa, D" uniqKey="Kostrzewa D">D Kostrzewa</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group><journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher><publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">25884504</article-id>
<article-id pub-id-type="pmc">4401624</article-id>
<article-id pub-id-type="publisher-id">PONE-D-14-28208</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0121453</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>CoMeta: Classification of Metagenomes Using <italic>k</italic>
-mers</article-title>
<alt-title alt-title-type="running-head">CoMeta: Classification of Metagenomes Using <italic>k</italic>
-mers</alt-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Kawulok</surname>
<given-names>Jolanta</given-names>
</name>
<xref ref-type="corresp" rid="cor001">*</xref>
<xref ref-type="aff" rid="aff001"></xref>
</contrib>
<contrib contrib-type="author"><name><surname>Deorowicz</surname>
<given-names>Sebastian</given-names>
</name>
<xref ref-type="aff" rid="aff001"></xref>
</contrib>
</contrib-group>
<aff id="aff001"><addr-line>Institute of Informatics, Silesian University of Technology, Gliwice, Poland</addr-line>
</aff>
<contrib-group><contrib contrib-type="editor"><name><surname>Golden</surname>
<given-names>Aaron Alain-Jon</given-names>
</name>
<role>Academic Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1"><addr-line>Albert Einstein College of Medicine, UNITED STATES</addr-line>
</aff>
<author-notes><fn fn-type="COI-statement" id="coi001"><p><bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con" id="contrib001"><p>Conceived and designed the experiments: JK SD. Performed the experiments: JK. Analyzed the data: JK. Contributed reagents/materials/analysis tools: JK SD. Wrote the paper: JK SD. Designed the software: JK SD.</p>
</fn>
<corresp id="cor001">* E-mail: <email>jolanta.kawulok@polsl.pl</email>
</corresp>
</author-notes>
<pub-date pub-type="collection"><year>2015</year>
</pub-date>
<pub-date pub-type="epub"><day>17</day>
<month>4</month>
<year>2015</year>
</pub-date>
<volume>10</volume>
<issue>4</issue>
<elocation-id>e0121453</elocation-id>
<history><date date-type="received"><day>24</day>
<month>6</month>
<year>2014</year>
</date>
<date date-type="accepted"><day>15</day>
<month>2</month>
<year>2015</year>
</date>
</history>
<permissions><copyright-statement>© 2015 Kawulok, Deorowicz</copyright-statement>
<copyright-year>2015</copyright-year>
<copyright-holder>Kawulok, Deorowicz</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="pone.0121453.pdf"></self-uri>
<abstract><p>Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (<underline>C</underline>
lassification <underline>o</underline>
f <underline>meta</underline>
genomes), which assigns a query read (a DNA fragment) into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions. In CoMeta, we used the exact method for read classification using short subsequences (<italic>k</italic>
-mers) and fast program for indexing large set of <italic>k</italic>
-mers. In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons. The presented experimental study confirms that CoMeta outperforms other programs used in this context. CoMeta is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/jkawulok/cometa">https://github.com/jkawulok/cometa</ext-link>
 under a free GNU GPL 2 license.</p>
</abstract>
<funding-group><funding-statement>This work was supported by the Polish National Science Centre under the project DEC-2012/05/B/ST6/03148 and the European Union from the European Social Fund (grant agreement number: UDA-POKL.04.01.01-00-106/09). The work was performed using the infrastructure supported by POIG.02.03.01-24-099/13 grant: "GeCONiI---Upper Silesian Center for Computational Science and Engineering". This research was supported in part by PL-Grid Infrastructure. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts><fig-count count="5"></fig-count>
<table-count count="6"></table-count>
<page-count count="23"></page-count>
</counts>
<custom-meta-group><custom-meta id="data-availability"><meta-name>Data Availability</meta-name>
<meta-value>The package and documentation of the program are freely available at <ext-link ext-link-type="uri" xlink:href="https://github.com/jkawulok/cometa">https://github.com/jkawulok/cometa</ext-link>
, all the data used in this paper are available at <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.7910/DVN/29265">http://dx.doi.org/10.7910/DVN/29265</ext-link>
.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes><title>Data Availability</title>
<p>The package and documentation of the program are freely available at <ext-link ext-link-type="uri" xlink:href="https://github.com/jkawulok/cometa">https://github.com/jkawulok/cometa</ext-link>
, all the data used in this paper are available at <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.7910/DVN/29265">http://dx.doi.org/10.7910/DVN/29265</ext-link>
.</p>
</notes>
</front>
<body><sec sec-type="intro" id="sec001"><title>Introduction</title>
<p>Comprehensive and complete analysis of the microbes’ genomes, performed in their original environment, usually called metagenomics [<xref rid="pone.0121453.ref001" ref-type="bibr">1</xref>
] or environmental and community genomics, became a popular field of research in recent years. Its origins can be found in the work of Pace <italic>et al.</italic>
[<xref rid="pone.0121453.ref002" ref-type="bibr">2</xref>
], in which the first proposal for cloning the environmental DNA by Polymerase Chain Reaction (PCR) to explore the diversity of ribosomal RNA sequences was formulated. In metagenomics, the isolation and culture of organisms is unnecessary. Therefore, it is possible to investigate the species that previously have been usually neglected due to the lack of laboratory-grown cultures. Moreover, a large number of unknown enzymes and metabolic capabilities are encoded in the genomes of uncultured species. Ultimately, metagenomics allows for discovering thousands of new microorganisms and their potentially useful functions [<xref rid="pone.0121453.ref003" ref-type="bibr">3</xref>
, <xref rid="pone.0121453.ref004" ref-type="bibr">4</xref>
].</p>
<p>Metagenomic analyzes can help in solving numerous practical challenges in medicine, engineering, agriculture, and ecology [<xref rid="pone.0121453.ref005" ref-type="bibr">5</xref>
]. Currently, many projects are carried out which are aimed at understanding biocenosis coming from various environments, such as soil [<xref rid="pone.0121453.ref006" ref-type="bibr">6</xref>
, <xref rid="pone.0121453.ref007" ref-type="bibr">7</xref>
], water (i.e., groundwater [<xref rid="pone.0121453.ref008" ref-type="bibr">8</xref>
], seawater [<xref rid="pone.0121453.ref009" ref-type="bibr">9</xref>
, <xref rid="pone.0121453.ref010" ref-type="bibr">10</xref>
], rivers [<xref rid="pone.0121453.ref011" ref-type="bibr">11</xref>
]), or places with extreme conditions, like hot springs and mud holes in solfataric fields [<xref rid="pone.0121453.ref012" ref-type="bibr">12</xref>
], glacier ice [<xref rid="pone.0121453.ref013" ref-type="bibr">13</xref>
], or Antarctic desert soil [<xref rid="pone.0121453.ref014" ref-type="bibr">14</xref>
]. The probes are also collected from other organisms, for example from rumens of buffalo [<xref rid="pone.0121453.ref015" ref-type="bibr">15</xref>
] or cow [<xref rid="pone.0121453.ref016" ref-type="bibr">16</xref>
].</p>
<p>The fact that human organism carries a hundred times more bacterial genes than our inherited human genome was the main reason for growing interests in the microorganisms living in the human body [<xref rid="pone.0121453.ref017" ref-type="bibr">17</xref>
–<xref rid="pone.0121453.ref019" ref-type="bibr">19</xref>
]. The main aim of the Human Microbiome Project [<xref rid="pone.0121453.ref020" ref-type="bibr">20</xref>
], started in 2009, lies in characterizing the human microbiome communities found at several different sites in the human body, including nasal passages, oral cavities, skin, gastrointestinal, and urogenital tracts. Furthermore, the project is aimed at analyzing the role of these microbes in human health and disease.</p>
<sec id="sec002"><title>Metagenomic processing</title>
<p>The metagenomic analysis is a multi-stage process [<xref rid="pone.0121453.ref004" ref-type="bibr">4</xref>
, <xref rid="pone.0121453.ref021" ref-type="bibr">21</xref>
, <xref rid="pone.0121453.ref022" ref-type="bibr">22</xref>
]. First, the genetic material is isolated from the environmental sample containing a mixture of various types of microorganisms. Subsequently, the DNA material is extracted and sequenced. Finally, the reads (short fragments of genomes obtained in sequencing) are binned and annotated.</p>
<p>In the recent decade, the DNA sequencing methods were becoming cheaper and faster. The first method for sequencing was invented by Sanger [<xref rid="pone.0121453.ref023" ref-type="bibr">23</xref>
] in 1977, and it dominated for almost two subsequent decades. In spite of many improvements proposed to this technique, it is inferior to the recent methods, referred to as Next Generation Sequencing (NGS) [<xref rid="pone.0121453.ref024" ref-type="bibr">24</xref>
]. The most popular among them are the 454/Roche and Illumina/Solexa systems, and nowadays they are extensively applied to the analysis of metagenomic samples [<xref rid="pone.0121453.ref021" ref-type="bibr">21</xref>
]. For example, the 454 sequencing has been used to study the metagenomes contained in kefir grains [<xref rid="pone.0121453.ref025" ref-type="bibr">25</xref>
], waste water [<xref rid="pone.0121453.ref026" ref-type="bibr">26</xref>
], whereas the sequences of infant gut [<xref rid="pone.0121453.ref027" ref-type="bibr">27</xref>
] or Cystic Fibrosis Lungs [<xref rid="pone.0121453.ref028" ref-type="bibr">28</xref>
] metagenomes have been sequenced with Illumina. In a single experiment, the 454/Roche sequencers produce millions of long reads (600–900 bp), while the Illumina/Solexa sequencers deliver hundreds of millions of shorter reads (36–200 bp).</p>
</sec>
<sec id="sec003"><title>Classification of metagenomic data</title>
<p>The sequencing results in obtaining a huge set of reads coming from the genomes of organisms living in the investigated environment. As it was mentioned earlier, an important aim of the metagenomic study is to determine qualitative and quantitative composition of the environmental sample, which is achieved by solving two important tasks, namely binning and annotation. The latter requires classification of the reads to a set of known sequences. The reads may be compared with annotated sequences stored in a number of databases (e.g., GenBank [<xref rid="pone.0121453.ref029" ref-type="bibr">29</xref>
]), and associated with a species or a gene function. In general, the questions raised are: “who is there?”, “how much of each?”, and “what are they doing?”. The answers to the first two questions may be obtained relying on taxonomic classification, while the third one can be answered using functional classification.</p>
<p>During the study of the environmental community, the obtained reads derived from a set of various organisms are assigned to taxa. The assignment may be either independent or dependent on the taxonomy. In the latter case, the reads are directly assigned to taxa on the basis of the reference sequences, where the taxon can range from the superkingdom to the species rank. During the taxonomy independent analysis, the reads are grouped into operational taxonomic units (OTUs) based on their similarity to each other in the sample. OTU is usually delineated with a 3% sequence dissimilarity, which corresponds to the taxonomical rank of species [<xref rid="pone.0121453.ref030" ref-type="bibr">30</xref>
, <xref rid="pone.0121453.ref031" ref-type="bibr">31</xref>
]. Obviously, the acceptance threshold may be set to a different value [<xref rid="pone.0121453.ref032" ref-type="bibr">32</xref>
]. Using the taxonomy dependent analysis, OTUs can be assigned to taxonomic names. In a single habitat, the organisms belonging to various groups appear together. Even though a microbial probe contains microbial eukaryotes, bacteria, archaea, and also viruses, the metagenomic study is primarily focused on the prokaryotic species. Moreover, sequencing of eukaryotic DNA is unprofitable due to the large genome size and low gene coding densities. Therefore, in some studies, the eukaryotic cells are eliminated by filtering the samples [<xref rid="pone.0121453.ref010" ref-type="bibr">10</xref>
].</p>
<p>There are several computer programs for read-to-taxa classification. They can be separated into two main groups, namely composition-based and similarity search methods. Using the former, reference sequence features are first extracted and subsequently compared, whilst using the latter, the reads are compared to some reference sequences. The hybrids of these two approaches may also include elements of phylogenetic analysis.</p>
<p>The composition-based methods follow the three-stage strategy [<xref rid="pone.0121453.ref033" ref-type="bibr">33</xref>
–<xref rid="pone.0121453.ref038" ref-type="bibr">38</xref>
]: 1) machine learning-based modeling of features extracted from reference sequences (e.g., distribution of short nucleotide subsequences, <italic>k</italic>
-mers); 2) modeling of the unknown set of reads (performed in the same way as for the set of reference sequences); 3) comparison of the reads and reference sequences models to assign taxonomic ranks for each read. Among the machine learning methods, it is worth to mention the interpolated Markov models [<xref rid="pone.0121453.ref034" ref-type="bibr">34</xref>
], support vector machines (SVMs) [<xref rid="pone.0121453.ref037" ref-type="bibr">37</xref>
, <xref rid="pone.0121453.ref038" ref-type="bibr">38</xref>
], <italic>k</italic>
-nearest neighbors [<xref rid="pone.0121453.ref035" ref-type="bibr">35</xref>
] or naive Bayesian classifier [<xref rid="pone.0121453.ref036" ref-type="bibr">36</xref>
]. For SVMs, training from large datasets may be problematic, however the training set can be effectively selected using various techniques [<xref rid="pone.0121453.ref039" ref-type="bibr">39</xref>
–<xref rid="pone.0121453.ref041" ref-type="bibr">41</xref>
].</p>
<p>The similarity search methods rely on the sequence homology. They use a database, containing nucleotide or protein reference sequences. For detecting remote homologies, it is better to use the protein sequences, as they are more well-conserved across greater evolutionary distances. However, in order to use the protein database, the reads have to be translated into amino acid sequences. Taking into account all three possible start sites of encoding amino acids on the both strands (the main sequence and its reverse-complement counterpart), each read has to be translated in all six reading frames, which negatively influences the computation time. In addition, the reads with non-coding DNA cannot be processed by such translation-to-protein method.</p>
<p>In most cases, the similarity search methods employ BLAST to obtain alignments of reads to a reference sequences set. Subsequently, these alignments are used for taxonomic classification. Some programs, like MEGAN [<xref rid="pone.0121453.ref042" ref-type="bibr">42</xref>
], MTR [<xref rid="pone.0121453.ref043" ref-type="bibr">43</xref>
], SOrt-ITEMS [<xref rid="pone.0121453.ref044" ref-type="bibr">44</xref>
], CARMA3 [<xref rid="pone.0121453.ref045" ref-type="bibr">45</xref>
], use the lowest common ancestor (LCA) algorithm for assigning the taxonomic labels. After performing the BLAST search for each read, the BLAST hits, whose bit scores are above the threshold, are selected for further analysis. LCA is computed for all species that were reported by best BLAST hits for a read. If BLAST hits are ambiguous (the hits are similar for reference sequences derived from different species), then the read is assigned to a higher taxonomic level.</p>
<p>Furthermore, the marker genes can also be used to facilitate reads classification. These genes help to identify a particular species, e.g., 16S rRNA occurs in the prokaryote genomes. MG-RAST [<xref rid="pone.0121453.ref046" ref-type="bibr">46</xref>
] relies on the chloroplast, mitochondrial, and ACLAME (including mobile genetic elements) databases. MetaPhyler [<xref rid="pone.0121453.ref047" ref-type="bibr">47</xref>
] uses 31 phylogenetic marker genes as the taxonomic references. One of CARMA3 variants [<xref rid="pone.0121453.ref045" ref-type="bibr">45</xref>
] and Treephyler [<xref rid="pone.0121453.ref048" ref-type="bibr">48</xref>
] use hidden Markov models (instead of BLAST) to search for the homologies against the Pfam database—protein domains contained in the Pfam are here used as the markers.</p>
<p>As discussed earlier, the composition-based classification methods compare the <italic>k</italic>
-mer distribution of a read with those which come from different taxa. In the FACS [<xref rid="pone.0121453.ref049" ref-type="bibr">49</xref>
] program, instead of determining the full distribution of <italic>k</italic>
-mers, their appearance in a reference sequence is taken into account (1 if a <italic>k</italic>
-mer from a read appears in a reference sequence, 0 otherwise). FACS can be regarded as a similar search method, because it aligns the reads to the reference sequences, represented by <italic>k</italic>
-mers indexed using the Bloom filters. The original FACS algorithm was implemented in the Perl language, but the latest version has been reimplemented in C (available at <ext-link ext-link-type="uri" xlink:href="https://github.com/SciLifeLab/facs">https://github.com/SciLifeLab/facs</ext-link>
). Actually, the new version is not intended for metagenomic data classification, but it checks how many reads might be contaminated in a particular sample.</p>
<p>The Livermore Metagenomics Analysis Toolkit (LMAT) also maps <italic>k</italic>
-mers without using information about their positions and quantity [<xref rid="pone.0121453.ref050" ref-type="bibr">50</xref>
]. When constructing a <italic>k</italic>
-mer database, each canonical <italic>k</italic>
-mer (i.e., the <italic>k</italic>
-mer or its reverse complement, if the latter is lexicographically smaller), derived from the reference sequence, is assigned to a group of reference sequences which contain that <italic>k</italic>
-mer. Hence, the <italic>k</italic>
-mers are grouped together in such a way that each group contains those <italic>k</italic>
-mers which occur in every reference sequence in the group and does not occur in any sequence outside the group. LMAT, like the programs discussed earlier, also computes the LCA—the created groups are linked together in a taxonomic tree. During classification, the canonical <italic>k</italic>
-mers of each read are compared to the <italic>k</italic>
-mers located in every group. The similarity score is increased for each matching <italic>k</italic>
-mer, and cumulated for the whole taxon. Similarly to other LCA-employing methods, in case of conflicts (i.e., situations, in which the scores for several taxa are high and identical) the read is classified to the level above. This helps in selecting the most specific taxonomic label, whose lineage has no conflicts with another taxonomic label.</p>
<p>Very recently, the Kraken algorithm [<xref rid="pone.0121453.ref051" ref-type="bibr">51</xref>
] using the <italic>k</italic>
-mer indexing scheme similar to LMAT, has been proposed. These methods differ, however, in classification and database construction strategy. In the algorithm used in Kraken, each <italic>k</italic>
-mer from a reference sequence stores the taxonomic ID number of the <italic>k</italic>
-mers’ LCA values. Like in LMAT, the Kraken database contains the <italic>k</italic>
-mers in the canonical representation. However, these <italic>k</italic>
-mers are first sorted according to the minimizer, a very popular idea in recent years in bioinformatics [<xref rid="pone.0121453.ref052" ref-type="bibr">52</xref>
–<xref rid="pone.0121453.ref054" ref-type="bibr">54</xref>
], (i.e., the lexicographically smallest <italic>M</italic>
-mer in each <italic>k</italic>
-mer), and the <italic>k</italic>
-mers containing the same minimizer are sorted in the lexicographical order in the database. This strategy substantially accelerates the queries. A taxonomic node cumulates points for every match of a <italic>k</italic>
-mer extracted from the given read. The read is classified to the node, which has obtained the largest number of points cumulated along the path leading from the root to that node.</p>
<p>Both LMAT and Kraken do not use the cumulative distribution of <italic>k</italic>
-mers and also they do not exploit the alignment searching. Thus, they can be regarded as the hybrid methods, combining two different strategies—the composition-based and similarity search approach.</p>
</sec>
<sec id="sec004"><title>Contribution</title>
<p>In this paper, we present CoMeta—a new fast and accurate algorithm for <underline>c</underline>
lassification <underline>o</underline>
<underline>meta</underline>
genomes (metagenomic reads). We determine the similarity (termed the match score) between the query read and a group of the reference sequences by counting the number of nucleotides in those <italic>k</italic>
-mers, which occur both in the read and in the group. The read is classified to that group, for which the match score is the largest. The group is defined as a set of sequences of specific attribution. Typically, this is one of the taxonomic ranks (e.g., phylum, genus). CoMeta employs an efficient <italic>k</italic>
-mer counting and indexing algorithm [<xref rid="pone.0121453.ref055" ref-type="bibr">55</xref>
]. Its low memory requirements allows us to create the indexes even at high taxonomy tree levels that embrace large groups of sequences. In this way, after having built the indexes, we can quickly search the tree from the root to the leaves, and find the closest match for a given query read. This classification scheme (i.e., analysis of the taxonomy tree from the top) is in contrast to the existing BLAST-based methods, which require the query read be compared with every reference sequence.</p>
<p>The main idea of the proposed method is similar to the one used in FACS. However, CoMeta does not impose any restrictions on the size of the data. We are able to classify sequences derived from both bacteria and big eukaryotes. The details of our algorithm are given in Section Methods. Extensive experimental study, whose results are reported in Section Results and Discussion, confirms that our algorithm is competitive, offering high speed and accuracy, compared with the state-of-the-art methods.</p>
</sec>
</sec>
<sec sec-type="materials|methods" id="sec005"><title>Methods</title>
<sec id="sec006"><title>Introduction</title>
<p>In the following description of our algorithm, several symbols will be used. For clarity, we gathered them in <xref rid="pone.0121453.t001" ref-type="table">Table 1</xref>
.</p>
<table-wrap id="pone.0121453.t001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.t001</object-id>
<label>Table 1</label>
<caption><title>Dictionary of symbols and acronyms used in the description of the classification.</title>
</caption>
<alternatives><graphic id="pone.0121453.t001g" xlink:href="pone.0121453.t001"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<tbody><tr><td align="left" rowspan="1" colspan="1"><italic>ξ</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">match score, similarity between query read and the set of the reference sequences</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Ξ</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">match rate score, percentage ratio of the match score to the read length</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>D</italic>
<sub><italic>i</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1"><italic>k</italic>
-mer database for an <italic>i-th</italic>
 group</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>f</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">number of various groups to which the reads were classified</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>F</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">output files after assignment to the best group</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>FP</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">number of incorrectly classified reads</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>G</italic>
<sup>0</sup>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">set of all reference sequences</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><inline-formula id="pone.0121453.e001"><mml:math id="M1"><mml:mrow><mml:msubsup><mml:mi>G</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">set of reference sequences for the <italic>i-th</italic>
 group at the <italic>j-th</italic>
 level</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>k</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">subsequence (<italic>k</italic>
-mer) length</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>M</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">dataset of reads</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>MC</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">match cut-off value of sequence identity</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>n</italic>
<sup><italic>j</italic>
</sup>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">number of various sets (groups) of reference sequences, with whom the query read is compared at the <italic>j-th</italic>
 taxonomic rank</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>NC</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">number of reads not classified to any group</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>R</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">query read</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>S</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">reference sequence</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>TP</italic>
</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="left" rowspan="1" colspan="1">number of correctly classified reads</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>The proposed method consists of two major stages outlined in Figs <xref ref-type="fig" rid="pone.0121453.g001">1</xref>
 and <xref ref-type="fig" rid="pone.0121453.g002">2</xref>
. Firstly (in the <italic>database construction</italic>
 stage), the indexed <italic>k</italic>
-mer databases of clustered reference sequences are constructed. Subsequently (in the <italic>classification</italic>
 stage), the reads are classified to various groups with the use of the databases. The second stage is composed of two steps. In the <italic>comparison</italic>
 step, the input reads are scored according to a number of databases ({<italic>D</italic>
<sub><italic>i</italic>
</sub>
}). In the <italic>assignment</italic>
 step, the reads are assigned to the best group. What is important, the classification stage is performed iteratively (for taxonomic classification) to search the taxonomy tree downwards.</p>
<fig id="pone.0121453.g001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.g001</object-id>
<label>Fig 1</label>
<caption><title>The processing pipeline for metagenomic reads classification for a single rank.</title>
<p>In order to avoiding obfuscating the schema, the upper index <italic>j</italic>
 is not added to the symbols, indicating the <italic>j</italic>
-th level of taxonomic classifications.</p>
</caption>
<graphic xlink:href="pone.0121453.g001"></graphic>
</fig>
<fig id="pone.0121453.g002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.g002</object-id>
<label>Fig 2</label>
<caption><title>Taxonomy tree-based classification.</title>
<p>Iterative execution of stage II (Classification) in <xref ref-type="fig" rid="pone.0121453.g001">Fig 1</xref>
.</p>
</caption>
<graphic xlink:href="pone.0121453.g002"></graphic>
</fig>
<p>The files with the input reads and reference sequences must be given in the FASTA format. The reference sequences and reads contain sometimes the unknown nucleotides (Ns). The <italic>k</italic>
-mers with such symbols are skipped.</p>
</sec>
<sec id="sec007"><title>Database construction</title>
<p>Before classification, the reference sequences have to be grouped into <italic>n</italic>
 categories, with whom we want to compare the metagenomic data. For example, the sequences can be grouped according to a phylum, so that a single group contains all the reference sequences belonging to Actinobacteria, Proteobacteria, Thermotogae, etc.</p>
<p>In order to classify the reads into a taxon, the nucleotide sequence database (nt data with entries from all traditional divisions of GenBank, EMBL, and DDBJ) has to be downloaded from the NCBI website. After that, the <italic>tax</italic>
 number (Taxonomic Identification ID) should be added to each reference sequence using the <italic>gi</italic>
 number (Sequence Identification ID). The <italic>tax</italic>
 number is necessary to categorize the sequences into groups. Hence, <italic>gi_taxid_nucl.dmp</italic>
 file, which contains the links between the <italic>gi</italic>
 and <italic>tax</italic>
 number, should also be downloaded from the NCBI website. This file is of a huge size, therefore we created an auxiliary program to avoid loading the entire file into RAM. This program splits the input file into smaller ones, then each of them is read sequentially, and finally the program extracts information about the <italic>tax</italic>
 number. Detailed support on how to prepare the data is given in <ext-link ext-link-type="uri" xlink:href="https://github.com/jkawulok/CoMeta/blob/master/readme.txt">readme.txt</ext-link>
 file in the CoMeta package.</p>
<p>The <italic>k</italic>
-mer database <italic>D</italic>
<sub><italic>i</italic>
</sub>
 for each group <italic>G</italic>
<sub><italic>i</italic>
</sub>
 is created using a parallel disk-based algorithm, which we derived from our earlier <italic>k</italic>
-mer counting software [<xref rid="pone.0121453.ref055" ref-type="bibr">55</xref>
]. First, every reference sequence from the group is scanned symbol by symbol to extract all <italic>k</italic>
-mers. Subsequently, the <italic>k</italic>
-mers are collected and sorted lexicographically. This makes it possible to create the set of all <italic>k</italic>
-mers, occurring at least once in the reference sequences (after sorting, the repeating <italic>k</italic>
-mers are at adjacent positions, so we can store only a single copy of each one).</p>
<p>The database is stored to the disk in a compact way (compact database). Each nucleotide is encoded using 2 bits. Instead of writing whole <italic>k</italic>
-mers to the file, the <italic>k</italic>
-mers sharing a common prefix are broken down into two parts, i.e., a four-nucleotide prefix and a suffix, thus, each suffix is saved on 2(<italic>k</italic>
−4) bits. The prefix is written once, and it is followed by a list of the suffixes with the number of each occurrences.</p>
<p>For classification purposes, CoMeta uses mainly two lists: 1) a buffer that contains sorted suffixes (stored on 1 byte) after cutting off eight-nucleotide prefix; 2) a list of 65,536 (= 4<sup>8</sup>
) elements of information, where the list of suffix for each prefix begins. These lists are built at the beginning of the classification process. However, in order to accelerate the loading of the database during the classification (which is crucial if the same database is used many times), compact database can be converted into a bit larger file (non-compact database), which contains among others the two lists. This file is loaded into the program once, and the size of this file is equal to the size of the memory that the <italic>k</italic>
-mer database occupies during the classification.</p>
</sec>
<sec id="sec008"><title>Classification</title>
<p>As it was mentioned earlier, the classification of the reads at a single level <italic>j</italic>
 (e.g., the order) consists of two steps: comparison and assignment. In the following subsections on the taxonomic classification, these steps are described for the <italic>j</italic>
-th taxonomic rank. In order to avoid obfuscating the notation, the upper index <italic>j</italic>
 is omitted.</p>
<sec id="sec009"><title>Comparison step</title>
<p>In the comparison step, the set of reads <italic>M</italic>
 is compared with all <italic>n</italic>
<italic>k</italic>
-mer databases that have been created beforehand. Each database <italic>D</italic>
<sub><italic>i</italic>
</sub>
 is loaded into RAM. When comparing each read from <italic>M</italic>
 against <italic>G</italic>
<sub><italic>i</italic>
</sub>
 group (1 ≤ <italic>i</italic>
 ≤ <italic>n</italic>
), the match score (<italic>ξ</italic>
) is obtained by cumulating the similarity between the <italic>k</italic>
-mers extracted from the read and from the reference sequences in <italic>G</italic>
<sub><italic>i</italic>
</sub>
. For a given read <italic>R</italic>
, the successive <italic>k</italic>
-mers are obtained using the 1-base sliding window. All possible subsequent <italic>k</italic>
-mers from <italic>R</italic>
 are checked for occurrence in <italic>D</italic>
<sub><italic>i</italic>
</sub>
. For each <italic>j</italic>
-th <italic>k</italic>
-mer of <italic>R</italic>
 found in <italic>D</italic>
<sub><italic>i</italic>
</sub>
, the match score <italic>ξ</italic>
 is increased by <italic>ξ</italic>
<sub><italic>j</italic>
</sub>
, which is the number of bases in the <italic>k</italic>
-mer that have not yet contributed to the match score (i.e., <inline-formula id="pone.0121453.e010"><mml:math id="M10"><mml:mrow><mml:msub><mml:mi>ξ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:msub><mml:mi mathvariant="fraktur">o</mml:mi>
<mml:mi mathvariant="fraktur">b</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, where <inline-formula id="pone.0121453.e011"><mml:math id="M11"><mml:mrow><mml:msub><mml:mi mathvariant="fraktur">o</mml:mi>
<mml:mi mathvariant="fraktur">b</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
 is the number of overlapping bases between the <italic>j</italic>
-th <italic>k</italic>
-mer and the previous <italic>k</italic>
-mer found in <italic>D</italic>
<sub><italic>i</italic>
</sub>
). Due to the 1-base sliding window, two subsequent read <italic>k</italic>
-mers have <italic>k</italic>
−1 overlapping bases, and our intention is to prevent from increasing the match score too much, if both exist in <italic>D</italic>
<sub><italic>i</italic>
</sub>
. The number of the overlapping bases between the <italic>p</italic>
-th and <italic>q</italic>
-th <italic>k</italic>
-mers (<italic>p</italic>
 < <italic>q</italic>
) is <inline-formula id="pone.0121453.e012"><mml:math id="M12"><mml:msub><mml:mi mathvariant="fraktur">o</mml:mi>
<mml:mi mathvariant="fraktur">b</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>max</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:math>
</inline-formula>
. An example on how the match score is calculated is presented in <xref ref-type="fig" rid="pone.0121453.g003">Fig 3</xref>
 for <italic>k</italic>
 = 5. For simplicity, we assume that <italic>G</italic>
<sub><italic>i</italic>
</sub>
 group contains only one reference sequence, <italic>S</italic>
. In the “<italic>k</italic>
-mers” column, the <italic>k</italic>
-mers that occur in the query read are sequentially listed. Those <italic>k</italic>
-mers, which are found in <italic>D</italic>
<sub><italic>i</italic>
</sub>
 database, are marked in bold (a sorted list of the <italic>k</italic>
-mers from <italic>G</italic>
<sub><italic>i</italic>
</sub>
 group is shown in the left part of the figure). The final match score for the sample read is 12, and the match rate score (Ξ), which is percentage ratio of the match score to the read length, is 85.7% (Ξ = 12/14⋅100%). For better illustration, the sequence matching is also shown at the top of the figure.</p>
<fig id="pone.0121453.g003" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.g003</object-id>
<label>Fig 3</label>
<caption><title>An example of comparing the query read with the reference sequence.</title>
</caption>
<graphic xlink:href="pone.0121453.g003"></graphic>
</fig>
<p>In order to quickly decide whether a read can obtain a significant score for each group <italic>G</italic>
<sub><italic>i</italic>
</sub>
, we perform simple filtering. We use the <italic>k</italic>
′-base offset sliding window to scan the query read (1 < <italic>k</italic>
′ ≤ <italic>k</italic>
, for <italic>k</italic>
′ = 1 this step is skipped). If none of such <italic>k</italic>
-mers exist in <italic>D</italic>
<sub><italic>i</italic>
</sub>
 we resign from scoring it according to <italic>D</italic>
<sub><italic>i</italic>
</sub>
. <italic>R</italic>
 is pre-assigned to the <italic>G</italic>
<sub><italic>i</italic>
</sub>
 group, if it (or its reverse complement) accumulates a match rate score exceeding a chosen match cut-off value (<italic>MC</italic>
). After the comparisons, for each group, we obtain two output files with the preliminary assignments, namely: 1) the <italic>match</italic>
 file that contains the reads, which accumulated a sufficient match rate score (Ξ ≥ <italic>MC</italic>
), and 2) the <italic>mismatch</italic>
 file which contains the remaining reads. Thus, 2<italic>n</italic>
 output intermediary assignment files are obtained after the first step of the classification. These files do not contain the nucleotide sequences, but only the single-line description of each read in the FASTA format, along with the obtained match scores. The corresponding nucleotide sequences are added after completing the classification stage.</p>
<p>The idea of this step is similar as used in the FACS algorithm. However, in FACS, the Bloom filters, which are of a limited capacity, are used to store the <italic>k</italic>
-mers. For each reference sequence, a separate Bloom filter is created. In addition, long sequences (≳ 200 Mbp) have to be split into a few subsequences, and then Bloom filters are created separately for each of them. Furthermore, usage of the Bloom filters may result in obtaining false <italic>k</italic>
-mer positives. In FACS (the Perl implementation), the reads which have been classified as belonging to some reference sequence, are withdrawn from further querying (the sequences are analyzed in some arbitrary order). This approach may result in classifying the read to an incorrect reference, if its match score is over the cut-off value for more than a single reference sequence, but the correct one does not appear as the first one.</p>
</sec>
<sec id="sec010"><title>Assignment to the best group</title>
<p>The second step of the classification stage consists in the analysis of the intermediary assignment files, and the query read is classified to that group, for which the match score (<italic>ξ</italic>
) is the highest. When multiple groups obtain the same highest match score, the read could be assigned to: 1) all of these groups; 2) any group; 3) random group.</p>
<p>To increase the sensitiveness of our method, in this step not only <italic>match</italic>
 but also <italic>mismatch</italic>
 files can be used. Using the latter, larger percentage of reads are classified, but in some cases this is achieved at the expense of precision. When taking into the account the <italic>mismatch</italic>
 file, the read is classified to a group with the highest match, even if it is below <italic>MC</italic>
. However, this matching must contain at least one matching <italic>k</italic>
-mer (<italic>ξ</italic>
 ≥ <italic>k</italic>
).</p>
<p>After this step, the classification is completed for a single taxonomic rank, and <italic>f</italic>
+1 output files (<italic>F</italic>
) are obtained. Apart from the classified reads, those reads which have not been assigned to any group, are stored in the additional <italic>F</italic>
<sub><italic>nc</italic>
</sub>
 file. The number <italic>f</italic>
 is equal to the number of groups, to which the reads from <italic>M</italic>
 were classified. For the groups without any reads preassigned, the files are not generated at all (hence, <italic>f</italic>
 ≤ <italic>n</italic>
).</p>
<p>For classification to a lower rank, classification stage has to be repeated, which is described in the following subsection.</p>
</sec>
</sec>
<sec id="sec011"><title>Taxonomy tree-based classification</title>
<p>Our taxonomic classification method starts from some high taxonomic rank, and then, if necessary, classifies reads to the lower levels. The search may be started from the superkingdom rank, however, due to very large collections of sequences which contain various groups, we suggest to begin from the phylum.</p>
<p>For the <italic>j</italic>
-th taxonomic rank, each read is compared to <italic>n</italic>
<sup><italic>j</italic>
</sup>
 groups and it is classified to that group (<inline-formula id="pone.0121453.e002"><mml:math id="M2"><mml:mrow><mml:msubsup><mml:mi>G</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>
), for which the match score is the highest. Next, the read is compared with those groups at a lower rank (<italic>j</italic>
+1), which are subgroups of <inline-formula id="pone.0121453.e003"><mml:math id="M3"><mml:mrow><mml:msubsup><mml:mi>G</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>
 (<inline-formula id="pone.0121453.e004"><mml:math id="M4"><mml:mrow><mml:msubsup><mml:mi>G</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow><mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>⊆</mml:mo>
<mml:msubsup><mml:mi>G</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>
, 1 ≤ <italic>i</italic>
 ≤ <italic>n</italic>
<sup><italic>j</italic>
+1</sup>
). <xref ref-type="fig" rid="pone.0121453.g002">Fig 2</xref>
 shows the taxonomy tree-based classification scheme with an example of the classification path (solid lines). The gray shade indicates a set of the reference sequences, where a query read was classified (<inline-formula id="pone.0121453.e005"><mml:math id="M5"><mml:mrow><mml:msubsup><mml:mi>G</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>
). In the tree, there are only six basic taxonomic ranks presented, however the process may include other ranks such as subphylum, superclass, etc.</p>
<p>During the classification of the <italic>M</italic>
<sup><italic>j</italic>
</sup>
 set of reads (at the <italic>j</italic>
-th taxonomic rank), the files <inline-formula id="pone.0121453.e006"><mml:math id="M6"><mml:mrow><mml:mo stretchy="false">{</mml:mo>
<mml:msubsup><mml:mi>F</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
 (<italic>i</italic>
 = 1,2,…,<italic>f</italic>
<sup><italic>j</italic>
</sup>
) are obtained, each of which contains the reads assigned to a particular <italic>i</italic>
-th group. In the classification at the next level (<italic>j</italic>
+1), the output file from the previous step (<italic>F</italic>
<sub><italic>i</italic>
</sub>
) is used as the input file, i.e., <italic>M</italic>
<sup><italic>j</italic>
+1</sup>
 = <italic>F</italic>
<sub><italic>i</italic>
</sub>
.</p>
</sec>
</sec>
<sec id="sec012"><title>Results and Discussion</title>
<sec id="sec013"><title>Implementation and test setup</title>
<p>The algorithms proposed in this paper are implemented in C++ language. The only exception is the tool grouping the reference sequences according to the taxonomic rank, which is implemented in Perl based on Perl module <italic>Bio::LITE::Taxonomy::NCBI</italic>
 from the <italic>Comprehensive Perl Archive Network</italic>. CoMeta package contains programs for the following tasks:
<list list-type="order"><list-item><p>Adding the <italic>tax</italic>
 number to the single-line description of each reference sequence.</p>
</list-item>
<list-item><p>Building <italic>k</italic>
-mer databases.</p>
</list-item>
<list-item><p>Two steps of the classification.</p>
</list-item>
</list>
The package and documentation are freely available at <ext-link ext-link-type="uri" xlink:href="https://github.com/jkawulok/cometa">https://github.com/jkawulok/cometa</ext-link>
, all the data used in this paper are available at <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.7910/DVN/29265">http://dx.doi.org/10.7910/DVN/29265</ext-link>
.</p>
<p>The experiments were conducted on a computer equipped with 12-core Intel Xeon clocked at 2.67 GHz and 96 GB RAM.</p>
<p>CoMeta is a similarity search method, thus we compare it with four other programs from this category. We also examine LMAT and Kraken, which are hybrids of composition-based and similarity search methods also using <italic>k</italic>
-mers.</p>
<p>The experiments are divided into two major parts. In the first one, our program was compared to FACS and each read was classified directly to a single reference sequence. This means that each <italic>group</italic>
 (c.f. <xref ref-type="fig" rid="pone.0121453.g001">Fig 1</xref>
) contained only one reference sequence (e.g., <italic>group</italic>
 = ‘Escherichia coli str. K-12 substr. DH10B’) and there was only one level. In the second part of our experiments, the reads were classified to the taxonomic ranks, thus the <italic>level</italic>
 was taxonomic rank and the <italic>group</italic>
 was one of the groups at the taxonomic rank, e.g, <italic>level</italic>
 = ‘phylum’ and <italic>group</italic>
=‘proteobacteria’. The classification results for CARMA (command line version 3.0), MEGAN (4.61.5), MG-RAST (3.0), and MetaPhyler (1.13) were taken from Bazinet–Cummings’ paper [<xref rid="pone.0121453.ref056" ref-type="bibr">56</xref>
] due to long computation time (in total approximately 34,000 CPU hours). The experiments for LMAT (1.2.1), Kraken (0.10.4b) were made by us.</p>
<p>We assessed the quality of the read classification taking into account the following criteria:
<list list-type="bullet"><list-item><p><bold>Time</bold>
: CPU classification time.</p>
</list-item>
<list-item><p><bold>Memory</bold>
: the maximal memory usage during the classification.</p>
</list-item>
<list-item><p><bold>Classified</bold>
: the overall percentage of reads that were classified (<inline-formula id="pone.0121453.e007"><mml:math id="M7"><mml:mrow><mml:mfrac><mml:mrow><mml:mtext mathvariant="italic">TP</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext mathvariant="italic">FP</mml:mtext>
</mml:mrow>
<mml:mtext mathvariant="italic">all</mml:mtext>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
), where <italic>TP</italic>
 and <italic>FP</italic>
 are numbers of correctly and incorrectly classified reads, respectively.</p>
</list-item>
<list-item><p><bold>Sensitivity</bold>
: the fraction of the correctly classified reads (<inline-formula id="pone.0121453.e008"><mml:math id="M8"><mml:mrow><mml:mfrac><mml:mtext mathvariant="italic">TP</mml:mtext>
<mml:mtext mathvariant="italic">all</mml:mtext>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
).</p>
</list-item>
<list-item><p><bold>Precision</bold>
: the percentage of correctly classified reads among all classified reads (<inline-formula id="pone.0121453.e009"><mml:math id="M9"><mml:mrow><mml:mfrac><mml:mtext mathvariant="italic">TP</mml:mtext>
<mml:mrow><mml:mtext mathvariant="italic">TP</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext mathvariant="italic">FP</mml:mtext>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
).</p>
</list-item>
</list>
</p>
</sec>
<sec id="sec014"><title>Datasets</title>
<p>The experiments were made for the following datasets:
<list list-type="order"><list-item><p><italic>FACS 269 bp</italic>
—simulated 454 metagenomic dataset containing 100,000 reads of an average length 269 bp. This dataset was proposed by Stranneheim et al. [<xref rid="pone.0121453.ref049" ref-type="bibr">49</xref>
] and we downloaded it from FACS website. The reads are from 17 bacterial genomes (four various phyla rank), three archaeal genomes (two various phyla rank), three viral genomes, and two human chromosomes. After removing reads containing more than 50% of unknown nucleotides, dataset of 93,653 reads was obtained, which we called <italic>reduced FACS 269 bp</italic>
.</p>
</list-item>
<list-item><p><italic>MetaPhyler 300 bp</italic>
—simulated metagenomic dataset containing 73,086 reads of length 300 bp. This dataset, proposed by Liu et al. [<xref rid="pone.0121453.ref047" ref-type="bibr">47</xref>
], was obtained from 31 phylogenetic marker. Unfortunately, some reads had no information about their origin and it would be impossible to verify whether they were correctly classified or not, so we filtered them out. Finally, 66,841 reads were left and used for our experiments. The reads have been derived from the organisms belonging to 17 various phyla. The majority originate from Proteobacteria (51%) and Firmicutes (21%).</p>
</list-item>
<list-item><p><italic>CARMA 265 bp</italic>
—simulated 454 metagenomic dataset containing 25,000 reads of an average length 265 bp. This dataset was proposed by Gerlach and Stoye [<xref rid="pone.0121453.ref045" ref-type="bibr">45</xref>
]. We downloaded it from WebCARMA website. The distribution of the reads in the bacterial phyla is: Proteobacteria—73.02%; Firmicutes—12.92%; Cyanobacteria—7.83%; Actinobacteria—5.22%; Chlamydiae—1.01%.</p>
</list-item>
<list-item><p><italic>PhyloPythia 961 bp</italic>
—dataset containing 124,941 random reads of an average length 961 bp from 113 isolate microbial genomes, proposed by Patil et al. [<xref rid="pone.0121453.ref037" ref-type="bibr">37</xref>
]. Some reads are repeated in this dataset and only 114,457 reads are unique. The majority of them (81%) come from Proteobacteria. These reads were classified to the genus rank (Rhodopseudomonas—21.00%; Bradyrhizobium—20.06%; Xylella—9.16%; the rest—each one below 6%).</p>
</list-item>
<list-item><p><italic>HiSeq 92 bp</italic>
—dataset containing 10,000 reads of an average length 92 bp, proposed by Wood and Salzberg [<xref rid="pone.0121453.ref051" ref-type="bibr">51</xref>
]. It was built using 20 sets of bacterial whole-genome shotgun reads and generated by Illumina HiSeq sequencing platform.</p>
</list-item>
<list-item><p><italic>MiSeq 156 bp</italic>
—dataset containing 10,000 reads of an average length 156 bp, proposed by Wood and Salzberg [<xref rid="pone.0121453.ref051" ref-type="bibr">51</xref>
]. It was built using 20 sets of bacterial whole-genome shotgun reads and generated by Illumina MiSeq sequencing platform.</p>
</list-item>
</list>
</p>
<p>The 2nd–6th datasets contain reads from bacterial genomes only. Both <italic>FACS 269 bp</italic>
 and <italic>reduced FACS 269 bp</italic>
 datasets contain also reads from human, viral, and archaeal species.</p>
</sec>
<sec id="sec015"><title>Experiment One</title>
<p>In the first experiment, we compared CoMeta with FACS 2.1 algorithm implemented in Perl [<xref rid="pone.0121453.ref049" ref-type="bibr">49</xref>
], and with FACS implemented in C. We tried to reproduce the results reported by Stranneheim et al. [<xref rid="pone.0121453.ref049" ref-type="bibr">49</xref>
] (FACS in Perl). Unfortunately, we obtained different scores, despite using their scripts, the same set of parameters, and the same set of 25 reference sequences.</p>
<p>Stranneheim et al. verified false positives using MEGABLAST for <italic>k</italic>
-mer length equal to 17, 21, 25, and 35. To speed up this process we constructed a homologous map for comparing reads to the reference sequences. Assuming the same criteria as in FACS, if a read obtains 500 hits with E-values < 10<sup>−50</sup>
 using MegaBLAST, then it is considered as a homologue. In this way, the classification results can be quickly checked for large sets of false positives, such as those created for short <italic>k</italic>
-mers. The resulting map contained 17 homologous.</p>
<p>As discussed in the previous section, <italic>FACS 269 bp</italic>
 dataset includes many reads, which consist mostly of unknown nucleotides. Therefore, in order to provide a fair comparison, we removed them and used <italic>reduced FACS 269 bp</italic> dataset. The comparison was performed using the following variants of FACS and CoMeta:
<list list-type="order"><list-item><p><italic>FACS-P</italic>
: FACS 2.1 algorithm in Perl. The probability of false positive parameter (<italic>p</italic>
<sub><italic>f</italic>
</sub>
) in Bloom filter (used by FACS) was set to 0.0005.</p>
</list-item>
<list-item><p><italic>FACS-C</italic>
: FACS algorithm in C, whose sources were downloaded on 5th February 2014, from <ext-link ext-link-type="uri" xlink:href="https://github.com/SciLifeLab/facs">https://github.com/SciLifeLab/facs</ext-link>
. The reads are classified to each reference sequence to which similarity is highest than <italic>MC</italic>
. The probability of false positive parameter in Bloom filter was set to the same value as for <italic>FACS-P</italic>
.</p>
</list-item>
<list-item><p><italic>pre-CoMeta</italic>
: The only comparison step of CoMeta algorithm (without assignment). This is a similar strategy as implemented in <italic>FACS-C</italic>
.</p>
</list-item>
<list-item><p><italic>CoMeta</italic>
: The complete proposed classification algorithm of a read (to all reference sequences) using the best solution (presented in <xref ref-type="fig" rid="pone.0121453.g001">Fig 1</xref>
).</p>
</list-item>
</list>
</p>
<p><italic>FACS-P</italic>
, <italic>FACS-C</italic>
, and <italic>pre-CoMeta</italic>
 were ran using various values of <italic>k</italic>
 and <italic>MC</italic>
. In <italic>CoMeta</italic>
, we used <italic>MC</italic>
 = 30% in the “comparison” step, and then the reads were classified to the reference sequence according to the highest score. When <italic>FACS-P</italic>
 classifies a read to some <italic>G</italic>
<sub><italic>i</italic>
</sub>
-th reference sequence it does not compare the read with any further reference sequence (<italic>G</italic>
<sub><italic>i</italic>
+<italic>j</italic>
</sub>
, <italic>j</italic>
 > 0). Since in <italic>FACS-C</italic>
 and <italic>pre-CoMeta</italic>
 the reads are compared with each reference sequence, their FP values can be larger than for <italic>FACS-P</italic>
.</p>
<p>In <xref rid="pone.0121453.t002" ref-type="table">Table 2</xref>
, we report the best classification results obtained using the four aforementioned methods. The results for <italic>CoMeta</italic>
 are when taking into account the <italic>mismatch</italic>
 files. When we stopped the algorithm after the “comparison” step (<italic>pre-CoMeta</italic>
), the sensitivity was the highest, unfortunately, at the expense of a large number of false positives. <italic>pre-CoMeta</italic>
 gave slightly better precision score than <italic>FACS-C</italic>
. The precision is high for <italic>FACS-P</italic>
, however the sensitivity is the lowest here. In general, the best results was obtained by <italic>CoMeta</italic>
 which was able to classify almost every read and the number of false positives was small.</p>
<table-wrap id="pone.0121453.t002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.t002</object-id>
<label>Table 2</label>
<caption><title>Comparison of FACS algorithms with CoMeta.</title>
</caption>
<alternatives><graphic id="pone.0121453.t002g" xlink:href="pone.0121453.t002"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1"><italic>k</italic>
</th>
<th align="left" rowspan="1" colspan="1"><italic>MC</italic>
</th>
<th align="left" rowspan="1" colspan="1">Sensitivity</th>
<th align="left" rowspan="1" colspan="1">Precision</th>
<th align="left" rowspan="1" colspan="1">Classified</th>
<th align="left" rowspan="1" colspan="1"><italic>t</italic>
</th>
</tr>
<tr><th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">[%]</th>
<th align="left" rowspan="1" colspan="1">[%]</th>
<th align="left" rowspan="1" colspan="1">[%]</th>
<th align="left" rowspan="1" colspan="1">[%]</th>
<th align="left" rowspan="1" colspan="1">[hh:mm:ss]</th>
</tr>
</thead>
<tbody><tr><td colspan="6" align="left" rowspan="1"><italic>FACS-P</italic>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">18</td>
<td align="left" rowspan="1" colspan="1">80</td>
<td align="char" char="." rowspan="1" colspan="1">97.62</td>
<td align="char" char="." rowspan="1" colspan="1">97.86</td>
<td align="char" char="." rowspan="1" colspan="1">99.76</td>
<td align="left" rowspan="1" colspan="1">00:03:14</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">21</td>
<td align="left" rowspan="1" colspan="1">65</td>
<td align="char" char="." rowspan="1" colspan="1">97.86</td>
<td align="char" char="." rowspan="1" colspan="1">98.08</td>
<td align="char" char="." rowspan="1" colspan="1">99.78</td>
<td align="left" rowspan="1" colspan="1">00:02:49</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">21</td>
<td align="left" rowspan="1" colspan="1">70</td>
<td align="char" char="." rowspan="1" colspan="1">97.82</td>
<td align="char" char="." rowspan="1" colspan="1">98.27</td>
<td align="char" char="." rowspan="1" colspan="1">99.55</td>
<td align="left" rowspan="1" colspan="1">00:02:49</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">24</td>
<td align="left" rowspan="1" colspan="1">55</td>
<td align="char" char="." rowspan="1" colspan="1">97.77</td>
<td align="char" char="." rowspan="1" colspan="1">98.12</td>
<td align="char" char="." rowspan="1" colspan="1">99.64</td>
<td align="left" rowspan="1" colspan="1">00:02:36</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">27</td>
<td align="left" rowspan="1" colspan="1">45</td>
<td align="char" char="." rowspan="1" colspan="1">97.65</td>
<td align="char" char="." rowspan="1" colspan="1">98.07</td>
<td align="char" char="." rowspan="1" colspan="1">99.58</td>
<td align="left" rowspan="1" colspan="1">00:02:27</td>
</tr>
<tr><td colspan="6" align="left" rowspan="1"><italic>FACS-C</italic>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">17</td>
<td align="left" rowspan="1" colspan="1">30</td>
<td align="char" char="." rowspan="1" colspan="1"><bold>99.92</bold>
</td>
<td align="char" char="." rowspan="1" colspan="1">90.20</td>
<td align="char" char="." rowspan="1" colspan="1">99.93</td>
<td align="left" rowspan="1" colspan="1">00:01:08</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">17</td>
<td align="left" rowspan="1" colspan="1">40</td>
<td align="char" char="." rowspan="1" colspan="1">98.78</td>
<td align="char" char="." rowspan="1" colspan="1">93.25</td>
<td align="char" char="." rowspan="1" colspan="1">98.78</td>
<td align="left" rowspan="1" colspan="1">00:01:12</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">19</td>
<td align="left" rowspan="1" colspan="1">30</td>
<td align="char" char="." rowspan="1" colspan="1">99.48</td>
<td align="char" char="." rowspan="1" colspan="1">92.65</td>
<td align="char" char="." rowspan="1" colspan="1">99.48</td>
<td align="left" rowspan="1" colspan="1"><bold>00:00:49</bold>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">21</td>
<td align="left" rowspan="1" colspan="1">30</td>
<td align="char" char="." rowspan="1" colspan="1">98.26</td>
<td align="char" char="." rowspan="1" colspan="1">94.27</td>
<td align="char" char="." rowspan="1" colspan="1">98.27</td>
<td align="left" rowspan="1" colspan="1"><bold>00:00:43</bold>
</td>
</tr>
<tr><td colspan="6" align="left" rowspan="1"><italic>pre-CoMeta</italic>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">15</td>
<td align="left" rowspan="1" colspan="1">55</td>
<td align="char" char="." rowspan="1" colspan="1">99.30</td>
<td align="char" char="." rowspan="1" colspan="1">93.56</td>
<td align="char" char="." rowspan="1" colspan="1">99.31</td>
<td align="left" rowspan="1" colspan="1">00:01:52</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">18</td>
<td align="left" rowspan="1" colspan="1">45</td>
<td align="char" char="." rowspan="1" colspan="1">99.42</td>
<td align="char" char="." rowspan="1" colspan="1">93.36</td>
<td align="char" char="." rowspan="1" colspan="1">99.43</td>
<td align="left" rowspan="1" colspan="1">00:01:21</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">21</td>
<td align="left" rowspan="1" colspan="1">45</td>
<td align="char" char="." rowspan="1" colspan="1">99.05</td>
<td align="char" char="." rowspan="1" colspan="1">93.93</td>
<td align="char" char="." rowspan="1" colspan="1">99.06</td>
<td align="left" rowspan="1" colspan="1">00:01:08</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">25</td>
<td align="left" rowspan="1" colspan="1">30</td>
<td align="char" char="." rowspan="1" colspan="1">99.56</td>
<td align="char" char="." rowspan="1" colspan="1">92.05</td>
<td align="char" char="." rowspan="1" colspan="1">99.57</td>
<td align="left" rowspan="1" colspan="1">00:01:09</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">27</td>
<td align="left" rowspan="1" colspan="1">35</td>
<td align="char" char="." rowspan="1" colspan="1">99.36</td>
<td align="char" char="." rowspan="1" colspan="1">93.07</td>
<td align="char" char="." rowspan="1" colspan="1">99.37</td>
<td align="left" rowspan="1" colspan="1">00:01:16</td>
</tr>
<tr><td colspan="6" align="left" rowspan="1"><italic>CoMeta</italic>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">18</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="char" char="." rowspan="1" colspan="1">97.91</td>
<td align="char" char="." rowspan="1" colspan="1">97.91</td>
<td align="char" char="." rowspan="1" colspan="1"><bold>100.00</bold>
</td>
<td align="left" rowspan="1" colspan="1">00:01:37</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">21</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="char" char="." rowspan="1" colspan="1">98.40</td>
<td align="char" char="." rowspan="1" colspan="1">98.41</td>
<td align="char" char="." rowspan="1" colspan="1">99.99</td>
<td align="left" rowspan="1" colspan="1">00:01:36</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">24</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="char" char="." rowspan="1" colspan="1">98.69</td>
<td align="char" char="." rowspan="1" colspan="1">98.75</td>
<td align="char" char="." rowspan="1" colspan="1">99.93</td>
<td align="left" rowspan="1" colspan="1">00:01:37</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">27</td>
<td align="left" rowspan="1" colspan="1">–</td>
<td align="char" char="." rowspan="1" colspan="1">98.71</td>
<td align="char" char="." rowspan="1" colspan="1"><bold>99.08</bold>
</td>
<td align="char" char="." rowspan="1" colspan="1">99.63</td>
<td align="left" rowspan="1" colspan="1">00:01:30</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="t002fn001"><p>Comparison of the best classification results obtained using four methods (bold values indicate the best score for each column):</p>
</fn>
<fn id="t002fn002"><p><italic>FACS-P</italic>
: the FACS 2.1 program in Perl [<xref rid="pone.0121453.ref049" ref-type="bibr">49</xref>
]. When read is classified to some <italic>G</italic>
<sub><italic>i</italic>
</sub>
-th reference sequence, it does not be compared with any further reference sequence;</p>
</fn>
<fn id="t002fn003"><p><italic>FACS-C</italic>
: the FACS program in C, which was downloaded from <ext-link ext-link-type="uri" xlink:href="https://github.com/SciLifeLab/facs">https://github.com/SciLifeLab/facs</ext-link>
. The reads are classified to each reference sequence to which similarity is highest than <italic>MC</italic>
;</p>
</fn>
<fn id="t002fn004"><p><italic>pre-CoMeta</italic>
: the only comparison step of CoMeta algorithm (without assignment). This is a similar strategy as implemented in <italic>FACS-C</italic>
.</p>
</fn>
<fn id="t002fn005"><p><italic>CoMeta</italic>
: the full proposed algorithm, the reads are classified to the reference sequence according to the highest score.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The precisions and sensitivities for <italic>CoMeta</italic>
, depending on <italic>k</italic>
, are shown in <xref ref-type="fig" rid="pone.0121453.g004">Fig 4</xref>
. The results are presented with and without taking into account the <italic>mismatch</italic>
 files (<italic>MM</italic>
). It may be noticed that for growing <italic>k</italic>
 up to <italic>k</italic>
 = 25 both precision and sensitivity grows, then sensitivity falls down. The reason is that with the increase of <italic>k</italic>
, the number of unclassified sequences also increases.</p>
<fig id="pone.0121453.g004" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.g004</object-id>
<label>Fig 4</label>
<caption><title>Classification accuracy for <italic>CoMeta</italic>
 in Experiment One.</title>
<p>Accuracy of classification is shown when taking into account only the <italic>match</italic>
 files (dotted line with square mark) and when considering additionally the <italic>mismatch</italic>
 files (solid line with a circle mark). The performance curve reflects various <italic>k</italic>
-mer lengths.</p>
</caption>
<graphic xlink:href="pone.0121453.g004"></graphic>
</fig>
<p>The sensitivity and precision for <italic>FACS-P</italic>
, <italic>FACS-C</italic>
, and <italic>pre-CoMeta</italic>
 for various <italic>k</italic>
 are presented in Fig <xref ref-type="fig" rid="pone.0121453.g005">5A</xref>
–<xref ref-type="fig" rid="pone.0121453.g005">5C</xref>
. Each series shows the results for 11 different threshold values, in sequence starting from the left part of each figure: <italic>MC</italic>
 = 30,35,40,…,80 [%]. It can be seen from the plot A that only for a small value of <italic>k</italic>
 in <italic>FACS-P</italic>
, the sensitivity does not drop with the increasing threshold values, while in other cases, the sensitivity for a large <italic>MC</italic>
 declines. The detailed analysis of the impact of the parameters <italic>k</italic>
, <italic>MC</italic>
 and <italic>p</italic>
<sub><italic>f</italic>
</sub>
 (for building the Bloom filters) on the accuracy of <italic>FACS-P</italic>
 was presented in our earlier study [<xref rid="pone.0121453.ref057" ref-type="bibr">57</xref>
].</p>
<fig id="pone.0121453.g005" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.g005</object-id>
<label>Fig 5</label>
<caption><title>Classification accuracy for the Experiment One using various <italic>k</italic>
 parameter.</title>
<p>The plot A represents scores after classification using <italic>FACS-P</italic>
, the plot B—using <italic>FACS-C</italic>
, and the plot C—using <italic>pre-CoMeta</italic>
. Each series shows the results for 11 different threshold values, in sequence starting from the left part of each figure: <italic>MC</italic>
 = 30,35,40,…,80 [%].</p>
</caption>
<graphic xlink:href="pone.0121453.g005"></graphic>
</fig>
<p>The processing times of the examined methods are given in <xref rid="pone.0121453.t002" ref-type="table">Table 2</xref>
. It can be seen that <italic>FACS-C</italic>
 is usually the fastest, however, <italic>CoMeta</italic>
 is slower only by a factor two.</p>
</sec>
<sec id="sec016"><title>Experiment Two</title>
<p>The second experiment consisted in classifying reads to the taxonomic groups. We compared our method with all the examined programs except for FACS.</p>
<p>The programs were evaluated for the 1st–4th metagenomic datasets (from the 454 sequencing). As was said the results for CARMA, MEGA, MG-RAST, and MetaPhyler were taken directly from Bazinet–Cummings’ paper [<xref rid="pone.0121453.ref056" ref-type="bibr">56</xref>
]. Bazinet and Cummings classified <italic>PhyloPythia 961 bp</italic>
 at the genus rank, <italic>FACS 269 bp</italic>
 at the superkingdom rank, and the other two datasets at the phylum rank. When running CoMeta, Kraken, and LMAT we also conducted <italic>PhyloPythia 961 bp</italic>
 classification into the genus but the three other datasets into the phyla rank.</p>
<p>For the Iluimina datasets (<italic>HiSeq 92 bp</italic>
 and <italic>MiSeq 156 bp</italic>
) we examined Kraken, LMAT, and CoMeta. The classification level was set to the genus rank here.</p>
<p>LMAT was tested for two databases downloaded from the LMAT website: “full” <italic>k</italic>
-mer/taxonomy database (<italic>kFull</italic>
) and smaller database built from “marker library” (<italic>kML</italic>
). These databases were constructed from the complete and partial microbial genome sequences from the NCBI genome database from 2011. The <italic>kFull</italic>
 database contains 20-mers, while <italic>kML</italic>
 — 18-mers.</p>
<p>Kraken was evaluated using MiniKraken database (the only available) downloaded from the Kraken website. Unfortunately, Kraken failed to construct the database from our set of reference sequences (probably due to huge memory requirements of Jellyfish tool used to collect <italic>k</italic>
-mer statistics). We were also not able to obtain the larger databases from the authors.</p>
<p>For CoMeta, we built <italic>k</italic>
-mer databases using all reference sequences from the NCBI genome database from 2012. We divided the sequences into several groups, so during classification we could easily select the groups we wanted to classify to. Therefore, in some experiments we used all sequences (<italic>allDb</italic>
 database), while in the rest only those from bacteria, viruses, and archaea (<italic>micDb</italic>
 database). The databases were constructed using various <italic>k</italic>
-mer lengths (15, 18, 21, 24, 27, and 30).</p>
<p>We conducted a large number of preliminary experiments for different parameters. Some of them are described in <xref ref-type="supplementary-material" rid="pone.0121453.s001">S1 Supporting Information</xref>
. The most important results of our experiments are summarized in Tables <xref rid="pone.0121453.t003" ref-type="table">3</xref>
 and <xref rid="pone.0121453.t004" ref-type="table">4</xref>
. LMAT results are for “minimum score” (<italic>ms</italic>
) set to 0 (optimal value according to the preliminary experiments). The results for CoMeta <italic>allDb</italic>
 were calculated in such a way that if a read was classified to several groups, then it was assigned to all of them. Hence, in some cases, the sum of TP, FP, and NC was higher than the number of all reads in the dataset. For better comparison of CoMeta and Kraken, the results for CoMeta <italic>micDb</italic>
 were computed using the same strategy as in Kraken, so if a read was classified to multiple groups we did not assign it to any group.</p>
<table-wrap id="pone.0121453.t003" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.t003</object-id>
<label>Table 3</label>
<caption><title>Comparison of programs using 454 reads.</title>
</caption>
<alternatives><graphic id="pone.0121453.t003g" xlink:href="pone.0121453.t003"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1">Program</th>
<th align="left" rowspan="1" colspan="1">FACS 269bp</th>
<th align="left" rowspan="1" colspan="1">MetaPhyler 300bp</th>
<th align="left" rowspan="1" colspan="1">CARMA 265bp</th>
<th align="left" rowspan="1" colspan="1">PhyloPythia 961bp</th>
</tr>
</thead>
<tbody><tr><td colspan="5" align="left" rowspan="1">Percentage of classified reads</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CARMA<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">29.0</td>
<td align="char" char="." rowspan="1" colspan="1">93.6</td>
<td align="char" char="." rowspan="1" colspan="1">68.7</td>
<td align="char" char="." rowspan="1" colspan="1">61.3</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MEGAN<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">48.4</td>
<td align="char" char="." rowspan="1" colspan="1">88.2</td>
<td align="char" char="." rowspan="1" colspan="1">90.5</td>
<td align="char" char="." rowspan="1" colspan="1">62.2</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MetaPhyler<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
<td align="char" char="." rowspan="1" colspan="1">80.9</td>
<td align="char" char="." rowspan="1" colspan="1">0.5</td>
<td align="char" char="." rowspan="1" colspan="1">0.6</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MG-RAST<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">27.1</td>
<td align="char" char="." rowspan="1" colspan="1">29.8</td>
<td align="char" char="." rowspan="1" colspan="1">80.2</td>
<td align="char" char="." rowspan="1" colspan="1">70.5</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kML</italic>
</td>
<td align="left" rowspan="1" colspan="1">24.7(26.4<xref ref-type="table-fn" rid="t003fn002"><sup>b</sup>
</xref>
)</td>
<td align="char" char="." rowspan="1" colspan="1">96.5</td>
<td align="char" char="." rowspan="1" colspan="1">80.4</td>
<td align="char" char="." rowspan="1" colspan="1">98.3</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="left" rowspan="1" colspan="1">92.5(98.8<xref ref-type="table-fn" rid="t003fn002"><sup>b</sup>
</xref>
)</td>
<td align="char" char="." rowspan="1" colspan="1">99.3</td>
<td align="char" char="." rowspan="1" colspan="1">86.0</td>
<td align="char" char="." rowspan="1" colspan="1">82.7</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="char" char="." rowspan="1" colspan="1">100.0</td>
<td align="char" char="." rowspan="1" colspan="1">96.7</td>
<td align="char" char="." rowspan="1" colspan="1">98.0</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>allDb</italic>
</td>
<td align="left" rowspan="1" colspan="1">93.6(100.0<xref ref-type="table-fn" rid="t003fn002"><sup>b</sup>
</xref>
)</td>
<td align="char" char="." rowspan="1" colspan="1">100.0</td>
<td align="char" char="." rowspan="1" colspan="1">99.9</td>
<td align="char" char="." rowspan="1" colspan="1">94.7</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="char" char="." rowspan="1" colspan="1">100.0</td>
<td align="char" char="." rowspan="1" colspan="1">98.9</td>
<td align="char" char="." rowspan="1" colspan="1">97.4</td>
</tr>
<tr><td colspan="5" align="left" rowspan="1">Sensitivity (percentage)</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CARMA<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">26.7</td>
<td align="char" char="." rowspan="1" colspan="1">93.4</td>
<td align="char" char="." rowspan="1" colspan="1">68.5</td>
<td align="char" char="." rowspan="1" colspan="1">59.8</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MEGAN<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">42.5</td>
<td align="char" char="." rowspan="1" colspan="1">87.9</td>
<td align="char" char="." rowspan="1" colspan="1">90.3</td>
<td align="char" char="." rowspan="1" colspan="1">61.0</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MetaPhyler<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.1</td>
<td align="char" char="." rowspan="1" colspan="1">80.7</td>
<td align="char" char="." rowspan="1" colspan="1">0.5</td>
<td align="char" char="." rowspan="1" colspan="1">0.5</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MG-RAST<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">25.0</td>
<td align="char" char="." rowspan="1" colspan="1">29.7</td>
<td align="char" char="." rowspan="1" colspan="1">80.1</td>
<td align="char" char="." rowspan="1" colspan="1">67.2</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kML</italic>
</td>
<td align="left" rowspan="1" colspan="1">24.7(26.3<xref ref-type="table-fn" rid="t003fn002"><sup>b</sup>
</xref>
)</td>
<td align="char" char="." rowspan="1" colspan="1">95.7</td>
<td align="char" char="." rowspan="1" colspan="1">80.4</td>
<td align="char" char="." rowspan="1" colspan="1">98.1</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="left" rowspan="1" colspan="1">92.5(98.7<xref ref-type="table-fn" rid="t003fn002"><sup>b</sup>
</xref>
)</td>
<td align="char" char="." rowspan="1" colspan="1">98.5</td>
<td align="char" char="." rowspan="1" colspan="1">86.0</td>
<td align="char" char="." rowspan="1" colspan="1">82.5</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="char" char="." rowspan="1" colspan="1">99.9</td>
<td align="char" char="." rowspan="1" colspan="1">96.7</td>
<td align="char" char="." rowspan="1" colspan="1">97.7</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>allDb</italic>
</td>
<td align="left" rowspan="1" colspan="1">93.4(99.7<xref ref-type="table-fn" rid="t003fn002"><sup>b</sup>
</xref>
)</td>
<td align="char" char="." rowspan="1" colspan="1">99.6</td>
<td align="char" char="." rowspan="1" colspan="1">99.1</td>
<td align="char" char="." rowspan="1" colspan="1">94.1</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="char" char="." rowspan="1" colspan="1">99.8</td>
<td align="char" char="." rowspan="1" colspan="1">98.9</td>
<td align="char" char="." rowspan="1" colspan="1">96.2</td>
</tr>
<tr><td colspan="5" align="left" rowspan="1">Precision (percentage)</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CARMA<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">92.0</td>
<td align="char" char="." rowspan="1" colspan="1">99.7</td>
<td align="char" char="." rowspan="1" colspan="1">99.7</td>
<td align="char" char="." rowspan="1" colspan="1">97.4</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MEGAN<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">78.1</td>
<td align="char" char="." rowspan="1" colspan="1">99.7</td>
<td align="char" char="." rowspan="1" colspan="1">99.8</td>
<td align="char" char="." rowspan="1" colspan="1">98.1</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MetaPhyler<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">84.0</td>
<td align="char" char="." rowspan="1" colspan="1">99.7</td>
<td align="char" char="." rowspan="1" colspan="1">100.0</td>
<td align="char" char="." rowspan="1" colspan="1">83.8</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MG-RAST<xref ref-type="table-fn" rid="t003fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">92.4</td>
<td align="char" char="." rowspan="1" colspan="1">99.8</td>
<td align="char" char="." rowspan="1" colspan="1">99.9</td>
<td align="char" char="." rowspan="1" colspan="1">95.3</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kML</italic>
</td>
<td align="left" rowspan="1" colspan="1">99.9(99.9<xref ref-type="table-fn" rid="t003fn002"><sup>b</sup>
</xref>
)</td>
<td align="char" char="." rowspan="1" colspan="1">97.8</td>
<td align="char" char="." rowspan="1" colspan="1">100.0</td>
<td align="char" char="." rowspan="1" colspan="1">99.8</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="left" rowspan="1" colspan="1">100.0(100.0<xref ref-type="table-fn" rid="t003fn002"><sup>b</sup>
</xref>
)</td>
<td align="char" char="." rowspan="1" colspan="1">97.8</td>
<td align="char" char="." rowspan="1" colspan="1">100.0</td>
<td align="char" char="." rowspan="1" colspan="1">99.8</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="char" char="." rowspan="1" colspan="1">99.9</td>
<td align="char" char="." rowspan="1" colspan="1">100.0</td>
<td align="char" char="." rowspan="1" colspan="1">99.7</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>allDb</italic>
</td>
<td align="left" rowspan="1" colspan="1">99.8(99.8<xref ref-type="table-fn" rid="t003fn002"><sup>b</sup>
</xref>
)</td>
<td align="char" char="." rowspan="1" colspan="1">99.6</td>
<td align="char" char="." rowspan="1" colspan="1">99.1</td>
<td align="char" char="." rowspan="1" colspan="1">99.3</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="char" char="." rowspan="1" colspan="1">99.8</td>
<td align="char" char="." rowspan="1" colspan="1">99.9</td>
<td align="char" char="." rowspan="1" colspan="1">98.8</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="t003fn001"><p><sup>a</sup>
—The results of the program are taken from the Bazinet–Cummings’ paper [<xref rid="pone.0121453.ref056" ref-type="bibr">56</xref>
].</p>
</fn>
<fn id="t003fn002"><p><sup>b</sup>
—The results for <italic>FACS 269bp</italic>
 dataset, where reads with more than 50% of unknown nucleotides (Ns) are filtered out. The values outside the brackets are for the whole dataset.</p>
</fn>
<fn id="t003fn003"><p>CoMeta <italic>allDb</italic>
 parameters: <italic>MC</italic>
 = 30%, <italic>k</italic>
 = 24.</p>
</fn>
<fn id="t003fn004"><p>CoMeta <italic>micDb</italic>
 parameters: <italic>MC</italic>
 = 5%, <italic>k</italic>
 = 30.</p>
</fn>
<fn id="t003fn005"><p>LMAT <italic>kML</italic>
 and <italic>kFull</italic>
 parameter: <italic>ms</italic>
 = 0.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="pone.0121453.t004" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.t004</object-id>
<label>Table 4</label>
<caption><title>Comparison of programs for various level classification using Illumina reads.</title>
</caption>
<alternatives><graphic id="pone.0121453.t004g" xlink:href="pone.0121453.t004"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1">Programs</th>
<th colspan="3" align="center" rowspan="1">HiSeq 92 bp</th>
<th colspan="3" align="center" rowspan="1">MiSeq 156 bp</th>
</tr>
<tr><th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">Sensitivity</th>
<th align="left" rowspan="1" colspan="1">Precision</th>
<th align="left" rowspan="1" colspan="1">Classified</th>
<th align="left" rowspan="1" colspan="1">Sensitivity</th>
<th align="left" rowspan="1" colspan="1">Precision</th>
<th align="left" rowspan="1" colspan="1">Classified</th>
</tr>
</thead>
<tbody><tr><td colspan="7" align="left" rowspan="1">PHYLUM</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">89.89</td>
<td align="char" char="." rowspan="1" colspan="1">99.74</td>
<td align="char" char="." rowspan="1" colspan="1">90.12</td>
<td align="char" char="." rowspan="1" colspan="1">88.23</td>
<td align="char" char="." rowspan="1" colspan="1">99.47</td>
<td align="char" char="." rowspan="1" colspan="1">88.70</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken<xref ref-type="table-fn" rid="t004fn001"><sup>a</sup>
</xref>
</td>
<td align="char" char="." rowspan="1" colspan="1">65.34</td>
<td align="char" char="." rowspan="1" colspan="1">99.79</td>
<td align="char" char="." rowspan="1" colspan="1">65.48</td>
<td align="char" char="." rowspan="1" colspan="1">75.88</td>
<td align="char" char="." rowspan="1" colspan="1">99.93</td>
<td align="char" char="." rowspan="1" colspan="1">75.93</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">81.64</td>
<td align="char" char="." rowspan="1" colspan="1">98.97</td>
<td align="char" char="." rowspan="1" colspan="1">82.49</td>
<td align="char" char="." rowspan="1" colspan="1">86.71</td>
<td align="char" char="." rowspan="1" colspan="1">99.11</td>
<td align="char" char="." rowspan="1" colspan="1">87.49</td>
</tr>
<tr><td colspan="7" align="left" rowspan="1">CLASS</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">88.06</td>
<td align="char" char="." rowspan="1" colspan="1">99.66</td>
<td align="char" char="." rowspan="1" colspan="1">88.36</td>
<td align="char" char="." rowspan="1" colspan="1">85.79</td>
<td align="char" char="." rowspan="1" colspan="1">99.65</td>
<td align="char" char="." rowspan="1" colspan="1">86.09</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken<xref ref-type="table-fn" rid="t004fn001"><sup>a</sup>
</xref>
</td>
<td align="char" char="." rowspan="1" colspan="1">65.16</td>
<td align="char" char="." rowspan="1" colspan="1">99.65</td>
<td align="char" char="." rowspan="1" colspan="1">65.39</td>
<td align="char" char="." rowspan="1" colspan="1">75.73</td>
<td align="char" char="." rowspan="1" colspan="1">99.91</td>
<td align="char" char="." rowspan="1" colspan="1">75.80</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">80.87</td>
<td align="char" char="." rowspan="1" colspan="1">98.14</td>
<td align="char" char="." rowspan="1" colspan="1">82.40</td>
<td align="char" char="." rowspan="1" colspan="1">86.34</td>
<td align="char" char="." rowspan="1" colspan="1">98.83</td>
<td align="char" char="." rowspan="1" colspan="1">87.36</td>
</tr>
<tr><td colspan="7" align="left" rowspan="1">ORDER</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">86.48</td>
<td align="char" char="." rowspan="1" colspan="1">99.80</td>
<td align="char" char="." rowspan="1" colspan="1">86.65</td>
<td align="char" char="." rowspan="1" colspan="1">81.00</td>
<td align="char" char="." rowspan="1" colspan="1">99.63</td>
<td align="char" char="." rowspan="1" colspan="1">81.30</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken<xref ref-type="table-fn" rid="t004fn001"><sup>a</sup>
</xref>
</td>
<td align="char" char="." rowspan="1" colspan="1">64.89</td>
<td align="char" char="." rowspan="1" colspan="1">99.51</td>
<td align="char" char="." rowspan="1" colspan="1">65.21</td>
<td align="char" char="." rowspan="1" colspan="1">75.52</td>
<td align="char" char="." rowspan="1" colspan="1">99.87</td>
<td align="char" char="." rowspan="1" colspan="1">75.62</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">80.34</td>
<td align="char" char="." rowspan="1" colspan="1">97.73</td>
<td align="char" char="." rowspan="1" colspan="1">82.21</td>
<td align="char" char="." rowspan="1" colspan="1">85.39</td>
<td align="char" char="." rowspan="1" colspan="1">98.01</td>
<td align="char" char="." rowspan="1" colspan="1">87.12</td>
</tr>
<tr><td colspan="7" align="left" rowspan="1">FAMILY</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">84.96</td>
<td align="char" char="." rowspan="1" colspan="1">99.79</td>
<td align="char" char="." rowspan="1" colspan="1">85.14</td>
<td align="char" char="." rowspan="1" colspan="1">79.40</td>
<td align="char" char="." rowspan="1" colspan="1">99.72</td>
<td align="char" char="." rowspan="1" colspan="1">79.62</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken<xref ref-type="table-fn" rid="t004fn001"><sup>a</sup>
</xref>
</td>
<td align="char" char="." rowspan="1" colspan="1">64.75</td>
<td align="char" char="." rowspan="1" colspan="1">99.46</td>
<td align="char" char="." rowspan="1" colspan="1">65.10</td>
<td align="char" char="." rowspan="1" colspan="1">75.43</td>
<td align="char" char="." rowspan="1" colspan="1">99.81</td>
<td align="char" char="." rowspan="1" colspan="1">75.57</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">80.13</td>
<td align="char" char="." rowspan="1" colspan="1">97.61</td>
<td align="char" char="." rowspan="1" colspan="1">82.09</td>
<td align="char" char="." rowspan="1" colspan="1">85.05</td>
<td align="char" char="." rowspan="1" colspan="1">97.76</td>
<td align="char" char="." rowspan="1" colspan="1">87.00</td>
</tr>
<tr><td colspan="7" align="left" rowspan="1">GENUS</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">84.74</td>
<td align="char" char="." rowspan="1" colspan="1">99.80</td>
<td align="char" char="." rowspan="1" colspan="1">84.91</td>
<td align="char" char="." rowspan="1" colspan="1">73.75</td>
<td align="char" char="." rowspan="1" colspan="1">99.53</td>
<td align="char" char="." rowspan="1" colspan="1">74.10</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken<xref ref-type="table-fn" rid="t004fn001"><sup>a</sup>
</xref>
</td>
<td align="char" char="." rowspan="1" colspan="1">64.54</td>
<td align="char" char="." rowspan="1" colspan="1">99.45</td>
<td align="char" char="." rowspan="1" colspan="1">64.90</td>
<td align="char" char="." rowspan="1" colspan="1">71.95</td>
<td align="char" char="." rowspan="1" colspan="1">98.04</td>
<td align="char" char="." rowspan="1" colspan="1">73.39</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken<xref ref-type="table-fn" rid="t004fn002"><sup>b</sup>
</xref>
</td>
<td align="char" char="." rowspan="1" colspan="1">66.12</td>
<td align="char" char="." rowspan="1" colspan="1">99.44</td>
<td align="char" char="." rowspan="1" colspan="1">—</td>
<td align="char" char="." rowspan="1" colspan="1">67.95</td>
<td align="char" char="." rowspan="1" colspan="1">97.41</td>
<td align="char" char="." rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Kraken<xref ref-type="table-fn" rid="t004fn002"><sup>b</sup>
</xref>
</td>
<td align="char" char="." rowspan="1" colspan="1">77.15</td>
<td align="char" char="." rowspan="1" colspan="1">99.20</td>
<td align="char" char="." rowspan="1" colspan="1">—</td>
<td align="char" char="." rowspan="1" colspan="1">73.46</td>
<td align="char" char="." rowspan="1" colspan="1">94.71</td>
<td align="char" char="." rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Kraken-GB<xref ref-type="table-fn" rid="t004fn002"><sup>b</sup>
</xref>
</td>
<td align="char" char="." rowspan="1" colspan="1">93.75</td>
<td align="char" char="." rowspan="1" colspan="1">99.51</td>
<td align="char" char="." rowspan="1" colspan="1">—</td>
<td align="char" char="." rowspan="1" colspan="1">86.23</td>
<td align="char" char="." rowspan="1" colspan="1">98.48</td>
<td align="char" char="." rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
</td>
<td align="char" char="." rowspan="1" colspan="1">79.82</td>
<td align="char" char="." rowspan="1" colspan="1">97.44</td>
<td align="char" char="." rowspan="1" colspan="1">81.92</td>
<td align="char" char="." rowspan="1" colspan="1">77.50</td>
<td align="char" char="." rowspan="1" colspan="1">90.83</td>
<td align="char" char="." rowspan="1" colspan="1">85.32</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="t004fn001"><p><sup>a</sup>
—The results of the program are counted by ourselves.</p>
</fn>
<fn id="t004fn002"><p><sup>b</sup>
—The results of the program are taken from the Wood–Salzberg’ paper [<xref rid="pone.0121453.ref051" ref-type="bibr">51</xref>
].</p>
</fn>
<fn id="t004fn003"><p>CoMeta <italic>micDb</italic>
 parameters: <italic>MC</italic>
 = 5%, <italic>k</italic>
=24. LMAT <italic>kFull</italic>
 parameter: <italic>ms</italic>
 = 0.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>In both variants of CoMeta (<italic>allDb</italic>
 and <italic>micDb</italic>
), the <italic>mismatch</italic>
 files were taken into account, when the reads were being assigned to the best groups. Depending on the dataset and database, the best classification results were obtained for different values of <italic>k</italic>
. Using <italic>micDb</italic>
, the best accuracy for the Illumina reads (which are short) was obtained using shorter <italic>k</italic>
-mers (i.e., <italic>k</italic>
 ≈ 24). For long reads (after the 454 sequencing) the most accurate classification scores were obtained for <italic>k</italic>
 ≈ 30. However, using <italic>allDb</italic>
, where reads were assigned to many groups, the best classification results were obtained for <italic>k</italic>
 = 24.</p>
<p>The difference in the number of reads between the <italic>reduced FACS 269 bp</italic>
 and the original dataset is 6,347 (these are the reads containing more than 50% of unknown nucleotides). Differences in the classification results for the original and <italic>reduced FACS 269 bp</italic>
 datasets using CoMeta and LMAT were in the number of unclassified reads and equal 6,346 and 6,347 reads, respectively. Obviously, real reads may contain unknown nucleotides, however in our opinion during the validation of the classifiers, ambiguous reads should not be treated equally, as the reads of all known nucleotides. Therefore, the classification results in <xref rid="pone.0121453.t003" ref-type="table">Table 3</xref>
 (using CoMeta and LMAT) are given both for the <italic>FACS 269 bp</italic>
 and the <italic>reduced FACS 269 bp</italic>
 datasets.</p>
<p>The greatest differences in the classification results between the tested programs were observed for the <italic>FACS 269 bp</italic>
 dataset, which includes 72,951 reads derived from a human chromosome. CoMeta <italic>allDb</italic>
 and LMAT <italic>kML</italic>
 classified the majority of reads, significantly outperforming other programs. The databases used by MetaPhyler, MG-RAST, LMAT <italic>kML</italic>
, as well as CoMeta <italic>micDb</italic>
 do not contain human sequences, or contain only specific marker genes, so it is understandable that the results are rather poor. Although the databases in CARMA and MEGAN contain human sequences, the results obtained on these metagenomic datasets were also poor. To investigate this problem, we tried to align a few reads from this dataset using BLASTX (both programs employ it), and BLASTX failed to classify some reads, which explains weak results for CARMA and MEGAN. LMAT <italic>kML</italic>
 classified incorrectly fewer reads than CoMeta <italic>micDb</italic>
, but also fewer reads were classified correctly, hence the total number of classified reads was smaller for LMAT than for CoMeta.</p>
<p>For three other datasets, the results of MetaPhyler, MG-RAST, CARMA, and MEGAN were better than those achieved for <italic>FACS 269 bp</italic>
, however, LMAT, CoMeta, and Kraken were able to classify more reads. MetaPhyler is very fast since it uses only the “marker genes”, however only reads having them are correctly classified. Thus, this algorithm performs well only for the dataset created by the program’s authors. During DNA sequencing, only a certain percentage of reads have the marker genes, therefore in many cases MetaPhyler does not recognize correctly the origin of the reads. The best results for the <italic>MetaPhyler 300 bp</italic>
 dataset were obtained by Kraken and CoMeta, which outperformed LMAT. For the <italic>CARMA 265 bp</italic>
 dataset the winner was CoMeta. Kraken returned slightly worse scores, and LMAT—much worse. However, for the <italic>PhyloPythia 961 bp</italic>
 dataset, it was LMAT <italic>kML</italic>
, which achieved the best score. Nevertheless, it is worth noting that the results of LMAT <italic>kFull</italic>
 was significantly worse (comparing only those three programs), whereas for the remaining datasets the classification results were better using <italic>kFull</italic>
 than using <italic>kML</italic>
 database.</p>
<p><xref rid="pone.0121453.t004" ref-type="table">Table 4</xref>
 summarizes the evaluation of CoMeta, LMAT, and Kraken for the Illumina reads. Here we showed results for five classification levels: phylum, class, order, family, and genus. As mentioned earlier, we run Kraken using only the MiniKraken database downloaded from the Kraken website, because we have not managed to build the larger database nor to obtain it Kraken’s authors. Therefore, in addition to the results obtained in our experiment, we present also the results quoted from Wood–Salzberg’ paper [<xref rid="pone.0121453.ref051" ref-type="bibr">51</xref>
] (that work reports the results only for the genus level). Although we carefully followed the instructions when running Kraken, we obtained different results for two datasets using MiniKraken database, compared with those reported in [<xref rid="pone.0121453.ref051" ref-type="bibr">51</xref>
]. The precision values were similar, but the difference in sensitivity was greater. For the <italic>HiSeq 92 bp</italic>
 dataset, we obtained the sensitivity 1.58% smaller than reported in [<xref rid="pone.0121453.ref051" ref-type="bibr">51</xref>
], and for the <italic>MiSeq 156 bp</italic>
 dataset it was 4% higher. The differences in precision could be due to the fact that Kraken’s authors took into account the reads incorrectly classified to the levels above the analyzed rank, whereas we consider such reads unclassified. However, we cannot explain the cause of the difference in the sensitivity values. The best classification results for both datasets at the genus level were obtained using Kraken-GB. This database, according to its authors, contains GenBanks draft and completed genomes for bacteria and archaea. Taking into account the results obtained in our experiments, the <italic>HiSeq 92 bp</italic>
 dataset was classified the best by LMAT and by CoMeta. For the <italic>MiSeq 156 bp</italic>
 dataset, LMAT was better than CoMeta only at the phylum level, while CoMeta correctly classified much more reads at lower levels.</p>
<p>In <xref rid="pone.0121453.t005" ref-type="table">Table 5</xref>
 we present the classification times and memory usage. It may be seen that the programs which use <italic>k</italic>
-mers databases use a lot of memory. Using all available reference sequences (<italic>allDb</italic>
), CoMeta consumed about 70 GB of RAM. This was reduced to 20 GB, when taking into account only bacteria, viruses, and archaea (<italic>micDb</italic>
). CoMeta <italic>allDb</italic>
 is by 1.5–2 times slower than CoMeta <italic>micDb</italic>
. MiniKraken database contains only a fraction of <italic>k</italic>
-mers of the reference sequence complete genomes for bacteria, viruses, and archaea; it consumed between 1.5 GB and 4 GB of RAM. When using the complete database without eukaryotes Kraken needs 74 GB (according to the authors).</p>
<table-wrap id="pone.0121453.t005" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.t005</object-id>
<label>Table 5</label>
<caption><title>Comparison of RAM memory usage and CPU times.</title>
</caption>
<alternatives><graphic id="pone.0121453.t005g" xlink:href="pone.0121453.t005"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1">Program</th>
<th align="left" rowspan="1" colspan="1">FACS</th>
<th align="left" rowspan="1" colspan="1">MetaPhyler</th>
<th align="left" rowspan="1" colspan="1">CARMA</th>
<th align="left" rowspan="1" colspan="1">PhyloPythia</th>
<th align="left" rowspan="1" colspan="1">HiSeq</th>
<th align="left" rowspan="1" colspan="1">MiSeq</th>
</tr>
<tr><th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">269bp</th>
<th align="left" rowspan="1" colspan="1">300bp</th>
<th align="left" rowspan="1" colspan="1">265bp</th>
<th align="left" rowspan="1" colspan="1">961bp</th>
<th align="left" rowspan="1" colspan="1">92bp</th>
<th align="left" rowspan="1" colspan="1">156bp</th>
</tr>
</thead>
<tbody><tr><td colspan="7" align="left" rowspan="1">CPU Runtime (minutes)</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CARMA<xref ref-type="table-fn" rid="t005fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">290880</td>
<td align="left" rowspan="1" colspan="1">77340</td>
<td align="left" rowspan="1" colspan="1">74950</td>
<td align="left" rowspan="1" colspan="1">360107</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MEGAN<xref ref-type="table-fn" rid="t005fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">288020</td>
<td align="left" rowspan="1" colspan="1">72060</td>
<td align="left" rowspan="1" colspan="1">72010</td>
<td align="left" rowspan="1" colspan="1">351060</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MetaPhyler<xref ref-type="table-fn" rid="t005fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">10</td>
<td align="left" rowspan="1" colspan="1">20</td>
<td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">28</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MG-RAST<xref ref-type="table-fn" rid="t005fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">60</td>
<td align="left" rowspan="1" colspan="1">10080</td>
<td align="left" rowspan="1" colspan="1">20160</td>
<td align="left" rowspan="1" colspan="1">12960</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kML</italic>
</td>
<td align="left" rowspan="1" colspan="1">36(60<xref ref-type="table-fn" rid="t005fn002"><sup>b</sup>
</xref>
)</td>
<td align="left" rowspan="1" colspan="1">58</td>
<td align="left" rowspan="1" colspan="1">43</td>
<td align="left" rowspan="1" colspan="1">348</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="left" rowspan="1" colspan="1">54(93<xref ref-type="table-fn" rid="t005fn002"><sup>b</sup>
</xref>
)</td>
<td align="left" rowspan="1" colspan="1">213</td>
<td align="left" rowspan="1" colspan="1">38</td>
<td align="left" rowspan="1" colspan="1">772</td>
<td align="left" rowspan="1" colspan="1">15</td>
<td align="left" rowspan="1" colspan="1">33</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">1.22</td>
<td align="left" rowspan="1" colspan="1">1.07</td>
<td align="left" rowspan="1" colspan="1">2.95</td>
<td align="left" rowspan="1" colspan="1">1.3</td>
<td align="left" rowspan="1" colspan="1">1.2</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>allDb</italic>
</td>
<td align="left" rowspan="1" colspan="1">41(76<xref ref-type="table-fn" rid="t005fn002"><sup>b</sup>
</xref>
)</td>
<td align="left" rowspan="1" colspan="1">14</td>
<td align="left" rowspan="1" colspan="1">28</td>
<td align="left" rowspan="1" colspan="1">144</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
 (ph)</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">9</td>
<td align="left" rowspan="1" colspan="1">14</td>
<td align="left" rowspan="1" colspan="1">35</td>
<td align="left" rowspan="1" colspan="1">8</td>
<td align="left" rowspan="1" colspan="1">9</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
 (ge)</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">79</td>
<td align="left" rowspan="1" colspan="1">42</td>
<td align="left" rowspan="1" colspan="1">68</td>
</tr>
<tr><td colspan="7" align="left" rowspan="1">Memory Usage (Megabytes of RAM)</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CARMA<xref ref-type="table-fn" rid="t005fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">100</td>
<td align="left" rowspan="1" colspan="1">120</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MEGAN<xref ref-type="table-fn" rid="t005fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">1024</td>
<td align="left" rowspan="1" colspan="1">1024</td>
<td align="left" rowspan="1" colspan="1">1024</td>
<td align="left" rowspan="1" colspan="1">1410</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MetaPhyler<xref ref-type="table-fn" rid="t005fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">5734</td>
<td align="left" rowspan="1" colspan="1">5734</td>
<td align="left" rowspan="1" colspan="1">5734</td>
<td align="left" rowspan="1" colspan="1">5734</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MG-RAST<xref ref-type="table-fn" rid="t005fn001"><sup>a</sup>
</xref>
</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kML</italic>
</td>
<td align="left" rowspan="1" colspan="1">17000(17284<xref ref-type="table-fn" rid="t005fn002"><sup>b</sup>
</xref>
)</td>
<td align="left" rowspan="1" colspan="1">17019</td>
<td align="left" rowspan="1" colspan="1">2128</td>
<td align="left" rowspan="1" colspan="1">13311</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">LMAT <italic>kFull</italic>
</td>
<td align="left" rowspan="1" colspan="1">9295(9481<xref ref-type="table-fn" rid="t005fn002"><sup>b</sup>
</xref>
)</td>
<td align="left" rowspan="1" colspan="1">13247</td>
<td align="left" rowspan="1" colspan="1">13286</td>
<td align="left" rowspan="1" colspan="1">15092</td>
<td align="left" rowspan="1" colspan="1">5807</td>
<td align="left" rowspan="1" colspan="1">12392</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MiniKraken</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">4098</td>
<td align="left" rowspan="1" colspan="1">3210</td>
<td align="left" rowspan="1" colspan="1">4100</td>
<td align="left" rowspan="1" colspan="1">1317</td>
<td align="left" rowspan="1" colspan="1">1449</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>allDb</italic>
</td>
<td align="left" rowspan="1" colspan="1">71260(71903<xref ref-type="table-fn" rid="t005fn002"><sup>b</sup>
</xref>
)</td>
<td align="left" rowspan="1" colspan="1">70743</td>
<td align="left" rowspan="1" colspan="1">71313</td>
<td align="left" rowspan="1" colspan="1">69508</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">—</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">CoMeta <italic>micDb</italic>
</td>
<td align="left" rowspan="1" colspan="1">—</td>
<td align="left" rowspan="1" colspan="1">19552</td>
<td align="left" rowspan="1" colspan="1">19320</td>
<td align="left" rowspan="1" colspan="1">19552</td>
<td align="left" rowspan="1" colspan="1">10297</td>
<td align="left" rowspan="1" colspan="1">17689</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="t005fn001"><p><sup>a</sup>
—The results of the program are taken from the Bazinet–Cummings’ paper [<xref rid="pone.0121453.ref056" ref-type="bibr">56</xref>
].</p>
</fn>
<fn id="t005fn002"><p><sup>b</sup>
—The results for <italic>FACS 269bp</italic>
 dataset, where reads with more than 50% of unknown nucleotides (Ns) are filtered out. The values outside the brackets are for the whole dataset.</p>
</fn>
<fn id="t005fn003"><p><italic>FACS 269 bp</italic>
, <italic>MetaPhyler 300 bp</italic>
, and <italic>CARMA 265 bp</italic>
 datasets were classified to phylum level, whilst <italic>PhyloPythia 961 bp</italic>
, <italic>HiSeq 92 bp</italic>
, and <italic>MiSeq 156 bp</italic>
 datasets to genus level. In the table besides the times of classification to the genus level for CoMeta <italic>micDb</italic>
 (ge), the times of classification to earlier levels are shown—the phylum levels (ph).</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The running time of CoMeta <italic>micDb</italic>
 when classifying to the genus level for the <italic>PhyloPythia 961 bp</italic>
 dataset, compared with the <italic>HiSeq 92 bp</italic>
 dataset, was only twice longer, although both the number of reads and their lengths are about ten times larger (hence, the file size is over 100 times larger). This results from the fact that loading the <italic>k</italic>
-mer database takes much more time than classification of the reads. Kraken is the fastest among the examined programs. Compared to LMAT, CoMeta was faster when classifying to the phylum level. For classification to the genus level, CoMeta was faster only for a big dataset (<italic>PhyloPythia 961 bp</italic>
), while the small datasets with short reads (<italic>HiSeq 92 bp</italic>
 and <italic>MiSeq 156 bp</italic>
) were classified faster by LMAT.</p>
</sec>
<sec id="sec017"><title>Databases building</title>
<p>The <italic>k</italic>
-mer/taxonomy databases consist of all reference sequences downloaded from the NCBI website. As it has been discussed earlier, we suggest the read classification be started from the phylum rank. The “raw” genome database used in this study was downloaded on July 2012. The 13 nt files included: 261,295 sequences from Archaea, 4,036,205 from Bacteria, 10,205,401 from Eukaryota, 3,127 from Viroids, and 1,175,053 from Viruses. Apart from 15,681,081 sequences of a known origin and defined superkingdom, 509,677 sequences were undefined (for example plasmids, artificial sequences, or environmental samples).</p>
<p>Each sequence had Sequence Identification ID (<italic>gi</italic>
), which was used to set Taxonomic Identification ID (<italic>tax</italic>
). The sequences were divided into groups according to the rank of phylum, plus for group Viruses and Viroids. Overall, 99 groups were established (c.f. <xref rid="pone.0121453.t006" ref-type="table">Table 6</xref>
, row “num groups”).</p>
<table-wrap id="pone.0121453.t006" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0121453.t006</object-id>
<label>Table 6</label>
<caption><title>Compact <italic>k</italic>
-mer database, where the reads are classified into the phylum rank.</title>
</caption>
<alternatives><graphic id="pone.0121453.t006g" xlink:href="pone.0121453.t006"></graphic>
<table frame="box" rules="all" border="0"><colgroup span="1"><col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
<col align="left" valign="top" span="1"></col>
</colgroup>
<thead><tr><th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">Archaea</th>
<th align="left" rowspan="1" colspan="1">Bacteria</th>
<th align="left" rowspan="1" colspan="1">Eukaryota</th>
<th align="left" rowspan="1" colspan="1">Viroids</th>
<th align="left" rowspan="1" colspan="1">Viruses</th>
<th align="left" rowspan="1" colspan="1">Total</th>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">num groups</td>
<td align="left" rowspan="1" colspan="1">6</td>
<td align="left" rowspan="1" colspan="1">36</td>
<td align="left" rowspan="1" colspan="1">55</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">99</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">num seq</td>
<td align="left" rowspan="1" colspan="1">261,295</td>
<td align="left" rowspan="1" colspan="1">4,036,205</td>
<td align="left" rowspan="1" colspan="1">10,205,401</td>
<td align="left" rowspan="1" colspan="1">3,127</td>
<td align="left" rowspan="1" colspan="1">1,175,053</td>
<td align="left" rowspan="1" colspan="1">15,681,081</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 15</td>
<td align="left" rowspan="1" colspan="1">1.9 GB</td>
<td align="left" rowspan="1" colspan="1">17.0 GB</td>
<td align="left" rowspan="1" colspan="1">29.9 GB</td>
<td align="left" rowspan="1" colspan="1">1.1 MB</td>
<td align="left" rowspan="1" colspan="1">1.1 GB</td>
<td align="left" rowspan="1" colspan="1">49.9 GB</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 18</td>
<td align="left" rowspan="1" colspan="1">2.2 GB</td>
<td align="left" rowspan="1" colspan="1">34.4 GB</td>
<td align="left" rowspan="1" colspan="1">93.7 GB</td>
<td align="left" rowspan="1" colspan="1">1.1 MB</td>
<td align="left" rowspan="1" colspan="1">1.4 GB</td>
<td align="left" rowspan="1" colspan="1">131.7 GB</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 21</td>
<td align="left" rowspan="1" colspan="1">2.3 GB</td>
<td align="left" rowspan="1" colspan="1">37.6 GB</td>
<td align="left" rowspan="1" colspan="1">111.9 GB</td>
<td align="left" rowspan="1" colspan="1">1.2 MB</td>
<td align="left" rowspan="1" colspan="1">1.5 GB</td>
<td align="left" rowspan="1" colspan="1">153.3 GB</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 24</td>
<td align="left" rowspan="1" colspan="1">2.3 GB</td>
<td align="left" rowspan="1" colspan="1">39.0 GB</td>
<td align="left" rowspan="1" colspan="1">117.4 GB</td>
<td align="left" rowspan="1" colspan="1">1.3 MB</td>
<td align="left" rowspan="1" colspan="1">1.6 GB</td>
<td align="left" rowspan="1" colspan="1">160.4 GB</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 27</td>
<td align="left" rowspan="1" colspan="1">2.4 GB</td>
<td align="left" rowspan="1" colspan="1">39.3 GB</td>
<td align="left" rowspan="1" colspan="1">120.9 GB</td>
<td align="left" rowspan="1" colspan="1">1.4 MB</td>
<td align="left" rowspan="1" colspan="1">1.7 GB</td>
<td align="left" rowspan="1" colspan="1">164.2 GB</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 30</td>
<td align="left" rowspan="1" colspan="1">2.4 GB</td>
<td align="left" rowspan="1" colspan="1">39.6 GB</td>
<td align="left" rowspan="1" colspan="1">123.3 GB</td>
<td align="left" rowspan="1" colspan="1">1.4 MB</td>
<td align="left" rowspan="1" colspan="1">1.8 GB</td>
<td align="left" rowspan="1" colspan="1">167.0 GB</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="t006fn001"><p>The total size of the compact <italic>k</italic>
-mer databases for groups of the phylum rank at various lengths of <italic>k</italic>
-mer. The number of groups belonging to the superkingdom is given in the first row, and the number of the sequences is in the second one. The sizes of each dataset are provided in <xref ref-type="supplementary-material" rid="pone.0121453.s001">S1 Supporting Information</xref>
.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>In the reported experiments, we divided the sequences into overlapping <italic>k</italic>
-mers of different lengths, <italic>k</italic>
 = 15,18,21,24,27,30, hence, we obtained six different database setups. In order to accelerate loading of the database during classification, we used non-compact databases. The overall sizes of the databases for classification at the phylum rank are presented in <xref rid="pone.0121453.t006" ref-type="table">Table 6</xref>
, with the number of groups belonging to the superkingdom. Sizes for all non-compact databases that are loaded into RAM during the “Comparison” step (c.f. <xref ref-type="fig" rid="pone.0121453.g001">Fig 1</xref>
), are provided in <xref ref-type="supplementary-material" rid="pone.0121453.s001">S1 Supporting Information</xref>
. The largest <italic>k</italic>
-mer database is for the “Chordata” phylum (up to 73 GB for <italic>k</italic>
 = 30), however in many metagenomic studies, the eukaryotes are not investigated at all. For bacteria, the Proteobacteria <italic>k</italic>
-mer database is the largest one (almost 20 GB of RAM is necessary).</p>
<p>The dependence of the database size on the number of unique <italic>k</italic>
-mers (which appeared at least once in the <italic>G</italic>
<sub><italic>i</italic>
</sub>
 group) is shown in <xref ref-type="supplementary-material" rid="pone.0121453.s001">S1 Supporting Information</xref>
. Approximately, the relationship between <italic>k</italic>
 and the database size is linear. The size of the non-compact database is approximately equal to the compact one for <italic>k</italic>
 = 30.</p>
</sec>
</sec>
<sec id="sec018"><title>Conclusions and future work</title>
<p>In this paper, we proposed a new method for classification of reads to the taxonomic rank. First, the groups of reference sequences (each derived from a single taxon) are divided into overlapping <italic>k</italic>
-mers (short substrings), from which the databases are built. Each database is subsequently used for checking the similarity between the query read and the group, which this database represents. We proceed the read classification from the root towards the leaves of the taxonomical tree, which accelerates the program execution, since the read does not have to be compared with each reference sequence. The presented experimental results proved our approach to be competitive and outperforming many alternative popular programs. The results also indicate how important it is to properly select the length of <italic>k</italic>
-mers. For too small <italic>k</italic>
’s, too many reads are misclassified, while too large <italic>k</italic>
’s increase the number of unclassified reads. The downside of our method is that it needs a lot of RAM, when large <italic>k</italic>
-mer databases are used. For classification at the phylum level, using the largest set of <italic>k</italic>
-mers for Proteobacteria, about 20 GB are required. CoMeta is slower than the very recently published Kraken program. However, CoMeta returns information about all the groups to which the query read was classified if it was classified to several ones, (when the conflict occurred), and not like Kraken and LCAT which cut off the branch and classify the read to a higher level.</p>
<p>Our ongoing research includes examining the influence of the length of the reference sequences (derived from one group) on the best value of the <italic>k</italic>
 parameter, so that it can be selected automatically. Furthermore, we intend to take into consideration not only the number of matched nucleotides (match scores), but also the number of deletions and insertions.</p>
</sec>
<sec sec-type="supplementary-material" id="sec019"><title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pone.0121453.s001"><label>S1 Supporting Information</label>
<caption><title>Additional tables and figures of the experiments results.</title>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0121453.s001.pdf"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><ref-list><title>References</title>
<ref id="pone.0121453.ref001"><label>1</label>
<mixed-citation publication-type="journal"><name><surname>Handelsman</surname>
<given-names>J</given-names>
</name>
, <name><surname>Rondon</surname>
<given-names>MR</given-names>
</name>
, <name><surname>Brady</surname>
<given-names>SF</given-names>
</name>
, <name><surname>Clardy</surname>
<given-names>J</given-names>
</name>
, <name><surname>Goodman</surname>
<given-names>RM</given-names>
</name>
. <article-title>Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products</article-title>
. <source>Chemistry & biology</source>
. <year>1998</year>
;<volume>5</volume>
(<issue>10</issue>
).</mixed-citation>
</ref>
<ref id="pone.0121453.ref002"><label>2</label>
<mixed-citation publication-type="journal"><name><surname>Pace</surname>
<given-names>NR</given-names>
</name>
, <name><surname>Stahl</surname>
<given-names>DA</given-names>
</name>
, <name><surname>Olsen</surname>
<given-names>GJ</given-names>
</name>
. <article-title>Analyzing natural microbial populations by rRNA sequences</article-title>
. <source>ASM News</source>
. <year>1985</year>
;<volume>51</volume>
:<fpage>4</fpage>
–<lpage>12</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0121453.ref003"><label>3</label>
<mixed-citation publication-type="journal"><name><surname>Handelsman</surname>
<given-names>J</given-names>
</name>
. <article-title>Metagenomics: application of genomics to uncultured microorganisms</article-title>
. <source>Microbiology and Molecular Biology Reviews</source>
. <year>2004</year>
;<volume>68</volume>
(<issue>4</issue>
):<fpage>669</fpage>
–<lpage>685</lpage>
. <pub-id pub-id-type="doi">10.1128/MMBR.68.4.669-685.2004</pub-id>
<pub-id pub-id-type="pmid">15590779</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref004"><label>4</label>
<mixed-citation publication-type="journal"><name><surname>Simon</surname>
<given-names>C</given-names>
</name>
, <name><surname>Daniel</surname>
<given-names>R</given-names>
</name>
. <article-title>Metagenomic Analyses: Past and Future Trends</article-title>
. <source>Applied and Environmental Microbiology</source>
. <year>2011</year>
;<volume>77</volume>
(<issue>4</issue>
):<fpage>1153</fpage>
–<lpage>1161</lpage>
. <pub-id pub-id-type="doi">10.1128/AEM.02345-10</pub-id>
<pub-id pub-id-type="pmid">21169428</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref005"><label>5</label>
<mixed-citation publication-type="book"><collab>Committee on Metagenomics: Challenges and Functional Applications NRC</collab>
. <source>The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet</source>
. <publisher-name>The National Academies Press</publisher-name>
; <year>2007</year>
.</mixed-citation>
</ref>
<ref id="pone.0121453.ref006"><label>6</label>
<mixed-citation publication-type="journal"><name><surname>Rousk</surname>
<given-names>J</given-names>
</name>
, <name><surname>Baath</surname>
<given-names>E</given-names>
</name>
, <name><surname>Brookes</surname>
<given-names>PC</given-names>
</name>
, <name><surname>Lauber</surname>
<given-names>CL</given-names>
</name>
, <name><surname>Lozupone</surname>
<given-names>C</given-names>
</name>
, <name><surname>Caporaso</surname>
<given-names>JG</given-names>
</name>
, <etal>et al</etal>
<article-title>Soil bacterial and fungal communities across a pH gradient in an arable soil</article-title>
. <source>The ISME Journal</source>
. <year>2010</year>
;<volume>4</volume>
:<fpage>1340</fpage>
–<lpage>1351</lpage>
. <pub-id pub-id-type="doi">10.1038/ismej.2010.58</pub-id>
<pub-id pub-id-type="pmid">20445636</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref007"><label>7</label>
<mixed-citation publication-type="journal"><name><surname>Fierer</surname>
<given-names>N</given-names>
</name>
, <name><surname>Leff</surname>
<given-names>J</given-names>
</name>
, <name><surname>Adams</surname>
<given-names>B</given-names>
</name>
, <name><surname>Nielsen</surname>
<given-names>U</given-names>
</name>
, <name><surname>Bates</surname>
<given-names>S</given-names>
</name>
, <name><surname>Lauber</surname>
<given-names>C</given-names>
</name>
, <etal>et al</etal>
<article-title>Cross-biome metagenomic analyses of soil microbial communities and their functional attributes</article-title>
. <source>Proceedings of the National Academy of Sciences of the United States of America</source>
. <year>2012</year>
;<volume>109</volume>
(<issue>52</issue>
). <pub-id pub-id-type="doi">10.1073/pnas.1215210110</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref008"><label>8</label>
<mixed-citation publication-type="journal"><name><surname>Abbai</surname>
<given-names>N</given-names>
</name>
, <name><surname>Govender</surname>
<given-names>A</given-names>
</name>
, <name><surname>Shaik</surname>
<given-names>R</given-names>
</name>
, <name><surname>Pillay</surname>
<given-names>B</given-names>
</name>
. <article-title>Pyrosequence analysis of unamplified and whole genome amplified DNA from hydrocarbon-contaminated groundwater</article-title>
. <source>Mol Biotechnol</source>
. <year>2011</year>
;<volume>50</volume>
:<fpage>39</fpage>
–<lpage>48</lpage>
. <pub-id pub-id-type="doi">10.1007/s12033-011-9412-8</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref009"><label>9</label>
<mixed-citation publication-type="journal"><name><surname>Kennedy</surname>
<given-names>J</given-names>
</name>
, <name><surname>O’Leary</surname>
<given-names>ND</given-names>
</name>
, <name><surname>Kiran</surname>
<given-names>GS</given-names>
</name>
, <name><surname>Morrissey</surname>
<given-names>JP</given-names>
</name>
, <name><surname>O’Gara</surname>
<given-names>F</given-names>
</name>
, <name><surname>Selvin</surname>
<given-names>J</given-names>
</name>
, <etal>et al</etal>
<article-title>Functional metagenomic strategies for the discovery of novel enzymes and biosurfactants with biotechnological applications from marine ecosystems</article-title>
. <source>Journal of Applied Microbiology</source>
. <year>2011</year>
;<volume>111</volume>
(<issue>4</issue>
):<fpage>787</fpage>
–<lpage>799</lpage>
. <pub-id pub-id-type="doi">10.1111/j.1365-2672.2011.05106.x</pub-id>
<pub-id pub-id-type="pmid">21777355</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref010"><label>10</label>
<mixed-citation publication-type="journal"><name><surname>Gilbert</surname>
<given-names>J</given-names>
</name>
, <name><surname>Field</surname>
<given-names>D</given-names>
</name>
, <name><surname>Huang</surname>
<given-names>Y</given-names>
</name>
, <name><surname>Edwards</surname>
<given-names>R</given-names>
</name>
, <name><surname>Li</surname>
<given-names>W</given-names>
</name>
, <name><surname>Gilna</surname>
<given-names>P</given-names>
</name>
, <etal>et al</etal>
<article-title>Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities</article-title>
. <source>PLoS ONE</source>
. <year>2008</year>
;<volume>3</volume>
(<issue>8</issue>
). <pub-id pub-id-type="doi">10.1371/journal.pone.0003042</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref011"><label>11</label>
<mixed-citation publication-type="journal"><name><surname>Yergeau</surname>
<given-names>E</given-names>
</name>
, <name><surname>Lawrence</surname>
<given-names>JR</given-names>
</name>
, <name><surname>Waiser</surname>
<given-names>MJ</given-names>
</name>
, <name><surname>Korber</surname>
<given-names>DR</given-names>
</name>
, <name><surname>Greer</surname>
<given-names>CW</given-names>
</name>
. <article-title>Metatranscriptomic analysis of the response of river biofilms to pharmaceutical products, using anonymous DNA microarrays</article-title>
. <source>Applied and Environmental Microbiology</source>
. <year>2010</year>
;<volume>76</volume>
(<issue>16</issue>
):<fpage>5432</fpage>
–<lpage>5439</lpage>
. <pub-id pub-id-type="doi">10.1128/AEM.00873-10</pub-id>
<pub-id pub-id-type="pmid">20562274</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref012"><label>12</label>
<mixed-citation publication-type="journal"><name><surname>Rhee</surname>
<given-names>JK</given-names>
</name>
, <name><surname>Ahn</surname>
<given-names>DG</given-names>
</name>
, <name><surname>Kim</surname>
<given-names>YG</given-names>
</name>
, <name><surname>Oh</surname>
<given-names>JW</given-names>
</name>
. <article-title>New thermophilic and thermostable esterase with sequence similarity to the hormone-sensitive lipase family, cloned from a metagenomic library</article-title>
. <source>Applied and Environmental Microbiology</source>
. <year>2005</year>
;<volume>71</volume>
(<issue>2</issue>
):<fpage>817</fpage>
–<lpage>825</lpage>
. <pub-id pub-id-type="doi">10.1128/AEM.71.2.817-825.2005</pub-id>
<pub-id pub-id-type="pmid">15691936</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref013"><label>13</label>
<mixed-citation publication-type="journal"><name><surname>Simon</surname>
<given-names>C</given-names>
</name>
, <name><surname>Wiezer</surname>
<given-names>A</given-names>
</name>
, <name><surname>Strittmatter</surname>
<given-names>AW</given-names>
</name>
, <name><surname>Daniel</surname>
<given-names>R</given-names>
</name>
. <article-title>Phylogenetic diversity and metabolic potential revealed in a glacier ice metagenome</article-title>
. <source>Applied and Environmental Microbiology</source>
. <year>2009</year>
;<volume>75</volume>
(<issue>23</issue>
):<fpage>7519</fpage>
–<lpage>7526</lpage>
. <pub-id pub-id-type="doi">10.1128/AEM.00946-09</pub-id>
<pub-id pub-id-type="pmid">19801459</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref014"><label>14</label>
<mixed-citation publication-type="journal"><name><surname>Heath</surname>
<given-names>C</given-names>
</name>
, <name><surname>Hu</surname>
<given-names>XPP</given-names>
</name>
, <name><surname>Cary</surname>
<given-names>SC</given-names>
</name>
, <name><surname>Cowan</surname>
<given-names>D</given-names>
</name>
. <article-title>Identification of a novel alkaliphilic esterase active at low temperatures by screening a metagenomic library from antarctic desert soil</article-title>
. <source>Applied and environmental microbiology</source>
. <year>2009</year>
;<volume>75</volume>
(<issue>13</issue>
):<fpage>4657</fpage>
–<lpage>4659</lpage>
. <pub-id pub-id-type="doi">10.1128/AEM.02597-08</pub-id>
<pub-id pub-id-type="pmid">19411411</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref015"><label>15</label>
<mixed-citation publication-type="journal"><name><surname>Nguyen</surname>
<given-names>NH</given-names>
</name>
, <name><surname>Maruset</surname>
<given-names>L</given-names>
</name>
, <name><surname>Uengwetwanit</surname>
<given-names>T</given-names>
</name>
, <name><surname>Mhuantong</surname>
<given-names>W</given-names>
</name>
, <name><surname>Harnpicharnchai</surname>
<given-names>P</given-names>
</name>
, <name><surname>Champreda</surname>
<given-names>V</given-names>
</name>
, <etal>et al</etal>
<article-title>Identification and characterization of a cellulase-encoding gene from the buffalo rumen metagenomic library</article-title>
. <source>Bioscience, Biotechnology and Biochemistry</source>
. <year>2012</year>
;<volume>76</volume>
(<issue>6</issue>
):<fpage>1075</fpage>
–<lpage>1084</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0121453.ref016"><label>16</label>
<mixed-citation publication-type="journal"><name><surname>Hess</surname>
<given-names>M</given-names>
</name>
, <name><surname>Sczyrba</surname>
<given-names>A</given-names>
</name>
, <name><surname>Egan</surname>
<given-names>R</given-names>
</name>
, <name><surname>Kim</surname>
<given-names>T</given-names>
</name>
, <name><surname>Chokhawala</surname>
<given-names>H</given-names>
</name>
, <name><surname>Schroth</surname>
<given-names>G</given-names>
</name>
, <etal>et al</etal>
<article-title>Metagenomic discovery of biomass-degrading genes and genomes from cow rumen</article-title>
. <source>Science</source>
. <year>2011</year>
;<volume>331</volume>
(<issue>6016</issue>
):<fpage>463</fpage>
–<lpage>467</lpage>
. <pub-id pub-id-type="doi">10.1126/science.1200387</pub-id>
<pub-id pub-id-type="pmid">21273488</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref017"><label>17</label>
<mixed-citation publication-type="journal"><name><surname>Qin</surname>
<given-names>J</given-names>
</name>
, <name><surname>Li</surname>
<given-names>R</given-names>
</name>
, <name><surname>Raes</surname>
<given-names>J</given-names>
</name>
, <name><surname>Arumugam</surname>
<given-names>M</given-names>
</name>
, <name><surname>Burgdorf</surname>
<given-names>K</given-names>
</name>
, <name><surname>Manichanh</surname>
<given-names>C</given-names>
</name>
, <etal>et al</etal>
<article-title>A human gut microbial gene catalogue established by metagenomic sequencing</article-title>
. <source>Nature</source>
. <year>2010</year>
;<volume>464</volume>
(<issue>7285</issue>
):<fpage>59</fpage>
–<lpage>65</lpage>
. <pub-id pub-id-type="doi">10.1038/nature08821</pub-id>
<pub-id pub-id-type="pmid">20203603</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref018"><label>18</label>
<mixed-citation publication-type="journal"><name><surname>Kuczynski</surname>
<given-names>J</given-names>
</name>
, <name><surname>Lauber</surname>
<given-names>CL</given-names>
</name>
, <name><surname>Walters</surname>
<given-names>WA</given-names>
</name>
, <name><surname>Parfrey</surname>
<given-names>LW</given-names>
</name>
, <name><surname>Clemente</surname>
<given-names>JC</given-names>
</name>
, <name><surname>Gevers</surname>
<given-names>D</given-names>
</name>
, <etal>et al</etal>
<article-title>Experimental and analytical tools for studying the human microbiome</article-title>
. <source>Nat Rev Genet</source>
. <year>2012</year>
<month>1</month>
;<volume>13</volume>
(<issue>1</issue>
):<fpage>47</fpage>
–<lpage>58</lpage>
. <pub-id pub-id-type="doi">10.1038/nrg3129</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref019"><label>19</label>
<mixed-citation publication-type="journal"><name><surname>Bruls</surname>
<given-names>T</given-names>
</name>
, <name><surname>Weissenbach</surname>
<given-names>J</given-names>
</name>
. <article-title>The human metagenome: our other genome?</article-title>
<source>Human Molecular Genetics</source>
. <year>2011</year>
;<volume>20</volume>
:<fpage>142</fpage>
–<lpage>148</lpage>
. <pub-id pub-id-type="doi">10.1093/hmg/ddr353</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref020"><label>20</label>
<mixed-citation publication-type="journal"><collab>NIH HMP Working Group</collab>
, <name><surname>Peterson</surname>
<given-names>J</given-names>
</name>
, <name><surname>Garges</surname>
<given-names>S</given-names>
</name>
, <name><surname>Giovanni</surname>
<given-names>M</given-names>
</name>
, <name><surname>McInnes</surname>
<given-names>P</given-names>
</name>
, <name><surname>Wang</surname>
<given-names>L</given-names>
</name>
, <etal>et al</etal>
<article-title>The NIH Human Microbiome Project</article-title>
. <source>Genome Research</source>
. <year>2009</year>
;<volume>19</volume>
(<issue>12</issue>
):<fpage>2317</fpage>
–<lpage>2323</lpage>
. <pub-id pub-id-type="doi">10.1101/gr.096651.109</pub-id>
<pub-id pub-id-type="pmid">19819907</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref021"><label>21</label>
<mixed-citation publication-type="journal"><name><surname>Thomas</surname>
<given-names>T</given-names>
</name>
, <name><surname>Gilbert</surname>
<given-names>J</given-names>
</name>
, <name><surname>Meyer</surname>
<given-names>F</given-names>
</name>
. <article-title>Metagenomics–a guide from sampling to data analysis</article-title>
. <source>Microbial Informatics and Experimentation</source>
. <year>2012</year>
;<volume>2</volume>
(<issue>1</issue>
):<fpage>3</fpage>
<pub-id pub-id-type="doi">10.1186/2042-5783-2-3</pub-id>
<pub-id pub-id-type="pmid">22587947</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref022"><label>22</label>
<mixed-citation publication-type="journal"><name><surname>Kunin</surname>
<given-names>V</given-names>
</name>
, <name><surname>Copeland</surname>
<given-names>A</given-names>
</name>
, <name><surname>Lapidus</surname>
<given-names>A</given-names>
</name>
, <name><surname>Mavromatis</surname>
<given-names>K</given-names>
</name>
, <name><surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
. <article-title>A Bioinformatician’s Guide to Metagenomics</article-title>
. <source>Microbiol Mol Biol Rev</source>
. <year>2008</year>
;<volume>72</volume>
(<issue>4</issue>
):<fpage>557</fpage>
–<lpage>578</lpage>
. <pub-id pub-id-type="doi">10.1128/MMBR.00009-08</pub-id>
<pub-id pub-id-type="pmid">19052320</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref023"><label>23</label>
<mixed-citation publication-type="journal"><name><surname>Sanger</surname>
<given-names>F</given-names>
</name>
, <name><surname>Nicklen</surname>
<given-names>S</given-names>
</name>
, <name><surname>Coulson</surname>
<given-names>AR</given-names>
</name>
. <article-title>DNA sequencing with chain-terminating inhibitors</article-title>
. <source>Proceedings of the National Academy of Sciences of the United States of America</source>
. <year>1977</year>
;<volume>74</volume>
(<issue>12</issue>
):<fpage>5463</fpage>
–<lpage>5467</lpage>
. <pub-id pub-id-type="doi">10.1073/pnas.74.12.5463</pub-id>
<pub-id pub-id-type="pmid">271968</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref024"><label>24</label>
<mixed-citation publication-type="journal"><name><surname>Metzker</surname>
<given-names>ML</given-names>
</name>
. <article-title>Sequencing technologies the next generation</article-title>
. <source>Nature Reviews Genetics</source>
. <year>2010</year>
;<volume>11</volume>
(<issue>1</issue>
):<fpage>31</fpage>
–<lpage>46</lpage>
. <pub-id pub-id-type="doi">10.1038/nrg2626</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref025"><label>25</label>
<mixed-citation publication-type="journal"><name><surname>Nalbantoglu</surname>
<given-names>U</given-names>
</name>
, <name><surname>Cakar</surname>
<given-names>A</given-names>
</name>
, <name><surname>Dogan</surname>
<given-names>H</given-names>
</name>
, <name><surname>Abaci</surname>
<given-names>N</given-names>
</name>
, <name><surname>Ustek</surname>
<given-names>D</given-names>
</name>
, <name><surname>Sayood</surname>
<given-names>K</given-names>
</name>
, <etal>et al</etal>
<article-title>Metagenomic analysis of the microbial community in kefir grains</article-title>
. <source>Food Microbiology</source>
. <year>2014</year>
;<volume>41</volume>
:<fpage>42</fpage>
–<lpage>51</lpage>
. <pub-id pub-id-type="doi">10.1016/j.fm.2014.01.014</pub-id>
<pub-id pub-id-type="pmid">24750812</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref026"><label>26</label>
<mixed-citation publication-type="journal"><name><surname>Wang</surname>
<given-names>Z</given-names>
</name>
, <name><surname>Yang</surname>
<given-names>J</given-names>
</name>
, <name><surname>Zhou</surname>
<given-names>J</given-names>
</name>
, <name><surname>Zhang</surname>
<given-names>C</given-names>
</name>
, <name><surname>Su</surname>
<given-names>X</given-names>
</name>
, <name><surname>Li</surname>
<given-names>T</given-names>
</name>
. <article-title>Composition and structure of bacterial communities in waste water of aquatic products processing factories</article-title>
. <source>Research Journal of Biotechnology</source>
. <year>2014</year>
;<volume>9</volume>
(<issue>2</issue>
):<fpage>65</fpage>
–<lpage>70</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0121453.ref027"><label>27</label>
<mixed-citation publication-type="journal"><name><surname>Shafquat</surname>
<given-names>A</given-names>
</name>
, <name><surname>Joice</surname>
<given-names>R</given-names>
</name>
, <name><surname>Simmons</surname>
<given-names>SL</given-names>
</name>
, <name><surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
. <article-title>Functional and phylogenetic assembly of microbial communities in the human microbiome</article-title>
. <source>Trends in microbiology</source>
. <year>2014</year>
;<volume>22</volume>
(<issue>5</issue>
):<fpage>261266</fpage>
<pub-id pub-id-type="doi">10.1016/j.tim.2014.01.011</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref028"><label>28</label>
<mixed-citation publication-type="journal"><name><surname>Hauser</surname>
<given-names>PM</given-names>
</name>
, <name><surname>Bernard</surname>
<given-names>T</given-names>
</name>
, <name><surname>Greub</surname>
<given-names>G</given-names>
</name>
, <name><surname>Jaton</surname>
<given-names>K</given-names>
</name>
, <name><surname>Pagni</surname>
<given-names>M</given-names>
</name>
, <name><surname>Hafen</surname>
<given-names>GM</given-names>
</name>
. <article-title>Microbiota present in cystic fibrosis lungs as revealed by whole genome sequencing</article-title>
. <source>PLoS ONE</source>
. <year>2014</year>
;<volume>9</volume>
(<issue>3</issue>
). <pub-id pub-id-type="doi">10.1371/journal.pone.0090934</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref029"><label>29</label>
<mixed-citation publication-type="journal"><name><surname>Benson</surname>
<given-names>DA</given-names>
</name>
, <name><surname>Cavanaugh</surname>
<given-names>M</given-names>
</name>
, <name><surname>Clark</surname>
<given-names>K</given-names>
</name>
, <name><surname>Karsch-Mizrachi</surname>
<given-names>I</given-names>
</name>
, <name><surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
, <name><surname>Ostell</surname>
<given-names>J</given-names>
</name>
, <etal>et al</etal>
<article-title>GenBank</article-title>
. <source>Nucleic acids research</source>
. <year>2013</year>
;<volume>41</volume>
(<issue>D1</issue>
):<fpage>D36</fpage>
–<lpage>D42</lpage>
. <pub-id pub-id-type="doi">10.1093/nar/gks1195</pub-id>
<pub-id pub-id-type="pmid">23193287</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref030"><label>30</label>
<mixed-citation publication-type="journal"><name><surname>Fierer</surname>
<given-names>N</given-names>
</name>
, <name><surname>Breitbart</surname>
<given-names>M</given-names>
</name>
, <name><surname>Nulton</surname>
<given-names>J</given-names>
</name>
, <name><surname>Salamon</surname>
<given-names>P</given-names>
</name>
, <name><surname>Lozupone</surname>
<given-names>C</given-names>
</name>
, <name><surname>Jones</surname>
<given-names>R</given-names>
</name>
, <etal>et al</etal>
<article-title>Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil</article-title>
. <source>Applied and Environmental Microbiology</source>
. <year>2007</year>
;<volume>73</volume>
(<issue>21</issue>
):<fpage>7059</fpage>
–<lpage>7066</lpage>
. <pub-id pub-id-type="doi">10.1128/AEM.00358-07</pub-id>
<pub-id pub-id-type="pmid">17827313</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref031"><label>31</label>
<mixed-citation publication-type="journal"><name><surname>Simister</surname>
<given-names>R</given-names>
</name>
, <name><surname>Taylor</surname>
<given-names>MW</given-names>
</name>
, <name><surname>Tsai</surname>
<given-names>P</given-names>
</name>
, <name><surname>Fan</surname>
<given-names>L</given-names>
</name>
, <name><surname>Bruxner</surname>
<given-names>TJ</given-names>
</name>
, <name><surname>Crowe</surname>
<given-names>ML</given-names>
</name>
, <etal>et al</etal>
<article-title>Thermal stress responses in the bacterial biosphere of the great barrier reef sponge, rhopaloeides odorabile</article-title>
. <source>Environmental microbiology</source>
. <year>2012</year>
;<volume>14</volume>
(<issue>12</issue>
):<fpage>3232</fpage>
–<lpage>3246</lpage>
. <pub-id pub-id-type="doi">10.1111/1462-2920.12010</pub-id>
<pub-id pub-id-type="pmid">23106937</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref032"><label>32</label>
<mixed-citation publication-type="journal"><name><surname>Krogius-Kurikka</surname>
<given-names>L</given-names>
</name>
, <name><surname>Kassinen</surname>
<given-names>A</given-names>
</name>
, <name><surname>Paulin</surname>
<given-names>L</given-names>
</name>
, <name><surname>Corander</surname>
<given-names>J</given-names>
</name>
, <name><surname>Makivuokko</surname>
<given-names>H</given-names>
</name>
, <name><surname>Tuimala</surname>
<given-names>J</given-names>
</name>
, <etal>et al</etal>
<article-title>Sequence analysis of percent G+C fraction libraries of human faecal bacterial DNA reveals a high number of Actinobacteria</article-title>
. <source>BMC Microbiology</source>
. <year>2009</year>
;<volume>9</volume>
<pub-id pub-id-type="doi">10.1186/1471-2180-9-68</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref033"><label>33</label>
<mixed-citation publication-type="journal"><name><surname>Wang</surname>
<given-names>J</given-names>
</name>
, <name><surname>McLenachan</surname>
<given-names>PA</given-names>
</name>
, <name><surname>Biggs</surname>
<given-names>PJ</given-names>
</name>
, <name><surname>Winder</surname>
<given-names>LH</given-names>
</name>
, <name><surname>Schoenfeld</surname>
<given-names>BIK</given-names>
</name>
, <name><surname>Narayan</surname>
<given-names>VV</given-names>
</name>
, <etal>et al</etal>
<article-title>Environmental bio-monitoring with high-throughput sequencing</article-title>
. <source>Briefings in Bioinformatics</source>
. <year>2013</year>
;<volume>14</volume>
(<issue>5</issue>
):<fpage>575</fpage>
–<lpage>588</lpage>
. <pub-id pub-id-type="doi">10.1093/bib/bbt032</pub-id>
<pub-id pub-id-type="pmid">23677899</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref034"><label>34</label>
<mixed-citation publication-type="journal"><name><surname>Brady</surname>
<given-names>A</given-names>
</name>
, <name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
. <article-title>Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models</article-title>
. <source>Nature Methods</source>
. <year>2009</year>
;<volume>6</volume>
(<issue>9</issue>
):<fpage>673</fpage>
–<lpage>676</lpage>
. <pub-id pub-id-type="doi">10.1038/nmeth.1358</pub-id>
<pub-id pub-id-type="pmid">19648916</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref035"><label>35</label>
<mixed-citation publication-type="journal"><name><surname>Diaz</surname>
<given-names>NN</given-names>
</name>
, <name><surname>Krause</surname>
<given-names>L</given-names>
</name>
, <name><surname>Goesmann</surname>
<given-names>A</given-names>
</name>
, <name><surname>Niehaus</surname>
<given-names>K</given-names>
</name>
, <name><surname>Nattkemper</surname>
<given-names>TW</given-names>
</name>
. <article-title>TACOA–Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach</article-title>
. <source>BMC Bioinformatics</source>
. <year>2009</year>
;<volume>10</volume>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-56</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref036"><label>36</label>
<mixed-citation publication-type="journal"><name><surname>Rosen</surname>
<given-names>GL</given-names>
</name>
, <name><surname>Reichenberger</surname>
<given-names>ER</given-names>
</name>
, <name><surname>Rosenfeld</surname>
<given-names>AM</given-names>
</name>
. <article-title>NBC: The naive Bayes classification tool webserver for taxonomic classification of metagenomic reads</article-title>
. <source>Bioinformatics</source>
. <year>2011</year>
;<volume>27</volume>
(<issue>1</issue>
):<fpage>127</fpage>
–<lpage>129</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btq619</pub-id>
<pub-id pub-id-type="pmid">21062764</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref037"><label>37</label>
<mixed-citation publication-type="journal"><name><surname>Patil</surname>
<given-names>KR</given-names>
</name>
, <name><surname>Haider</surname>
<given-names>P</given-names>
</name>
, <name><surname>Pope</surname>
<given-names>PB</given-names>
</name>
, <name><surname>Turnbaugh</surname>
<given-names>PJ</given-names>
</name>
, <name><surname>Morrison</surname>
<given-names>M</given-names>
</name>
, <name><surname>Scheffer</surname>
<given-names>T</given-names>
</name>
, <etal>et al</etal>
<article-title>Taxonomic metagenome sequence assignment with structured output models</article-title>
. <source>Nature Methods</source>
. <year>2011</year>
;<volume>8</volume>
(<issue>3</issue>
):<fpage>191</fpage>
–<lpage>192</lpage>
. <pub-id pub-id-type="doi">10.1038/nmeth0311-191</pub-id>
<pub-id pub-id-type="pmid">21358620</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref038"><label>38</label>
<mixed-citation publication-type="journal"><name><surname>Cui</surname>
<given-names>H</given-names>
</name>
, <name><surname>Zhang</surname>
<given-names>X</given-names>
</name>
. <article-title>Alignment-free supervised classification of metagenomes by recursive SVM</article-title>
. <source>BMC Genomics</source>
. <year>2013</year>
;<volume>14</volume>
(<issue>1</issue>
). <pub-id pub-id-type="doi">10.1186/1471-2164-14-641</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref039"><label>39</label>
<mixed-citation publication-type="book"><name><surname>Kawulok</surname>
<given-names>M</given-names>
</name>
, <name><surname>Nalepa</surname>
<given-names>J</given-names>
</name>
. <chapter-title>Support Vector Machines Training Data Selection Using a Genetic Algorithm</chapter-title>
 In: <name><surname>Gimel’farb</surname>
<given-names>G</given-names>
</name>
, <name><surname>Hancock</surname>
<given-names>E</given-names>
</name>
, <name><surname>Imiya</surname>
<given-names>A</given-names>
</name>
, <name><surname>Kuijper</surname>
<given-names>A</given-names>
</name>
, <name><surname>Kudo</surname>
<given-names>M</given-names>
</name>
, <name><surname>Omachi</surname>
<given-names>S</given-names>
</name>
, <etal>et al</etal>
, editors. <source>Structural, Syntactic, and Statistical Pattern Recognition. vol. 7626 of Lecture Notes in Computer Science</source>
. <publisher-name>Springer Berlin Heidelberg</publisher-name>
; <year>2012</year>
 p. <fpage>557</fpage>
–<lpage>565</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0121453.ref040"><label>40</label>
<mixed-citation publication-type="book"><name><surname>Cyran</surname>
<given-names>KA</given-names>
</name>
, <name><surname>Kawulok</surname>
<given-names>J</given-names>
</name>
, <name><surname>Kawulok</surname>
<given-names>M</given-names>
</name>
, <name><surname>Stawarz</surname>
<given-names>M</given-names>
</name>
, <name><surname>Michalak</surname>
<given-names>M</given-names>
</name>
, <name><surname>Pietrowska</surname>
<given-names>M</given-names>
</name>
, <etal>et al</etal>
<chapter-title>Support Vector Machines in Biomedical and Biometrical Applications</chapter-title>
 In: <name><surname>Ramanna</surname>
<given-names>S</given-names>
</name>
, <name><surname>Jain</surname>
<given-names>LC</given-names>
</name>
, <name><surname>Howlett</surname>
<given-names>RJ</given-names>
</name>
, editors. <source>Emerging Paradigms in Machine Learning. vol. 13 of Smart Innovation, Systems and Technologies</source>
. <publisher-name>Springer Berlin Heidelberg</publisher-name>
; <year>2013</year>
 p. <fpage>379</fpage>
–<lpage>417</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0121453.ref041"><label>41</label>
<mixed-citation publication-type="journal"><name><surname>Wang</surname>
<given-names>D</given-names>
</name>
, <name><surname>Shi</surname>
<given-names>L</given-names>
</name>
. <article-title>Selecting valuable training samples for SVMs via data structure analysis</article-title>
. <source>Neurocomputing</source>
. <year>2008</year>
;<volume>71</volume>
:<fpage>2772</fpage>
–<lpage>2781</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0121453.ref042"><label>42</label>
<mixed-citation publication-type="journal"><name><surname>Huson</surname>
<given-names>DH</given-names>
</name>
, <name><surname>Auch</surname>
<given-names>AF</given-names>
</name>
, <name><surname>Qi</surname>
<given-names>J</given-names>
</name>
, <name><surname>Schuster</surname>
<given-names>SC</given-names>
</name>
. <article-title>MEGAN analysis of metagenomic data</article-title>
. <source>Genome Research</source>
. <year>2007</year>
;<volume>17</volume>
(<issue>3</issue>
):<fpage>377</fpage>
–<lpage>386</lpage>
. <pub-id pub-id-type="doi">10.1101/gr.5969107</pub-id>
<pub-id pub-id-type="pmid">17255551</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref043"><label>43</label>
<mixed-citation publication-type="journal"><name><surname>Gori</surname>
<given-names>F</given-names>
</name>
, <name><surname>Folino</surname>
<given-names>G</given-names>
</name>
, <name><surname>Jetten</surname>
<given-names>MSM</given-names>
</name>
, <name><surname>Marchiori</surname>
<given-names>E</given-names>
</name>
. <article-title>MTR: Taxonomic annotation of short metagenomic reads using clustering at multiple taxonomic ranks</article-title>
. <source>Bioinformatics</source>
. <year>2011</year>
;<volume>27</volume>
(<issue>2</issue>
):<fpage>196</fpage>
–<lpage>203</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btq649</pub-id>
<pub-id pub-id-type="pmid">21127032</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref044"><label>44</label>
<mixed-citation publication-type="journal"><name><surname>Monzoorul Haque</surname>
<given-names>M</given-names>
</name>
, <name><surname>Ghosh</surname>
<given-names>TS</given-names>
</name>
, <name><surname>Komanduri</surname>
<given-names>D</given-names>
</name>
, <name><surname>Mande</surname>
<given-names>SS</given-names>
</name>
. <article-title>SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences</article-title>
. <source>Bioinformatics</source>
. <year>2009</year>
;<volume>25</volume>
(<issue>14</issue>
):<fpage>1722</fpage>
–<lpage>1730</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btp317</pub-id>
<pub-id pub-id-type="pmid">19439565</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref045"><label>45</label>
<mixed-citation publication-type="journal"><name><surname>Gerlach</surname>
<given-names>W</given-names>
</name>
, <name><surname>Stoye</surname>
<given-names>J</given-names>
</name>
. <article-title>Taxonomic classification of metagenomic shotgun sequences with CARMA3</article-title>
. <source>Nucleic acids research</source>
. <year>2011</year>
;<volume>39</volume>
(<issue>14</issue>
). <pub-id pub-id-type="doi">10.1093/nar/gkr225</pub-id>
<pub-id pub-id-type="pmid">21586583</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref046"><label>46</label>
<mixed-citation publication-type="journal"><name><surname>Meyer</surname>
<given-names>F</given-names>
</name>
, <name><surname>Paarmann</surname>
<given-names>D</given-names>
</name>
, <name><surname>D’Souza</surname>
<given-names>M</given-names>
</name>
, <name><surname>Olson</surname>
<given-names>R</given-names>
</name>
, <name><surname>Glass</surname>
<given-names>E</given-names>
</name>
, <name><surname>Kubal</surname>
<given-names>M</given-names>
</name>
, <etal>et al</etal>
<article-title>The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes</article-title>
. <source>BMC Bioinformatics</source>
. <year>2008</year>
;<volume>9</volume>
(<issue>1</issue>
):<fpage>386</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-386</pub-id>
<pub-id pub-id-type="pmid">18803844</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref047"><label>47</label>
<mixed-citation publication-type="other">Liu B, Gibbons T, Ghodsi M, Pop M. MetaPhyler: Taxonomic profiling for metagenomic sequences. In: Proceedings of the 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010; 2010. p. 95–100.</mixed-citation>
</ref>
<ref id="pone.0121453.ref048"><label>48</label>
<mixed-citation publication-type="journal"><name><surname>Schreiber</surname>
<given-names>F</given-names>
</name>
, <name><surname>Gumrich</surname>
<given-names>P</given-names>
</name>
, <name><surname>Daniel</surname>
<given-names>R</given-names>
</name>
, <name><surname>Meinicke</surname>
<given-names>P</given-names>
</name>
. <article-title>Treephyler: Fast taxonomic profiling of metagenomes</article-title>
. <source>Bioinformatics</source>
. <year>2010</year>
;<volume>26</volume>
(<issue>7</issue>
):<fpage>960</fpage>
–<lpage>961</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btq070</pub-id>
<pub-id pub-id-type="pmid">20172941</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref049"><label>49</label>
<mixed-citation publication-type="journal"><name><surname>Stranneheim</surname>
<given-names>H</given-names>
</name>
, <name><surname>Kaller</surname>
<given-names>M</given-names>
</name>
, <name><surname>Allander</surname>
<given-names>T</given-names>
</name>
, <name><surname>Andersson</surname>
<given-names>B</given-names>
</name>
, <name><surname>Arvestad</surname>
<given-names>L</given-names>
</name>
, <name><surname>Lundeberg</surname>
<given-names>J</given-names>
</name>
. <article-title>Classification of DNA sequences using Bloom filters</article-title>
. <source>Bioinformatics</source>
. <year>2010</year>
;<volume>26</volume>
(<issue>13</issue>
):<fpage>1595</fpage>
–<lpage>1600</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btq230</pub-id>
<pub-id pub-id-type="pmid">20472541</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref050"><label>50</label>
<mixed-citation publication-type="journal"><name><surname>Ames</surname>
<given-names>S</given-names>
</name>
, <name><surname>Hysom</surname>
<given-names>DA</given-names>
</name>
, <name><surname>Gardner</surname>
<given-names>SN</given-names>
</name>
, <name><surname>Lloyd</surname>
<given-names>GS</given-names>
</name>
, <name><surname>Gokhale</surname>
<given-names>MB</given-names>
</name>
, <name><surname>Allen</surname>
<given-names>JE</given-names>
</name>
. <article-title>Scalable metagenomic taxonomy classification using a reference genome database</article-title>
. <source>Bioinformatics</source>
. <year>2013</year>
;<volume>29</volume>
(<issue>18</issue>
):<fpage>2253</fpage>
–<lpage>2260</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btt389</pub-id>
<pub-id pub-id-type="pmid">23828782</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref051"><label>51</label>
<mixed-citation publication-type="journal"><name><surname>Wood</surname>
<given-names>DE</given-names>
</name>
, <name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
. <article-title>Kraken: Ultrafast metagenomic sequence classification using exact alignments</article-title>
. <source>Genome biology</source>
. <year>2014</year>
;<volume>15</volume>
(<issue>3</issue>
). <pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
<pub-id pub-id-type="pmid">24580807</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref052"><label>52</label>
<mixed-citation publication-type="journal"><name><surname>Roberts</surname>
<given-names>M</given-names>
</name>
, <name><surname>Hayes</surname>
<given-names>W</given-names>
</name>
, <name><surname>Hunt</surname>
<given-names>BR</given-names>
</name>
, <name><surname>Mount</surname>
<given-names>SM</given-names>
</name>
, <name><surname>Yorke</surname>
<given-names>JA</given-names>
</name>
. <article-title>Reducing storage requirements for biological sequence comparison</article-title>
. <source>Bioinformatics</source>
. <year>2004</year>
;<volume>20</volume>
(<issue>18</issue>
):<fpage>3363</fpage>
–<lpage>3369</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/bth408</pub-id>
<pub-id pub-id-type="pmid">15256412</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref053"><label>53</label>
<mixed-citation publication-type="journal"><name><surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
, <name><surname>Kokot</surname>
<given-names>M</given-names>
</name>
, <name><surname>Grabowski</surname>
<given-names>S</given-names>
</name>
, <name><surname>Debudaj-Grabysz</surname>
<given-names>A</given-names>
</name>
. <article-title>KMC 2: Fast and resource-frugal k-mer counting</article-title>
. <source>Bioinformatics</source>
. <year>2015</year>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv022</pub-id>
<pub-id pub-id-type="pmid">25609798</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref054"><label>54</label>
<mixed-citation publication-type="journal"><name><surname>Movahedi</surname>
<given-names>NS</given-names>
</name>
, <name><surname>Forouzmand</surname>
<given-names>E</given-names>
</name>
, <name><surname>Chitsaz</surname>
<given-names>H</given-names>
</name>
. <article-title>De novo co-assembly of bacterial genomes from multiple single cells</article-title>
. In: <source>BIBM</source>
; <year>2012</year>
 p. <fpage>1</fpage>
–<lpage>5</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0121453.ref055"><label>55</label>
<mixed-citation publication-type="journal"><name><surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
, <name><surname>Debudaj-Grabysz</surname>
<given-names>A</given-names>
</name>
, <name><surname>Grabowski</surname>
<given-names>S</given-names>
</name>
. <article-title>Disk-based k-mer counting on a PC</article-title>
. <source>BMC Bioinformatics</source>
. <year>2013</year>
;<volume>14</volume>
:<fpage>160(160)</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-14-160</pub-id>
<pub-id pub-id-type="pmid">23679007</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref056"><label>56</label>
<mixed-citation publication-type="journal"><name><surname>Bazinet</surname>
<given-names>AL</given-names>
</name>
, <name><surname>Cummings</surname>
<given-names>MP</given-names>
</name>
. <article-title>A comparative evaluation of sequence classification programs</article-title>
. <source>BMC Bioinformatics</source>
. <year>2012</year>
;<volume>13</volume>
(<issue>1</issue>
):<fpage>1</fpage>
–<lpage>13</lpage>
. <pub-id pub-id-type="doi">10.1186/1471-2105-13-92</pub-id>
<pub-id pub-id-type="pmid">22214541</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0121453.ref057"><label>57</label>
<mixed-citation publication-type="book"><name><surname>Kawulok</surname>
<given-names>J</given-names>
</name>
, <name><surname>Deorowicz</surname>
<given-names>S</given-names>
</name>
. <chapter-title>An Improved Algorithm for Fast and Accurate Classification of Sequences</chapter-title>
 In: <name><surname>Kozielski</surname>
<given-names>S</given-names>
</name>
, <name><surname>Mrozek</surname>
<given-names>D</given-names>
</name>
, <name><surname>Kasprowski</surname>
<given-names>P</given-names>
</name>
, <name><surname>Maysiak-Mrozek</surname>
<given-names>B</given-names>
</name>
, <name><surname>Kostrzewa</surname>
<given-names>D</given-names>
</name>
, editors. <source>Beyond Databases, Architectures, and Structures. vol. 424 of Communications in Computer and Information Science</source>
. <publisher-name>Springer International Publishing</publisher-name>
; <year>2014</year>
 p. <fpage>335</fpage>
–<lpage>344</lpage>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001006 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 001006 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4401624
   |texte=   CoMeta: Classification of Metagenomes Using k-mers
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:25884504" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

CoMeta: Classification of Metagenomes Using k-mers

CoMeta: Classification of Metagenomes Using k-mers

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki