Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Analysis of composition-based metagenomic classification

Identifieur interne : 000955 ( Pmc/Corpus ); précédent : 000954; suivant : 000956

Analysis of composition-based metagenomic classification

Auteurs : Susan Higashi ; André Da Motta Salles Barreto ; Maurício Egidio Cantão ; Ana Tereza Ribeiro De Vasconcelos

Source :

RBID : PMC:3477002

Abstract

Background

An essential step of a metagenomic study is the taxonomic classification, that is, the identification of the taxonomic lineage of the organisms in a given sample. The taxonomic classification process involves a series of decisions. Currently, in the context of metagenomics, such decisions are usually based on empirical studies that consider one specific type of classifier. In this study we propose a general framework for analyzing the impact that several decisions can have on the classification problem. Instead of focusing on any specific classifier, we define a generic score function that provides a measure of the difficulty of the classification task. Using this framework, we analyze the impact of the following parameters on the taxonomic classification problem: (i) the length of n-mers used to encode the metagenomic sequences, (ii) the similarity measure used to compare sequences, and (iii) the type of taxonomic classification, which can be conventional or hierarchical, depending on whether the classification process occurs in a single shot or in several steps according to the taxonomic tree.

Results

We defined a score function that measures the degree of separability of the taxonomic classes under a given configuration induced by the parameters above. We conducted an extensive computational experiment and found out that reasonable values for the parameters of interest could be (i) intermediate values of n, the length of the n-mers; (ii) any similarity measure, because all of them resulted in similar scores; and (iii) the hierarchical strategy, which performed better in all of the cases.

Conclusions

As expected, short n-mers generate lower configuration scores because they give rise to frequency vectors that represent distinct sequences in a similar way. On the other hand, large values for n result in sparse frequency vectors that represent differently metagenomic fragments that are in fact similar, also leading to low configuration scores. Regarding the similarity measure, in contrast to our expectations, the variation of the measures did not change the configuration scores significantly. Finally, the hierarchical strategy was more effective than the conventional strategy, which suggests that, instead of using a single classifier, one should adopt multiple classifiers organized as a hierarchy.


Url:
DOI: 10.1186/1471-2164-13-S5-S1
PubMed: 23095761
PubMed Central: 3477002

Links to Exploration step

PMC:3477002

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Analysis of composition-based metagenomic classification</title>
<author>
<name sortKey="Higashi, Susan" sort="Higashi, Susan" uniqKey="Higashi S" first="Susan" last="Higashi">Susan Higashi</name>
<affiliation>
<nlm:aff id="I1">Laboratório Nacional de Computação Científica (LNCC), Petrópolis, RJ, Brazil</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Barreto, Andre Da Motta Salles" sort="Barreto, Andre Da Motta Salles" uniqKey="Barreto A" first="André Da Motta Salles" last="Barreto">André Da Motta Salles Barreto</name>
<affiliation>
<nlm:aff id="I1">Laboratório Nacional de Computação Científica (LNCC), Petrópolis, RJ, Brazil</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Cantao, Mauricio Egidio" sort="Cantao, Mauricio Egidio" uniqKey="Cantao M" first="Maurício Egidio" last="Cantão">Maurício Egidio Cantão</name>
<affiliation>
<nlm:aff id="I1">Laboratório Nacional de Computação Científica (LNCC), Petrópolis, RJ, Brazil</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Embrapa Suínos e Aves, Concórdia, SC, Brazil</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="De Vasconcelos, Ana Tereza Ribeiro" sort="De Vasconcelos, Ana Tereza Ribeiro" uniqKey="De Vasconcelos A" first="Ana Tereza Ribeiro" last="De Vasconcelos">Ana Tereza Ribeiro De Vasconcelos</name>
<affiliation>
<nlm:aff id="I1">Laboratório Nacional de Computação Científica (LNCC), Petrópolis, RJ, Brazil</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23095761</idno>
<idno type="pmc">3477002</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3477002</idno>
<idno type="RBID">PMC:3477002</idno>
<idno type="doi">10.1186/1471-2164-13-S5-S1</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000955</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000955</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Analysis of composition-based metagenomic classification</title>
<author>
<name sortKey="Higashi, Susan" sort="Higashi, Susan" uniqKey="Higashi S" first="Susan" last="Higashi">Susan Higashi</name>
<affiliation>
<nlm:aff id="I1">Laboratório Nacional de Computação Científica (LNCC), Petrópolis, RJ, Brazil</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Barreto, Andre Da Motta Salles" sort="Barreto, Andre Da Motta Salles" uniqKey="Barreto A" first="André Da Motta Salles" last="Barreto">André Da Motta Salles Barreto</name>
<affiliation>
<nlm:aff id="I1">Laboratório Nacional de Computação Científica (LNCC), Petrópolis, RJ, Brazil</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Cantao, Mauricio Egidio" sort="Cantao, Mauricio Egidio" uniqKey="Cantao M" first="Maurício Egidio" last="Cantão">Maurício Egidio Cantão</name>
<affiliation>
<nlm:aff id="I1">Laboratório Nacional de Computação Científica (LNCC), Petrópolis, RJ, Brazil</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Embrapa Suínos e Aves, Concórdia, SC, Brazil</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="De Vasconcelos, Ana Tereza Ribeiro" sort="De Vasconcelos, Ana Tereza Ribeiro" uniqKey="De Vasconcelos A" first="Ana Tereza Ribeiro" last="De Vasconcelos">Ana Tereza Ribeiro De Vasconcelos</name>
<affiliation>
<nlm:aff id="I1">Laboratório Nacional de Computação Científica (LNCC), Petrópolis, RJ, Brazil</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Genomics</title>
<idno type="eISSN">1471-2164</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>An essential step of a metagenomic study is the taxonomic classification, that is, the identification of the taxonomic lineage of the organisms in a given sample. The taxonomic classification process involves a series of decisions. Currently, in the context of metagenomics, such decisions are usually based on empirical studies that consider one specific type of classifier. In this study we propose a general framework for analyzing the impact that several decisions can have on the
<italic>classification problem</italic>
. Instead of focusing on any specific classifier, we define a generic score function that provides a measure of the difficulty of the classification task. Using this framework, we analyze the impact of the following parameters on the taxonomic classification problem: (i) the length of
<italic>n</italic>
-mers used to encode the metagenomic sequences, (ii) the similarity measure used to compare sequences, and (iii) the type of taxonomic classification, which can be conventional or hierarchical, depending on whether the classification process occurs in a single shot or in several steps according to the taxonomic tree.</p>
</sec>
<sec>
<title>Results</title>
<p>We defined a score function that measures the degree of separability of the taxonomic classes under a given configuration induced by the parameters above. We conducted an extensive computational experiment and found out that reasonable values for the parameters of interest could be (i) intermediate values of n, the length of the
<italic>n</italic>
-mers; (ii) any similarity measure, because all of them resulted in similar scores; and (iii) the hierarchical strategy, which performed better in all of the cases.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>As expected, short
<italic>n</italic>
-mers generate lower configuration scores because they give rise to frequency vectors that represent distinct sequences in a similar way. On the other hand, large values for n result in sparse frequency vectors that represent differently metagenomic fragments that are in fact similar, also leading to low configuration scores. Regarding the similarity measure, in contrast to our expectations, the variation of the measures did not change the configuration scores significantly. Finally, the hierarchical strategy was more effective than the conventional strategy, which suggests that, instead of using a single classifier, one should adopt multiple classifiers organized as a hierarchy.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Handelsman, J" uniqKey="Handelsman J">J Handelsman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tringe, Sg" uniqKey="Tringe S">SG Tringe</name>
</author>
<author>
<name sortKey="Rubin, Em" uniqKey="Rubin E">EM Rubin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schreiber, F" uniqKey="Schreiber F">F Schreiber</name>
</author>
<author>
<name sortKey="Gumrich, P" uniqKey="Gumrich P">P Gumrich</name>
</author>
<author>
<name sortKey="Daniel, R" uniqKey="Daniel R">R Daniel</name>
</author>
<author>
<name sortKey="Meinicke, P" uniqKey="Meinicke P">P Meinicke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mchardy, Ac" uniqKey="Mchardy A">AC McHardy</name>
</author>
<author>
<name sortKey="Martin, Hg" uniqKey="Martin H">HG Martín</name>
</author>
<author>
<name sortKey="Tsirigos, A" uniqKey="Tsirigos A">A Tsirigos</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Rigoutsos, I" uniqKey="Rigoutsos I">I Rigoutsos</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Eugene W Myers, Ew" uniqKey="Eugene W Myers E">EW Eugene W Myers</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krause, L" uniqKey="Krause L">L Krause</name>
</author>
<author>
<name sortKey="Diaz, Nn" uniqKey="Diaz N">NN Diaz</name>
</author>
<author>
<name sortKey="Goesmann, A" uniqKey="Goesmann A">A Goesmann</name>
</author>
<author>
<name sortKey="Kelley, S" uniqKey="Kelley S">S Kelley</name>
</author>
<author>
<name sortKey="Nattkemper, Tw" uniqKey="Nattkemper T">TW Nattkemper</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
<author>
<name sortKey="Edwards, Ra" uniqKey="Edwards R">Ra Edwards</name>
</author>
<author>
<name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Diaz, Nn" uniqKey="Diaz N">NN Diaz</name>
</author>
<author>
<name sortKey="Krause, L" uniqKey="Krause L">L Krause</name>
</author>
<author>
<name sortKey="Goesmann, A" uniqKey="Goesmann A">A Goesmann</name>
</author>
<author>
<name sortKey="Niehaus, K" uniqKey="Niehaus K">K Niehaus</name>
</author>
<author>
<name sortKey="Nattkemper, Tw" uniqKey="Nattkemper T">TW Nattkemper</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saeed, I" uniqKey="Saeed I">I Saeed</name>
</author>
<author>
<name sortKey="Halgamuge, Sk" uniqKey="Halgamuge S">SK Halgamuge</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karlin, S" uniqKey="Karlin S">S Karlin</name>
</author>
<author>
<name sortKey="Mrazek, J" uniqKey="Mrazek J">J Mrázek</name>
</author>
<author>
<name sortKey="Campbell, Am" uniqKey="Campbell A">aM Campbell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karlin, Ac" uniqKey="Karlin A">AC Karlin</name>
</author>
<author>
<name sortKey="Mraazek, J" uniqKey="Mraazek J">J Mráazek</name>
</author>
<author>
<name sortKey="Samuel" uniqKey="Samuel">Samuel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brady, A" uniqKey="Brady A">A Brady</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
<author>
<name sortKey="Garbarine, E" uniqKey="Garbarine E">E Garbarine</name>
</author>
<author>
<name sortKey="Caseiro, D" uniqKey="Caseiro D">D Caseiro</name>
</author>
<author>
<name sortKey="Polikar, R" uniqKey="Polikar R">R Polikar</name>
</author>
<author>
<name sortKey="Sokhansanj, B" uniqKey="Sokhansanj B">B Sokhansanj</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benson, Da" uniqKey="Benson D">Da Benson</name>
</author>
<author>
<name sortKey="Karsch Mizrachi, I" uniqKey="Karsch Mizrachi I">I Karsch-Mizrachi</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
<author>
<name sortKey="Ostell, J" uniqKey="Ostell J">J Ostell</name>
</author>
<author>
<name sortKey="Sayers, Ew" uniqKey="Sayers E">EW Sayers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Doolittle, Wf" uniqKey="Doolittle W">WF Doolittle</name>
</author>
<author>
<name sortKey="Zhaxybayeva, O" uniqKey="Zhaxybayeva O">O Zhaxybayeva</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Richter, Dc" uniqKey="Richter D">DC Richter</name>
</author>
<author>
<name sortKey="Ott, F" uniqKey="Ott F">F Ott</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author>
<name sortKey="Schmid, R" uniqKey="Schmid R">R Schmid</name>
</author>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Margulies, M" uniqKey="Margulies M">M Margulies</name>
</author>
<author>
<name sortKey="Egholm, M" uniqKey="Egholm M">M Egholm</name>
</author>
<author>
<name sortKey="Altman, We" uniqKey="Altman W">WE Altman</name>
</author>
<author>
<name sortKey="Attiya, S" uniqKey="Attiya S">S Attiya</name>
</author>
<author>
<name sortKey="Bader, Js" uniqKey="Bader J">JS Bader</name>
</author>
<author>
<name sortKey="Bemben, La" uniqKey="Bemben L">LA Bemben</name>
</author>
<author>
<name sortKey="Berka, J" uniqKey="Berka J">J Berka</name>
</author>
<author>
<name sortKey="Braverman, Ms" uniqKey="Braverman M">MS Braverman</name>
</author>
<author>
<name sortKey="Chen, Yj" uniqKey="Chen Y">Yj Chen</name>
</author>
<author>
<name sortKey="Chen, Z" uniqKey="Chen Z">Z Chen</name>
</author>
<author>
<name sortKey="Dewell, B" uniqKey="Dewell B">B Dewell</name>
</author>
<author>
<name sortKey="Du, L" uniqKey="Du L">L Du</name>
</author>
<author>
<name sortKey="Fierro, Jm" uniqKey="Fierro J">JM Fierro</name>
</author>
<author>
<name sortKey="Gomes, Xv" uniqKey="Gomes X">XV Gomes</name>
</author>
<author>
<name sortKey="Goodwin, Bc" uniqKey="Goodwin B">BC Goodwin</name>
</author>
<author>
<name sortKey="He, W" uniqKey="He W">W He</name>
</author>
<author>
<name sortKey="Helgesen, S" uniqKey="Helgesen S">S Helgesen</name>
</author>
<author>
<name sortKey="Ho, Ch" uniqKey="Ho C">CH Ho</name>
</author>
<author>
<name sortKey="Irzyk, Gp" uniqKey="Irzyk G">GP Irzyk</name>
</author>
<author>
<name sortKey="Jando, Sc" uniqKey="Jando S">SC Jando</name>
</author>
<author>
<name sortKey="I, Ml" uniqKey="I M">ML I</name>
</author>
<author>
<name sortKey="Jarvie, Tp" uniqKey="Jarvie T">TP Jarvie</name>
</author>
<author>
<name sortKey="Jirage, Kb" uniqKey="Jirage K">KB Jirage</name>
</author>
<author>
<name sortKey="Kim, Jb" uniqKey="Kim J">Jb Kim</name>
</author>
<author>
<name sortKey="Knight, Jr" uniqKey="Knight J">JR Knight</name>
</author>
<author>
<name sortKey="Lanza, R" uniqKey="Lanza R">R Lanza</name>
</author>
<author>
<name sortKey="Leamon, Jh" uniqKey="Leamon J">JH Leamon</name>
</author>
<author>
<name sortKey="Lefkowitz, Sm" uniqKey="Lefkowitz S">SM Lefkowitz</name>
</author>
<author>
<name sortKey="Lei, M" uniqKey="Lei M">M Lei</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author>
<name sortKey="L, K" uniqKey="L K">K L</name>
</author>
<author>
<name sortKey="Lu, H" uniqKey="Lu H">H Lu</name>
</author>
<author>
<name sortKey="Makhijani, Vb" uniqKey="Makhijani V">VB Makhijani</name>
</author>
<author>
<name sortKey="Mcdade, Ke" uniqKey="Mcdade K">KE Mcdade</name>
</author>
<author>
<name sortKey="Mckenna, Mp" uniqKey="Mckenna M">MP Mckenna</name>
</author>
<author>
<name sortKey="Myers, W" uniqKey="Myers W">W Myers</name>
</author>
<author>
<name sortKey="Nickerson, E" uniqKey="Nickerson E">E Nickerson</name>
</author>
<author>
<name sortKey="Nobile, Jr" uniqKey="Nobile J">JR Nobile</name>
</author>
<author>
<name sortKey="Plant, R" uniqKey="Plant R">R Plant</name>
</author>
<author>
<name sortKey="Puc, Bp" uniqKey="Puc B">BP Puc</name>
</author>
<author>
<name sortKey="Ronan, T" uniqKey="Ronan T">T Ronan</name>
</author>
<author>
<name sortKey="Roth, Gt" uniqKey="Roth G">GT Roth</name>
</author>
<author>
<name sortKey="Sarkis, Gj" uniqKey="Sarkis G">GJ Sarkis</name>
</author>
<author>
<name sortKey="Simons, Jf" uniqKey="Simons J">JF Simons</name>
</author>
<author>
<name sortKey="Simpson, Jw" uniqKey="Simpson J">JW Simpson</name>
</author>
<author>
<name sortKey="Srinivasan, M" uniqKey="Srinivasan M">M Srinivasan</name>
</author>
<author>
<name sortKey="Tartaro, Kr" uniqKey="Tartaro K">KR Tartaro</name>
</author>
<author>
<name sortKey="Tomasz, A" uniqKey="Tomasz A">A Tomasz</name>
</author>
<author>
<name sortKey="Vogt, Ka" uniqKey="Vogt K">KA Vogt</name>
</author>
<author>
<name sortKey="A, G" uniqKey="A G">G A</name>
</author>
<author>
<name sortKey="Wang, Sh" uniqKey="Wang S">SH Wang</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Weiner, Mp" uniqKey="Weiner M">MP Weiner</name>
</author>
<author>
<name sortKey="Yu, P" uniqKey="Yu P">P Yu</name>
</author>
<author>
<name sortKey="F, R" uniqKey="F R">R F</name>
</author>
<author>
<name sortKey="Rothberg, Jm" uniqKey="Rothberg J">JM Rothberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Russell, Sj" uniqKey="Russell S">SJ Russell</name>
</author>
<author>
<name sortKey="Norvig, P" uniqKey="Norvig P">P Norvig</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stajich, J" uniqKey="Stajich J">J Stajich</name>
</author>
<author>
<name sortKey="Block, D" uniqKey="Block D">D Block</name>
</author>
<author>
<name sortKey="Boulez, K" uniqKey="Boulez K">K Boulez</name>
</author>
<author>
<name sortKey="Brenner, S" uniqKey="Brenner S">S Brenner</name>
</author>
<author>
<name sortKey="Chervitz, S" uniqKey="Chervitz S">S Chervitz</name>
</author>
<author>
<name sortKey="Dagdigian, C" uniqKey="Dagdigian C">C Dagdigian</name>
</author>
<author>
<name sortKey="Fuellen, G" uniqKey="Fuellen G">G Fuellen</name>
</author>
<author>
<name sortKey="Gilbert, J" uniqKey="Gilbert J">J Gilbert</name>
</author>
<author>
<name sortKey="Korf, I" uniqKey="Korf I">I Korf</name>
</author>
<author>
<name sortKey="Lapp, H" uniqKey="Lapp H">H Lapp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hastie, T" uniqKey="Hastie T">T Hastie</name>
</author>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
<author>
<name sortKey="Friedman, J" uniqKey="Friedman J">J Friedman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Scholkopf, B" uniqKey="Scholkopf B">B Scholkopf</name>
</author>
<author>
<name sortKey="Smola, Aj" uniqKey="Smola A">AJ Smola</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cover, T" uniqKey="Cover T">T Cover</name>
</author>
<author>
<name sortKey="Hart, P" uniqKey="Hart P">P Hart</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zheng, H" uniqKey="Zheng H">H Zheng</name>
</author>
<author>
<name sortKey="Wu, H" uniqKey="Wu H">H Wu</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Genomics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Genomics</journal-id>
<journal-title-group>
<journal-title>BMC Genomics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2164</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">23095761</article-id>
<article-id pub-id-type="pmc">3477002</article-id>
<article-id pub-id-type="publisher-id">1471-2164-13-S5-S1</article-id>
<article-id pub-id-type="doi">10.1186/1471-2164-13-S5-S1</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Analysis of composition-based metagenomic classification</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Higashi</surname>
<given-names>Susan</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>susan.higashi@inria.fr</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Barreto</surname>
<given-names>André da Motta Salles</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>amsb@lncc.br</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Cantão</surname>
<given-names>Maurício Egidio</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<email>cantao@lncc.br</email>
</contrib>
<contrib contrib-type="author" id="A4">
<name>
<surname>de Vasconcelos</surname>
<given-names>Ana Tereza Ribeiro</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>atrv@lncc.br</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Laboratório Nacional de Computação Científica (LNCC), Petrópolis, RJ, Brazil</aff>
<aff id="I2">
<label>2</label>
Embrapa Suínos e Aves, Concórdia, SC, Brazil</aff>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>19</day>
<month>10</month>
<year>2012</year>
</pub-date>
<volume>13</volume>
<issue>Suppl 5</issue>
<supplement>
<named-content content-type="supplement-title">Proceedings of the International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2011)</named-content>
<named-content content-type="supplement-editor">Ronaldo Nagem, Thiago Venancio, Ricardo De Marco, Lucas Bleicher, Gerald Weber, Adriano Barbosa-Silva, Liza Felicori, Wagner Arbex, Javier De Las Rivas and Alan Durham</named-content>
<named-content content-type="supplement-sponsor">This supplement has not been supported by sponsorship or other external funding.</named-content>
</supplement>
<fpage>S1</fpage>
<lpage>S1</lpage>
<permissions>
<copyright-statement>Copyright ©2012 Higashi et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Higashi et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an open access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2164/13/S5/S1"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>An essential step of a metagenomic study is the taxonomic classification, that is, the identification of the taxonomic lineage of the organisms in a given sample. The taxonomic classification process involves a series of decisions. Currently, in the context of metagenomics, such decisions are usually based on empirical studies that consider one specific type of classifier. In this study we propose a general framework for analyzing the impact that several decisions can have on the
<italic>classification problem</italic>
. Instead of focusing on any specific classifier, we define a generic score function that provides a measure of the difficulty of the classification task. Using this framework, we analyze the impact of the following parameters on the taxonomic classification problem: (i) the length of
<italic>n</italic>
-mers used to encode the metagenomic sequences, (ii) the similarity measure used to compare sequences, and (iii) the type of taxonomic classification, which can be conventional or hierarchical, depending on whether the classification process occurs in a single shot or in several steps according to the taxonomic tree.</p>
</sec>
<sec>
<title>Results</title>
<p>We defined a score function that measures the degree of separability of the taxonomic classes under a given configuration induced by the parameters above. We conducted an extensive computational experiment and found out that reasonable values for the parameters of interest could be (i) intermediate values of n, the length of the
<italic>n</italic>
-mers; (ii) any similarity measure, because all of them resulted in similar scores; and (iii) the hierarchical strategy, which performed better in all of the cases.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>As expected, short
<italic>n</italic>
-mers generate lower configuration scores because they give rise to frequency vectors that represent distinct sequences in a similar way. On the other hand, large values for n result in sparse frequency vectors that represent differently metagenomic fragments that are in fact similar, also leading to low configuration scores. Regarding the similarity measure, in contrast to our expectations, the variation of the measures did not change the configuration scores significantly. Finally, the hierarchical strategy was more effective than the conventional strategy, which suggests that, instead of using a single classifier, one should adopt multiple classifiers organized as a hierarchy.</p>
</sec>
</abstract>
<kwd-group>
<kwd>Metagenomics</kwd>
<kwd>classification problem</kwd>
<kwd>taxonomic classification</kwd>
</kwd-group>
<conference>
<conf-date>12-15 October 2011</conf-date>
<conf-name>X-meeting 2011 - International Conference on the Brazilian Association for Bioinformatics and Computational Biology</conf-name>
<conf-loc>Florianópolis, Brazil</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>Rather than considering a single species in pure culture, metagenomics goes beyond and focuses on the exploration of entire microbial communities [
<xref ref-type="bibr" rid="B1">1</xref>
]. This focus is possible only because of the recent improvements in sequencing technology. As is typical of new concepts, the emergence of this new paradigm has brought up some new challenges. Among them, the manipulation and analysis of short reads deserves special attention.</p>
<p>In some cases, the phylogenetic diversity of a microbial community is not well covered and, as a consequence, only a few reads can be assembled [
<xref ref-type="bibr" rid="B2">2</xref>
]. Hence, one of the first steps of a large-scale metagenomic analysis is to estimate the phylogenetic distribution of the sample. One approach to perform this task is the taxonomic classification of the reads, which is the assignment of these reads into phylogenetic categories [
<xref ref-type="bibr" rid="B3">3</xref>
].</p>
<p>Essentially, there are three approaches to classifying sequences into taxonomic categories. One possibility is to focus on conserved gene markers (such as rRNA 16S) to identify the source organism of the read. Because rRNA is well conserved, this approach produces an accurate taxonomic classification of the reads. Nevertheless, because only a small fraction of the sequences contain these gene markers, most of the reads of a metagenomic sample cannot be classified using this approach [
<xref ref-type="bibr" rid="B4">4</xref>
].</p>
<p>Taxonomic classification can also be based on sequence similarity, that is, the alignment of metagenomic reads to a reference dataset (for example using BLAST [
<xref ref-type="bibr" rid="B5">5</xref>
]). This approach is an accurate method, as long as a similar sequence is present in the database--which is not always true for metagenomic projects [
<xref ref-type="bibr" rid="B3">3</xref>
]. Some examples of off-the-shelf software for metagenomic analysis based on sequence similarity are CARMA [
<xref ref-type="bibr" rid="B6">6</xref>
] and Megan [
<xref ref-type="bibr" rid="B7">7</xref>
].</p>
<p>Yet another way to perform the taxonomic classification is to rely on a set of features that is induced by the sequences of nucleotides, producing the so-called
<italic>composition-based classification </italic>
[
<xref ref-type="bibr" rid="B8">8</xref>
]. Some features employed in this case are: codon usage, GC content, and oligonucleotide frequency (henceforth
<italic>n</italic>
-mer frequency). The latter is usually considered to be a good choice, because the
<italic>n</italic>
-mer frequencies carry phylogenetic signals that are useful for extracting common patterns between organisms at different taxonomic levels [
<xref ref-type="bibr" rid="B9">9</xref>
-
<xref ref-type="bibr" rid="B11">11</xref>
]. The following are some examples of software for taxonomic classification based on
<italic>n</italic>
-mer frequencies: Phylopythia [
<xref ref-type="bibr" rid="B4">4</xref>
] implements a support vector machine for classifying sequences that are larger than 3 kbp, Phymm [
<xref ref-type="bibr" rid="B12">12</xref>
] uses interpolated Markov modes (IMM) to classify reads with at least 100 bp, TACOA [
<xref ref-type="bibr" rid="B8">8</xref>
] merges the k-nearest-neighbor (k-NN) algorithm with kernelized learning strategies to handle sequences from 800 bp to 50 kbp, and Treephyler [
<xref ref-type="bibr" rid="B3">3</xref>
] uses hidden Markov models (HMM) to classify reads of 200 bp.</p>
<p>This work focuses on composition-based classification using
<italic>n</italic>
-mer frequencies to encode genomic sequences. Such an approach involves a series of decisions, regardless of the specific classifier chosen to perform the task. Usually, these decisions are based on a set of preliminary experiments that account for one particular type of classifier [
<xref ref-type="bibr" rid="B4">4</xref>
,
<xref ref-type="bibr" rid="B13">13</xref>
]. These studies provide valuable information regarding the performance of a given category of classifier; however, because they are biased by the peculiarities of the classifier of choice, they provide little insight about the characteristics of the classification problem itself. This paper presents a general framework for the empirical assessment of the impact that several decisions have on the degree of separability of taxonomic classes. Thus, instead of focusing on any classifier in particular, we focus our study on the classification problem.</p>
<p>Here we refer to a specific configuration of the classification problem as the setting induced by the following three features: (i) the length of the
<italic>n</italic>
-mer word used to encode the DNA sequences; (ii) the similarity measure adopted to compare the sequences; and (iii) the strategy used to assign sequences to taxonomic classes, which can be the conventional approach, in which the sequences are considered independently, or the hierarchical approach, in which the taxonomic context of each DNA fragment is accounted for. The goal of the current work is to serve as a guideline for the development of composition-based metagenomic classifiers by providing some intuition as to how the difficulty of the taxonomic classification problem changes with respect to the variation in the features described above.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>Acquisition of datasets</title>
<p>We used two types of data: (i) complete genomes; and (ii) synthetic metagenomic fragments. These datasets are described in the following sections.</p>
<sec>
<title>Complete genomes</title>
<p>The genomes were obtained from GenBank, the NCBI database of genetic sequences [
<xref ref-type="bibr" rid="B14">14</xref>
]. We used only microbial sequences, because the majority of metagenomic studies are focused on this type of organism [
<xref ref-type="bibr" rid="B15">15</xref>
]. We considered all 1, 032 microbial genomes sequenced until January, 2010. Among these, 497 sequences had to be removed because they had incomplete taxonomic lineage or undefined nucleotides. Hence, the actual number of genomes used was 535, which encompassed the domains Bacteria and Archaea.</p>
</sec>
<sec>
<title>Synthetic metagenomic fragments</title>
<p>The synthetic fragments were generated by the program MetaSim [
<xref ref-type="bibr" rid="B16">16</xref>
] using the genomes described above. MetaSim is a metagenomic sequence simulator that can be used to create sets of synthetic fragments reflecting the taxonomic composition of typical metagenomic scenarios. A total of 23, 000 fragments with ~ 400
<italic>bp </italic>
was generated under the sequencing conditions of Roche's 454 pyrosequencer [
<xref ref-type="bibr" rid="B17">17</xref>
].</p>
</sec>
</sec>
<sec>
<title>Preprocessing of datasets</title>
<p>We now describe how we preprocessed the data to perform our analysis.</p>
<sec>
<title>Calculating n-mer frequencies</title>
<p>To encode the nucleotide sequences we calculated the
<italic>n</italic>
-mer frequencies in each (meta)genomic sequence. To do so, we counted the number of occurrences of all possible
<italic>n</italic>
-mers in a given sequence, considering an overlap of
<italic>n - </italic>
1 nucleotides (that is, we started from position 1 to n, then from position 2 to n+1, and so on). This strategy gives rise to a 4
<italic>
<sup>n</sup>
</italic>
-dimensional vector whose elements represent the number of occurrence of each possible
<italic>n</italic>
-mer. We then divided the elements of such a vector by the total number of
<italic>n</italic>
-mers contained in the sequence. For the experiments with Kullback-Leibler (KL) divergence we used a slightly different approach to count the
<italic>n</italic>
-mers, because the strategy above could lead to a division by zero (see Equation 4). In particular, we assumed that each
<italic>n</italic>
-mer had occurred at least once, a method usually referred to in the literature as "add-one smoothing" [
<xref ref-type="bibr" rid="B18">18</xref>
]. In the end, each sequence is represented by a vector of
<italic>n</italic>
-mer frequencies (hereafter, "vector of frequencies"). We will sometimes refer to a vector of frequencies as simply a "sequence" when there is no risk of misinterpretation. Figure
<xref ref-type="fig" rid="F1">1</xref>
illustrates the process described above.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Process of counting n-mer frequencies</bold>
. Given a value for
<italic>n</italic>
, the first step is generating all of the
<italic>n</italic>
-mer words that are possible. In the next step, we count the number of times that each word appears in the sequence. Finally, we normalize the frequency vector by dividing each number of occurrences by the total number of
<italic>n</italic>
-mers.</p>
</caption>
<graphic xlink:href="1471-2164-13-S5-S1-1"></graphic>
</fig>
</sec>
<sec>
<title>Determining taxonomic lineage</title>
<p>To associate the sequence with its corresponding taxonomic lineage we used the information available at
<italic>NCBI Taxonomy </italic>
and BioPerl, a toolkit for the manipulation of genomic data [
<xref ref-type="bibr" rid="B19">19</xref>
]. The result of this process was a vector comprising seven positions that were filled out with NCBI
<italic>taxids </italic>
(taxonomy identifiers) corresponding to each one of the seven taxa: domain, phylum, class, order, family, genus, and species.</p>
</sec>
</sec>
<sec>
<title>Score functions</title>
<p>The next step is implementing a score function, which provides, under a specific configuration, a score for the degree of separability of the taxonomic classes. To formally define this function, we will adopt the following notation.
<italic>D </italic>
= {
<italic>G</italic>
,
<italic>F</italic>
} is the dataset, in which
<italic>G </italic>
represents the genomic sequences and
<italic>F </italic>
is the metagenomic synthetic fragments.
<italic>T </italic>
= {
<italic>do</italic>
,
<italic>ph</italic>
,
<italic>cl</italic>
,
<italic>or</italic>
,
<italic>fa</italic>
,
<italic>ge</italic>
,
<italic>sp</italic>
} is the taxon set, which represents the sequence's taxonomic lineage.
<italic>N </italic>
= {1, 2, . . . , 10} is the set of lengths of
<italic>n</italic>
-mers and
<italic>S </italic>
= {1, 2, ∞,
<italic>kl</italic>
} represents the set of similarity measures, where 1 is the 1-norm distance (Equation 1), 2 represents the 2-norm (Euclidean) distance (Equation 2), and ∞ is the ∞-norm distance (Equation 3);
<italic>kl </italic>
is the Kullback-Leibler divergence (Equation 4).</p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M1" name="1471-2164-13-S5-S1-i1" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M2" name="1471-2164-13-S5-S1-i2" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mo class="MathClass-rel">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo class="MathClass-bin">/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula id="bmcM3">
<label>(3)</label>
<mml:math id="M3" name="1471-2164-13-S5-S1-i3" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:munder class="msub">
<mml:mrow>
<mml:mtext>max</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>.</mml:mi>
<mml:mi>.</mml:mi>
<mml:mi>.</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula id="bmcM4">
<label>(4)</label>
<mml:math id="M4" name="1471-2164-13-S5-S1-i4" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">k</mml:mtext>
</mml:mstyle>
<mml:mi mathvariant="normal">l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mtext>ln</mml:mtext>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<italic>A </italic>
={
<italic>c</italic>
,
<italic>h</italic>
} is the set of score measures. The element
<italic>c </italic>
represents the conventional score measure, in which the configuration is scored considering the sequence separately, and the element
<italic>h </italic>
is the hierarchical score measure, in which the configuration is scored with respect to the sequence's taxonomic context (see below). Considering this notation the score function is defined as follows:</p>
<p>
<disp-formula id="bmcM5">
<label>(5)</label>
<mml:math id="M5" name="1471-2164-13-S5-S1-i5" overflow="scroll">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo class="MathClass-rel">:</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:mrow>
<mml:mo class="MathClass-open">[</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo class="MathClass-close">]</mml:mo>
</mml:mrow>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Thus,
<italic>f </italic>
(
<italic>d</italic>
,
<italic>t</italic>
,
<italic>n</italic>
,
<italic>s</italic>
,
<italic>a</italic>
) =
<italic>y </italic>
represents a score
<italic>y </italic>
to the dataset
<italic>d</italic>
, considering the taxon
<italic>t</italic>
, using a
<italic>n</italic>
-mer length of
<italic>n </italic>
to encode the sequences and the similarity measure
<italic>s </italic>
to check how similar the sequences are and, finally, using the score measure
<italic>a</italic>
. In other words, the score
<italic>y </italic>
is a measure of the degree of separability of the taxonomic classes in
<italic>d </italic>
at level
<italic>t </italic>
under the specific configuration induced by
<italic>n</italic>
,
<italic>s</italic>
, and
<italic>a</italic>
.</p>
<p>We now describe how we defined the score measures that were used to evaluate the classification problem.</p>
<sec>
<title>Conventional score measure</title>
<p>We want to assess the "separability" of the taxonomic classes under a given configuration. A straightforward way to do so would be to choose a specific type of classifier and then measure its classification accuracy for each possible combination of values for (
<italic>d</italic>
,
<italic>t</italic>
,
<italic>n</italic>
,
<italic>s</italic>
,
<italic>a</italic>
) (using cross-validation, for example [
<xref ref-type="bibr" rid="B20">20</xref>
]). Note that in this case we would be measuring the difficulty of the problem under the assumptions made by that specific classifier. For example, if we adopted a linear model such as the Naive Bayes classifier, then we would be measuring how well classes can be separated by a hyperplane [
<xref ref-type="bibr" rid="B20">20</xref>
]. Therefore, if we want to make no assumptions regarding the "shape" of the classes, the correct approach would be to use a nonlinear model capable of representing any boundary between the classes (such as a support vector machine using an appropriate kernel [
<xref ref-type="bibr" rid="B21">21</xref>
]). However, such an approach would require an expensive cross-validation process to determine the correct level of complexity of the model under each configuration (using, for example, regularization [
<xref ref-type="bibr" rid="B20">20</xref>
]).</p>
<p>We want a measure of the separability of the classes that can be efficiently computed and at the same time makes no strong assumptions regarding the shapes of the classes. A possible way of solving this problem is to base our measure on this simple observation: given a set of objects that belong to different classes, the level of separability of the classes can be assessed by the fraction of objects whose closest neighbor belongs to the same class. Note that, under this criterion, if the boundaries between the classes are well defined, then the set of objects will usually be considered to be separable, regardless of the shape of the classes. Therefore, this simple measure is an efficient way of assessing the degree of overlap between classes.</p>
<p>Algorithm 1 presents a detailed description of the computation of the proposed separability measure. Given a configuration (
<italic>d</italic>
,
<italic>t</italic>
,
<italic>n</italic>
,
<italic>s</italic>
), for each sequence in
<italic>d</italic>
, we calculate its nearest neighbor (NN) and check whether both sequences belong to the same class at the taxonomic level
<italic>t</italic>
. If so, then we add 1 to the configuration score. The result is then normalized to fall in the interval [0,1]. We call this approach the
<italic>conventional score </italic>
measure.</p>
<p>
<bold>Algorithm 1</bold>
: conventional_score(
<italic>d</italic>
,
<italic>t</italic>
,
<italic>n</italic>
,
<italic>s</italic>
)</p>
<p>/* Computes the conventional score for a given set of DNA sequences */</p>
<p>
<bold>Input</bold>
:
<italic>d </italic>
<italic>D</italic>
,
<italic>t </italic>
<italic>T</italic>
,
<italic>n </italic>
<italic>N</italic>
,
<italic>s </italic>
<italic>S</italic>
</p>
<p>
<bold>Output</bold>
: Conventional score</p>
<p>1 score ← 0</p>
<p>2
<italic>m </italic>
← 0</p>
<p>3
<bold>foreach </bold>
<italic>sequence d
<sub>i </sub>
</italic>
<italic>d </italic>
<bold>do</bold>
</p>
<p>4
<bold>if </bold>
<italic>d
<sub>i </sub>
is not the only representative of its class in d at level t </italic>
<bold>then</bold>
</p>
<p>5
<italic>m </italic>
<italic>m </italic>
+ 1</p>
<p>6
<italic>d
<sub>j </sub>
</italic>
<italic>NN</italic>
(
<italic>d
<sub>i</sub>
</italic>
,
<italic>d</italic>
,
<italic>n</italic>
,
<italic>s</italic>
) ;/* nearest neighbor of
<italic>d
<sub>i </sub>
</italic>
in
<italic>d </italic>
using
<italic>n</italic>
-mers and measure
<italic>s </italic>
*/</p>
<p>7
<bold>if </bold>
class(
<italic>d
<sub>i</sub>
</italic>
) = class(
<italic>d
<sub>j</sub>
</italic>
) at taxonomic level
<italic>t </italic>
<bold>then </bold>
score ← score + 1</p>
<p>8
<bold>return </bold>
score/
<italic>m</italic>
</p>
<p>Note that, if a genome is the only representative of its taxonomic group, then its nearest neighbor will necessarily belong to another class, which biases downwards the score measure shown in Algorithm 1. For this reason, we classify a genome only if it is not the unique example of its taxonomic class (line 4 of Algorithm 1). In the dataset used in our experiments, classes with a single member occur only at the taxonomic level of species. Specifically, out of 535 genomes used in the experiments, 328 were the unique representatives of their species.</p>
<p>As shown in Algorithm 1, the conventional score is the percentage of sequences that have the same lineage as their nearest-neighbors at a given taxonomic level. Incidentally, this approach is similar to using the
<italic>k</italic>
-Nearest-Neighbor (
<italic>k</italic>
-NN) classifier with
<italic>k </italic>
= 1 (except that in the latter case we would not eliminate classes with a single representative) [
<xref ref-type="bibr" rid="B22">22</xref>
]. This approach is in accordance with our objective of focusing our analysis on the classification problem, because the 1-NN classifier does not make strong assumptions regarding the shape of the classes [
<xref ref-type="bibr" rid="B20">20</xref>
].</p>
</sec>
<sec>
<title>Hierarchical score measure</title>
<p>Given the hierarchical structure of the taxonomic classification task, one might wonder whether it is a good strategy to decompose the problem into simpler sub-problems that are defined at each taxonomic level. More specifically, instead of using a single classifier, one would have a hierarchy of classifiers that are organized according to the taxonomic tree. In this case, a given DNA sequence would be classified as follows: first, a classifier at the highest hierarchical level would determine the domain to which the sequence belongs. Then, the DNA sequence would be classified at the next hierarchical level, the phylum, with the particular classifier used to do so determined by the domain the sequence was assigned to at one level above. Following the same reasoning, the sequence would then be passed on to the classifier that is responsible for the specific phylum that it was assigned to, and so on, until the desired taxonomic level had been reached. This classification strategy has been used before in the literature [
<xref ref-type="bibr" rid="B4">4</xref>
,
<xref ref-type="bibr" rid="B23">23</xref>
].</p>
<p>Note that, to compare the hierarchical scoring measure with the conventional measure, we cannot simply apply Algorithm 1 at each taxonomic level, because the nearest neighbor of a given sequence defines its classification at all of the taxonomic levels (and thus the hierarchical score would coincide with the conventional score). Since we do not want to introduce any bias in our analysis, we must define a score measure that is compatible with our strategy of measuring the separability between classes. This goal can be accomplished as follows. Suppose that a given sequence
<italic>d
<sub>i </sub>
</italic>
has been correctly classified at taxonomic level
<italic>t</italic>
. Then, to classify it one level below in the taxonomic tree,
<italic>t </italic>
+ 1, we can eliminate all of the</p>
<p>sequences that do not belong to the same class as
<italic>d
<sub>i </sub>
</italic>
at level
<italic>t</italic>
. This procedure corresponds to selecting a specific classifier in the hierarchical scheme described above. Next, if we remove our initial assumption that
<italic>d
<sub>i </sub>
</italic>
was correctly classified at level
<italic>t</italic>
, it is clear that, by eliminating the appropriate sequences of the dataset, an incorrect classification at level
<italic>t </italic>
can be followed by a correct classification at level
<italic>t </italic>
+ 1. This strategy is precisely what allows us to evaluate the hierarchical classification using nothing but the nearest neighbor of each DNA sequence.</p>
<p>Observe that eliminating the sequences that do not belong to the same class as
<italic>d
<sub>i </sub>
</italic>
at level
<italic>t </italic>
corresponds to assuming that
<italic>d
<sub>i </sub>
</italic>
was correctly classified at that level. Of course, to have an accurate score function at level
<italic>t </italic>
+ 1, we must account for the possibility that the sequence was incorrectly classified at level
<italic>t</italic>
. Clearly, a straightforward way to estimate the probability of a misclassification at level
<italic>t </italic>
is to use the score function at that level. Therefore, we define the
<italic>hierarchical score </italic>
measure recursively: roughly speaking, the hierarchical score at level
<italic>t </italic>
corresponds to the product between the conventional score at the same level and the hierarchical score one level above. Algorithm 2 provides a step-by-step description of how to compute the proposed hierarchical score measure.</p>
<p>
<bold>Algorithm 2</bold>
: hierarchical_score(
<italic>d</italic>
,
<italic>t</italic>
,
<italic>n</italic>
,
<italic>s</italic>
)</p>
<p>/* Computes the hierarchical score for a given set of DNA sequences */</p>
<p>
<bold>Input</bold>
:
<italic>d </italic>
<italic>D</italic>
,
<italic>t </italic>
<italic>T</italic>
,
<italic>n </italic>
<italic>N</italic>
,
<italic>s </italic>
<italic>S</italic>
</p>
<p>
<bold>Output</bold>
: Hierarchical score</p>
<p>1
<bold> if </bold>
<italic>t </italic>
= 1
<bold>then return </bold>
conventional_score(
<italic>d</italic>
,
<italic>t</italic>
,
<italic>n</italic>
,
<italic>s</italic>
) ;/*
<italic>i.e</italic>
., if
<italic>t </italic>
is "domain" */</p>
<p>2
<bold>else</bold>
</p>
<p>3score ← 0</p>
<p>4
<italic>m </italic>
← 0</p>
<p>5
<bold>foreach </bold>
<italic>sequence d
<sub>i </sub>
</italic>
<italic>d </italic>
<bold>do</bold>
</p>
<p>6
<italic>d</italic>
' ←
<italic>d </italic>
with only sequences
<italic>d
<sub>k </sub>
</italic>
which belong to the same class as
<italic>d
<sub>i </sub>
</italic>
at level
<italic>t </italic>
- 1</p>
<p>7
<bold>if </bold>
|
<italic>d</italic>
'| > 1
<italic>and d
<sub>i </sub>
is not the only representative of its class at level t </italic>
<bold>then</bold>
</p>
<p>8
<italic>m </italic>
<italic>m </italic>
+ 1</p>
<p>9
<italic>d
<sub>j </sub>
</italic>
← NN(
<italic>d
<sub>i</sub>
</italic>
,
<italic>d</italic>
',
<italic>n</italic>
,
<italic>s</italic>
)</p>
<p>10
<bold>if </bold>
class(
<italic>d
<sub>i</sub>
</italic>
) = class(
<italic>d
<sub>j</sub>
</italic>
) at taxonomic level
<italic>t </italic>
<bold>then </bold>
score ← score + 1</p>
<p>11
<bold>return </bold>
score/
<italic>m </italic>
* hierarchical_score(
<italic>d</italic>
,
<italic>t </italic>
- 1,
<italic>n</italic>
,
<italic>s</italic>
)</p>
<p>Using Algorithm 2, one can assess the degree of separability of taxonomic classes under a hierarchical classification scheme without making any strong assumptions regarding the shape of the classes. Therefore, the result of such an analysis applies to any set of classifiers, including a heterogeneous hierarchy composed of classifiers of different types.</p>
</sec>
</sec>
</sec>
<sec>
<title>Results and discussion</title>
<p>As described above, in this work we assume that a given configuration of the taxonomic classification problem is defined by: (i)
<italic>n</italic>
, the length of
<italic>n</italic>
-mers used to encode the sequences; (ii)
<italic>s</italic>
, the similarity measure used; and (iii)
<italic>a</italic>
, the score measure, which can be the conventional measure or the hierarchical measure (Algorithms 1 and 2, respectively). To provide an empirical basis for the development of composition-based metagenomic classifiers, we analyzed the separability of taxonomic classes under different configurations of the classification task.</p>
<p>We performed 10 * 4 * 2 = 80 experiments with the genomic dataset and 8 * 4 * 2 = 64 experiments with the synthetic metagenomic fragments data (in both cases the three numbers correspond to
<italic>|N|</italic>
,
<italic>|S|</italic>
, and
<italic>|A|</italic>
, respectively; see Equation (5)). In total, we performed 80 + 64 = 144 experiments. Our analysis addresses the impact of parameters
<italic>n</italic>
,
<italic>s</italic>
, and
<italic>a </italic>
over the configuration scores. Although we also discuss other taxonomic levels, we focus our analysis on the classification problem at the taxon species.</p>
<sec>
<title>Complete genomes</title>
<p>The genomic dataset comprises 535 genomes encompassing 386 different species. Considering the conventional score measure, the configuration scores for this type of data at the level of species varied from
<italic>f </italic>
(
<italic>G</italic>
,
<italic>sp</italic>
, 1,
<italic>kl</italic>
,
<italic>c</italic>
) = 0.275, for the worst configuration, to
<italic>f </italic>
(
<italic>G</italic>
,
<italic>sp</italic>
, 5, 2,
<italic>c</italic>
) = 0.512, for the best configuration. The hierarchical scores varied between
<italic>f </italic>
(
<italic>G</italic>
,
<italic>sp</italic>
, 1, 2,
<italic>h</italic>
) = 0.378 and
<italic>f </italic>
(
<italic>G</italic>
,
<italic>sp</italic>
, 7,
<italic>kl</italic>
,
<italic>h</italic>
) = 0.532.</p>
<p>Figure
<xref ref-type="fig" rid="F2">2</xref>
presents the configuration scores that were generated on the genomic dataset over the different taxa for
<italic>n </italic>
= 5 (this value was the value of
<italic>n </italic>
that generated the highest conventional scores). As shown in the figure, as we go downward in the taxonomic tree (t → species), the configuration score decreases. This decrease is expected, because a correct nearest-neighbor classification at level
<italic>t </italic>
implies a correct classification at level
<italic>t - </italic>
1 (but not the converse). Observe that in the left graph in Figure
<xref ref-type="fig" rid="F2">2</xref>
the score function actually increases when one moves from the taxon genus to the species. This increase is due to the removal of unique representatives of some species, as explained above. Surprisingly, varying the similarity measure
<italic>s </italic>
did not result in remarkable differences in the scores. As shown in Figure
<xref ref-type="fig" rid="F2">2</xref>
, the scores referring to
<italic>s </italic>
= 1,
<italic>s </italic>
= 2, and
<italic>s </italic>
=
<italic>kl </italic>
are very similar, and the scores computed with
<italic>s </italic>
= ∞ differ only slightly from the others. This phenomenon was observed across all configurations. Thus, from this point on, we will fix the similarity measure at
<italic>s </italic>
=
<italic>kl </italic>
and study the impact of the other variables over the scores.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Configuration scores per taxon for a genomic dataset (
<italic>d </italic>
=
<italic>G</italic>
)</bold>
. The graph on the left presents the scores for the configuration (
<italic>G</italic>
,
<italic>-</italic>
, 5,
<italic>-</italic>
,
<italic>c</italic>
) and graph on the right presents the scores for the configuration (
<italic>G</italic>
,
<italic>-</italic>
, 5,
<italic>-</italic>
,
<italic>h</italic>
).</p>
</caption>
<graphic xlink:href="1471-2164-13-S5-S1-2"></graphic>
</fig>
<p>Figure
<xref ref-type="fig" rid="F3">3</xref>
shows the genomic scores per
<italic>n</italic>
-mer length for the different taxa. In Figure
<xref ref-type="fig" rid="F3">3</xref>
, it is difficult to identify the value of
<italic>n </italic>
that produces the best score, because from
<italic>n </italic>
= 2 to
<italic>n </italic>
= 8 the score curve is almost flat. From
<italic>n </italic>
= 1 to
<italic>n </italic>
= 2 there is a rough increase in the scores. This increase was expected, because
<italic>n </italic>
= 1 means counting the frequencies of the nucleotides A, T, C, and G, which does not provide sufficient information about the sequences to discriminate between the classes. In general, a small value for
<italic>n </italic>
represents two different sequences in a similar way. As an example, consider the taxonomic tree shown in Figure
<xref ref-type="fig" rid="F4">4</xref>
, which includes the phyla Crenarchaeota, Actinobacteria, Bacteroidetes, Thermotogae, and Chlamydiae. Although these phyla are distant from a taxonomic point of view, some of their members give rise to very similar frequency vectors, as shown in Table
<xref ref-type="table" rid="T1">1</xref>
.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Configuration scores per
<bold>
<italic>n</italic>
</bold>
-mer word length for the genomic dataset</bold>
. The graph on the left was generated under the configuration (
<italic>G</italic>
,
<italic>-</italic>
,
<italic>-</italic>
,
<italic>kl</italic>
,
<italic>c</italic>
), and graph on the right was generated under (
<italic>G</italic>
,
<italic>-</italic>
,
<italic>-</italic>
,
<italic>kl</italic>
,
<italic>h</italic>
). All of the seven taxonomic levels are considered.</p>
</caption>
<graphic xlink:href="1471-2164-13-S5-S1-3"></graphic>
</fig>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Taxonomic tree</bold>
. Taxonomic tree for phyla, including Crenarchaeota, Actinobacteria, Bacteroidetes, Thermotogae, and Chlamydiae.</p>
</caption>
<graphic xlink:href="1471-2164-13-S5-S1-4"></graphic>
</fig>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>1
<bold>-</bold>
mer frequencies for sequences in five different phyla.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center">Sequence</th>
<th align="center" colspan="4">Sequence representation</th>
<th align="center">Phylum</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">
<italic>d</italic>
<sub>1</sub>
</td>
<td align="center">0.2613</td>
<td align="center">0.2611</td>
<td align="center">0.2379</td>
<td align="center">0.2397</td>
<td align="center">Bacteroidetes</td>
</tr>
<tr>
<td align="center">
<italic>d</italic>
<sub>2</sub>
</td>
<td align="center">0.2606</td>
<td align="center">0.2612</td>
<td align="center">0.2390</td>
<td align="center">0.2392</td>
<td align="center">Actinobacteria</td>
</tr>
<tr>
<td align="center">
<italic>d</italic>
<sub>3</sub>
</td>
<td align="center">0.2445</td>
<td align="center">0.2443</td>
<td align="center">0.2557</td>
<td align="center">0.2554</td>
<td align="center">Thermotogae</td>
</tr>
<tr>
<td align="center">
<italic>d</italic>
<sub>4</sub>
</td>
<td align="center">0.2430</td>
<td align="center">0.2584</td>
<td align="center">0.2439</td>
<td align="center">0.2547</td>
<td align="center">Chlamydiae</td>
</tr>
<tr>
<td align="center">
<italic>d</italic>
<sub>5</sub>
</td>
<td align="center">0.2690</td>
<td align="center">0.2678</td>
<td align="center">0.2317</td>
<td align="center">0.2315</td>
<td align="center">Crenarchaeota</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Observe also that from
<italic>n </italic>
= 8 to
<italic>n </italic>
= 10 the scores decrease slightly. This decrease is a consequence of the fact that, when
<italic>n </italic>
≥ 8, the number of possible
<italic>n</italic>
-mer sequences is very large, which results in sparse frequency vectors with a low discriminative power. For example, if the similar sequences
<italic>di </italic>
=
<italic>A</italic>
<bold>A</bold>
<italic>ATGGTA </italic>
and
<italic>d
<sub>j </sub>
</italic>
=
<italic>A</italic>
<bold>G</bold>
<italic>ATGGTA </italic>
are encoded with
<italic>n </italic>
= 8, the result is two vectors with 65, 536 positions filled with zeros in all but one position, which would contain a "1" representing the words
<italic>d
<sub>i </sub>
</italic>
and
<italic>d
<sub>j</sub>
</italic>
. Hence, we have two extremely similar sequences represented by two different frequency vectors, which clearly disrupts the score function
<italic>f</italic>
.</p>
<p>Concerning the two score measures, the hierarchical approach presented slightly better performance than the conventional score, as shown in Figures
<xref ref-type="fig" rid="F2">2</xref>
and
<xref ref-type="fig" rid="F3">3</xref>
. This relationship suggests that decomposing the classification task into smaller sub-problems does indeed make the problem easier.</p>
</sec>
<sec>
<title>Synthetic metagenomic fragments</title>
<p>The synthetic dataset comprises 23, 000 fragments with approximately 400
<italic>bp</italic>
. As mentioned previously, these sequences were generated with the sequences simulator MetaSim [
<xref ref-type="bibr" rid="B16">16</xref>
] under the sequencing conditions of the 454 pyrosequencer. The configuration scores at the level of species varied between
<italic>f</italic>
(
<italic>F</italic>
,
<italic>sp</italic>
, 8, ∞,
<italic>c</italic>
) = 0.007 and
<italic>f</italic>
(
<italic>F</italic>
,
<italic>sp</italic>
, 4,
<italic>kl</italic>
,
<italic>c</italic>
) = 0.112 for the conventional score function, and between
<italic>f</italic>
(
<italic>F</italic>
,
<italic>sp</italic>
, 7, 1,
<italic>h</italic>
) = 0.113 and
<italic>f</italic>
(
<italic>F</italic>
,
<italic>sp</italic>
, 4, 1,
<italic>h</italic>
) = 0.5 when the hierarchical measure was considered.</p>
<p>Figure
<xref ref-type="fig" rid="F5">5</xref>
shows the value of the score as a function of the taxonomic level
<italic>t </italic>
when
<italic>n </italic>
= 4. The first thing that stands out in this figure is the fact that, for the synthetic data, the advantage of using the hierarchical score measure over the conventional measure is much more expressive than with complete genomes. This result indicates that, when sequences are short, the overlap between the classes is less correlated with the taxonomic tree. In other words, the overlap between two classes at level
<italic>t </italic>
is not strongly affected by the fact that they belong to the same class at level
<italic>t </italic>
- 1. A possible explanation for this phenomenon is that shorter sequences give rise to higher variability within each class.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Configuration scores per taxon for the metagenomic synthetic fragments dataset
<bold>(
<italic>d </italic>
=
<italic>F</italic>
)</bold>
</bold>
. The graph on the left was generated under configuration (
<italic>F</italic>
,
<italic>-</italic>
, 4,
<italic>-</italic>
,
<italic>c</italic>
) and the graph on the right was generated under configuration (
<italic>F</italic>
,
<italic>-</italic>
, 4,
<italic>-</italic>
,
<italic>h</italic>
).</p>
</caption>
<graphic xlink:href="1471-2164-13-S5-S1-5"></graphic>
</fig>
<p>Again, changing the similarity measure
<italic>s </italic>
did not have a significant impact on the scores. Note, however, that with metagenomic fragments the use of
<italic>s </italic>
= ∞ has a degenerating impact over the scores which is more noticeable than the trend observed in the case of complete genomes (compare Figures
<xref ref-type="fig" rid="F2">2</xref>
and
<xref ref-type="fig" rid="F5">5</xref>
). Figure
<xref ref-type="fig" rid="F6">6</xref>
shows the conventional and hierarchical scores as a function of
<italic>n </italic>
when the KL divergence is adopted as the similarity measure. Here we observe curves similar to the curves shown in Figure
<xref ref-type="fig" rid="F3">3</xref>
, with the peak of each curve shifted slightly to the left. This change makes sense, because with shorter sequences the "sparsification" of frequency vectors discussed in the previous section occurs at smaller values of
<italic>n</italic>
. Additionally, note how the conventional scores of the metagenomic dataset are low at the taxonomic level of order and below. This trend suggests that using a single classifier in this case might not be the best alternative. Such an observation could be particularly helpful in the future development of composition-based classifiers, because one of the major problems with real metagenomic projects is the difficulty of obtaining accurate classification at lower taxonomic levels [
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B12">12</xref>
].</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>Configuration scores per
<bold>
<italic>n</italic>
</bold>
-mer word length for the metagenomic dataset</bold>
. The graph on the left was generated under the configuration (
<italic>F</italic>
,
<italic>-</italic>
,
<italic>-</italic>
,
<italic>kl</italic>
,
<italic>c</italic>
) and graph on the right was generated under the configuration (
<italic>F</italic>
,
<italic>-</italic>
,
<italic>-</italic>
,
<italic>kl</italic>
,
<italic>h</italic>
). All of the seven taxonomic levels are considered.</p>
</caption>
<graphic xlink:href="1471-2164-13-S5-S1-6"></graphic>
</fig>
<p>In summary, we observed that the scores associated with metagenomic data are in general smaller than the scores generated with genomic data, and using a hierarchical classification approach in this case appears to be even more beneficial. Moreover, the value of
<italic>n </italic>
that generated the best results decreased from
<italic>n ≈ </italic>
7 to
<italic>n </italic>
≈ 4, which indicates that, when dealing with metagenomic fragments with approximately 400
<italic>bp</italic>
, there is no point in using frequency vectors that have a dimension much higher than 256.</p>
</sec>
<sec>
<title>Discussion</title>
<p>In this section we summarize the results presented in the previous sections and provide an overview of our analysis. To accomplish those goals, we show in Figure
<xref ref-type="fig" rid="F7">7</xref>
the scores that were generated with the genomic data at the level of species as a function of
<italic>n </italic>
and
<italic>s</italic>
, and in Figure
<xref ref-type="fig" rid="F8">8</xref>
we show the same information for the scores generated with the metagenomic dataset. From examining these figures, we arrive at the following conclusions:</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption>
<p>
<bold>Scores as a function of
<italic>n </italic>
and
<italic>s </italic>
for a genomic dataset (
<italic>d </italic>
=
<italic>G</italic>
)</bold>
. The
<italic>x</italic>
-axis represents the length of the
<italic>n</italic>
-mer sequences. The top graph is the conventional score function (
<italic>G</italic>
,
<italic>sp</italic>
,
<italic>-</italic>
,
<italic>-</italic>
,
<italic>c</italic>
), and the bottom graph is the hierarchical score function (
<italic>G</italic>
,
<italic>sp</italic>
,
<italic>-</italic>
,
<italic>-</italic>
,
<italic>h</italic>
).</p>
</caption>
<graphic xlink:href="1471-2164-13-S5-S1-7"></graphic>
</fig>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption>
<p>
<bold>Scores as a function of
<italic>n </italic>
and
<italic>s </italic>
for synthetic metagenomic dataset
<bold>(
<italic>d </italic>
=
<italic>F</italic>
)</bold>
</bold>
. The
<italic>x</italic>
-axis represents the length of the
<italic>n</italic>
-mer sequences. The top graph is the conventional score function (
<italic>F</italic>
,
<italic>sp</italic>
,
<italic>-</italic>
,
<italic>-</italic>
,
<italic>c</italic>
), and the bottom graph is the hierarchical score function (
<italic>F</italic>
,
<italic>sp</italic>
,
<italic>-</italic>
,
<italic>-</italic>
,
<italic>h</italic>
).</p>
</caption>
<graphic xlink:href="1471-2164-13-S5-S1-8"></graphic>
</fig>
<p>
<italic></italic>
The scores are an approximately concave function of
<italic>n </italic>
with a maximum value that is between 4 and 7; the "optimal" value of
<italic>n </italic>
is smaller for the metagenomic dataset.</p>
<p>
<italic></italic>
Changing the similarity measure
<italic>s </italic>
does not have a strong effect on the scores.</p>
<p>
<italic></italic>
The hierarchical classification scheme appears to be a better alternative for both genomic and metagenomic data; however, in the latter case, its advantage over the conventional classification approach is more evident.</p>
<p>
<italic></italic>
In general the scores that are associated with the metagenomic data are smaller than the scores that are associated with the genomic data, but the difference is more significant under the conventional classification scheme.</p>
<p>In conclusion, we show in Table
<xref ref-type="table" rid="T2">2</xref>
the configurations that produced the best results in both datasets. The values shown in Table
<xref ref-type="table" rid="T2">2</xref>
can serve as a starting point in the development of composition-based metagenomic classifiers.</p>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>Configuration scores referring to the best
<italic>n </italic>
and
<italic>s </italic>
values.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="center" colspan="3">Conventional</th>
<th align="center" colspan="3">Hierarchical</th>
</tr>
<tr>
<th></th>
<th align="center">
<italic>n</italic>
</th>
<th align="center">
<italic>s</italic>
</th>
<th align="center">
<italic>score</italic>
</th>
<th align="center">
<italic>n</italic>
</th>
<th align="center">
<italic>s</italic>
</th>
<th align="center">
<italic>score</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">Genomes</td>
<td align="center">5</td>
<td align="center">2</td>
<td align="center">0.512</td>
<td align="center">7</td>
<td align="center">kl</td>
<td align="center">0.532</td>
</tr>
<tr>
<td align="center">Fragments</td>
<td align="center">4</td>
<td align="center">kl</td>
<td align="center">0.112</td>
<td align="center">4</td>
<td align="center">1</td>
<td align="center">0.501</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The scores were produced considering both score measures and both datasets, at the species taxonomic level.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusions">
<title>Conclusions</title>
<p>Taxonomic classification is an essential step within a metagenomic study, since this is the first step of a metagenomic analysis and its result is used as a basis to posterior investigations. Usually, composition-based metagenomic classifiers are configured based on preliminary experiments that account for a specific type of classifier. In this work we proposed to shift the focus of the analysis to the classification task itself. To make this shift, we presented a general framework that can be used to study the impact of several decisions on the difficulty of the classification problem (that is, how "separable" the classes are under different configurations of the task).</p>
<p>In this work we focused the analysis on the impact of three factors in particular: (i) the length of the
<italic>n</italic>
-mers used to encode the DNA sequences; (ii) the similarity measure used to compare frequency vectors; and (iii) the underlying classification scheme (hierarchical or conventional). The results presented provide some intuition on how the difficulty of the classification problem changes as a function of the features above. Because our analysis does not assume any structure of the classification problem, it can be used as a guideline for the development of composition-based metagenomic classifiers of any type. Moreover, the framework presented in this work can be used for the analysis of the impact of other factors over the taxonomic classification task.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>SH performed the experiments, helped in the interpretation of the results, and wrote the manuscript. AMSB analyzed and interpreted the results, and helped in writing the manuscript. MEC helped in the experiments, and reviewed the manuscript. ATRV reviewed the manuscript, and conceived the project. All of the authors have approved the final version to be published.</p>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>We thank Douglas Adriano Augusto for important help with the computational experiments. We also thank Carlos Henrique Brandt for granting permission to run some of the experiments on the clusters of CESUP (Centro Nacional de Supercomputação). This work was funded by CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior).</p>
<p>This article has been published as part of
<italic>BMC Genomics </italic>
Volume 13 Supplement 5, 2012: Proceedings of the International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2011). The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcgenomics/supplements/13/S5">http://www.biomedcentral.com/bmcgenomics/supplements/13/S5</ext-link>
.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Handelsman</surname>
<given-names>J</given-names>
</name>
<article-title>Metagenomics: Application of Genomics to Uncultured Microorganisms</article-title>
<source>Microbiology and molecular biology reviews: MMBR</source>
<year>2004</year>
<volume>68</volume>
<issue>4</issue>
<fpage>669</fpage>
<lpage>685</lpage>
<pub-id pub-id-type="doi">10.1128/MMBR.68.4.669-685.2004</pub-id>
<pub-id pub-id-type="pmid">15590779</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="other">
<name>
<surname>Tringe</surname>
<given-names>SG</given-names>
</name>
<name>
<surname>Rubin</surname>
<given-names>EM</given-names>
</name>
<article-title>Metagenomics : DNA sequencing of environmental samples</article-title>
<source>eScholarship</source>
<year>2005</year>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Schreiber</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Gumrich</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Daniel</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Meinicke</surname>
<given-names>P</given-names>
</name>
<article-title>Treephyler: fast taxonomic profiling of metagenomes</article-title>
<source>Bioinformatics (Oxford, England)</source>
<year>2010</year>
<volume>26</volume>
<issue>7</issue>
<fpage>960</fpage>
<lpage>1</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq070</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>McHardy</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Martín</surname>
<given-names>HG</given-names>
</name>
<name>
<surname>Tsirigos</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rigoutsos</surname>
<given-names>I</given-names>
</name>
<article-title>Accurate phylogenetic classification of variable-length DNA fragments</article-title>
<source>Nature Methods</source>
<year>2007</year>
<volume>4</volume>
<fpage>63</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth976</pub-id>
<pub-id pub-id-type="pmid">17179938</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Eugene W Myers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
<article-title>Basic Local Alignment Search Tool</article-title>
<source>Journal of Molecular Biology</source>
<year>1990</year>
<volume>215</volume>
<issue>3</issue>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="pmid">2231712</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Krause</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Diaz</surname>
<given-names>NN</given-names>
</name>
<name>
<surname>Goesmann</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kelley</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Nattkemper</surname>
<given-names>TW</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>Ra</given-names>
</name>
<name>
<surname>Stoye</surname>
<given-names>J</given-names>
</name>
<article-title>Phylogenetic classification of short environmental DNA fragments</article-title>
<source>Nucleic acids research</source>
<year>2008</year>
<volume>36</volume>
<issue>7</issue>
<fpage>2230</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn038</pub-id>
<pub-id pub-id-type="pmid">18285365</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>Auch</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schuster</surname>
<given-names>SC</given-names>
</name>
<article-title>MEGAN analysis of metagenomic data</article-title>
<source>Genome research</source>
<year>2007</year>
<volume>17</volume>
<issue>3</issue>
<fpage>377</fpage>
<lpage>86</lpage>
<pub-id pub-id-type="doi">10.1101/gr.5969107</pub-id>
<pub-id pub-id-type="pmid">17255551</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Diaz</surname>
<given-names>NN</given-names>
</name>
<name>
<surname>Krause</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Goesmann</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Niehaus</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Nattkemper</surname>
<given-names>TW</given-names>
</name>
<article-title>TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach</article-title>
<source>BMC bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<fpage>56</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-56</pub-id>
<pub-id pub-id-type="pmid">19210774</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Saeed</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Halgamuge</surname>
<given-names>SK</given-names>
</name>
<article-title>The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments</article-title>
<source>BMC genomics</source>
<year>2009</year>
<volume>10</volume>
<issue>Suppl 3</issue>
<fpage>S10</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-10-S3-S10</pub-id>
<pub-id pub-id-type="pmid">19958473</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Karlin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mrázek</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Campbell</surname>
<given-names>aM</given-names>
</name>
<article-title>Compositional biases of bacterial genomes and evolutionary implications</article-title>
<source>Journal of bacteriology</source>
<year>1997</year>
<volume>179</volume>
<issue>12</issue>
<fpage>3899</fpage>
<lpage>913</lpage>
<pub-id pub-id-type="pmid">9190805</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Karlin</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Mráazek</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Samuel</surname>
</name>
<article-title>Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA</article-title>
<source>Molecular Phylogenetics and Evolution</source>
<year>1999</year>
<volume>11</volume>
<issue>3</issue>
<fpage>343</fpage>
<lpage>350</lpage>
<pub-id pub-id-type="doi">10.1006/mpev.1998.0567</pub-id>
<pub-id pub-id-type="pmid">10196076</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Brady</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
<article-title>Phymm and PhymmBl: metagenomic phylogenetic classification with interpolated Markov models</article-title>
<source>Nature</source>
<year>2009</year>
<volume>6</volume>
<issue>9</issue>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Rosen</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Garbarine</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Caseiro</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Polikar</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Sokhansanj</surname>
<given-names>B</given-names>
</name>
<article-title>Metagenome fragment classification using N-mer frequency profiles</article-title>
<source>Advances in bioinformatics</source>
<year>2008</year>
<volume>2008</volume>
<fpage>205969</fpage>
<pub-id pub-id-type="pmid">19956701</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="other">
<name>
<surname>Benson</surname>
<given-names>Da</given-names>
</name>
<name>
<surname>Karsch-Mizrachi</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Ostell</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Sayers</surname>
<given-names>EW</given-names>
</name>
<article-title>GenBank</article-title>
<source>Nucleic acids research</source>
<year>2010</year>
<issue>38 Database</issue>
<fpage>46</fpage>
<lpage>51</lpage>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Doolittle</surname>
<given-names>WF</given-names>
</name>
<name>
<surname>Zhaxybayeva</surname>
<given-names>O</given-names>
</name>
<article-title>Metagenomics and the Units of Biological Organization</article-title>
<source>BioScience</source>
<year>2010</year>
<volume>60</volume>
<issue>2</issue>
<fpage>102</fpage>
<lpage>112</lpage>
<pub-id pub-id-type="doi">10.1525/bio.2010.60.2.5</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Richter</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Ott</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Auch</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Schmid</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
<article-title>MetaSim: a sequencing simulator for genomics and metagenomics</article-title>
<source>PloS one</source>
<year>2008</year>
<volume>3</volume>
<issue>10</issue>
<fpage>.</fpage>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Margulies</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Egholm</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Altman</surname>
<given-names>WE</given-names>
</name>
<name>
<surname>Attiya</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bader</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Bemben</surname>
<given-names>LA</given-names>
</name>
<name>
<surname>Berka</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Braverman</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Yj</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Dewell</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Fierro</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Gomes</surname>
<given-names>XV</given-names>
</name>
<name>
<surname>Goodwin</surname>
<given-names>BC</given-names>
</name>
<name>
<surname>He</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Helgesen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ho</surname>
<given-names>CH</given-names>
</name>
<name>
<surname>Irzyk</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Jando</surname>
<given-names>SC</given-names>
</name>
<name>
<surname>I</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Jarvie</surname>
<given-names>TP</given-names>
</name>
<name>
<surname>Jirage</surname>
<given-names>KB</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>Jb</given-names>
</name>
<name>
<surname>Knight</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Lanza</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Leamon</surname>
<given-names>JH</given-names>
</name>
<name>
<surname>Lefkowitz</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Lei</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
<name>
<surname>L</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Makhijani</surname>
<given-names>VB</given-names>
</name>
<name>
<surname>Mcdade</surname>
<given-names>KE</given-names>
</name>
<name>
<surname>Mckenna</surname>
<given-names>MP</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Nickerson</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Nobile</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Plant</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Puc</surname>
<given-names>BP</given-names>
</name>
<name>
<surname>Ronan</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Roth</surname>
<given-names>GT</given-names>
</name>
<name>
<surname>Sarkis</surname>
<given-names>GJ</given-names>
</name>
<name>
<surname>Simons</surname>
<given-names>JF</given-names>
</name>
<name>
<surname>Simpson</surname>
<given-names>JW</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Tartaro</surname>
<given-names>KR</given-names>
</name>
<name>
<surname>Tomasz</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Vogt</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>A</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>SH</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Weiner</surname>
<given-names>MP</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>P</given-names>
</name>
<name>
<surname>F</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rothberg</surname>
<given-names>JM</given-names>
</name>
<article-title>Genome Sequencing in Open Microfabricated High Density Picoliter Reactors</article-title>
<source>Life Sciences</source>
<year>2005</year>
<volume>437</volume>
<issue>7057</issue>
<fpage>376</fpage>
<lpage>380</lpage>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="book">
<name>
<surname>Russell</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Norvig</surname>
<given-names>P</given-names>
</name>
<source>Artificial Intelligence: A Modern Approach</source>
<year>2003</year>
<publisher-name>Prentice Hall Series in Artificial Intelligence</publisher-name>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Stajich</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Block</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Boulez</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Brenner</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chervitz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Dagdigian</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Fuellen</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Gilbert</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Korf</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Lapp</surname>
<given-names>H</given-names>
</name>
<etal></etal>
<article-title>The Bioperl toolkit: Perl modules for the life sciences</article-title>
<source>Genome research</source>
<year>2002</year>
<volume>12</volume>
<issue>10</issue>
<fpage>1611</fpage>
<pub-id pub-id-type="doi">10.1101/gr.361602</pub-id>
<pub-id pub-id-type="pmid">12368254</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="other">
<name>
<surname>Hastie</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Friedman</surname>
<given-names>J</given-names>
</name>
<source>The Elements of Statistical Learning</source>
<year>2009</year>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="book">
<name>
<surname>Scholkopf</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Smola</surname>
<given-names>AJ</given-names>
</name>
<source>Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond</source>
<year>2001</year>
<publisher-name>MIT Press</publisher-name>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Cover</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Hart</surname>
<given-names>P</given-names>
</name>
<article-title>Nearest neighbor pattern classification</article-title>
<source>Information Theory, IEEE Transactions on</source>
<year>1967</year>
<volume>13</volume>
<fpage>21</fpage>
<lpage>27</lpage>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="other">
<name>
<surname>Zheng</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>H</given-names>
</name>
<article-title>A novel LDA and PCA-based hierarchical scheme for metagenomic fragment binning</article-title>
<source>2009 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology</source>
<year>2009</year>
<fpage>53</fpage>
<lpage>59</lpage>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000955 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000955 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3477002
   |texte=   Analysis of composition-based metagenomic classification
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:23095761" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021